Expression recognition method, device, equipment and storage medium for intelligent monitoring

文档序号:8436 发布日期:2021-09-17 浏览:37次 中文

1. An expression recognition method for intelligent monitoring is characterized by comprising the following steps:

acquiring an image sequence; wherein the image sequence contains a target person;

obtaining a face region in the image sequence through a face detection model;

obtaining expression information in the face area through an expression recognition model;

generating an initial expression sequence according to the time sequence of the image sequence and the expression information;

and correcting through a prediction model according to the initial expression sequence to obtain a facial expression sequence.

2. The expression recognition method according to claim 1, wherein the face detection model is a YOLOv3 face recognition model; the expression recognition model is a VGG16 expression classification model;

the expression information comprises x types; wherein the x class comprises neutral, serious, panic, curious, surfrise, happensess, despise;

generating an initial expression sequence according to the time sequence of the image sequence and the expression information, and specifically:

generating a time sequence T according to the time information of each frame in the image sequence;

and sequencing the expression information according to the time sequence to obtain the initial expression sequence I.

3. The expression recognition method of claim 1, wherein the predictive model is an LSTM model; the input length of the LSTM model is n, and the characteristics of unit length comprise x types;

according to the initial expression sequence, correcting through a prediction model to obtain the facial expression sequence of the character, specifically:

according to the initial expression sequence, dividing the expression sequence into input sequences with the length of n;

inputting the input sequence into the predictive model to obtain an output sequence of length n;

and obtaining the facial expression sequence according to the output sequence.

4. The expression recognition method according to claim 3, wherein the input length n is 11 frames.

5. The utility model provides an expression recognition device of intelligent monitoring which characterized in that contains:

a sequence module for acquiring an image sequence; wherein the image sequence comprises images of a person;

the region module is used for obtaining a face region in the image sequence through a face detection model;

the expression module is used for obtaining expression information in the face area through an expression recognition model;

the initial module is used for generating an initial expression sequence according to the time sequence of the image sequence and the expression information;

and the final module is used for correcting through a prediction model according to the initial expression sequence so as to obtain a facial expression sequence.

6. The expression recognition apparatus according to claim 5, wherein the face detection model is a YOLOv3 face recognition model; the expression recognition model is a VGG16 expression classification model;

the expression information comprises x types; wherein the x class comprises neutral, serious, panic, curious, surfrise, happensess, despise;

the initial module specifically comprises:

the time unit is used for generating a time sequence T according to the time information of each frame in the image sequence;

and the initial unit is used for sequencing the expression information according to the time sequence so as to obtain the initial expression sequence I.

7. The expression recognition apparatus according to claim 5, wherein the prediction model is an LSTM model; the input length of the LSTM model is n, and the characteristics of unit length comprise x types;

the final module specifically includes:

the input unit is used for dividing the initial expression sequence into an input sequence with the length of n;

an output unit for inputting the input sequence to the prediction model to obtain an output sequence of length n;

and the final unit is used for obtaining the facial expression sequence according to the output sequence.

8. The expression recognition apparatus according to claim 7, wherein the input length n is 11 frames.

9. An intelligently monitored expression recognition device comprising a processor, a memory, and a computer program stored in the memory; the computer program is executable by the processor to implement the intelligently monitored expression recognition method of any one of claims 1 to 4.

10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method according to any one of claims 1 to 4.

Background

In order to find out accidents of old people or children in time, cameras are often installed at the moving places of the old people and the children to shoot in real time. Meanwhile, in order to know whether the old or the child has an accident in time, the pictures shot by the camera can be analyzed in real time through equipment such as a local server and a cloud server, and an alarm is generated to inform relevant personnel when the target person is judged to have the accident.

In particular, in the prior art, the expression of the target person can be analyzed to determine whether the target person has an unexpected expression such as pain or anger, thereby determining whether the target person has an unexpected expression. However, in the prior art, the expression recognition accuracy is not high, so that false alarm is easily caused, and unnecessary troubles are caused to related personnel.

Disclosure of Invention

The invention provides an expression recognition method, device, equipment and storage medium for intelligent monitoring, which aim to solve the problem of inaccurate expression recognition in the related art.

In a first aspect, an embodiment of the present invention provides an expression recognition method for intelligent monitoring, which includes the following steps:

S3B0, acquiring an image sequence; wherein the image sequence contains a target person;

S3B1, obtaining a face region in the image sequence through a face detection model;

S3B2, obtaining expression information in the face area through an expression recognition model;

S3B3, generating an initial expression sequence according to the time sequence of the image sequence and the expression information;

and S3B4, correcting through a prediction model according to the initial expression sequence to obtain a facial expression sequence.

Optionally, the face detection model is a YOLOv3 face recognition model; the expression recognition model is a VGG16 expression classification model;

optionally, the expression information includes x classes; wherein the x class comprises neutral l, ser i ous, pan i c, cur i ous, surf i se, happ i ness and desp i se;

optionally, the step S3B3 specifically includes:

S3B31, generating a time sequence T according to the time information of each frame in the image sequence;

and S3B32, sequencing the expression information according to the time sequence to obtain the initial expression sequence I.

Optionally, the predictive model is an LSTM model; the input length of the LSTM model is n, and the characteristics of unit length comprise x types;

optionally, the step S3B4 specifically includes:

S3B41, dividing the expression sequence into input sequences with the length of n according to the initial expression sequence;

S3B42, inputting the input sequence into the prediction model to obtain an output sequence with the length of n;

and S3B43, obtaining the facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

In a second aspect, an embodiment of the present invention provides an expression recognition apparatus for intelligent monitoring, including:

a sequence module for acquiring an image sequence; wherein the image sequence comprises images of a person;

the region module is used for obtaining a face region in the image sequence through a face detection model;

the expression module is used for obtaining expression information in the face area through an expression recognition model;

the initial module is used for generating an initial expression sequence according to the time sequence of the image sequence and the expression information;

and the final module is used for correcting through a prediction model according to the initial expression sequence so as to obtain a facial expression sequence.

Optionally, the face detection model is a YOLOv3 face recognition model; the expression recognition model is a VGG16 expression classification model;

optionally, the expression information includes x classes; wherein the x class comprises neutral l, ser i ous, pan i c, cur i ous, surf i se, happ i ness and desp i se;

optionally, the initial module specifically includes:

the time unit is used for generating a time sequence T according to the time information of each frame in the image sequence;

and the initial unit is used for sequencing the expression information according to the time sequence so as to obtain the initial expression sequence I.

Optionally, the predictive model is an LSTM model; the input length of the LSTM model is n, and the characteristics of unit length comprise x types;

optionally, the final module specifically includes:

the input unit is used for dividing the initial expression sequence into an input sequence with the length of n;

an output unit for inputting the input sequence to the prediction model to obtain an output sequence of length n;

and the final unit is used for obtaining the facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

In a third aspect, an embodiment of the present invention provides an intelligent monitoring expression recognition apparatus, which includes a processor, a memory, and a computer program stored in the memory; the computer program is executable by the processor to implement the intelligently monitored expression recognition method according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides that the computer-readable storage medium includes a stored computer program, where when the computer program runs, the apparatus in which the computer-readable storage medium is located is controlled to execute the expression recognition method for intelligent monitoring according to the first aspect.

By adopting the technical scheme, the invention can obtain the following technical effects:

according to the embodiment, the face image is extracted through the face detection module, expression recognition is carried out, recognition efficiency is greatly improved, and after recognition is lacked, the recognition result is corrected through the prediction model, and the recognition accuracy is greatly improved. Has good practical significance.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of a security monitoring method according to a first embodiment of the present invention.

Fig. 2 is a schematic view of a camera layout of a target area.

FIG. 3 is a block diagram of the structure of the LSTM model.

Fig. 4 is a block diagram of the structure of the SSD model.

Fig. 5 is a flow chart of a security monitoring method according to a first embodiment of the present invention.

Figure 6 is a schematic view of a human skeletal model.

Fig. 7 is a schematic structural diagram of a security monitoring device according to a second embodiment of the present invention.

Fig. 8 is a schematic flow chart of a security monitoring method according to a fifth embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a safety monitoring device according to a sixth embodiment of the present invention.

Fig. 10 is a schematic flow chart of a security monitoring method according to a ninth embodiment of the present invention.

Fig. 11 is a schematic structural diagram of a safety monitoring device according to a tenth embodiment of the present invention.

The labels in the figure are: the system comprises a 0-sequence module, a 1-video module, a 2-image module, a 3-coefficient module, a 4-grade module, a 5-human body model module, a 6-human body coordinate module, a 7-human body parameter module, an 8-human body posture module, a 9-region module, a 10-expression module, an 11-initial module, a 12-final module, a 13-model module, a 14-detection module and a 15-classification module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

the first embodiment,

Referring to fig. 1 to 6, the security monitoring method according to the first embodiment of the present invention can be executed by a security monitoring device, and particularly, executed by one or more processors in the security monitoring device to implement steps S1 to S4.

And S1, receiving a plurality of monitoring videos of different angles of the target area.

Specifically, the safety monitoring device is electrically connected to a monitoring system of the target area, and can receive and analyze a monitoring picture shot by the monitoring system. As shown in fig. 2, the monitoring system has at least three cameras installed in a target area, and the cameras are installed at a position 2.5m or more higher than the bottom surface, so that the included angle between the shooting angle of the cameras and the portrait in the target area is not more than 45 °. And the at least three cameras are respectively arranged at different angles of the target area.

It should be noted that the security monitoring device may be a cloud server, or a local computer, which is not limited in this respect.

And S2, respectively acquiring image sequences of the persons in the target area according to the plurality of monitoring videos.

In particular, since the monitoring system has a plurality of cameras photographed at different angles. Therefore, the surveillance video includes video streams of different angles of the respective characters in the target area. It is necessary to select image data of each person from these videos which is most suitable for the subsequent analysis operation to perform the subsequent analysis.

Based on the above embodiments, in an alternative embodiment of the present invention, step S2 specifically includes steps S21 to S23.

And S21, obtaining skeleton information of each character in the target area at different angles through an OpenPose model according to the multiple monitoring videos.

And S22, acquiring the image areas of the persons according to the skeleton information. The image region is a region where an image having the largest skeleton area of each person is located.

S23, extracting image sequences of the respective persons from the plurality of monitored videos based on the image regions.

It should be noted that, when a person enters a target area, the openpos model identifies the same person in multiple surveillance video streams and continuously tracks the person. The openpos model can identify skeletal information of people in a video stream.

In this embodiment, the area occupied by the skeleton is used as the area of the human image captured by the camera. And for each person in the target area, only extracting the image collected by the camera in the direction with the largest skeleton area as the analysis basis. That is, image information having the largest skeleton area for each person is extracted from a plurality of video streams based on the skeleton information, and the extracted image information is sorted into an image sequence based on the time order of the video streams.

It can be understood that the image with the largest skeleton area of a person in the image is often the front of the person. Therefore, the extracted image sequence includes the facial expression information and the body gesture information of each person.

In other embodiments, the extracted image sequence can be further ensured to be the image sequence of the front face of the person in the target area by combining with the face recognition model, so as to ensure the validity of the information.

And S3, acquiring the body posture, the facial expression sequence and the gesture sequence of each character according to the image sequence, and performing regression analysis to obtain the safety factor of each character.

Specifically, the state of a person can be determined by the body posture, facial expression and gesture sequence of the person. And if the robot is in a violent state of angry and violent movement or in a quiet and peaceful normal state, performing regression analysis according to the collected states and a preset sequence of the states to obtain the current safety factor.

Based on the above embodiments, in an alternative embodiment of the present invention, the step S3 specifically includes the steps S3a1 to S3a 4.

And S3A1, acquiring joint point data according to the image sequence, and establishing a human skeleton model. Wherein the joint point data includes a head, a neck joint, a trunk joint, a right shoulder joint, a right elbow joint, a right wrist joint, a left shoulder joint, a left elbow joint, a left wrist joint, a right ankle joint, a left knee joint, a left ankle joint, a left hip joint, a left knee joint, and a left ankle joint.

Specifically, the openpos model has identified and tracked each joint point of each person. As shown in fig. 6, in this embodiment, the above 15 joint points are selected from the joint points located by the openpos model, so as to establish a human skeleton model that is sufficient to form a body language, has relatively few joint points, and is convenient for calculation. In other embodiments, more or fewer joints may be selected, which is not specifically limited by the present invention.

S3A2, establishing a human body dynamic coordinate system by taking the trunk joint as an origin, the trunk joint pointing to the neck joint as a Z axis, the left shoulder joint pointing to the right shoulder joint as an x axis and the human body orientation direction as a Y axis according to the human body skeleton model, as shown in figure 6.

And S3A3, carrying out normalization processing on the coordinates of each joint according to the height according to the dynamic coordinate system of the human body, and then calculating the body parameters. Wherein the parameters include height, a first distance from the head to the x-axis, a second distance from the right foot to the x-axis, a third distance from the left foot to the x-axis, a body tilt angle, a foot angular velocity, a shoulder center angular velocity, and moment information.

Specifically, in order to further analyze the body language information of each person in the target area, a coordinate system needs to be established for each person in each image sequence to analyze the position information of each joint point, so as to further analyze the body information.

It should be noted that, in order to adapt to the difference between different scenes, the accuracy of the body language judgment is improved. In this embodiment, the coordinate information in the human body dynamic coordinate system is normalized based on the height information of each person. In other embodiments, in order to reduce the amount of calculation, the normalization process may not be performed, and the present invention is not particularly limited thereto. After normalization processing, body parameters of tasks in the image sequence are calculated according to the coordinate information of each joint point. The normalization process is prior art and the present invention is not described herein.

And S3A4, classifying through an SVM model according to the body parameters to obtain the body posture of the person in the image sequence.

Specifically, the body parameters are input into an SVM model, and a subject language of a task in the image sequence is acquired through the SVM model, for example: standing still, walking slowly at a uniform speed, pushing and contracting arms at a uniform speed, swinging arms in the horizontal direction, swinging arms in the vertical direction and the like. The behavior characteristics of the human body are analyzed through an SVM model, which belongs to the prior art, and the invention is not repeated herein.

In addition to the above embodiments, in an alternative embodiment of the present invention, the step S3 further includes steps S3B1 to S3B 4.

And S3B1, obtaining a face region in the image sequence through a face detection model.

Specifically, the image sequence acquired in step S2 includes the front face information of the person. Therefore, a region of a human face is detected from the image sequence by a face detection model, and then an image of the region is extracted, thereby further analyzing the expression information of the person.

Preferably, the face detection model is the YOLOv3 face recognition model. The Yolov3 face recognition model has good mobility and multi-target recognition capability and small object recognition capability. The face region can be accurately identified from the image sequence. Training the YOLOv3 face recognition model capable of recognizing faces is a conventional technical means for those skilled in the art, and the present invention is not described herein in detail. In other embodiments, the face recognition model may be other face recognition models, which is not specifically limited in the present invention.

And S3B2, obtaining expression information in the face area through the expression recognition model.

Preferably, the expression recognition model is a VGG16 expression classification model. The expression information includes x classes. Wherein, the x class comprises neutral l, ser i ous, pan i c, cur i ous, surf i se, happ i ness and desp i se. Training the VGG16 expression classification model capable of recognizing the x-type expression information is a conventional technical means for those skilled in the art, and the present invention is not described herein again. Specifically, after the face region is recognized by the YOLOv3 face recognition model, the face region is extracted and input into the VGG16 expression classification model to obtain the expression information of the person in the image sequence. In other embodiments, the expression recognition model may be other expression recognition models, and the present invention is not limited in this respect.

And S3B3, generating an initial expression sequence according to the time sequence and the expression information of the image sequence.

Based on the above embodiments, in an alternative embodiment of the present invention, the step S3B3 specifically includes steps S3B31 and S3B 32.

S3B31, time information of each frame in the image sequence, and generates a time sequence T.

And S3B32, sorting the expression information according to the time sequence to obtain an initial expression sequence I.

Specifically, the recognition results of the VGG16 expression classification models are sorted according to time sequence, so as to obtain an initial expression sequence I, and the time sequence can affect the prediction effect of the prediction model. Therefore, by sequentially generating the initial expression sequence, good input data can be provided for the subsequent prediction model. Has good practical significance.

And S3B4, correcting through a prediction model according to the initial expression sequence to obtain a facial expression sequence.

Specifically, expression information in the image sequence can be rapidly identified through the face recognition model and the expression classification model. However, since the information is directly recognized, there may be a case where the individual frame is recognized incorrectly. In order to avoid the situation of recognition error, in the present embodiment, the recognition result of the VGG16 expression classification model is corrected by the LSTM prediction model. Thereby avoiding the occurrence of expression recognition errors. The LSTM prediction model is shown in fig. 3, the input length of the LSTM model is n, and the features of unit length include x types. The x-type expression information corresponds to the expression information of the face in the image sequence judged by the previous VGG16 expression classification model, and the probability of the expression of the x-type expression corresponds to the x-type expression information. In other embodiments, the prediction model may use other existing prediction models, and the present invention is not limited thereto.

Based on the above embodiments, in an alternative embodiment of the present invention, the step S3B4 specifically includes steps S3B41 to S3B 43.

And S3B41, dividing the input sequence into input sequences with the length of n according to the initial expression sequence. Wherein the input length n is 11 frames. In other embodiments, the input length may be other frame numbers, which is not specifically limited in the present invention.

Specifically, the expression of the change of the facial expression on the time sequence is often piecewise continuous rather than discrete, that is, the probability of sudden change of the expression in a certain time period is low. Therefore, the initial expression sequence is divided into the input sequence with the length of n for analysis, and the input sequence is corrected according to the change of the detection result in the small neighborhood, so that the accuracy rate of the input sequence is improved. In order to obtain the optimal number of n, the inventor counts the whole sample, finds that the duration time series interval of the same expression obtained by the segmentation of the VGG16 is [11-58], so the inventor designs to find the optimal n between 10-15, and finally obtains the optimal design time series number n which is 11. In this embodiment, the output element 11 is selected as the number of correction sequences of the one-time LSTM, and the expression recognition capability is improved.

And S3B42, inputting the input sequence into a prediction model to obtain an output sequence with the length of n.

And S3B43, obtaining a facial expression sequence according to the output sequence.

Specifically, the results of VGG16 expression classification model recognition are sorted according to time sequence, the sorted expression sequence is marked as I, and I is used as<i>Representing the ith expression in the sequence. Let the corresponding time series be T, let T<i>Note the time when the ith expression appeared. Inputting the information of T and I into an LSTM prediction model, and outputting final classification results (1: neutra l, 2: ser ious, 3: panic, 4: cur ious, 5: surpr I se, 6: happi ness, 7: despi se) through the LSTM prediction model, wherein the final classification results are marked as FAIn which F isA(t) is the classification result of the picture appearing at time t.

In an alternative embodiment of the present invention, the step S3 further includes the steps S3C1 to S3C 3:

and S3C1, constructing an object detection model based on the terminal lightweight neural network model.

Specifically, the image sequence includes image information of the entire person. Therefore, the amount of calculation for directly performing gesture recognition on the object is large. In this embodiment, a hand image is identified and extracted from a sequence of images by building an object detection model. Thereby improving the recognition speed and accuracy. In this embodiment, the terminal weight reduction model is a mnsnet model. The object detection model is an SSD model. In other embodiments, the terminal lightweight model and the object detection model may also be other existing models, which is not specifically limited in the present invention.

Based on the above embodiments, in an alternative embodiment of the present invention, step S3C1 includes step S3C 11.

And S3C11, constructing an SSD model with the MnasNet model as a backbone network. As shown in fig. 4, the backbone network sequentially includes: the 1-layer convolution kernel is 3x3 Conv, the 1-layer convolution kernel is 3x3 SpeConv, the 2-layer convolution kernel is 3x3 MBConv, the 3-layer convolution kernel is 5x5 MBConv, the 4-layer convolution kernel is 3x3 MBConv, the 2-layer convolution kernel is 3x3 MBConv, the 3-layer convolution kernel is 5x5 MBConv, the 1-layer convolution kernel is 3x3 MBConv, and the 1-layer Poo l i ng or the 1-layer FC.

Specifically, the main rod network of the existing SSD model is replaced by MnasNet from the VGG16 convolution main rod network, so that the calculation amount in the target detection process can be reduced, the target detection speed is greatly increased, and the method has good practical significance.

And S3C2, extracting hand images in the image sequence through the object detection model, and generating the hand image sequence according to the time of the image sequence.

Based on the above embodiments, in an alternative embodiment of the present invention, the step S3C2 includes steps S3C21 to S3C 23.

S3C21, as shown in fig. 4, the image sequence is input into the backbone network frame by frame, so that the backbone network convolutes the image layer by layer.

S3C22, extracting five middle layers with the scale of 112 × 112 × 16, 56 × 56 × 24, 28 × 28 × 40, 14 × 14 × 112 and 7 × 7 × 100 in the convolution process, performing regression analysis to obtain a region S3C23 of the hand image, and extracting the hand image from the image according to the region.

In this example, the inventors selected five intermediate layers with dimensions of 112 × 112 × 16, 56 × 56 × 24, 28 × 28 × 40, 14 × 14 × 112, and 7 × 7 × 100 in the mnsnet as candidate regions under the SSD framework, and then performed regression analysis on the candidate regions in the classical manner of SSD in these candidate regions to obtain the final localization result.

And S3C3, classifying the hand image sequence through an image classification model to obtain a gesture sequence.

Because the structure of the MnasNet is simpler than that of the VGG16, the MnasNet is not enough to complete a classification task while positioning, in order to classify gestures, a final area selected by a final frame is further divided, and finally, the divided images are preprocessed and processed into images suitable for the input scale of the VGG16 network, the images are input to the VGG16 network for operation, and finally, the VGG16 completes the classification of hand postures. Compared with the traditional SSD frame with the classic VGG16 network as the main network, the network for collecting the main network by replacing the candidate area is of a MnaNet network structure, and the network for classifying the final frame selection area by using the VGG16 is lighter, smaller in magnitude and correspondingly faster in operation speed. The parameter set numbers for the two models are compared as follows.

Specifically, by inputting each frame of the image sequence into an MNaNet-based SSD network, performing operations such as convolution by MNaNet, extracting five intermediate layers having a scale of 112 × 112 × 16, 56 × 56 × 24, 28 × 28 × 40, 14 × 14 × 112, and 7 × 7 × 100 of MNaNet during convolution as candidate regions of a region where a hand is located, and performing regression analysis on the five intermediate layers by a classical SSD analysis manner for the five candidate regions. And taking the hand area identified by the area with the highest confidence coefficient in the five areas as a candidate area where the hand is located, mapping the hand position of the candidate area to the original drawing, deducting the corresponding position area on the original drawing, and sending the corresponding position area to the VGG16 for carrying out hand posture classification to obtain a final result.

In the present embodiment, the image classification model is a VGG16 classification model. In other embodiments, the image classification model may be other classification/recognition models, and the invention is not limited in this respect.

And S4, generating corresponding safety alarm levels according to the safety factors.

Based on the above embodiments, in an alternative embodiment of the present invention, step S4 includes steps S41 to S43.

S41, according to the safety factors, calculating the number of people in the target area, which is smaller than a preset safety factor threshold, a first average value of the safety factors and a second average value of the safety factors of adjacent scenes.

And S42, arranging the number of people, the first average value and the second average value into a time sequence characteristic according to the time sequence.

And S43, according to the time sequence characteristics, predicting through a prediction model to obtain the safety alarm level.

Specifically, a feature vector is first calculated for each individual in the scene. The feature vector is calculated in step S3 as a body posture, a facial expression sequence, and a gesture sequence. And then performing regression analysis on the feature vectors to calculate the safety factor of the individual. And combining the safety coefficient of the individual with the feature vectors of the body posture, the facial expression sequence and the gesture sequence to obtain the feature vectors of the individuals.

The safety factor calculation method comprises the following steps: firstly, acquiring individual pictures in a dangerous state, manually carrying out index evaluation on the pictures, marking a safety factor of 1-10 points, sending the scenes into each model in the step S3 for calculation to obtain three eigenvectors V1, V2 and V3 of each scene, and combining the three eigenvectors, a final scene result and a manual scoring result to generate a sample. Performing linear regression analysis based on the samples, and calculating the scene to be scored according to the generated regression function, namely calculating the safety factor by inputting the feature vectors V1, V2 and V3 of the new scene into the function

After calculating the feature vectors of all the people in the target area. And calculating the number of people smaller than the safety coefficient threshold, the average threshold, the adjacent scene threshold and other characteristics according to the characteristic vectors, arranging the characteristics into time sequence characteristics according to the acquisition time, finally sending the time sequence characteristics into the LSTM for safety grade evaluation, and carrying out alarm of corresponding grade according to the rating result and the specific use scene. Wherein the adjacent scene threshold is a safety factor threshold of an area beside the target area.

According to the embodiment of the invention, an image sequence of each character at an angle which is suitable for analysis is extracted from a monitoring video passing through the angle, then the body posture, the facial expression sequence and the gesture sequence of each character in the target area are respectively analyzed, the safety factor of each character is obtained through regression analysis according to the information, and then an alarm at a corresponding level is generated according to the safety factors. People do not need to stare at the monitoring in real time, and the warning condition can be timely found, so that the method has good practical significance.

In a second embodiment, an embodiment of the present invention provides a security monitoring apparatus, which includes:

the video module 1 is used for receiving a plurality of monitoring videos of different angles of a target area.

And the image module 2 is used for respectively acquiring image sequences of all people in the target area according to the plurality of monitoring videos.

And the coefficient module 3 is used for acquiring the body posture, the facial expression sequence and the gesture sequence of each character according to the image sequence, and performing regression analysis to obtain the safety coefficient of each character.

And the grade module 4 is used for generating corresponding safety alarm grades according to the safety factors.

Optionally, the image module 2 specifically includes:

and the skeleton unit is used for acquiring skeleton information of each person in the target area at different angles through an OpenPose model according to the plurality of monitoring videos.

And the area unit is used for acquiring the image area of each person according to the skeleton information. The image region is a region where an image having the largest skeleton area of each person is located.

And the image unit is used for extracting image sequences of all people from the plurality of monitored videos respectively in the image area.

Optionally, the coefficient module 3 comprises:

and the human body model module 5 is used for acquiring joint point data according to the image sequence and establishing a human body skeleton model. Wherein the joint point data includes a head, a neck joint, a trunk joint, a right shoulder joint, a right elbow joint, a right wrist joint, a left shoulder joint, a left elbow joint, a left wrist joint, a right ankle joint, a left knee joint, a left ankle joint, a left hip joint, a left knee joint, and a left ankle joint.

And the human body coordinate module 6 is used for establishing a human body dynamic coordinate system by taking the trunk joint as an original point, the trunk joint pointing to the neck joint as a Z axis, the left shoulder joint pointing to the right shoulder joint as an x axis and the human body orientation direction as a Y axis according to the human body skeleton model.

And the human body parameter module 7 is used for calculating the body parameters after normalizing the coordinates of each joint according to the height according to the human body dynamic coordinate system. Wherein the parameters include height, a first distance from the head to the x-axis, a second distance from the right foot to the x-axis, a third distance from the left foot to the x-axis, a body tilt angle, a foot angular velocity, a shoulder center angular velocity, and moment information.

And the human body posture module 8 is used for classifying through an SVM (support vector machine) model according to the body parameters so as to obtain the body posture of the person in the image sequence.

Optionally, the coefficient module 3 further includes:

and the region module 9 is configured to obtain a face region in the image sequence through the face detection model.

The expression module 10 is configured to obtain expression information in the face area through the expression recognition model.

And the initial module 11 is configured to generate an initial expression sequence according to the time sequence of the image sequence and the expression information.

And a final module 12, configured to correct, according to the initial expression sequence, through the prediction model, to obtain a facial expression sequence.

Optionally, the face detection model is a YOLOv3 face recognition model. The expression recognition model is a VGG16 expression classification model.

Optionally, the expression information includes x classes. Wherein, the x class comprises neutral l, ser i ous, pan i c, cur i ous, surf i se, happ i ness and desp i se.

Optionally, the initialization module 11 comprises:

and the time unit is used for generating a time sequence T by using the time information of each frame in the image sequence.

And the initial unit is used for sequencing the expression information according to the time sequence so as to obtain an initial expression sequence I.

Optionally, the predictive model is an LSTM model. The input length of the LSTM model is n, and the features per unit length include x classes.

Optionally, the final module 12 comprises:

and the input unit is used for dividing the input sequence into input sequences with the length of n according to the initial expression sequence.

An output unit for inputting the input sequence to the prediction model to obtain an output sequence of length n.

And the final unit is used for obtaining the facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

Optionally, the coefficient module 3 further includes:

and the model module 13 is used for constructing an object detection model based on the terminal lightweight neural network model.

And the detection module 14 is configured to extract a hand image in the image sequence through the object detection model, and generate the hand image sequence according to the time of the image sequence.

And the classification module 15 is configured to perform classification through an image classification model according to the hand image sequence to obtain a gesture sequence.

Optionally, the terminal lightweight model is a mnsnet model. The object detection model is an SSD model.

Optionally, the model module 13 is specifically configured to:

and constructing an SSD model taking the MnasNet model as a backbone network. Wherein, the backbone network includes in proper order: the 1-layer convolution kernel is 3x3 Conv, the 1-layer convolution kernel is 3x3 SpeConv, the 2-layer convolution kernel is 3x3 MBConv, the 3-layer convolution kernel is 5x5 MBConv, the 4-layer convolution kernel is 3x3 MBConv, the 2-layer convolution kernel is 3x3 MBConv, the 3-layer convolution kernel is 5x5 MBConv, the 1-layer convolution kernel is 3x3 MBConv, and the 1-layer Poo l i ng or the 1-layer FC.

Optionally, the detection module 14 comprises:

and the convolution unit is used for inputting the image sequence into the main network frame by frame so as to ensure that the main network convolutes the image layer by layer.

An analysis unit for extracting five middle layers with the scale of 112 × 112 × 16, 56 × 56 × 24, 28 × 28 × 40, 14 × 14 × 112 and 7 × 7 × 100 in the convolution process to perform regression analysis to obtain the region of the hand image

And the extraction unit is used for extracting the hand image from the image according to the area.

Optionally, the ranking module 4 comprises:

and the threshold unit is used for calculating the number of people in the target area, which is smaller than a preset safety coefficient threshold, a first average value of the safety coefficient and a second average value of the safety coefficient of the adjacent scene according to each safety coefficient.

And the time sequence unit is used for arranging the number of people, the first average value and the second average value into time sequence characteristics according to the time sequence.

And the grade unit is used for predicting through the prediction model according to the time sequence characteristics so as to obtain the safety alarm grade.

In a third embodiment, an embodiment of the present invention provides a security monitoring device, which includes a processor, a memory, and a computer program stored in the memory. The computer program is executable by a processor to implement the security monitoring method as defined in the first aspect.

In a fourth embodiment, the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus in which the computer-readable storage medium is located is controlled to execute the security monitoring method according to the first aspect.

Fifth embodiment, the expression recognition method of the present embodiment has the same implementation principle and technical effect as the first embodiment, and the present embodiment is briefly described. Where not mentioned in this embodiment, reference may be made to embodiment one.

Referring to fig. 8, an embodiment of the present invention provides an expression recognition method for intelligent monitoring, which can be executed by an expression recognition device or a security monitoring device for intelligent monitoring. In particular, the method is performed by one or more processors within an expression recognition device or a security monitoring device to implement at least steps S3B 0-S3B 4.

And S3B0, acquiring the image sequence. Wherein, the image sequence contains a target person.

And S3B1, obtaining a face region in the image sequence through a face detection model.

And S3B2, obtaining expression information in the face area through the expression recognition model.

And S3B3, generating an initial expression sequence according to the time sequence and the expression information of the image sequence.

And S3B4, correcting through a prediction model according to the initial expression sequence to obtain a facial expression sequence.

Optionally, the face detection model is a YOLOv3 face recognition model. The expression recognition model is a VGG16 expression classification model.

Optionally, the expression information includes x classes. Wherein, the x class comprises neutral l, ser i ous, pan i c, cur i ous, surf i se, happ i ness and desp i se.

Optionally, step S3B3 specifically includes:

S3B31, time information of each frame in the image sequence, and generates a time sequence T.

And S3B32, sorting the expression information according to the time sequence to obtain an initial expression sequence I.

Optionally, the predictive model is an LSTM model. The input length of the LSTM model is n, and the features per unit length include x classes.

Optionally, step S3B4 specifically includes:

and S3B41, dividing the input sequence into input sequences with the length of n according to the initial expression sequence.

And S3B42, inputting the input sequence into a prediction model to obtain an output sequence with the length of n.

And S3B43, obtaining a facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

Optionally, the step S3B0 specifically includes steps S1 and S2:

and S1, receiving a plurality of monitoring videos of different angles of the target area.

And S2, respectively acquiring image sequences of the persons in the target area according to the plurality of monitoring videos.

Optionally, step S2 specifically includes:

and S21, obtaining skeleton information of each character in the target area at different angles through an OpenPose model according to the multiple monitoring videos.

And S22, acquiring the image areas of the persons according to the skeleton information. The image region is a region where an image having the largest skeleton area of each person is located.

S23, extracting image sequences of the respective persons from the plurality of monitored videos based on the image regions.

Step S3B4 is followed by the steps of:

and S3, performing regression analysis according to the facial expression sequence to obtain the safety factor of each character.

And S4, generating corresponding safety alarm levels according to the safety factors.

Based on the above embodiments, in an alternative embodiment of the present invention, step S4 includes steps S41 to S43.

S41, according to the safety factors, calculating the number of people in the target area, which is smaller than a preset safety factor threshold, a first average value of the safety factors and a second average value of the safety factors of adjacent scenes.

And S42, arranging the number of people, the first average value and the second average value into a time sequence characteristic according to the time sequence.

And S43, according to the time sequence characteristics, predicting through a prediction model to obtain the safety alarm level.

Please refer to the first embodiment. In the present embodiment, in order to save the amount of calculation and increase the recognition speed, portions related to the body posture and the gesture sequence are omitted. In other embodiments, only one of the body gestures and gesture sequences may be omitted.

In a sixth embodiment, referring to fig. 9, an embodiment of the present invention provides an intelligent monitoring expression recognition apparatus, including:

and the sequence module 0 is used for acquiring an image sequence. Wherein, the image sequence contains the human image.

And the region module 9 is configured to obtain a face region in the image sequence through the face detection model.

The expression module 10 is configured to obtain expression information in the face area through the expression recognition model.

And the initial module 11 is configured to generate an initial expression sequence according to the time sequence of the image sequence and the expression information.

And a final module 12, configured to correct, according to the initial expression sequence, through the prediction model, to obtain a facial expression sequence.

Optionally, the face detection model is a YOLOv3 face recognition model. The expression recognition model is a VGG16 expression classification model.

Optionally, the expression information includes x classes. Wherein, the x class comprises neutral l, ser i ous, pan i c, cur i ous, surf i se, happ i ness and desp i se.

Optionally, the initial module 11 specifically includes:

and the time unit is used for generating a time sequence T according to the time information of each frame in the image sequence.

And the initial unit is used for sequencing the expression information according to the time sequence so as to obtain an initial expression sequence I.

Optionally, the predictive model is an LSTM model. The input length of the LSTM model is n, and the features per unit length include x classes.

Optionally, the final module 12 specifically includes:

and the input unit is used for dividing the input sequence into input sequences with the length of n according to the initial expression sequence.

An output unit for inputting the input sequence to the prediction model to obtain an output sequence of length n.

And the final unit is used for obtaining the facial expression sequence according to the output sequence.

Optionally, the input length n is 11 frames.

Optionally, the sequence module 0 includes a video module and an image module in the first embodiment, which includes:

the receiving unit is used for receiving a plurality of monitoring videos of different angles of the target area.

And the skeleton unit is used for acquiring skeleton information of each person in the target area at different angles through an OpenPose model according to the plurality of monitoring videos.

And the area unit is used for acquiring the image area of each person according to the skeleton information. The image region is a region where an image having the largest skeleton area of each person is located.

And the image unit is used for respectively extracting image sequences of all people from the plurality of monitoring videos according to the image areas.

Optionally, the sequence module 0 includes a video module and an image module in the first embodiment, which includes:

and the video module is used for receiving a plurality of monitoring videos of different angles of the target area.

And the image module is used for respectively acquiring the image sequences of all people in the target area according to the plurality of monitoring videos.

Optionally, the image module comprises:

and the skeleton unit is used for acquiring skeleton information of each person in the target area at different angles through an OpenPose model according to the plurality of monitoring videos.

And the area unit is used for acquiring the image area of each person according to the skeleton information. The image region is a region where an image having the largest skeleton area of each person is located.

And the image unit is used for respectively extracting image sequences of all people from the plurality of monitoring videos according to the image areas.

The expression recognition apparatus further includes:

and the coefficient module 3 is used for performing regression analysis according to the facial expression sequence to obtain the safety coefficient of each character.

And the grade module 4 is used for generating corresponding safety alarm grades according to the safety factors.

On the basis of the above embodiment, in an alternative embodiment of the present invention, the level module 4 includes.

And the threshold unit is used for calculating the number of people in the target area, which is smaller than a preset safety coefficient threshold, a first average value of the safety coefficient and a second average value of the safety coefficient of the adjacent scene according to each safety coefficient.

And the time sequence unit is used for arranging the number of people, the first average value and the second average value into time sequence characteristics according to the time sequence.

And the grade unit is used for predicting through the prediction model according to the time sequence characteristics so as to obtain the safety alarm grade.

Seventh, an embodiment of the present invention provides an intelligent monitoring expression recognition device, which includes a processor, a memory, and a computer program stored in the memory. The computer program can be executed by a processor to implement the intelligent monitored expression recognition method according to the fifth embodiment.

Eighth embodiment, the present invention provides a computer-readable storage medium, which includes a stored computer program, where when the computer program runs, the apparatus in which the computer-readable storage medium is located is controlled to execute the expression recognition method for intelligent monitoring according to fifth embodiment.

Ninth embodiment, the implementation principle and the generated technical effects of the gesture recognition method of the present embodiment are the same as those of the first embodiment, and the present embodiment is briefly described. Where not mentioned in this embodiment, reference may be made to embodiment one.

Referring to fig. 10, an embodiment of the present invention provides an intelligent monitoring gesture recognition method, which can be executed by an intelligent monitoring gesture recognition device or a security monitoring device. In particular, the method may be performed by one or more processors within a gesture recognition device or a security monitoring device to implement at least steps S3C0 through S3C 3.

And S3C0, acquiring the image sequence. Wherein, the image sequence contains a target person.

And S3C1, constructing an object detection model based on the terminal lightweight neural network model.

And S3C2, extracting hand images in the image sequence through the object detection model, and generating the hand image sequence according to the time of the image sequence.

And S3C3, classifying the hand image sequence through an image classification model to obtain a gesture sequence.

Optionally, the terminal lightweight model is a mnsnet model. The object detection model is an SSD model.

Optionally, step S3C1 specifically includes:

and S3C11, constructing an SSD model with the MnasNet model as a backbone network. Wherein, the backbone network includes in proper order: the 1-layer convolution kernel is 3x3 Conv, the 1-layer convolution kernel is 3x3 SpeConv, the 2-layer convolution kernel is 3x3 MBConv, the 3-layer convolution kernel is 5x5 MBConv, the 4-layer convolution kernel is 3x3 MBConv, the 2-layer convolution kernel is 3x3 MBConv, the 3-layer convolution kernel is 5x5 MBConv, the 1-layer convolution kernel is 3x3 MBConv, and the 1-layer Poo l i ng or the 1-layer FC.

Optionally, step S3C2 specifically includes:

and S3C21, inputting the image sequence into the backbone network frame by frame so that the backbone network convolutes the image layer by layer.

S3C22, extracting five middle layers with the scale of 112 × 112 × 16, 56 × 56 × 24, 28 × 28 × 40, 14 × 14 × 112 and 7 × 7 × 100 in the convolution process, performing regression analysis to obtain a region S3C23 of the hand image, and extracting the hand image from the image according to the region.

Optionally, the step S3C0 specifically includes steps S1 and S2:

and S1, receiving a plurality of monitoring videos of different angles of the target area.

And S2, respectively acquiring image sequences of the persons in the target area according to the plurality of monitoring videos.

Optionally, step S2 specifically includes:

and S21, obtaining skeleton information of each character in the target area at different angles through an OpenPose model according to the multiple monitoring videos.

And S22, acquiring the image areas of the persons according to the skeleton information. The image region is a region where an image having the largest skeleton area of each person is located.

S23, extracting image sequences of the respective persons from the plurality of monitored videos based on the image regions.

Further included after step S3C3 is:

and S3, performing regression analysis according to the gesture sequence to obtain the safety factor of each character.

And S4, generating corresponding safety alarm levels according to the safety factors.

Based on the above embodiments, in an alternative embodiment of the present invention, step S4 includes steps S41 to S43.

S41, according to the safety factors, calculating the number of people in the target area, which is smaller than a preset safety factor threshold, a first average value of the safety factors and a second average value of the safety factors of adjacent scenes.

And S42, arranging the number of people, the first average value and the second average value into a time sequence characteristic according to the time sequence.

And S43, according to the time sequence characteristics, predicting through a prediction model to obtain the safety alarm level.

Please refer to the first embodiment. In the present embodiment, in order to save the amount of calculation and increase the recognition speed, portions concerning the sequence of body postures and facial expressions are omitted. In other embodiments, only one of the body gesture and facial expression sequences may be omitted.

In an tenth embodiment, referring to fig. 11, an embodiment of the present invention provides an intelligent monitoring gesture recognition apparatus, including:

and the sequence module 0 is used for acquiring an image sequence. Wherein, the image sequence contains a target person.

And the model module 13 is used for constructing an object detection model based on the terminal lightweight neural network model.

And the detection module 14 is configured to extract a hand image in the image sequence through the object detection model, and generate the hand image sequence according to the time of the image sequence.

And the classification module 15 is configured to perform classification through an image classification model according to the hand image sequence to obtain a gesture sequence.

Optionally, the terminal lightweight model is a mnsnet model. The object detection model is an SSD model.

Optionally, the model module 13 is specifically configured to:

and constructing an SSD model taking the MnasNet model as a backbone network. Wherein, the backbone network includes in proper order: the 1-layer convolution kernel is 3x3 Conv, the 1-layer convolution kernel is 3x3 SpeConv, the 2-layer convolution kernel is 3x3 MBConv, the 3-layer convolution kernel is 5x5 MBConv, the 4-layer convolution kernel is 3x3 MBConv, the 2-layer convolution kernel is 3x3 MBConv, the 3-layer convolution kernel is 5x5 MBConv, the 1-layer convolution kernel is 3x3 MBConv, and the 1-layer Poo l i ng or the 1-layer FC.

Optionally, the detection module 14 includes:

and the convolution unit is used for inputting the image sequence into the main network frame by frame so as to ensure that the main network convolutes the image layer by layer.

An analysis unit for extracting five middle layers with the scale of 112 × 112 × 16, 56 × 56 × 24, 28 × 28 × 40, 14 × 14 × 112 and 7 × 7 × 100 in the convolution process to perform regression analysis to obtain the region of the hand image

And the extraction unit is used for extracting the hand image from the image according to the area.

Optionally, the sequence module 0 includes a video module and an image module in the first embodiment, which includes:

the receiving unit is used for receiving a plurality of monitoring videos of different angles of the target area.

And the skeleton unit is used for acquiring skeleton information of each person in the target area at different angles through an OpenPose model according to the plurality of monitoring videos.

And the area unit is used for acquiring the image area of each person according to the skeleton information. The image region is a region where an image having the largest skeleton area of each person is located.

And the image unit is used for respectively extracting image sequences of all people from the plurality of monitoring videos according to the image areas.

The gesture recognition apparatus further includes:

and the coefficient module 3 is used for performing regression analysis according to the gesture sequence to obtain the safety coefficient of each person.

And the grade module 4 is used for generating corresponding safety alarm grades according to the safety factors.

On the basis of the above embodiment, in an alternative embodiment of the present invention, the level module 4 includes.

And the threshold unit is used for calculating the number of people in the target area, which is smaller than a preset safety coefficient threshold, a first average value of the safety coefficient and a second average value of the safety coefficient of the adjacent scene according to each safety coefficient.

And the time sequence unit is used for arranging the number of people, the first average value and the second average value into time sequence characteristics according to the time sequence.

And the grade unit is used for predicting through the prediction model according to the time sequence characteristics so as to obtain the safety alarm grade.

Eleventh, an embodiment of the present invention provides an intelligently monitored gesture recognition apparatus, which includes a processor, a memory, and a computer program stored in the memory. The computer program can be executed by a processor to implement the gesture recognition method according to embodiment nine.

Twelfth, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where the apparatus where the computer-readable storage medium is located is controlled to execute the intelligent monitoring gesture recognition method according to ninth embodiment when the computer program runs.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:一种安全监控方法、装置、设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!