Video-based face recognition method and device and storage medium
1. A face recognition method based on video is characterized by comprising the following steps:
importing a video data set, the video data set comprising a plurality of video data;
respectively converting each video data to obtain a plurality of video frames corresponding to the video data;
respectively extracting the features of each video frame to obtain a face feature vector and a weight corresponding to each video frame, and collecting all the face feature vectors to obtain a face feature vector set;
and evaluating and analyzing the face feature vector set and all weights to obtain an optimal feature vector, and taking the optimal feature vector as a face recognition result.
2. The video-based face recognition method according to claim 1, wherein the process of respectively performing feature extraction on each video frame to obtain the face feature vector and the weight corresponding to each video frame comprises:
and respectively extracting the characteristics of each video frame through a convolutional neural network (SSD) to obtain a face characteristic vector and a weight corresponding to each video frame.
3. The video-based face recognition method according to claim 1, wherein the process of performing evaluation analysis on the face feature vector set and all weights to obtain an optimal feature vector comprises:
respectively calculating the information quantity of each weight to obtain a plurality of information quantities corresponding to each video frame;
respectively calculating the total information quantity of the plurality of information quantities to obtain the total information quantity corresponding to each video frame;
calculating feature evaluation functions of the face feature vector set, any two face feature vectors in the face feature vector set and total information quantity corresponding to the two face feature vectors according to the weights respectively to obtain feature evaluation functions corresponding to the weights;
and screening the minimum value of all the feature evaluation functions to obtain the minimum feature evaluation function, and taking the face feature vector corresponding to the minimum feature evaluation function as the optimal feature vector.
4. The video-based face recognition method according to claim 3, wherein the process of calculating the information amount of each weight respectively to obtain a plurality of information amounts corresponding to each video frame comprises:
and calculating the information content of each weight respectively through a first formula to obtain a plurality of information contents corresponding to each video frame, wherein the first formula is as follows:
wherein, h (x)j) Is the information quantity, xjFor the (j) th neuron, the number of neurons is,is the ith weight.
5. The video-based face recognition method according to claim 3, wherein the step of calculating the total information amount of each of the plurality of information amounts to obtain the total information amount corresponding to each of the video frames comprises:
respectively calculating the total information quantity of the plurality of information quantities through a second formula to obtain the total information quantity corresponding to each video frame, wherein the second formula is as follows:
wherein H (X) is the total information content, h (x)j) Is an effective amount of information.
6. The video-based face recognition method according to claim 5, wherein the step of calculating feature evaluation functions for the face feature vector set, any two face feature vectors in the face feature vector set, and the total information amount corresponding to the two face feature vectors according to the weights respectively to obtain the feature evaluation functions corresponding to the weights comprises:
respectively calculating feature evaluation functions of the face feature vector set, any two face feature vectors in the face feature vector set and total information amount corresponding to the two face feature vectors according to a third formula and each weight to obtain the feature evaluation function corresponding to each weight, wherein the third formula is as follows:
wherein D iserror(ai,aj,a^)=|D(a^,ai)-D(a^,aj)|if H(Xi)<H(Xj),
Wherein a ^ WTA,
Wherein F (a ^) is a feature evaluation function, k is the number of face feature vectors in a face feature vector set, ai and aj are any two face feature vectors in the face feature vector set respectively, a ^ is a face feature vector to be learned, and X isiFor the total amount of information, X, corresponding to the face feature vector aijIs a face feature vector ajCorresponding total information amount, A is face feature vector set, WTAre weights.
7. A video-based face recognition apparatus, comprising:
the data set importing module is used for importing a video data set, and the video data set comprises a plurality of video data;
the data conversion module is used for respectively converting the video data to obtain a plurality of video frames corresponding to the video data;
the feature extraction module is used for respectively extracting features of the video frames to obtain face feature vectors and weights corresponding to the video frames, and collecting all the face feature vectors to obtain a face feature vector set;
and the recognition result obtaining module is used for evaluating and analyzing the face feature vector set and all weights to obtain an optimal feature vector, and taking the optimal feature vector as a face recognition result.
8. The video-based face recognition device of claim 7, wherein the feature extraction module is specifically configured to:
and respectively extracting the characteristics of each video frame through a convolutional neural network (SSD) to obtain a face characteristic vector and a weight corresponding to each video frame.
9. A video-based face recognition apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that when the computer program is executed by the processor, the video-based face recognition method according to any one of claims 1 to 6 is implemented.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a video-based face recognition method according to any one of claims 1 to 6.
Background
The image-based face recognition technology has made great progress, and the current algorithm has obtained nearly 100% accuracy on the LFW data set, but the video-based face recognition effect is still not ideal. In consideration of the fact that certain application scenes cannot constrain the recognition target, a good effect cannot be obtained by directly extracting a certain frame of the face video for recognition, and therefore the video face recognition has a high application value. Compared with the traditional feature extraction method, the face feature extraction method based on the convolutional neural network can obtain more discriminative feature information, and the key of video face recognition is how to represent a group of features rather than a face image.
Video face recognition can be regarded as a feature fusion process, and the most common feature fusion strategies include maximum pooling, average pooling and score pooling. The former two methods are fusion based on eigenvalues, and the latter method is fusion based on eigenvalue comparison results. Although the three methods are easy to implement, the problems of face posture and illumination transformation and the like cannot be processed. The embedded media pooling method for face verification and clustering adds media number information on the basis of an average pooling method, but the method is only suitable for IJB-A data sets and lacks generalization capability. Unconstrained face recognition using a set of distance metrics to a set of distance metrics on deep learning features proposes a K-nearest neighbor average pooling method based on an average pooling method. In the method, only the score of the most similar target is considered in the characteristic comparison process, so that the intra-class characteristic distance can be shortened, but the inter-class characteristic distance is shortened, and the noise sample interference is easy to cause, so that the method cannot obtain satisfactory effect on a YTF data set and an IQIYI data set with a large number of frames of a single video face image, and meanwhile, most of the prior art needs reference evaluation and additional training of an evaluation model.
Disclosure of Invention
The invention aims to solve the technical problem of the prior art and provides a method, a device and a storage medium for face recognition based on video.
The technical scheme for solving the technical problems is as follows: a face recognition method based on video comprises the following steps:
importing a video data set, the video data set comprising a plurality of video data;
respectively converting each video data to obtain a plurality of video frames corresponding to the video data;
respectively extracting the features of each video frame to obtain a face feature vector and a weight corresponding to each video frame, and collecting all the face feature vectors to obtain a face feature vector set;
and evaluating and analyzing the face feature vector set and all weights to obtain an optimal feature vector, and taking the optimal feature vector as a face recognition result.
Another technical solution of the present invention for solving the above technical problems is as follows: a video-based face recognition apparatus, comprising:
the data set importing module is used for importing a video data set, and the video data set comprises a plurality of video data;
the data conversion module is used for respectively converting the video data to obtain a plurality of video frames corresponding to the video data;
the feature extraction module is used for respectively extracting features of the video frames to obtain face feature vectors and weights corresponding to the video frames, and collecting all the face feature vectors to obtain a face feature vector set;
and the recognition result obtaining module is used for evaluating and analyzing the face feature vector set and all weights to obtain an optimal feature vector, and taking the optimal feature vector as a face recognition result.
Another technical solution of the present invention for solving the above technical problems is as follows: a video-based face recognition apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the computer program, when executed by the processor, implementing a video-based face recognition method as described above.
Another technical solution of the present invention for solving the above technical problems is as follows: a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a video-based face recognition method as set out above.
The invention has the beneficial effects that: the method has the advantages that a plurality of video frames are obtained through conversion of video data, the features of the video frames are extracted respectively to obtain the face feature vector set and the weights, the face feature vector set and all the weights are evaluated and analyzed to obtain a face recognition result, interference of noise samples is avoided, reference evaluation and an additional training evaluation model are not needed, processing steps are simplified, dependency on data quantity is reduced, correlation between the recognition result and feature expression capacity is achieved, and accuracy of face recognition in the video is improved.
Drawings
Fig. 1 is a schematic flow chart of a video-based face recognition method according to an embodiment of the present invention;
fig. 2 is a block diagram of a video-based face recognition apparatus according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a schematic flow chart of a video-based face recognition method according to an embodiment of the present invention.
As shown in fig. 1, a face recognition method based on video includes the following steps:
importing a video data set, the video data set comprising a plurality of video data;
respectively converting each video data to obtain a plurality of video frames corresponding to the video data;
respectively extracting the features of each video frame to obtain a face feature vector and a weight corresponding to each video frame, and collecting all the face feature vectors to obtain a face feature vector set;
and evaluating and analyzing the face feature vector set and all weights to obtain an optimal feature vector, and taking the optimal feature vector as a face recognition result.
It will be appreciated that the video data set may be a large number of videos and faces without excessive angular illumination changes that may appear to be rotated by 90 degrees and blurred videos to varying degrees.
In the embodiment, a plurality of video frames are obtained by converting each video data, the features of each video frame are respectively extracted to obtain the face feature vector set and the weights, the face feature vector set and all the weights are evaluated and analyzed to obtain the face recognition result, the interference of noise samples is avoided, a reference evaluation model and an additional training evaluation model are not needed, the processing steps are simplified, the dependence on the data quantity is reduced, the correlation between the recognition result and the feature expression capability is realized, and the accuracy of face recognition in the video is improved.
Optionally, as an embodiment of the present invention, the process of respectively performing feature extraction on each of the video frames to obtain a face feature vector and a weight corresponding to each of the video frames includes:
and respectively extracting the characteristics of each video frame through a convolutional neural network (SSD) to obtain a face characteristic vector and a weight corresponding to each video frame.
It should be understood that, the collected videos (i.e. the video frames) are input into the SSD one by one, all the features corresponding to the faces appearing in the videos (i.e. the face feature vectors) are output, and the set of all the possible feature vectors (i.e. the face feature vectors) of a person in a certain video is denoted as a.
It should be understood that the face target detection is performed by a convolutional neural network SSD (single Shot detection), and the detection speed is high because the SSD algorithm has no process of generating candidate boxes.
In the embodiment, the face feature vectors and the weights corresponding to the video frames are obtained by respectively extracting the features of the video frames through the convolutional neural network SSD, so that the speed of feature extraction is increased, and the accuracy of face recognition in the video is improved.
Optionally, as an embodiment of the present invention, the process of performing evaluation analysis on the face feature vector set and all weights to obtain an optimal feature vector includes:
respectively calculating the information quantity of each weight to obtain a plurality of information quantities corresponding to each video frame;
respectively calculating the total information quantity of the plurality of information quantities to obtain the total information quantity corresponding to each video frame;
calculating feature evaluation functions of the face feature vector set, any two face feature vectors in the face feature vector set and total information quantity corresponding to the two face feature vectors according to the weights respectively to obtain feature evaluation functions corresponding to the weights;
and screening the minimum value of all the feature evaluation functions to obtain the minimum feature evaluation function, and taking the face feature vector corresponding to the minimum feature evaluation function as the optimal feature vector.
It should be understood that, by finding the optimal feature a (i.e. the optimal feature vector) of the feature set a (i.e. the face feature vector set) through the feature evaluation function, it is first necessary to judge whether the feature vector (i.e. the face feature vector) can effectively express the input image, and set a learning target for the algorithm by evaluating the "good and bad" of the feature vector (i.e. the face feature vector).
Specifically, the steps for evaluating the amount of information available are as follows:
firstly, defining the effective information quantity of backward propagation of a neuron; and secondly, defining the information quantity transmitted in the whole characteristic mapping process.
Through the above research on the characteristic effective information quantity evaluation method, it can be considered that a is effective for having the largest face.
In the embodiment, the optimal feature vector is obtained by evaluating and analyzing the face feature vector set and all weights, so that the interference of noise samples is avoided, reference evaluation and an additional training evaluation model are not needed, the correlation between the recognition result and the feature expression capability is realized, and the accuracy of face recognition in a video is improved.
Optionally, as an embodiment of the present invention, the step of calculating information quantities of the weights respectively to obtain a plurality of information quantities corresponding to the video frames includes:
and calculating the information content of each weight respectively through a first formula to obtain a plurality of information contents corresponding to each video frame, wherein the first formula is as follows:
wherein, h (x)j) Is the information quantity, xjFor the (j) th neuron, the number of neurons is,is the ith weight.
It should be understood that xj may be the jth neuron of the second last layer in the convolutional neural network.
It should be understood that X is a neuron after the convolution eigenmap tiling, X ═ X1, X2,, xn ], m denotes the number of neurons in the next layer connected to neuron X1, and w denotes the connection weight between neurons (i.e., the weight).
In the embodiment, the information amount of each weight is calculated respectively through the first type to obtain a plurality of information amounts corresponding to each video frame, so that a basis is provided for subsequent processing, and the accuracy of face recognition in a video is improved.
Optionally, as an embodiment of the present invention, the calculating of the total information amount for each of the plurality of information amounts to obtain the total information amount corresponding to each of the video frames includes:
respectively calculating the total information quantity of the plurality of information quantities through a second formula to obtain the total information quantity corresponding to each video frame, wherein the second formula is as follows:
wherein H (X) is the total information content, h (x)j) Is an effective amount of information.
It should be understood that higher h (x) means that the input image responds more to the convolution operation, while the value passed on to the next layer of neurons is larger, and the amount of information ultimately applied to the feature is larger. The convolution kernel in the network is obtained by clear face image training, if the face image has angle and illumination interference, the response to the convolution kernel is less, | xi | is smaller, and therefore H (X) is smaller.
In the embodiment, the total information amount corresponding to each video frame is obtained by calculating the total information amount of the plurality of information amounts through the second formula, reference evaluation is not needed, an additional training evaluation model is not needed, meanwhile, correlation between the recognition result and the feature expression capability is realized, and the accuracy of face recognition in the video is improved.
Optionally, as an embodiment of the present invention, the calculating a feature evaluation function on the face feature vector set, any two face feature vectors in the face feature vector set, and a total information amount corresponding to the two face feature vectors according to each of the weights respectively to obtain the feature evaluation function corresponding to each of the weights includes:
respectively calculating feature evaluation functions of the face feature vector set, any two face feature vectors in the face feature vector set and total information amount corresponding to the two face feature vectors according to a third formula and each weight to obtain the feature evaluation function corresponding to each weight, wherein the third formula is as follows:
wherein D iserror(ai,aj,a^)=|D(a^,ai)-D(a^,aj)|if H(Xi)<H(Xj),
Wherein a ^ WTA,
Wherein F (a ^) is a feature evaluation function, k is the number of face feature vectors in the face feature vector set, aiAnd ajRespectively in face feature vector setAny two face feature vectors, a ^ is the face feature vector needing to be learned, XiIs a face feature vector aiCorresponding total information amount, XjIs a face feature vector ajCorresponding total information amount, A is face feature vector set, WTAre weights.
It should be understood that the set of face feature vectors includes any two face feature vectors in the subsequent processing.
It should be understood that the feature evaluation function is obtained by calculating the distance error between the face feature vector a to be learned and each face feature vector.
Specifically, the amount of information, i.e., its h (x), is largest in the face feature vector set a. Therefore, there are any two face feature vectors a1, a2 in a:
D(a*,a1)>D(a*,a2)if H(X1)<H(X2),
the algorithm for evaluating the feature vector calculates the effective information quantity of the feature by intensively researching the feature map neurons of the last convolutional layer and analyzing the neurons and the connection weight in the feature extraction model, and provides a relational distance error evaluation function by combining the distance relation between the effective information quantity and different features, thereby realizing the evaluation of the feature vector in the feature space.
In the above embodiment, the feature evaluation functions corresponding to the weights are obtained by calculating the feature evaluation functions of the third formula and the weights for any two face feature vectors in the face feature vector set and the total information amount corresponding to the two face feature vectors, so that the feature vectors in the feature space are evaluated without reference evaluation and additional training of an evaluation model, and meanwhile, the correlation between the recognition result and the feature expression capability is realized, and the accuracy of face recognition in the video is improved.
Fig. 2 is a block diagram of a video-based face recognition apparatus according to an embodiment of the present invention.
Optionally, as another embodiment of the present invention, as shown in fig. 2, a video-based face recognition apparatus includes:
the data set importing module is used for importing a video data set, and the video data set comprises a plurality of video data;
the data conversion module is used for respectively converting the video data to obtain a plurality of video frames corresponding to the video data;
the feature extraction module is used for respectively extracting features of the video frames to obtain face feature vectors and weights corresponding to the video frames, and collecting all the face feature vectors to obtain a face feature vector set;
and the recognition result obtaining module is used for evaluating and analyzing the face feature vector set and all weights to obtain an optimal feature vector, and taking the optimal feature vector as a face recognition result.
Optionally, as an embodiment of the present invention, the feature extraction module is specifically configured to:
and respectively extracting the characteristics of each video frame through a convolutional neural network (SSD) to obtain a face characteristic vector and a weight corresponding to each video frame.
Optionally, another embodiment of the present invention provides a video-based face recognition apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the video-based face recognition method as described above is implemented. The device may be a computer or the like.
Alternatively, another embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the video-based face recognition method as described above.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.