Multi-mode child emotion recognition fusion model based on video image facial expressions and voice
1. The multimode child emotion recognition fusion model based on the facial expressions and the voices of the video images is characterized by comprising the following steps of:
step (A), enhancing facial textures of facial expression training data through Gabor;
training the dense connection convolution neural network on the facial expression training data after texture enhancement to obtain an image emotion recognition model;
step (C), performing feature fusion on the MFCC features and the GFCC features of the voice data training data set;
and (D) inputting the fusion characteristics into a Convolutional Neural Network (CNN) and a Gated Round Unit (GRU) network fusion formation model CGRU and an SVM for training, and forming integrated learning by the CGRU and the SVM to obtain a speech emotion recognition model.
And (E) performing decision fusion on the image emotion recognition model and the voice emotion recognition model to obtain a bimodal child emotion recognition model.
2. The multi-modal children emotion recognition fusion model based on video image facial expressions and speech of claim 1, wherein: step (A), the facial expression training data is subjected to Gabor enhancement to enhance the facial texture, and the method comprises the following steps:
(A1) constructing a Gabor filter, wherein the Gabor filter has six different wavelength values of 2, 3, 4, 5, 6 and 7, and the filter of each wavelength is designed to have 4 directions of 0, pi/4, 2 pi/4 and 3 pi/4;
(A2) and convolving the facial expression training data with the constructed Gabor filter to obtain a Gabor image with enhanced texture.
3. The multi-modal children emotion recognition fusion model based on video image facial expressions and speech of claim 1, wherein: step (B), inputting the facial expression training data after texture enhancement into a dense connection convolution neural network for training to obtain an image emotion recognition model, and the method comprises the following steps:
(B1) acquiring training samples after texture enhancement, wherein the training samples comprise 5582 facial expression images;
(B2) and training the training sample by using a dense convolutional neural network to obtain an image emotion recognition model, wherein the dense convolutional neural network comprises 4 dense blocks, and each dense block comprises 6 bottleneck layers, 12 bottleneck layers, 24 bottleneck layers and 16 bottleneck layers respectively.
4. The multi-modal children emotion recognition fusion model based on video image facial expressions and speech of claim 1, wherein: and (C) performing feature fusion on the MFCC features and the GFCC of the voice data training data set, wherein the method comprises the following steps of:
(C1) preprocessing the voice emotion data, wherein the preprocessing comprises normalization, pre-emphasis and framing and windowing;
(C2) extracting MFCC features and GFCC features of the speech emotion data;
(C3) the MFCC was fused with the GFCC.
5. The multi-modal children emotion recognition fusion model based on video image facial expressions and speech of claim 1, wherein: inputting the fusion characteristics into a Convolutional Neural Network (CNN) and a Gated Round Unit (GRU) network fusion composition model CGRU and an SVM for training, and forming integrated learning by the CGRU and the SVM to obtain a speech emotion recognition model, wherein the method comprises the following steps:
(D1) a Convolutional Neural Network (CNN) with good capturing capability on a frequency domain and a gated cyclic unit (GRU) network with good timing sequence feature extraction capability are fused to form a CGRU;
(D2) acquiring a fusion feature set of the MFCC and the GFCC;
(D3) respectively training the CGRU and the SVM by using the training samples to respectively obtain a CGRU model and an SVM model;
(D4) and performing ensemble learning on the CGRU model and the SVM model to obtain a speech emotion recognition model.
6. The multi-modal children emotion recognition fusion model based on video image facial expressions and speech of claim 1, wherein: and (E) performing decision fusion on the image emotion recognition model and the voice emotion recognition model to obtain a bimodal child emotion recognition model.
Background
The emotion is an external expression mode of the human mind world and is a psychological activity taking individual will and demand as media. Therefore, mood regulation is closely related to mental health of children and teenagers. However, since the ability of a child to defuse emotions and take reasonable measures against different emotions is far from that of an adult, it is difficult for a guardian to find the emotional conditions of the child, so that the child cannot be helped to defuse emotions and conduct emotion dispersion in time. Therefore, the infant and the teenager can be caused with mood disorder, anxiety, mental health and other problems.
The current better emotion recognition solution is to adopt a man-machine interaction mode, namely, effective characteristics are screened out by analyzing the voice or facial expressions of children under different emotions, and relevant models are trained by using the characteristics to obtain recognition models. However, these methods do not take into account the problems of sharper sound and higher frequency of children, and ignore the fact that the facial texture of children is more subtle than that of adults. Therefore, it is an urgent need to overcome the problems of the above-mentioned emotion recognition methods for children.
Disclosure of Invention
The invention aims to provide a multi-modal child emotion recognition fusion model based on video image facial expressions and voice, so as to solve the problems in the background technology.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a multi-modal child emotion recognition fusion model based on video image facial expressions and voice comprises the following steps:
step (A), enhancing facial textures of facial expression training data through Gabor;
training the dense connection convolution neural network on the facial expression training data after texture enhancement to obtain an image emotion recognition model;
step (C), performing feature fusion on the MFCC features and the GFCC features of the voice data training data set;
and (D) inputting the fusion characteristics into a Convolutional Neural Network (CNN) and a Gated Round Unit (GRU) network fusion formation model CGRU and an SVM for training, and forming integrated learning by the CGRU and the SVM to obtain a speech emotion recognition model.
And (E) performing decision fusion on the image emotion recognition model and the voice emotion recognition model to obtain a bimodal child emotion recognition model.
In the multi-modal child emotion recognition fusion model based on the video image facial expressions and the voice, the step (A) of enhancing the facial textures of the children by using the facial expression training data through the Gabor comprises the following steps,
(A1) constructing a Gabor filter, wherein the Gabor filter has six different wavelength values of 2, 3, 4, 5, 6 and 7, and the filter of each wavelength is designed to have 4 directions of 0, pi/4, 2 pi/4 and 3 pi/4;
(A2) convolving the facial expression training data with the constructed Gabor filter to obtain a Gabor image with enhanced texture;
the multi-modal child emotion recognition fusion model based on the video image facial expressions and the voice is characterized in that: step (B), inputting the facial expression training data after texture enhancement into a dense connection convolution neural network for training to obtain an image emotion recognition model, and the method comprises the following steps:
(B1) acquiring training samples after texture enhancement, wherein the training samples comprise 5582 facial expression images;
(B2) and training the training sample by using a dense convolutional neural network to obtain an image emotion recognition model, wherein the dense convolutional neural network comprises 4 dense blocks, and each dense block comprises 6 bottleneck layers, 12 bottleneck layers, 24 bottleneck layers and 16 bottleneck layers respectively.
The multi-modal child emotion recognition fusion model based on the video image facial expressions and the voice is characterized in that: and (C) performing feature fusion on the MFCC features and the GFCC of the voice data training data set, wherein the method comprises the following steps of:
(C1) preprocessing the voice emotion data, wherein the preprocessing comprises normalization, pre-emphasis and framing and windowing;
(C2) extracting MFCC features and GFCC features of the speech emotion data;
(C3) the MFCC was fused with the GFCC.
The multi-modal child emotion recognition fusion model based on the video image facial expressions and the voice is characterized in that: and (C3) fusing the MFCC and the GFCC, wherein the specific fusion formula is shown as formula (1):
Mmix=[(MMFCC),(MGFCC)] (1)
wherein M isMFCCRepresenting extracted MFCC features, MGFCCRepresenting the extracted GFCC features, MmixRepresenting the fusion characteristics.
The multi-modal child emotion recognition fusion model based on the video image facial expressions and the voice is characterized in that: inputting the fusion characteristics into a Convolutional Neural Network (CNN) and a Gated Round Unit (GRU) network fusion composition model CGRU and an SVM for training, and forming integrated learning by the CGRU and the SVM to obtain a speech emotion recognition model, wherein the method comprises the following steps:
(D1) a Convolutional Neural Network (CNN) with good capturing capability on a frequency domain and a gated cyclic unit (GRU) network with good timing sequence feature extraction capability are fused to form a CGRU;
(D2) acquiring a fusion feature set of the MFCC and the GFCC;
(D3) respectively training the CGRU and the SVM by using the training samples to respectively obtain a CGRU model and an SVM model;
(D4) and performing ensemble learning on the CGRU model and the SVM model to obtain a speech emotion recognition model.
The multi-modal child emotion recognition fusion model based on the video image facial expressions and the voice is characterized in that: a step (E) of recognizing the emotion of the image and the speechAnd (3) carrying out decision fusion on the emotion recognition model to obtain a bimodal child emotion recognition model, wherein a specific decision fusion formula is shown in a formula (2):
wherein E represents the result of the recognized emotion, PmRepresenting the result of the classification, P, on the video image pathvAnd representing the classification result of the voice channel, wherein alpha and beta represent the weights of the voice channel, and alpha is 0.62, and beta is 0.38.
The invention has the beneficial effects that: the invention provides a multi-modal child emotion recognition fusion model based on video image facial expressions and voice, which comprises the following steps: in the facial expression mode, the pictures are processed by utilizing Gabor filtering, so that the fine facial expression texture features of the children are enhanced; training the facial expression data after texture enhancement by using DenseNet which can extract more fine implicit characteristics to construct a facial expression recognition model; in the speech mode, a GFCC with higher robustness to a high-frequency speech signal and an MFCC with high noise immunity to a low-frequency speech signal form a fusion characteristic; inputting the data into a Convolutional Neural Network (CNN) and a gated cyclic unit (GRU) network fusion construction model CGRU and SVM for training; then, the CGRU and the SVM are formed into an integrated learning mode to obtain a speech emotion recognition model; and finally, fusing the facial expression mode and the voice mode in a decision layer according to a weight criterion to obtain a multi-mode child emotion recognition model. The technical scheme provided by the invention solves the problems of fine facial texture and overhigh voice frequency of children, effectively improves the accuracy of emotion recognition of the children and has strong popularization value.
Drawings
FIG. 1 is a schematic block diagram of the process flow of the present invention.
FIG. 2 is a block diagram of a CGRU constructed in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution: the method for constructing the multi-modal child emotion recognition fusion model based on the facial expressions and the voice of the video images comprises the following steps:
step (A), the facial expression training data is processed by Gabor to enhance the facial texture, comprising the following steps,
(A1) constructing a Gabor filter, wherein the Gabor filter has six different wavelength values of 2, 3, 4, 5, 6 and 7, and the filter of each wavelength is designed to have 4 directions of 0, pi/4, 2 pi/4 and 3 pi/4;
(A2) convolving the facial expression training data with the constructed Gabor filter to obtain a Gabor image with enhanced texture;
step (B), training the dense connection convolution neural network on the facial expression training data after texture enhancement to obtain an image emotion recognition model, and the method comprises the following steps:
(B1) acquiring training samples after texture enhancement, wherein the training samples comprise 5582 facial expression images;
(B2) a bottleneck layer is constructed, which consists of Batch Normalization (BN), activation function, 1 × 1 convolutional layer, BN, activation function, and 3 × 3 convolutional layer.
(B3) And constructing a transition layer, wherein the transition layer is composed of a1 × 1 convolution layer and a2 × 2 mean pooling layer.
(B4) The dense convolutional neural network consists of convolutional layers, dense blocks, Transition layers (Transition layers), Pooling layers (Pooling), Global Pooling layers (Global Average Poolin) and output layers, and the dense convolutional neural network comprises ten layers in total, wherein the former two layers consist of a 7 x 7 convolutional Layer and a 3 x 3 maximum Pooling Layer. And then, alternately stacking dense blocks and transition layers, wherein the whole DenseNet consists of N dense blocks, and each dense block comprises M bottleneck layers for basic construction operation and feature map splicing and fusing operation. Taking N to 4, M to [6, 12, 24, 16], a total of 116 convolution operations, with 256, 512, 1024, 1024 output channels per dense block. Each dense block is followed by a transition layer for compressing the number of feature channels to alleviate information redundancy caused by an excessive number of feature channels. After compression, the characteristic channels are 128, 256 and 512 in sequence, and the last layer of dense blocks is not compressed. And finally, after passing through a global pooling layer, connecting with a classification output layer to output emotion classification, and completing construction of a facial expression recognition model.
And (C) performing feature fusion on the MFCC features and the GFCC of the voice data training data set, wherein the method comprises the following steps of:
(C1) normalizing the voice emotion data, wherein the preprocessing comprises pre-emphasis and framing and windowing;
(C2) extracting MFCC features and GFCC features of the speech emotion data;
(C3) the MFCC was fused with the GFCC. As shown in equation (1): mmix=[(MMFCC),(MGFCC)] (1)
Wherein M isMFCCRepresenting extracted MFCC features, MGFCCRepresenting the extracted GFCC features, MmixRepresenting the fusion characteristics.
Inputting the fusion characteristics into a Convolutional Neural Network (CNN) and a Gated Round Unit (GRU) network fusion composition model CGRU and an SVM for training, and forming integrated learning by the CGRU and the SVM to obtain a speech emotion recognition model, wherein the method comprises the following steps:
(D1) firstly, 3 modules consisting of a convolution layer of 3 multiplied by 3, a BN layer and an activation function are constructed, then a convolution layer of 1 multiplied by 1 is connected to adjust the number of channels, and finally the convolution module is connected with a Max scaling layer to finish the operation of the convolution module.
(D2) And three GRU modules are connected to form a GRU network.
(D2) And finally, converting the extracted multidimensional characteristics into one dimension by using a Flatten layer and inputting the one dimension into a full connection layer to complete the construction of the CGRU model.
(D4) Respectively training the CGRU and the SVM by using the training samples to respectively obtain a CGRU model and an SVM model;
(D5) and performing ensemble learning on the CGRU model and the SVM model to obtain a speech emotion recognition model.
And (E) performing decision fusion on the image emotion recognition model and the voice emotion recognition model to obtain a bimodal child emotion recognition model, wherein a specific decision fusion formula is shown in a formula (2):
wherein E represents the result of the recognized emotion, PmRepresenting the result of the classification, P, on the video image pathvAnd representing the classification result of the voice channel, wherein alpha and beta represent the weights of the voice channel, and alpha is 0.62, and beta is 0.38.
The performance pairs for the different models and methods are shown in table 1. In the speech mode, the fusion characteristics of GFCC and MFCC improve the speech recognition rate to a certain extent, and the recognition method of CGRU + SVM is improved in accuracy compared with LSTM; in the facial expression mode, the recognition accuracy rate of the proposed Gabor + DenseNet model reaches 79.6%, and the recognition accuracy rate is superior to that of other recognition models; the final accuracy of the fusion mode can reach 83.4 percent, and the accuracy rate is superior to that of other single-mode recognition models. Therefore, in the aspect of emotion recognition of children, the recognition accuracy of the multi-mode fusion strategy is improved to a certain extent compared with that of a single mode.
TABLE 1 comparison of recognition results by different methods
While the present invention has been described with reference to the above embodiments, the specific implementation of the present invention is not limited to the above embodiments, and any person skilled in the art can easily think of the changes and substitutions within the scope of the calculation disclosed in the present application, and the ways of changing the data set, the number of emotion types, the weight parameters, etc. are all covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:基于一致性训练的半监督三维形状识别方法