Speech emotion recognition method
1. A speech emotion recognition method is characterized by comprising the following steps:
acquiring audio files containing conversation recording contents to construct a voice emotion database, and classifying and storing the audio files based on preset emotion types;
slicing the audio file based on preset segmentation parameters to form a voice segment;
extracting the characteristics of the voice segments based on preset characteristic description;
splicing and fusing the extracted characteristics based on a preset function and standardizing to obtain fused characteristics;
training a preset convolutional neural network model by using the fusion characteristics to predict emotion;
and performing emotion recognition on the target voice file or the voice fragment of the target voice file by using the trained preset convolutional neural network model.
2. The method for speech emotion recognition of claim 1, wherein the preset emotion classifications include four emotion classifications, active, passive and neutral.
3. The speech emotion recognition method of claim 1, wherein the audio file is a wav file with a monaural audio format and a sampling rate of 8000 Hz.
4. The speech emotion recognition method of claim 1, wherein the length of the speech segment is determined by the size of the preset segmentation parameter, wherein the minimum granularity is 1s, and when the length of the last remaining part of the speech file is not enough to be specified by the parameter, the remaining part alone becomes a segment.
5. The method for recognizing speech emotion according to claim 1, wherein the extracting the features of the speech segment based on the preset feature description comprises:
the five different features used to characterize emotions specifically include: signal frame root mean square energy, 12 th-order mel-frequency cepstrum coefficients, zero-crossing rate of time signal, harmonic-to-noise ratio, and fundamental frequency calculated from cepstrum.
6. The method for recognizing speech emotion according to claim 1, wherein the step of performing feature splicing fusion and normalization on each extracted feature based on a preset function to obtain a fusion feature comprises the steps of: and performing splicing fusion on the features by applying 12 functions, wherein the 12 functions are respectively used for obtaining an average value, a standard deviation, a kurtosis, a skewness, a minimum value, a maximum value, a relative position, a range, a slope of linear approximation of the contour, an offset of the linear approximation of the contour, a linear approximation and a difference value of an actual contour, and finally obtaining a primary fusion feature of which the total feature vector contains 384 attributes through first-order difference.
7. The method for recognizing speech emotion according to claim 6, wherein the step of performing feature splicing fusion and normalization on each extracted feature based on a preset function to obtain a fusion feature further comprises the steps of:
the preliminary fusion features are normalized, the mean subtracted and divided by the standard deviation.
8. The speech emotion recognition method of claim 1, wherein the training of the preset convolutional neural network model for emotion prediction by using the fusion features comprises:
the preset convolution neural network model comprises two layers of one-dimensional convolution, wherein the number of convolution kernels and the size of the convolution kernels are respectively set to be 64 and 5, each layer of convolution is provided with a normalization layer and a dropout layer, and the last layer is a softmax layer.
9. The method for recognizing speech emotion according to any one of claims 1 to 8, wherein performing emotion recognition on the target speech file or the speech segment of the target speech file by using the trained preset convolutional neural network model comprises:
and when the voice segments of the target voice file are identified, obtaining the corresponding prediction labels and the confidence score corresponding to each label, and performing corresponding analysis and combination.
Background
With the wide application of deep learning in the field of artificial intelligence, people begin to pay attention to whether the robot perceives emotion or not, as many artificial interactive intelligent robots playing the role of customer service are present. As is well known, human emotion is changing constantly, so in order to enable a customer service robot to provide a comfortable interaction environment, eliminate obstacles between the robot and the human, and provide better service to customers, it is necessary for an intelligent robot to know the emotion change of the customer, and a solution to this problem is emotion recognition (EmotionRecognition). The emotion change of a human can cause the change of expressions, behaviors, body temperature, heart rate, sound, language, organs, nervous system and other aspects, and the change can be used as monitoring information of human emotion recognition research. However, the emotion change of the unique sound is difficult to hide and easy to perceive, and the speech emotion recognition has great significance for promoting harmonious human-computer interaction.
The emotion in speech is represented by speech parameters, which are emotional characteristics for emotion recognition. With the continuous research of scholars at home and abroad, the current emotional features are extracted into the following types: prosodic features, spectral features, and tonal features. The rhythm characteristics comprise a fundamental tone frequency characteristic, a formant characteristic, an energy characteristic and the like; currently, there are mainly Linear Prediction Cepstrum Coefficients (LPCC), mel-frequency cepstrum coefficients (MFCC) based on the relevant features of the spectrum; the parameters of long-term average frequency spectrum, harmonic noise ratio and spectrum center moment belong to the category of the voice quality characteristics. In recent years, methods for emotional feature extraction using deep learning have also emerged in large numbers including: 1. extracting Mel frequency cepstrum coefficient characteristics of the audio as input of a convolutional neural network, and further extracting the characteristics by using the convolutional neural network; 2. features are extracted from a spectrogram of speech directly using a deep neural network. The extracted features are finally distinguished by a classifier, and the commonly used classifier comprises a Support Vector Machine (SVM), a random forest and the like.
Although the emotion recognition of single features is mature at present, the method is difficult to have universality due to the limitation of characteristics, cannot obtain a high recognition rate, easily ignores possible emotion changes in the speech, and has a final result which is a relatively fuzzy result, so that the result is not accurate enough.
Disclosure of Invention
In order to solve the problems of low recognition rate and low accuracy in the prior art, the invention provides a speech emotion recognition method which has the characteristics of high accuracy, high recognition rate and the like.
The speech emotion recognition method according to the embodiment of the invention comprises the following steps:
acquiring audio files containing conversation recording contents to construct a voice emotion database, and classifying and storing the audio files based on preset emotion types;
slicing the audio file based on preset segmentation parameters to form a voice segment;
extracting the characteristics of the voice segments based on preset characteristic description;
splicing and fusing the extracted characteristics based on a preset function and standardizing to obtain fused characteristics;
training a preset convolutional neural network model by using the fusion characteristics to predict emotion;
and performing emotion recognition on the target voice file or the voice fragment of the target voice file by using the trained preset convolutional neural network model.
Further, the preset emotion categories include four emotion categories, which are active, passive and neutral.
Further, the audio file is a wav file with a single sound channel audio format and a sampling rate of 8000 Hz.
Further, the length of the voice segment is determined by the size of the preset segmentation parameter, wherein the minimum granularity is 1s, and when the last remaining part of the voice file is not longer than the length specified by the parameter, the remaining part alone becomes a segment.
Further, the feature extraction of the voice segment based on the preset feature description includes:
the five different features used to characterize emotions specifically include: signal frame root mean square energy, 12 th-order mel-frequency cepstrum coefficients, zero-crossing rate of time signal, harmonic-to-noise ratio, and fundamental frequency calculated from cepstrum.
Further, the splicing, fusion and normalization of the extracted features based on a preset function to obtain fusion features includes: and performing splicing fusion on the features by applying 12 functions, wherein the 12 functions are respectively used for obtaining an average value, a standard deviation, a kurtosis, a skewness, a minimum value, a maximum value, a relative position, a range, a slope of linear approximation of the contour, an offset of the linear approximation of the contour, a linear approximation and a difference value of an actual contour, and finally obtaining a primary fusion feature of which the total feature vector contains 384 attributes through first-order difference.
Further, the splicing and fusing of the extracted features based on a preset function and the normalization to obtain fused features further includes:
the preliminary fusion features are normalized, the mean subtracted and divided by the standard deviation.
Further, the training of the preset convolutional neural network model for emotion prediction by using the fusion features comprises:
the preset convolution neural network model comprises two layers of one-dimensional convolution, wherein the number of convolution kernels and the size of the convolution kernels are respectively set to be 64 and 5, each layer of convolution is provided with a normalization layer and a dropout layer, and the last layer is a softmax layer.
Further, performing emotion recognition on the target voice file or the voice segment of the target voice file by using the trained preset convolutional neural network model comprises:
and when the voice segments of the target voice file are identified, obtaining the corresponding prediction labels and the confidence score corresponding to each label, and performing corresponding analysis and combination.
The invention has the beneficial effects that: firstly, input audio signals are segmented, then features capable of expressing emotion information are extracted through an audio feature extraction method, then each feature is calculated and preliminarily spliced and fused through a function, and the fused features are input into a constructed one-dimensional convolutional neural network model for training and recognition. The method for fusing the multiple expression emotional characteristics overcomes the defects of simplification and limited expression capability of a single characteristic, and the method for fusing the multiple expression emotional characteristics can acquire speech emotion information from different angles and layers and describe the speech emotion information more comprehensively, so that the system obtains higher recognition rate and the robustness of the system is improved; the audio signal fragmentation identification can more accurately master the emotion change, and the problem that the emotion change is ignored when the whole audio identification only returns the label with the maximum probability is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow diagram of a method of speech emotion recognition provided in accordance with an exemplary embodiment;
FIG. 2 is another flow diagram of a method of speech emotion recognition provided in accordance with an exemplary embodiment;
FIG. 3 is a block diagram of a predictive convolutional neural network model provided in accordance with an exemplary embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a speech emotion recognition method, which specifically includes the following steps:
101. acquiring audio files containing conversation recording contents to construct a voice emotion database, and classifying and storing the audio files based on preset emotion categories;
a speech emotion database can be constructed by obtaining conversation records of customer service and customers, wherein the emotion database comprises a plurality of emotion categories, and speech collected under each emotion category is from audio files of different speakers.
102. Slicing the audio file based on preset segmentation parameters to form voice segments;
the speech segmentation means that a long speech is sliced by the set segmentation parameters, and the length of each segment is determined by the set parameters.
103. Performing feature extraction on the voice fragments based on preset feature description;
and carrying out feature extraction processing on the segmented speech in the speech database, and respectively extracting corresponding features from the speech segments.
104. Splicing and fusing the extracted characteristics based on a preset function and standardizing to obtain fused characteristics;
105. training a preset convolutional neural network model by using the fusion characteristics to predict emotion;
106. and performing emotion recognition on the target voice file or the voice fragment of the target voice file by using the trained preset convolutional neural network model.
In this way, input audio signals are segmented, then features capable of expressing emotion information are extracted through an audio feature extraction method, then each feature is calculated through a function, and finally, preliminary splicing fusion is carried out; the fused features are input into a constructed one-dimensional convolutional neural network model for fusion, and finally, corresponding classifiers are used for judging emotion types, so that the defects of single feature simplification and limited expression capacity are overcome by adopting multi-feature fusion, speech emotion information can be obtained from different angles and layers, and the speech emotion information is more comprehensively described, so that the system obtains higher recognition rate, and meanwhile, the system robustness is improved; the audio signal fragmentation identification can more accurately master the emotion change, and the problem that the emotion change is ignored when the whole audio identification only returns the label with the maximum probability is avoided.
Referring to fig. 2, as a possible implementation manner of the above embodiment, the database includes four emotion categories, namely active (exciting), passive (passive), and neutral (neutral), the collected voices in each emotion category are from different speakers, and the final audio format is a wav audio file with a mono channel and a sampling rate of 8000 Hz.
The length of each segment is determined by set parameters, the minimum granularity of the segmentation is 1s, and when the final residual part of the voice is not enough to be the length specified by the parameters, the residual part alone becomes a segment. Whether to take a fragmentation operation on speech can also be selected by changing parameters. The present invention is not limited to the specific segmentation parameters, and those skilled in the art can set the segmentation parameters according to the actual application.
And carrying out feature extraction processing on the segmented speech in the speech database. Firstly, five different features used for representing emotions are extracted, wherein the first feature is signal frame root mean square energy (RMSenergy) which is a measure of the loudness of an audio signal, and the change of the loudness is an important cue of a new sound event and can be used for audio segmentation, and the change of the loudness has an influence on the loudness of a generated voice signal, so that the change of the loudness can be regarded as an important cue of the new emotion event, and the RMSenergy is used as one of emotion recognition features for detecting boundaries between different emotions; the second feature is the 12 th-order Mel-frequency cepstral coefficients, which are cepstral parameters extracted in the frequency domain of the Mel-scale, which describes the non-linear characteristics of the human ear frequency. The extraction process comprises the following steps: pre-emphasis, framing, windowing, Fast Fourier Transform (FFT), passing through a Mel filter bank, logarithmic operation, discrete cosine transform, and extracting differential dynamic parameters. Extracting cepstrum parameters in a Mel scale frequency domain, and finally only keeping the first twelve parameters; the third characteristic is the zero crossing rate of the time signal; the fourth feature is a harmonic noise ratio, which is a ratio of harmonic to noise components, and is usually used as a psychoacoustic feature to reflect emotion changes, which are related changes along with emotion changes in joy, and the like, which are high in joy, the harmonic noise ratio is relatively high, and the emotion in the negative direction of joy, such as sadness, anger, etc., is relatively low in value in the harmonic noise ratio feature; the fifth feature is the fundamental frequency calculated from the cepstrum.
And applying 12 functions to each feature to perform feature splicing and fusion, wherein the 12 functions are mainly used for obtaining an average value, a standard deviation, a kurtosis, a skewness, a minimum value, a maximum value, a relative position, a relative range, two linear regression coefficients with Mean Square Error (MSE) (a slope (m) of linear approximation of the contour, an offset (t) of the linear approximation of the contour) and a calculated quadratic error (a difference value between the linear approximation and an actual contour).
Finally, after first-order difference, the total feature vector of each voice contains 16.2.12-384 attributes, and then the normalization of the preliminarily fused features is performed. The main objective of subtracting the mean divided by the standard deviation is to make the features between the different metrics comparable, with the effect on the objective function being reflected in the geometric distribution, rather than the numerical value, this step does not change the distribution of the original data, x ═ x- μ ÷ σ, where μ denotes the mean and σ denotes the variance.
Referring to the structure diagram of the convolutional neural network model (CNN1D) shown in fig. 3, the convolutional neural network model (CNN1D) is trained by using the preliminarily processed features, and the final model is used for performing deep feature fusion on one hand and predicting emotion on the other hand. The specific network structure is respectively two layers of one-dimensional convolution, wherein the convolution kernel number and the convolution size are respectively set to be 64 and 5, each layer of convolution is provided with a batch normalization layer and a dropout layer (the dropout parameter is set to be 0.5) to prevent model overfitting, the activation functions are all relu, the last layer is a softmax layer for final emotion type judgment, and each output node corresponds to one category. The learning rate of the CNN1D model in the training phase is set to be 0.0002, the batch is set to be 32, the iteration number is set to be 100, a cross entropy loss function (CE) is used as an optimization function of the model, and the optimizer adopts adam.
The above expression represents the output value of the ith node, and C is the number of output nodes, i.e., the number of classes to be classified. The output values of the multi-classes can be converted into probability distributions ranging between 0,1 and 1 by the Softmax function.
In some embodiments of the present invention, the model is sent to predict, and there are two options for the whole prediction phase. One is fragmented emotion recognition, each fragment will get a corresponding prediction label and a confidence score corresponding to each label, and for such fragmented audio, the results will eventually need to be analyzed and merged. Analysis of fragmented audio there are several cases of merger: 1. if the prediction results of the previous segment and the current segment are the same, combining the results, and averaging the confidence scores of the segments during combining; 2. the prediction results of the previous segment and the current segment are different, and the previous segment and the current segment do not need to be combined and only need to be recorded respectively. Whether the audio is combined or not, the audio is subjected to time positioning in the whole audio, and the starting and ending time nodes corresponding to the current segment are given. The final result format is start time, end time, duration, confidence level, emotion type, and the specific result format is shown by referring to the following table:
parameter name
Type (B)
Description of the parameters
Start
float
Mood onset time, unit s
End
float
End of emotion time, unit s
Duration
float
Duration of time
Confidence
float
Confidence level
Emotion
string
Type of emotion
The other method is whole sentence emotion recognition, the input audio skips fragmentation operation, the overall multi-feature fusion is carried out and then recognition is carried out, the corresponding result only gives the label with the highest prediction probability and the score corresponding to each label, the returned result is in the form of emotion type (emotion) and score (score), and the final model is shown in the following table:
emotion categories
Precision
Recall
F1
positive
0.86313466
0.85371179
0.85839737
negative
0.88361045
0.87529412
0.87943262
exciting
0.89537713
0.87410926
0.88461538
neutral
0.878125
0.93355482
0.90499195
Precision, Recall and F1 correspondingly represent accuracy, Recall and F1 values, the model is evaluated by using three indexes of the accuracy, the Recall and the F1 values, and the model is comprehensively analyzed, wherein the final recognition rate of the model for the four emotions is 88 percent, so that a high speech emotion recognition rate can be obtained by adopting a multi-feature fusion method, and the system can more accurately master emotion change in audio by adding fragmentation.
It is understood that other forms of neural network models may be used by those skilled in the art to identify emotion classes, and the invention is not limited thereto.
According to the speech emotion recognition method provided by the embodiment of the invention, an input audio signal is subjected to fragmentation operation, five types of features (root mean square energy, 2-order Mel frequency cepstrum coefficient, zero crossing rate, harmonic noise ratio and fundamental frequency) capable of expressing emotion information are extracted by an audio feature extraction method, then a function is used for calculating an average value, a standard deviation, a kurtosis, a skewness, a minimum value, a maximum value, a relative position, a range, two linear regression coefficients with Mean Square Error (MSE) and a secondary error of each feature, and finally primary splicing fusion is carried out by carrying out first-order difference; and inputting the fused features into the constructed one-dimensional convolutional neural network model for fusion, and finally judging the emotion type by using a softmax classifier. The problems that single characteristics only effectively simplify certain data and are limited in expression capacity and the like are solved, the convolutional neural network is used at the rear end, so that the characteristics with distinguishing capacity can be selected automatically by the prediction model according to the implicit characteristics of voice data, and the recognition rate is effectively improved. The method for segmenting and recognizing the audio signal after being segmented solves the problem that emotion changes cannot be accurately captured under the whole speech recognition.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.