Voice identity verification method based on long-term formant measurement
1. A voice identity verification method based on long-term formant measurement is characterized by comprising the following steps: the method comprises the following steps:
knowing a voice file from the same speaker, calculating the distance between the long-time resonance peak data of any two sections of voice in the known voice file, and obtaining the upper limit distanceAnd a lower limit distance
When a material testing voice is collected, calculating the long-term formant distance D between the material testing voice and the known voice file, and carrying out the following judgment:
when in useJudging that the material checking voice in the time interval has the same identity with the known voice file, namely the same speaker;
when in useJudging that the time interval material checking voice and the known voice file do not have the sameSex, namely different speakers;
when in useA hypothesis test is used to verify identity.
2. The method of claim 1, wherein: the upper limit distanceAnd a lower limit distanceThe calculation method of (2) is as follows:
let the 4 long-term formant measurement data of 2-segment speech in the known speech file be X1 and Y1, wherein,
in the formula, xF11……xF1mFor the first to the mth resonance peak data, x, under the first frequency of the first section voiceF21……xF2mFor the first to the mth resonance peak data x under the second frequency of the first section voiceF31……xF3mIs the first to the mth resonance peak data x under the third frequency of the first section voiceF41……xF4mData of first to mth resonance peaks under a fourth frequency of the first section of voice; y isF11……yF1nFor the first to nth resonance peak data, y, at the first frequency of the second speech segmentF21……yF2nFor the first to nth resonance peak data, y, at the second frequency of the second speech segmentF31……yF3nIs the first frequency of the second speech segmentTo nth resonance peak data, yF41……yF4nThe data of the first to nth resonance peak under the fourth frequency of the second section of voice; the first to fourth frequencies are frequencies which are sequentially increased or sequentially decreased;
the column data of each long-time formant measurement data matrix form a formant vector xi=[xF1i xF2i xF3ixF4i]、yi=[yF1i yF2i yF3i yF4i]Respectively calculating the central position of the m vectors of the first section of voice and the n vectors of the second section of voice, and enabling x to bec=[xF1c xF2c xF3c xF4c]Is the center of the X1 matrix, let yc=[yF1c yF2c yF3c yF4c]For the center of Y1 matrix, x is obtained according to the clustering principlecTo xiIs minimized, x is obtained by solving the following minimum problemcAnd yc:
At xcAnd ycOn the basis, the Euclidean distance between centers is calculated to calculate the long-term formant distance D of the two sections of voice*:
Respectively calculating the distance between every two voices from the known voice file according to the method, and taking the maximum value and the minimum value as the upper limit distanceAnd a lower limit distance
3. The method of claim 2, wherein: the method for calculating the long-term formant distance D of the material-tested voice is the same as the method for calculating the long-term formant distance D of two sections of voice in the known voice file.
4. The method of claim 3, wherein: the hypothesis testing method is a t testing method, and comprises the following specific steps:
let the 4 long-term formant measurement data of the material testing voice be Z1, wherein
In the formula, zF11……zF1jFor first to jth resonance peak data, z, at a first frequency of the material-testing voiceF21……zF2jFor first to jth resonance peak data, z, at a second frequency of the material-testing voiceF31……zF3jFor the first to jth resonance peak data, z, at the third frequency of the material-testing voiceF41……zF4jData of first to jth resonance peaks at a fourth frequency of the material detection voice are obtained;
let xF21、xF22、xF23、……、xF2mObedience as N (u, σ)2) Normal distribution of (a), zF21、zF22、zF23……zF2jObedience as N (v, σ)2) According to the statistical theory, the data of the resonance peak at the second frequency are distributed as follows:
wherein xF2mean、SxAre respectively xF21、xF22、xF23、……、xF2mMean and standard deviation of (2), zF2mean、SzAre each zF21、zF22、zF23……zF2jMean and standard deviation of;
given a degree of confidence a, when
And judging that the time-interval material-checking voice is identical to the known voice file, otherwise, judging that the time-interval material-checking voice is not identical to the known voice file.
5. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the computer program, performs the steps of the method for verifying speech identity based on long-term formant measurements according to any one of claims 1 to 4.
6. A non-transitory computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when being executed by a processor realizes the steps of a method for verifying speech identity based on long-term formant measurements according to any one of claims 1 to 4.
Background
Formants are important features in voiceprint identification, which not only provide a reference for consonants and vowel resolution, but also include personality characteristics of the speaker. The formant frequency is affected by the length of the vocal tract, and a longer vocal tract results in a lower vowel formant, and the proportional size between the various parts of the vocal tract also affects the formant frequency.
There are many ways to measure the formant frequency. Among them, the method of measuring the central frequency values of different vowel formants is the most classical. However, there is not sufficient correlation between formant frequencies of different vowels and between different formants, and this characteristic reduces the accuracy of identification. Another method for studying formants is dynamic analysis, in which individuals leave traces of their specific movement patterns when they pronounce, these traces reflect the personality characteristics of the speaker, but the dynamics of formants are affected by both the segment and prosodic contexts, and this method also requires further study of the differences between different speaking contexts.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method for verifying the voice identity based on the long-term formant measurement can improve the verification precision.
The technical scheme adopted by the invention for solving the technical problems is as follows: a voice identity verification method based on long-term formant measurement comprises the following steps:
knowing a voice file from the same speaker, calculating the distance between the long-time resonance peak data of any two sections of voice in the known voice file, and obtaining the upper limit distanceAnd a lower limit distance
When a material testing voice is collected, calculating the long-term formant distance D between the material testing voice and the known voice file, and carrying out the following judgment:
when in useJudging that the material checking voice in the time interval has the same identity with the known voice file, namely the same speaker;
when in useJudging that the material testing voice in the time interval does not have the same identity with the known voice file, namely different speakers are obtained;
when in useA hypothesis test is used to verify identity.
According to the above method, the upper limit distanceAnd a lower limit distanceThe calculation method of (2) is as follows:
let the 4 long-term formant measurement data of 2-segment speech in the known speech file be X1 and Y1, wherein,
in the formula, xF11……xF1mFor the first to the mth resonance peak data, x, under the first frequency of the first section voiceF21……xF2mFor the first to the mth resonance peak data x under the second frequency of the first section voiceF31……xF3mIs the first to the mth resonance peak data x under the third frequency of the first section voiceF41……xF4mData of first to mth resonance peaks under a fourth frequency of the first section of voice; y isF11……yF1nFor the first to nth resonance peak data, y, at the first frequency of the second speech segmentF21……yF2nFor the first to nth resonance peak data, y, at the second frequency of the second speech segmentF31……yF3nIs the first to nth resonance peak data y under the third frequency of the second section voiceF41……yF4nThe data of the first to nth resonance peak under the fourth frequency of the second section of voice; the first to fourth frequencies are frequencies which are sequentially increased or sequentially decreased;
the column data of each long-time formant measurement data matrix form a formant vector xi= [xF1i xF2ixF3i xF4i]、yi=[yF1i yF2i yF3i yF4i]Respectively calculating the central position of the m vectors of the first section of voice and the n vectors of the second section of voice, and enabling x to bec=[xF1c xF2c xF3c xF4c]Is the center of the X1 matrix, let yc=[yF1c yF2c yF3c yF4c]For the center of Y1 matrix, x is obtained according to the clustering principlecTo xiIs minimized, x is obtained by solving the following minimum problemcAnd yc:
At xcAnd ycOn the basis, the Euclidean distance between centers is calculated to calculate the long-term formant distance D of the two sections of voice*:
From the known speech fileRespectively calculating the distance between every two voices in different segments according to the method, and taking the maximum value and the minimum value as the upper limit distanceAnd a lower limit distance
According to the method, the method for calculating the long-term formant distance D of the material detection voice and the long-term formant distance D of the two sections of voices in the known voice file*The same method is used.
According to the method, the hypothesis testing method is a t testing method, and the method comprises the following specific steps:
let the 4 long-term formant measurement data of the material testing voice be Z1, wherein
In the formula, zF11……zF1jFor first to jth resonance peak data, z, at a first frequency of the material-testing voiceF21……zF2jFor first to jth resonance peak data, z, at a second frequency of the material-testing voiceF31……zF3jFor first to jth resonance peak data, Z, under the third frequency of the material-testing voiceF41……zF4jData of first to jth resonance peaks at a fourth frequency of the material detection voice are obtained;
let xF21、xF22、xF23、……、xF2mObedience as N (u, σ)2) Normal distribution of (a), zF21、zF22、 zF23……zF2jObedience as N (v, σ)2) According to the statistical theory, the data of the resonance peak at the second frequency are distributed as follows:
wherein xF2mean、SxAre respectively xF21、xF22、xF23、……、xF2mMean and standard deviation of (2), zF2mean、SzAre each zF21、zF22、zF23……zF2jMean and standard deviation of;
given a degree of confidence a, when
And judging that the time-interval material-checking voice is identical to the known voice file, otherwise, judging that the time-interval material-checking voice is not identical to the known voice file.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for voice identity verification based on long-term formant measurements when executing the computer program.
A non-transitory computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for voice identity verification based on long-term formant measurements.
The invention has the beneficial effects that: the voice identity verification is carried out by acquiring the long-term formants of the voice file and combining a hypothesis test method according to the distance of the long-term formants, so that the verification precision can be improved.
Drawings
FIG. 1 shows the frequencies of the formants LTF2 and LTF3 at vowels in different contexts of speech.
FIG. 2 is a formant spectrum.
FIG. 3 is a plot of formant F1-F3 frequency versus time.
FIG. 4 is a frequency distribution plot of formants F1-F3.
FIG. 5 is a graph of the long-term formant LTF2 and LTF3 distribution for different speakers.
FIG. 6 is a graph of the long term formant LTF2 and LTF3 distribution for the same speaker.
Fig. 7 is a t-test confidence interval distribution map.
FIG. 8 is a flowchart of a method according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following specific examples and figures.
FIG. 1 depicts the frequency variation of LTF2 and LTF3 in both the natural speaking and reading contexts of multiple test persons, from which it can be seen that the frequency mean variation of LTF2 and LTF3 for speakers in both contexts is very small; LTF4 is more affected by the telephone communication bandwidth, so the present invention selects LTF2 and LTF3 for the voiceprint authentication basis.
As shown in FIG. 2, the positions of vowel formants F1-F4 are determined by combining a linear predictive analysis technique and manual correction for a voice file to be identified, wherein the positions are F1-F4 in sequence according to a curve from low frequency to high frequency, the formants F1-F3 are not used as identification bases due to unstable formants F4, the time-varying curves of the formants F1-F3 are shown in FIG. 3, and long-time formant F1-F3 frequency distribution curves shown in FIG. 4 can be drawn according to the frequency and the occurrence probability of each formant. From the above-mentioned frequency distribution characteristics of the long-term formants, different speakers have different distributions of LTF2 and LTF3, and fig. 5 depicts the distributions of vowels LTF2 and LTF3 of 2 testers, in which two solid lines are distributions of LTF2 of two testers, respectively, and two dashed lines are distributions of LTF3 of two testers, respectively. It can be seen from the figure that LTF2 and LTF3 of 2 people not only have different frequency means, but also have larger differences in the section covered by the distribution curve and the curve shape. The distribution of vowel LTF2 and LTF3 measured in different contexts for the same speaker is shown in fig. 6, where two solid lines are the vowel LTF2 distribution measured in different contexts for the same speaker, and two dashed lines are the vowel LTF3 distribution measured in different contexts for the same speaker, it can be known from the figure that the long-term formants LTF2 and LTF3 of the same speaker in different contexts not only have small frequency mean variation, but also have very similar intervals and shapes of distribution curves, so that hypothesis test can be performed on the measured long-term formants LTF2 and LTF3 data by using a probabilistic method to determine whether the detected speech sample is the target speaker.
Based on the above principle and research, the present invention provides a voice identity verification method based on long-term formant measurement, as shown in fig. 8, the method includes:
s1, knowing a voice file from the same speaker, calculating the distance between the long-time resonance peak data of any two sections of voice in the known voice file, and obtaining the upper limit distanceAnd a lower limit distance
The upper limit distanceAnd a lower limit distanceThe calculation method of (2) is as follows:
let the 4 long-term formant measurement data of 2-segment speech in the known speech file be X1 and Y1, wherein,
in the formula, xF11……xF1mFor the first to the mth resonance peak data, x, under the first frequency of the first section voiceF21……xF2mFor the first to the mth resonance peak data x under the second frequency of the first section voiceF31……xF3mIs the first to the mth resonance peak data x under the third frequency of the first section voiceF41……xF4mFor the first speech segment from the first to the m < th > frequencyIndividual resonance peak data; y isF11……yF1nFor the first to nth resonance peak data, y, at the first frequency of the second speech segmentF21……yF2nFor the first to nth resonance peak data, y, at the second frequency of the second speech segmentF31……yF3nIs the first to nth resonance peak data y under the third frequency of the second section voiceF41……yF4nThe data of the first to nth resonance peak under the fourth frequency of the second section of voice; the first to fourth frequencies are frequencies which are sequentially increased or sequentially decreased;
the column data of each long-time formant measurement data matrix form a formant vector xi= [xF1ixF2ixF3ixF4i]、yi=[yF1i yF2i yF3i yF4i]Respectively calculating the central position of the m vectors of the first section of voice and the n vectors of the second section of voice, and enabling x to bec=[xF1c xF2c xF3c xF4c]Is the center of the X1 matrix, let yc=[yF1c yF2c yF3c yF4c]For the center of Y1 matrix, x is obtained according to the clustering principlecTo xiIs minimized, x is obtained by solving the following minimum problemcAnd yc:
At xcAnd ycOn the basis, the Euclidean distance between centers is calculated to calculate the long-term formant distance D of the two sections of voice*:
From said knownRespectively calculating the distance between every two voices of different segments in the voice file according to the method, and taking the maximum value and the minimum value as the upper limit distanceAnd a lower limit distance
S2, when a material detection voice is collected, calculating the long-term formant distance D between the material detection voice and the known voice file, and the method for calculating the long-term formant distance D between the material detection voice and the long-term formant distance D between two sections of voices in the known voice file*The same method is used.
Then the following judgments were made: when in useJudging that the material checking voice in the time interval has the same identity with the known voice file, namely the same speaker; when in useJudging that the material testing voice in the time interval does not have the same identity with the known voice file, namely different speakers are obtained; when in useA hypothesis test is used to verify identity.
The hypothesis testing method is a t testing method, and comprises the following specific steps:
let the 4 long-term formant measurement data of the material testing voice be Z1, wherein
In the formula, zF11……zF1jFor first to jth resonance peak data, z, at a first frequency of the material-testing voiceF21……ZF2jSecond voice for material inspectionFirst to jth resonance peak data z at frequencyF31……zF3jFor first to jth resonance peak data, Z, under the third frequency of the material-testing voiceF41……ZF4jData of first to jth resonance peaks at a fourth frequency of the material detection voice are obtained;
let xF21、xF22、xF23、……、xF2mObedience as N (u, σ)2) Normal distribution of (a), zF21、zF22、 zF23……zF2jObedience as N (v, σ)2) According to the statistical theory, the data of the resonance peak at the second frequency are distributed as follows:
wherein xF2mean、SxAre respectively xF21、xF22、xF23、……、xF2mMean and standard deviation of (2), ZF2mean、SzAre each zF21、zF22、ZF23……zF2jMean and standard deviation of.
There are 2 hypotheses, H0:u=v,H1U is not equal to v if H0If true, then this time:
to H0、H1When performing hypothesis testing, a confidence level α is given when
Then, it is determined that the time interval material-checking voice is identical to the known voice file, i.e. H is accepted0(ii) a Otherwise, judging that the time interval material checking voice does not have the same identity with the known voice file, namely rejecting H0。
As shown in fig. 7, when two test materials are considered to be from the same speaker with a probability of 95% confidence level, the two detected files are required to measure long-term formants satisfying the following inequality:
|xF2mean-zF2mean|<c
whereint0.05And (m + j-2) is a t distribution variable value corresponding to the degree of freedom m + j-2, i.e., the degree of reliability α is 0.05. As can be seen from FIG. 7, the larger 1 α is, the higher H0The greater the confidence that it is established. Since the t distribution is symmetrical about the vertical axis, let 2 β be 1- α, then
When the two samples are subjected to the hypothesis test of the identity of the two samples, in order to determine the reasonable value range of the beta, the upper and lower limits of the beta can be determined by comparing the beta with the samplesWhen in use The detected materials are considered to have identity; when in useRefusing the material to be detected to have identity;a comprehensive judgment needs to be made in conjunction with the distance D.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the voice identity verification method based on the long-term formant measurement when executing the computer program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for voice identity verification based on long-term formant measurements.
The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:语音处理方法和系统、及语音交互设备和方法