Audio signal real-time tracking comparison method based on voiceprint technology
1. A voice frequency signal real-time tracking comparison method based on voiceprint technology is characterized by comprising the following steps:
p1: calculating voiceprints, defining broadcast audio as source audio and empty receiving audio as target audio, preprocessing the broadcast audio and the empty receiving audio, respectively obtaining source voiceprints and target voiceprints vectors, and placing the source voiceprints and the target voiceprints vectors into corresponding matrix caches;
p2: calculating vector distance, namely calculating Euclidean distance between a source voiceprint and a target voiceprint respectively by taking the target voiceprint as an axis and taking seconds as a unit according to minimum stepping, and calculating to obtain an Euclidean distance matrix Dxs;
p3: primarily determining the delay time, and determining the last delay time t through a Euclidean distance matrix DxslastThe delay time t corresponding to the minimum value DxsminThe delay time t corresponding to the minimum value of the arithmetic mean of the Dxs rowsline;
P4: the delay time is determined in a refined manner, and t is calculated respectivelylast、tmin 、tlineCorresponding to the similarity, performing delay time secondary judgment according to the similarity;
p5: jump judgment, namely introducing a delay processing mechanism for the jump of the similar and dissimilar states of the delay time;
p6: aligning the audio by outputting a delay time to align the source audio with the target audio;
p7: calculating the channel online indexes corresponding to the source audio and the target audio;
p8: P1-P7 are repeated, and the source audio and the target audio are dynamically tracked circularly.
2. The method for real-time tracking and comparing audio signals based on the voiceprint technology as claimed in claim 1, wherein the preprocessing comprises the following steps:
s1: pre-emphasis, compensating the high frequency part;
s2: dividing a plurality of sampling points into frames;
s3: windowing, namely using a Hamming window for smoothing signals and weakening the side lobe size and frequency spectrum leakage after FFT;
s4: fast Fourier transform, converting the time domain features into distribution on the frequency domain;
s5: the amplitude spectrum is squared and converted into a power spectrum;
s6: filtering by a Mel band-pass filter, smoothing the frequency spectrum, eliminating harmonic waves and highlighting a formant;
s7: logarithmic power, plus one frame of logarithmic energy;
s8: discrete cosine transform, namely performing discrete cosine transform on the logarithmic energy of the Mel filter, and taking a low-frequency part to obtain an L-order MFCC coefficient;
s9: dynamic difference extraction (including first-order difference and second-order difference), wherein the dynamic characteristics of the speech are described by a difference spectrum of static features;
and S10, calculating the voiceprint, and superposing the MFCC coefficients with the first-order difference and the second-order difference to form the final voiceprint.
3. The method for real-time tracking and comparing the audio signal based on the voiceprint technology as claimed in claim 2, wherein P1 is a starting point of the comparison method, the source audio is preprocessed, the audio sampling frequency is 16Khz, hop =500, 1 second MFCC characteristic information outputs {13 x 32} matrix, MFCC characteristic information, Deltas first order differential coefficient, Delta-Deltas second order acceleration coefficient three sets of vectors are superimposed on {39 x 32}, and 20s is used as an analysis time slot, and {39 x 640} matrix is output; and similarly, preprocessing the target audio to obtain 39-dimensional target voiceprint feature vectors and outputting a {39 x 640} matrix.
4. The method for real-time tracking and comparing audio signals based on the voiceprint technology as claimed in claim 1, wherein the Euclidean distance matrix Dxs of P2 is used for calculating Euclidean distance from the source audio in sequence from n/2 seconds in the middle of the target audio, and calculating the number of seconds n/2; setting the target audio step as 1, corresponding to the audio offset Rate/hop, and repeating the calculation to obtain the next group; until the target audio is stepped to the last second of the matrix, the diagonal matrix Dxs is finally generated.
5. The method as claimed in claim 1, wherein the P3 preliminary determination of the delay time is performed on the last delay time tlastThe delay time t corresponding to the minimum value DxsminThe delay time t corresponding to the minimum value of the arithmetic mean of the Dxs rowslineAnd judging, if the corresponding delays of the three are consistent, secondary refinement judgment is not needed, and if the calculated delay exceeds a threshold, the audio is judged to be dissimilar.
6. The method as claimed in claim 5, wherein the P4 refines the decision delay time by the last delay time tlastThe delay time t corresponding to the minimum value DxsminThe delay time t corresponding to the minimum value of the arithmetic mean of the Dxs rowslineAnd substituting the cosine similarity into the audio waveform, respectively calculating the cosine similarity of the audio subjected to delay alignment, if the similarity is greater than a standard value, selecting the corresponding delay of the cosine similarity with the highest value, and if the similarity is less than the standard value, judging the similarity twice.
7. The method as claimed in claim 1, wherein the P5 transition determination defines a sim _ min lower threshold and a sim _ max upper threshold if a similar to dissimilar state transition occurs or a dissimilar to similar state transition occurs, and the transition of the two states respectively corresponds to that the state transition is successful if n times are lower than or higher than the threshold.
8. The method according to claim 1, wherein the P6 aligns the audio, outputs a delay time if the source audio and the target audio have similar waveforms, and aligns the source audio and the target audio; and calculating the online index of the audio channel by using the aligned waveform.
Background
In the safety broadcasting monitoring of the broadcasting relay station, in order to objectively analyze and measure the performance index of the transmitter, a broadcast signal (hereinafter referred to as a broadcast signal) transmitted to the transmitter needs to be compared with a broadcast air-receiving return signal (hereinafter referred to as an air-receiving signal) transmitted after being transmitted.
Broadcast signals, especially medium wave signals, are very susceptible to weather, environmental influences and interferences, such as solar black sub-activity and atmospheric layer changes, and the difference between the air-received signals and the broadcast signals is large. Therefore, the alignment of the broadcast signal and the space received signal is always a difficult problem, and the traditional method of audio envelope comparison and energy value comparison can realize the dynamic alignment of the signals to a certain extent, but with the increase of interference, the alignment synchronization is easily lost.
Disclosure of Invention
The invention aims to provide a voiceprint technology-based real-time audio signal tracking and comparing method, so as to solve the technical problem that the alignment and synchronization of a broadcast signal and a space receiving signal are lost when the environmental interference is large.
In order to solve the above technical problems, the specific technical solution of the audio signal real-time tracking comparison method based on voiceprint technology of the present invention is as follows:
a voice frequency signal real-time tracking comparison method based on voiceprint technology comprises the following steps:
p1: calculating voiceprints, defining broadcast audio as source audio and empty receiving audio as target audio, preprocessing the broadcast audio and the empty receiving audio, respectively obtaining source voiceprints and target voiceprints vectors, and placing the source voiceprints and the target voiceprints vectors into corresponding matrix caches;
p2: calculating vector distance, namely calculating Euclidean distance between a source voiceprint and a target voiceprint respectively by taking the target voiceprint as an axis and taking seconds as a unit according to minimum stepping, and calculating to obtain an Euclidean distance matrix Dxs;
p3: primarily determining the delay time, and determining the last delay time t through a Euclidean distance matrix DxslastThe delay time t corresponding to the minimum value DxsminThe delay time t corresponding to the minimum value of the arithmetic mean of the Dxs rowsline;
P4: the delay time is determined in a refined manner, and t is calculated respectivelylast、tmin 、tlineCorresponding to the similarity, performing delay time secondary judgment according to the similarity;
p5: jump judgment, namely introducing a delay processing mechanism for the jump of the similar and dissimilar states of the delay time;
p6: aligning the audio by outputting a delay time to align the source audio with the target audio;
p7: calculating the channel online indexes corresponding to the source audio and the target audio;
p8: P1-P7 are repeated, and the source audio and the target audio are dynamically tracked circularly.
Further, the pretreatment comprises the following steps:
s1: pre-emphasis, compensating the high frequency part;
s2: dividing a plurality of sampling points into frames;
s3: windowing, namely using a Hamming window for smoothing signals and weakening the side lobe size and frequency spectrum leakage after FFT;
s4: fast Fourier transform, converting the time domain features into distribution on the frequency domain;
s5: the amplitude spectrum is squared and converted into a power spectrum;
s6: filtering by a Mel band-pass filter, smoothing the frequency spectrum, eliminating harmonic waves and highlighting a formant;
s7: logarithmic power, plus one frame of logarithmic energy;
s8: discrete cosine transform, namely performing discrete cosine transform on the logarithmic energy of the Mel filter, and taking a low-frequency part to obtain an L-order MFCC coefficient;
s9: dynamic difference extraction (including first-order difference and second-order difference), wherein the dynamic characteristics of the speech are described by a difference spectrum of static features;
and S10, calculating the voiceprint, and superposing the MFCC coefficients with the first-order difference and the second-order difference to form the final voiceprint.
Further, the P1 is the starting point of the comparison method, the source audio is preprocessed, the audio sampling frequency is 16Khz, hop =500, 1 second MFCC characteristic information outputs {13 × 32} matrix, MFCC characteristic information, Deltas first order differential coefficient, Delta-Deltas second order acceleration coefficient three groups of vectors are superimposed {39 × 32}, and 20s is used as an analysis time slot, and {39 × 640} matrix is output; and similarly, preprocessing the target audio to obtain 39-dimensional target voiceprint feature vectors and outputting a {39 x 640} matrix.
Further, the Euclidean distance matrix Dxs of the P2 calculates the Euclidean distance from the source audio in sequence from the middle n/2 seconds of the target audio in a second-by-second mode, and calculates the number n/2 of seconds; setting the target audio step as 1, corresponding to the audio offset Rate/hop, and repeating the calculation to obtain the next group; until the target audio is stepped to the last second of the matrix, the diagonal matrix Dxs is finally generated.
Further, the P3 primarily determines the delay time for the last delay time tlastThe delay time t corresponding to the minimum value DxsminThe delay time t corresponding to the minimum value of the arithmetic mean of the Dxs rowslineAnd judging, if the corresponding delays of the three are consistent, secondary refinement judgment is not needed, and if the calculated delay exceeds a threshold, the audio is judged to be dissimilar.
Further, the P4 refines the decision delay time by the last delay time tlastThe delay time t corresponding to the minimum value DxsminThe delay time t corresponding to the minimum value of the arithmetic mean of the Dxs rowslineAnd substituting the cosine similarity into the audio waveform, respectively calculating the cosine similarity of the audio subjected to delay alignment, if the similarity is greater than a standard value, selecting the corresponding delay of the cosine similarity with the highest value, and if the similarity is less than the standard value, judging the similarity twice.
Further, in the P5 jump determination, if a jump from a similar state to a dissimilar state occurs or a jump from a dissimilar state to a similar state occurs, a sim _ min lower threshold and a sim _ max upper threshold are defined, which respectively correspond to the jumps of the two states, and if n times are both lower than or higher than the thresholds, it indicates that the state jump is successful.
Further, the P6 aligns the audio, outputs a delay time if the source audio and the target audio are similar in waveform, and aligns the source audio and the target audio; and calculating the online index of the audio channel by using the aligned waveform.
The real-time audio signal tracking and comparing method based on the voiceprint technology has the following advantages:
the invention is based on the voiceprint technology, utilizes cepstrum analysis, can continuously and dynamically align the broadcast signal and the empty receiving signal when the environmental interference is large, and calculates the delay amount of the broadcast signal and the empty receiving signal.
Drawings
FIG. 1 is a block diagram of a cepstrum analysis process of the present invention;
FIG. 2 is a flowchart of a real-time tracking and comparing method for audio signals based on voiceprint technology according to the present invention;
fig. 3 is a flowchart illustrating an exemplary application of the method for tracking and comparing audio signals in real time based on voiceprint technology.
Detailed Description
In order to better understand the purpose, structure and function of the present invention, the following describes an audio signal real-time tracking comparison method based on voiceprint technology in detail with reference to the accompanying drawings.
The invention is based on voiceprint technology. According to a theoretical model of speech generation, a speech signal is generated by convolving an excitation signal with a channel impulse response signal, and deconvolution is used to separate the various components of the convolved signal. The method adopts the voiceprint vector technology, and is essentially non-parameter deconvolution (called homomorphic deconvolution), namely cepstrum analysis.
Mel-scale Frequency Cepstral Coefficients (MFCC for short), and MFCC feature extraction comprises two key steps of Mel spectrogram and cepstrum analysis.
The Mel spectrogram firstly performs Fourier transform on a time domain signal to convert the time domain signal into a frequency domain, then uses a filter bank with Mel frequency scales to perform segmentation on the frequency domain signal, and finally each frequency segment corresponds to a numerical value.
The mel scale is an auditory characteristic based on the variation of pitch (pitch) of human ears to equal distance, and the relation of frequency is as follows:
wherein m is the Mel scale and f is the frequency.
The frequency spectrum is composed of a frequency spectrum envelope and frequency spectrum details, the cepstrum analysis aims to separate the frequency spectrum envelope from the frequency spectrum, and the envelope of a sound frequency domain is important information for distinguishing the sound and is used as a speech feature. Cepstrum analysis firstly takes log of the Mel spectrogram, then Discrete Cosine Transform (DCT) is carried out, and the MFCC characteristic value is obtained by reserving the first 13 coefficients.
The MFCC obtains the energy spectrum envelope on a frame of voice, adds the dynamic information of the voice signal to improve the voice recognition capability, the noise robustness and the anti-interference capability, and the first-order difference deltas and the second-order difference deltas-deltas represent differential coefficients and acceleration coefficients. Wherein the content of the first and second substances,
t is the frame sequence and N is the frame size.
First, a voice signal is preprocessed, as shown in fig. 1, which is a preprocessing flow of the present invention, and is used to obtain voiceprints of a broadcast signal and a null reception signal. Mainly comprises the following steps:
s1: pre-emphasis, compensating the high frequency part;
s2: dividing a plurality of sampling points into frames;
s3: windowing, namely using a Hamming window for smoothing signals and weakening the side lobe size and frequency spectrum leakage after FFT;
s4: fast Fourier transform, converting the time domain features into distribution on the frequency domain;
s5, squaring the amplitude spectrum and converting the squared amplitude spectrum into a power spectrum;
s6, filtering by a Mel band-pass filter, smoothing the frequency spectrum, eliminating harmonic waves and highlighting a formant;
s7, logarithmic power and volume are also important characteristics of voice, and logarithmic energy of one frame is added;
s8, discrete cosine transform, namely performing discrete cosine transform on the logarithmic energy of the Mel filter, and taking a low-frequency part to obtain an L-order MFCC coefficient;
s9, extracting dynamic difference (including first order difference and second order difference), describing the dynamic characteristic of the voice by using the difference spectrum of static characteristics, and improving the identification performance of the system;
s10 the voiceprint, MFCC is superimposed with the first and second order differences to form the final voiceprint.
Wherein S1 is pre-emphasized and setA high-pass filter for the high-pass filter,= 0.97. Using formulas in the implementation。
S2 framing, wherein in audio monitoring, the sampling frequency is 16KHz, the frame length is 512 sampling points, and the frame time is 512/16000 multiplied by 1000=32 ms.
S3 adding Hamming window, using formula in implementationWhere N is the size of the frame.
And S4-S5 fast Fourier transformation, namely, converting the time domain signal into a frequency domain signal for analysis, wherein the process comprises the steps of converting the frequency domain signal into an amplitude spectrum and then converting the frequency domain signal into a power spectrum by square transformation.
And S6 Mel filtering, wherein the adopted filter is a triangular filter. The audio sampling rate is 16KHz, the lowest frequency is 0Hz, the number of filters with fmax =8KHz is 26, the frame size is 512, and the number of Fourier transform points is 512. By usingAnd converting Mel frequency, wherein the lowest Mel frequency is 0, the highest Mel frequency is 2840.02, and the distance of the center frequency is: (2840.02-0)/(26+1) =105.19, resulting in the center frequency of the Mel filter bank: [0, 105.19, 210.38,...,2840.02]And finally, calculating an FFT point index group corresponding to the actual frequency group: [0,2,4,7,10,13,16,...,256]。
And S8 discrete cosine transform, wherein the logarithmic energy of each filter is introduced into the discrete cosine transform, and L-order MFCC coefficients are taken, and the method is L = 13.
S9 extracts the voiceprint vector by dynamic difference as the MFCC vector with the first order difference coefficient and the second order difference coefficient superimposed, and obtains a 39-dimensional vector, that is, an N-dimensional voiceprint vector = (N/3 MFCC coefficient + N/3 first order difference parameter + N/3 second order difference parameter), N = 39.
As shown in fig. 2, the audio signal dynamic tracking comparison method of the present invention mainly comprises the following steps:
p1 calculating the voiceprint, defining the broadcast audio as the source audio and the empty receiving audio as the target audio, respectively obtaining the source voiceprint and the target voiceprint vector according to the preprocessing algorithm, and placing the source voiceprint and the target voiceprint vector into corresponding matrix cache;
p2, calculating vector distance, respectively calculating Euclidean distance between a source voiceprint and a target voiceprint by taking the target voiceprint as an axis and taking seconds as a unit according to minimum stepping, and calculating to obtain an Euclidean distance matrix Dxs;
p3 primarily determines the delay time, and determines the last delay time t through the Euclidean distance matrix DxslastThe delay time t corresponding to the minimum value DxsminThe delay time t corresponding to the minimum value of the arithmetic mean of the Dxs rowsline;
P4 refines the decision delay time and calculates tlast、tmin 、tlineCorresponding to the similarity, performing delay time secondary judgment according to the similarity;
p5 jump judgment, for the jump of similar and dissimilar states of delay time, introducing a delay processing mechanism for improving the system stability;
p6 aligns the audio, aligning the source audio with the target audio by outputting a delay time;
p7 calculates the audio and target audio channel online indicators. And dynamically tracking in a loop.
P1 is the starting point of the comparison method, the source audio is preprocessed, the audio sampling frequency is 16Khz, hop =500, 1 second MFCC characteristic information outputs a {13 x 32} matrix, three groups of vectors of MFCC characteristic information, Deltas first order differential coefficient and Delta-Deltas second order acceleration coefficient are superposed with {39 x 32}, and 20s is used as an analysis time slot to output a {39 x 640} matrix. And similarly, preprocessing the target audio to obtain 39-dimensional target voiceprint feature vectors and outputting a {39 x 640} matrix.
The Euclidean distance matrix Dxs in P2 calculates the Euclidean distance from the source audio in a second-by-second order from the middle n/2 seconds of the destination audio, considering that the destination audio lags behind the source audio, and calculates the number of seconds n/2, noting that the target audio cannot lead the source audio. Setting the target audio step as 1, corresponding to the audio offset Rate/hop, repeating the calculation to obtain the next group. Until the target audio is stepped to the last second of the matrix, the diagonal matrix Dxs is finally generated.
P3 preliminary determination of delay time for last delay time tlastThe delay time t corresponding to the minimum value DxsminThe delay time t corresponding to the minimum value of the arithmetic mean of the Dxs rowslineAnd judging, if the corresponding delays of the three are consistent, secondary refinement judgment is not needed, and if the calculated delay exceeds a threshold, the audio is judged to be dissimilar.
P4 refines and judges the delay time, substitutes the three delay times into the audio waveform, respectively calculates the cosine similarity of the audio subjected to delay alignment, if the similarity is greater than the standard value, selects the corresponding delay of the cosine similarity maximum value, if the similarity is less than the standard value, then judges the two times as dissimilar.
And P5 transition judgment, if the transition from the similar state to the dissimilar state occurs or the transition from the similar state to the similar state does not occur, defining a sim _ min lower limit threshold and a sim _ max upper limit threshold, respectively corresponding to the transitions of the two states, and if the n times are both lower than or higher than the threshold, indicating that the state transition is successful.
P6 aligns the audio, outputs a delay time if the source audio and the target audio are similar in waveform, and aligns the source audio and the target audio. And calculating the online index of the audio channel by using the aligned waveform.
By the circulation, the online dynamic real-time tracking comparison is realized.
As shown in fig. 3, the present invention has been put into practical application in the new chang relay station to monitor the broadcast signals of three frequency medium wave broadcasts, i.e., zhejiang sound, china sound and chinese economy, in real time, and simultaneously monitor the index condition of the channel.
It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.