Echo cancellation method and device for joint dereverberation
1. A method for joint dereverberation echo cancellation, comprising the steps of:
s1, respectively acquiring an analog microphone signal and an analog reference signal by using a microphone array and an audio playing device, and respectively converting the analog microphone signal and the analog reference signal into a digital microphone signal and a digital reference signal through an ADC (analog-to-digital converter);
s2, decomposing the received time domain signals of each sound channel of the digital microphone signals and the digital reference signals into frequency domain signals of a plurality of frequency points through short-time Fourier transform to obtain frequency domain microphone signals and frequency domain reference signals;
s3, carrying out the following operations frame by frame:
storing frequency domain microphone signals and frequency domain reference signals of a current frame and a plurality of past frames of the current frame as cached frequency domain signals into a joint prediction buffer;
s4, filtering the frequency domain signals cached in the combined prediction buffer through a self-adaptive filter, and performing echo cancellation on the frequency domain microphone signals in the combined prediction buffer;
the filtering adopts a joint prediction adaptive filter W (l, k) = P (l, k) × R (l, k);
p (l, k) is an autocorrelation matrix inverse matrix of the frequency domain transposed signal Buff _ T (l, k) of the kth frequency point of the ith frame;
r (l, k) is a cross-correlation matrix of the frequency domain microphone signal Mic (l, k) of the kth frequency point of the ith frame and the cached frequency domain signal Buff (l, k);
outputting a target voice frequency domain signal Y:
target voice frequency domain signal Y (l, k) = Mic (l, k) -W of kth frequency point of l frameH(l,k)* Buff_T(l,k);
Where the superscript H denotes the conjugate transpose, and Mic (l, k) is the frequency domain microphone signal of the kth frequency point of the l-th frame.
2. The joint dereverberation echo cancellation method according to claim 1, wherein R (l, k) = λ R (l-1, k) + (1- λ) Buff _ T (l, k) conj (Mic (l, k));
lambda is a forgetting factor, and R (l-1, k) is a cross-correlation matrix of a frequency domain microphone signal Mic (l-1, k) of the kth frequency point of the l-1 frame and a cached frequency domain signal Buff (l-1, k); buff _ T (l, k) is a frequency domain transposed signal of the kth frequency point of the ith frame; and conj represents conjugation.
3. The joint dereverberation echo cancellation method of claim 1, wherein P (l, k) in step S4 is calculated as follows:
;
lambda is a forgetting factor, P (l-1, k) represents an autocorrelation matrix inverse matrix of a frequency domain transposed signal Buff _ T (l-1, k) of the kth frequency point of the l-1 frame, and P (l, k) is defined in the same way; buff _ T (l, k) is a frequency domain transposed signal of the kth frequency point of the ith frame, and the superscript H represents conjugate transposition; kal (l, k) is the k frequency point Kalman gain matrix of the l frame.
4. The joint dereverberation echo cancellation method as claimed in claim 1, wherein in said step S3, (L) of the current frame and the current frameP-1) storing the frequency domain microphone signal of the frame and (V-1) the frame frequency domain reference signal as buffered frequency domain signals in a joint prediction buffer; wherein is LPThe reference signal linear prediction length, V is the microphone signal linear prediction length.
5. The echo cancellation device for joint dereverberation is characterized by comprising a joint prediction buffer and two preprocessing branches connected with the joint prediction buffer, wherein the preprocessing branches comprise an analog-to-digital converter and a short-time Fourier transform module which are connected in series;
the joint prediction buffer is also connected with a transposition module, the transposition module is also connected with a joint prediction adaptive filter, and a short-time Fourier transform module of a branch which is used for being connected with a microphone in the two preprocessing branches is connected with an echo canceller and the joint prediction adaptive filter; the joint prediction adaptive filter is also connected with a conjugate device module; the conjugate device module and the transposition module are connected with a multiplier together, the multiplier is connected with the echo canceller, and the echo canceller is also connected with a short-time Fourier transform module.
Background
In recent years, artificial intelligence technology and internet of things are rapidly developed to the ground, voice recognition is used as an important means of man-machine interaction and plays a key role in the technology, however, in the actual ground falling process, the effective ground falling popularization of products is limited by the harsh user experience in a complex acoustic application environment, and particularly for an audio system with a loudspeaker and a microphone, the voice recognition interaction experience of the user is directly influenced by the echo cancellation quality. Therefore, how to effectively improve the echo cancellation effect is a key problem for improving the voice recognition interaction quality.
The existing common echo cancellation method estimates an echo channel through different adaptive filters to further cancel the echo, and the method has certain effectiveness, but ignores reverberation existing in the environment and often influences the estimation effect of the adaptive filters in the practical application process. Meanwhile, another echo cancellation method based on a deep neural network is provided, which has an obvious effect on solving the problem of nonlinear distortion in the echo cancellation process, but training samples required by the method are harsh, and the development of the method is restricted by factors such as computing power and memory of a product in the actual landing process.
Disclosure of Invention
In order to overcome the technical defects of the existing echo cancellation method, the invention discloses an echo cancellation method and device for jointly removing reverberation.
The echo cancellation method of the joint dereverberation comprises the following steps:
s1, respectively acquiring an analog microphone signal and an analog reference signal by using a microphone array and an audio playing device, and respectively converting the analog microphone signal and the analog reference signal into a digital microphone signal and a digital reference signal through an ADC (analog-to-digital converter);
s2, decomposing the received time domain signals of each sound channel of the digital microphone signals and the digital reference signals into frequency domain signals of a plurality of frequency points through short-time Fourier transform to obtain frequency domain microphone signals and frequency domain reference signals;
s3, carrying out the following operations frame by frame:
storing frequency domain microphone signals and frequency domain reference signals of a current frame and a plurality of past frames of the current frame as cached frequency domain signals into a joint prediction buffer;
s4, filtering the frequency domain signals cached in the combined prediction buffer through a self-adaptive filter, and performing echo cancellation on the frequency domain microphone signals in the combined prediction buffer;
the filtering adopts a joint prediction adaptive filter W (l, k) = P (l, k) × R (l, k);
p (l, k) is an autocorrelation matrix inverse matrix of the frequency domain transposed signal Buff _ T (l, k) of the kth frequency point of the ith frame;
r (l, k) is a cross-correlation matrix of the frequency domain microphone signal Mic (l, k) of the kth frequency point of the ith frame and the cached frequency domain signal Buff (l, k);
outputting a target voice frequency domain signal Y:
target voice frequency domain signal Y (l, k) = Mic (l, k) -W of kth frequency point of l frameH(l,k)* Buff_T(l,k);
Where the superscript H denotes the conjugate transpose, and Mic (l, k) is the frequency domain microphone signal of the kth frequency point of the l-th frame.
Preferably, in the step S4
R(l,k)= λR(l-1,k)+(1-λ) Buff_T(l,k)conj(Mic(l,k));
Lambda is a forgetting factor, and R (l-1, k) is a cross-correlation matrix of a frequency domain microphone signal Mic (l-1, k) of the kth frequency point of the l-1 frame and a cached frequency domain signal Buff (l-1, k); buff _ T (l, k) is a frequency domain transposed signal of the kth frequency point of the ith frame; and conj represents conjugation.
Preferably, the calculation manner of P (l, k) in step S4 is as follows:
;
lambda is a forgetting factor, P (l-1, k) represents an autocorrelation matrix inverse matrix of a frequency domain transposed signal Buff _ T (l-1, k) of the kth frequency point of the l-1 frame, and P (l, k) is defined in the same way; buff _ T (l, k) is a frequency domain transposed signal of the kth frequency point of the ith frame, and the superscript H represents conjugate transposition; kal (l, k) is the k frequency point Kalman gain matrix of the l frame.
Preferably, in the step S3, the current frame and the (L) of the current frame are combinedP-1) frequency domain microphone signal of frame and (V-1) frame frequency domain reference signal as buffered frequency domainStoring the signal into a joint prediction buffer; wherein is LPThe reference signal linear prediction length, V is the microphone signal linear prediction length.
The invention also discloses an echo cancellation device for joint dereverberation, which comprises a joint prediction buffer and two preprocessing branches connected with the joint prediction buffer, wherein the preprocessing branches comprise an analog-to-digital converter and a short-time Fourier transform module which are connected in series;
the joint prediction buffer is also connected with a transposition module, the transposition module is also connected with a joint prediction adaptive filter, and a short-time Fourier transform module of a branch which is used for being connected with a microphone in the two preprocessing branches is connected with an echo canceller and the joint prediction adaptive filter; the joint prediction adaptive filter is also connected with a conjugate device module; the conjugate device module and the transposition module are connected with a multiplier together, the multiplier is connected with the echo canceller, and the echo canceller is also connected with a short-time Fourier transform module
The scheme of the invention utilizes the echo cancellation algorithm of joint dereverberation, can effectively improve the echo cancellation effect and improve the voice interaction quality.
Drawings
Fig. 1 is a schematic flow chart of an echo cancellation method according to the present invention;
fig. 2 is a schematic diagram of an embodiment of an echo cancellation device according to the present invention;
FIG. 3 is a waveform diagram of an input signal according to an embodiment of the present invention;
FIG. 4 is a schematic diagram showing a comparison of waveforms of output signals after echo cancellation of the input signals of FIG. 3 using the prior art and the method of the present invention;
the abscissa of the waveform diagrams in fig. 3 and 4 is time, and the ordinate is voltage.
Detailed Description
The following provides a more detailed description of the present invention.
The echo cancellation method of the present invention can be implemented by the following steps:
s1, respectively acquiring an analog microphone signal and an analog reference signal by using a microphone array and an audio playing device, and respectively converting the analog microphone signal and the analog reference signal into a digital microphone signal and a digital reference signal through an analog-to-digital converter (ADC).
The number N of the microphones in the microphone array is not less than 2, and when N =1, the microphone array is a single-microphone system; the array structure formed by the microphones is not limited to a regular geometric array or an unconventional array, the number Q of the loudspeakers in the audio playing device is not less than 1, the number of the reference signal channels is reflected on the attribute of the sound source, and if the three-dimensional sound source has two channels, the reference channel generated by the three-dimensional sound source has two sound source signals; when the requirement for sound quality is not strict or the transmission bandwidth is limited, the sound source only needs to take one signal or only one signal.
For further details of the embodiment, a minimum system is taken as an example, i.e. a single-microphone single-speaker system is adopted for description, where N =1 and Q = 1.
The analog microphone signal is obtained from the electrical signal output end of the microphone array, and the analog reference signal is an electrical analog signal input by the audio playing device.
As shown in fig. 2, an analog reference signal emitted by an audio source is converted into an audio signal through a speaker for playing, and the audio signal after playing is mixed with other audio signals such as external environment noise and then received by a microphone and converted into an electrical signal serving as an analog microphone signal.
S2, decomposing the received digital microphone signal and the time domain signal of each sound channel of the digital reference signal into frequency domain signals of K1 frequency points through short-time Fourier transform,
in a specific embodiment, in the short-time fourier process, 512-point fourier transform is adopted, where the frequency point number K1= 257.
That is, the time-domain microphone signal of the current l-th frame in the digital microphone signals is converted into the frequency-domain microphone signal Mic (l) = [ Mic [)1(l), Mic2(l)…MicN(l)]In the exemplary minimum system, the dimension of mic (l) is K1 x 1; n is the number of microphones;
converting a time domain reference signal in the digital reference signal into a frequency domain reference signal;
Ref(l)=[ Ref1(l), Ref2(l)…RefQ(l)](ii) a Q is the number of loudspeakers; in the exemplary minimum system, dimension of ref (l) is K1 × 1;
s3, carrying out the following operations frame by frame:
storing frequency domain microphone signals and frequency domain reference signals of a current frame and a plurality of past frames of the current frame as cached frequency domain signals into a joint prediction buffer; actual data is stored in the joint prediction buffer and then input into the filter for algorithm expansion and realization, and abstract extraction can be suitable for other filter schemes.
The method of linear prediction is applied to the processing of the buffered data, the known data of the past frame is needed for prediction, the predicted length order represents how many frames of data in the past are needed, and therefore, the number of the past frames is usually determined according to the linear prediction length.
The data Buff (l) stored in the joint prediction buffer can be expressed by the following formula
Buff(l)=[ Buff_Ref(l), Buff_Mic(l)]
Buff_Ref(l)=[Ref1(l-LP+1),…Ref1 (l),…RefQ(l-LP+1),…RefQ (l)]
Buff_Mic(l)= [Mic1 (l-V+1),…Mic1 (l),…MicN (l-V+1),…MicN (l)]
For example, assume that the reference signal linear prediction length LP=4, the linear prediction length of the microphone signal V =6, and the dimension of buff (L) is the frequency point number V × LP=257 × 4+6), in the process, frequency domain reference signals of 4 frames of the past 3 frames and the current frame are stored, frequency domain microphone signals of 6 frames of the past 5 frames and the current frame are stored, then a joint prediction buffer Buff (l) of the current frame is obtained, and then a linear prediction method is combined, and the same adaptive filter is adopted for dereverberation and linear echo cancellation;
s4, filtering the cache data Buff (l) of the frequency domain signal through a self-adaptive filter, and performing echo cancellation on the frequency domain microphone signal;
specifically, the calculation of each frequency point k is as follows:
in order to facilitate the algorithm to realize the processing, the transposition processing is carried out on the frequency domain signal Buff (l, k) cached at the kth frequency point of the l frame:
frequency domain transposed signal Buff _ T (l, k) = BuffT(l,k)
Wherein BuffTThe dimension of (l, k) is (4 x 1+6 x 1), the superscript T denotes transposition;
computing a Kalman gain matrix Kal (l, k):
wherein λ is forgetting factor, the value range is 0-1, λ =0.999 can be set, the dimension of Kal (l, k) in the current minimum system is (4 × 1+6 × 1), and the superscript H represents the conjugate transpose;
p (l-1, k) represents the autocorrelation matrix inverse matrix of the frequency domain transposed signal Buff _ T (l-1, k) of the kth frequency point of the l-1 frame;
to avoid a complex inversion process, the inverse P (l, k) of the autocorrelation matrix of Buff _ T (l, k) is calculated:
wherein the dimension of P (l, k) is (4 × 1+6 × 1) (4 × 1+6 × 1);
further calculating a cross-correlation matrix R (l, k) of the frequency domain microphone signal Mic (l, k) with the buffered Buff (l, k) frequency domain signal:
R(l,k)= λR(l-1,k)+(1-λ) Buff_T(l,k)conj(Mic(l,k))
wherein the dimension of R (L, k) is (L)PQ + V × N) N, conj represents conjugation;
n is the number of microphones, and Q is the number of loudspeakers;
LPlinear prediction length for the microphone signal, V is the linear prediction length for the reference signal;
computing a joint prediction adaptive filter W (l, k):
W(l,k)=P(l,K)* R(l,K)
wherein the dimension of W (L, k) is (L)PQ + V N) N, W (l, k) is the k frequency point joint prediction filter of the l frame;
and carrying out echo cancellation processing, and calculating a target voice frequency domain signal Y (l, k):
Y(l,k)= Mic(l,k)-WH(l,k)* Buff_T(l,k)
the dimensionality of Y (l, k) is N, and Y (l, k) is a target voice frequency domain signal of the kth frequency point of the l frame after being filtered by the echo canceller;
the echo cancellation module filters out linear echoes of the system.
Aiming at the frequency domain signal after echo cancellation processing, the frequency domain signal can be converted into a time domain signal through a short-time inverse Fourier transform module (ISTFT), and the output time domain signal can be directly transmitted to the next processing module through a system.
The echo cancellation method of the present invention can be performed by using an echo cancellation device for joint dereverberation, as shown in fig. 2, the echo cancellation device for joint dereverberation includes a joint prediction buffer and two preprocessing branches connected thereto, and the preprocessing branches include an analog-to-digital converter and a short-time fourier transform module connected in series;
the joint prediction buffer is also connected with a transposition module, the transposition module is also connected with a joint prediction adaptive filter, and a short-time Fourier transform module of a branch which is used for being connected with a microphone in the two preprocessing branches is connected with an echo canceller and the joint prediction adaptive filter; the joint prediction adaptive filter is also connected with a conjugate device module; the conjugate device module and the transposition module are connected with a multiplier together, the multiplier is connected with the echo canceller, and the echo canceller is also connected with a short-time Fourier transform module. The two preprocessing branches are respectively used for connecting the output end of the microphone array and the electrical signal input end of the loudspeaker and processing the analog microphone signal and the analog reference signal, and other modules can realize other steps of the echo cancellation method.
Compared with the traditional echo cancellation method, the scheme of the invention utilizes the echo cancellation algorithm of joint dereverberation, can effectively improve the echo cancellation effect and improve the voice interaction quality. As shown in fig. 3 and 4, a specific embodiment of the present invention is given, based on the echo cancellation device shown in fig. 2, after an audio source sends a signal shown in a lower channel of fig. 3, the signal is played through a loudspeaker, there is a command word spoken by a target person at a position 3m away from a microphone, a waveform diagram of an input signal obtained at an input end of a microphone array is shown in an upper channel of fig. 3, and voice components of the waveform mainly include ambient noise, sound played by the loudspeaker, and the target person;
FIG. 4 is a waveform diagram obtained after echo cancellation processing of the signal of FIG. 3; FIG. 4 is a waveform diagram of the lower channel output after processing by the echo cancellation device of FIG. 2 according to the present invention; the upper channel is an output waveform diagram processed by using a prior art RLS (least squares) based echo cancellation method. As can be seen from fig. 4, the difference between the target speech processed by the present invention, i.e., the parts with larger voltage amplitudes appearing in the waveforms of the upper and lower channels in fig. 4, and the echo residual value, i.e., the part with smaller voltage amplitudes, is larger, i.e., the signal-to-noise ratio of the speech signal processed by the present invention is higher, which indicates that the present invention has a better echo cancellation effect.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.