Voice separation method and device
1. A method of speech separation, comprising:
decomposing the obtained target noisy audio signal to M frequency points in a preset frequency domain range to obtain decomposed signals of the M frequency points;
merging the decomposed signals of the M frequency points into P preset frequency sub-bands based on the auditory perception characteristic of human ears;
every interval of preset duration, performing framing processing on the decomposed signals of the frequency points included in each preset frequency sub-band to obtain N start-stop time periods corresponding to each preset frequency sub-band;
estimating a target ratio corresponding to each analysis unit in P analysis units of the same start-stop time period corresponding to the P preset frequency sub-bands;
based on the P target ratios, carrying out noise elimination processing on the target noisy audio signals corresponding to the same start-stop time period to obtain target audio signals corresponding to the same start-stop time period;
p, M is a positive integer, P is smaller than M, N is an integer greater than or equal to 2, the same start-stop time period is any one of N start-stop time periods, the target ratio is a first ratio or a second ratio, the first ratio is a ratio of speech energy to total energy of the analysis unit, and the second ratio is a ratio of noise energy to total energy of the analysis unit.
2. The method according to claim 1, wherein the estimating of the target ratio value corresponding to each of the P analysis units in the same start-stop time period corresponding to the P preset frequency subbands comprises:
extracting the acoustic features of each analysis unit in the P analysis units;
inputting P acoustic features of P analysis units into a target model, so as to output the target ratio corresponding to the ith analysis unit in the P analysis units through the target model;
and i is more than or equal to 1 and less than or equal to P, and the target model is obtained by training a pre-constructed preset model based on P acoustic feature samples of P analysis unit samples with noise frequency signal samples.
3. The method according to claim 2, wherein the performing noise cancellation processing on the target noisy audio signal corresponding to the same start-stop time period based on the P target ratios to obtain the target speech signal corresponding to the same start-stop time period comprises:
under the condition that the target ratio is a first ratio, obtaining frequency domain voice signals corresponding to the same start-stop time period based on the product of the ith first ratio in the P first ratios and a target noisy audio signal corresponding to each frequency point in frequency points included in a target preset frequency sub-band, and performing time domain transformation on the frequency domain voice signals corresponding to the same start-stop time period to obtain target voice signals corresponding to the same start-stop time period;
or, under the condition that the target model outputs the second ratio, calculating a difference value between a preset ratio and the second ratio, obtaining a frequency domain voice signal corresponding to the same start-stop time period based on a product of an ith difference value of the P difference values and a target noisy audio signal corresponding to each of frequency points included in a target preset frequency sub-band, and performing time domain transformation on the frequency domain voice signal corresponding to the same start-stop time period to obtain a target voice signal corresponding to the same start-stop time period;
and the target preset frequency sub-band is a preset frequency sub-band corresponding to the ith analysis unit.
4. The method according to claim 1 or 2, wherein the P preset frequency subbands include m low frequency subbands, n mid-frequency subbands and s high frequency subbands, and wherein the bandwidth of the mid-frequency subbands is greater than the bandwidth of the low frequency subbands and less than the bandwidth of the high frequency subbands.
5. The method according to claim 1, wherein before said decomposing the obtained target noisy audio signal into M frequency bins in a preset frequency domain range to obtain decomposed signals of the M frequency bins, further comprises:
acquiring a plurality of noisy audio signals acquired by a plurality of microphones of a microphone array, wherein the microphone array is an annular array formed by the plurality of microphones;
aligning the plurality of noisy audio signals, and adding the aligned plurality of noisy audio signals to obtain the target noisy audio signal.
6. A speech separation apparatus, comprising:
the decomposition module is used for decomposing the acquired target noisy audio signal into M frequency points in a preset frequency domain range to obtain decomposition signals of the M frequency points;
the merging module is used for merging the decomposed signals of the M frequency points into P preset frequency sub-bands based on the auditory perception characteristic of human ears;
the frame dividing module is used for performing frame dividing processing on the decomposition signals of the frequency points included in each preset frequency sub-band at intervals of preset time length to obtain N starting and stopping time periods corresponding to each preset frequency sub-band;
the estimation module is used for estimating a target ratio corresponding to each analysis unit in P analysis units of the same starting and stopping time period corresponding to the P preset frequency sub-bands;
the processing module is used for carrying out noise elimination processing on the target noisy audio signals corresponding to the same start-stop time period based on the P target ratios to obtain target audio signals corresponding to the same start-stop time period;
p, M is a positive integer, P is smaller than M, N is an integer greater than or equal to 2, the same start-stop time period is any one of N start-stop time periods, the target ratio is a first ratio or a second ratio, the first ratio is a ratio of speech energy to total energy of the analysis unit, and the second ratio is a ratio of noise energy to total energy of the analysis unit.
7. The apparatus according to claim 6, wherein the estimation module comprises:
an extraction unit, configured to extract an acoustic feature of each of the P analysis units;
the output unit is used for inputting P acoustic characteristics of the P analysis units into a target model so as to output the target ratio corresponding to the ith analysis unit in the P analysis units through the target model;
and i is more than or equal to 1 and less than or equal to P, and the target model is obtained by training a pre-constructed preset model based on P acoustic feature samples of P analysis unit samples with noise frequency signal samples.
8. The apparatus according to claim 7, wherein the processing module is specifically configured to, when the target ratio is a first ratio, obtain the frequency domain speech signals corresponding to the same start-stop time period based on a product of an ith first ratio of the P first ratios and a target noisy audio signal corresponding to each frequency point in frequency points included in a target preset frequency sub-band, and perform time domain transformation on the frequency domain speech signals corresponding to the same start-stop time period to obtain the target speech signals corresponding to the same start-stop time period;
or, under the condition that the target model outputs the second ratio, calculating a difference value between a preset ratio and the second ratio, obtaining a frequency domain voice signal corresponding to the same start-stop time period based on a product of an ith difference value of the P difference values and a target noisy audio signal corresponding to each of frequency points included in a target preset frequency sub-band, and performing time domain transformation on the frequency domain voice signal corresponding to the same start-stop time period to obtain a target voice signal corresponding to the same start-stop time period;
and the target preset frequency sub-band is a preset frequency sub-band corresponding to the ith analysis unit.
9. The apparatus according to claim 6 or 7, wherein the P preset frequency subbands include m low frequency subbands, n mid frequency subbands and s high frequency subbands, and wherein the bandwidth of the mid frequency subbands is greater than the bandwidth of the low frequency subbands and less than the bandwidth of the high frequency subbands.
10. The apparatus of claim 6, further comprising:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of noisy audio signals acquired by a plurality of microphones of a microphone array, and the microphone array is an annular array formed by the plurality of microphones;
and the second acquisition module is used for aligning the plurality of noisy audio signals and adding the aligned plurality of noisy audio signals to acquire the target noisy audio signal.
Background
Separating speech in a noisy environment is always a difficult problem in speech system application and a hot spot in scientific research, and particularly, far-field audio with long distance of a target speaker, multiple noise types, large noise energy and complex environment and strong reverberation is not easy to separate speech.
At present, a microphone array is adopted, and the method for enhancing the voice quality in a far field and a complex noise environment is a commonly adopted method. The existing multi-microphone processing method mainly adopts multiple sensors, determines the source direction of target voice through sound source positioning algorithms such as energy, time delay, sound characteristics and the like for each noisy audio frame comprising the target voice, and performs spatial filtering on the noisy audio frame according to the source direction of the target voice to obtain the target voice. However, the method is sensitive to signal frequency, and for speech with a wider frequency domain, the distortion of the obtained speech is larger, and the speech quality effect is poorer.
Disclosure of Invention
The embodiment of the invention aims to provide a voice separation method and a voice separation device, so as to solve the problems of large distortion of obtained voice and poor voice quality effect caused by the existing voice separation method. The specific technical scheme is as follows:
in a first aspect of the present invention, there is provided a current speech separation method, including:
decomposing the obtained target noisy audio signal to M frequency points in a preset frequency domain range to obtain decomposed signals of the M frequency points;
merging the decomposed signals of the M frequency points into P preset frequency sub-bands based on the auditory perception characteristic of human ears;
every interval of preset duration, performing framing processing on the decomposed signals of the frequency points included in each preset frequency sub-band to obtain N start-stop time periods corresponding to each preset frequency sub-band;
estimating a target ratio corresponding to each analysis unit in P analysis units of the same start-stop time period corresponding to the P preset frequency sub-bands;
based on the P target ratios, carrying out noise elimination processing on the target noisy audio signals corresponding to the same start-stop time period to obtain target audio signals corresponding to the same start-stop time period;
p, M is a positive integer, P is smaller than M, N is an integer greater than or equal to 2, the same start-stop time period is any one of N start-stop time periods, the target ratio is a first ratio or a second ratio, the first ratio is a ratio of speech energy to total energy of the analysis unit, and the second ratio is a ratio of noise energy to total energy of the analysis unit.
In a second aspect of the present invention, there is also provided a speech separation apparatus, disposed at a blockchain node, including:
the decomposition module is used for decomposing the acquired target noisy audio signal into M frequency points in a preset frequency domain range to obtain decomposition signals of the M frequency points;
the merging module is used for merging the decomposed signals of the M frequency points into P preset frequency sub-bands based on the auditory perception characteristic of human ears;
the frame dividing module is used for performing frame dividing processing on the decomposition signals of the frequency points included in each preset frequency sub-band at intervals of preset time length to obtain N starting and stopping time periods corresponding to each preset frequency sub-band;
the estimation module is used for estimating a target ratio corresponding to each analysis unit in P analysis units of the same starting and stopping time period corresponding to the P preset frequency sub-bands;
the processing module is used for carrying out noise elimination processing on the target noisy audio signals corresponding to the same start-stop time period based on the P target ratios to obtain target audio signals corresponding to the same start-stop time period;
p, M is a positive integer, P is smaller than M, N is an integer greater than or equal to 2, the same start-stop time period is any one of N start-stop time periods, the target ratio is a first ratio or a second ratio, the first ratio is a ratio of speech energy to total energy of the analysis unit, and the second ratio is a ratio of noise energy to total energy of the analysis unit.
The current speech separation method provided in this embodiment decomposes an acquired target noisy audio signal into M frequency points in a preset frequency domain range to obtain decomposed signals of the M frequency points, merges the decomposed signals of the M frequency points into P preset frequency subbands based on auditory perception characteristics of human ears, performs framing processing on the decomposed signals of the frequency points included in each preset frequency subband at intervals of a preset duration to obtain N start-stop time periods corresponding to each preset frequency subband, estimates P start-stop time periods corresponding to the P preset frequency subbands in the P analysis units, and performs noise cancellation processing on the target noisy audio signal corresponding to the same start-stop time period based on the P target ratios to obtain a target speech signal corresponding to the same start-stop time period. Because the target voice signal is obtained by estimating the noise energy ratio or the voice energy ratio of the analysis unit, the fuzzification and noise reduction processing of the noisy audio signal is realized according to the voice energy ratio, the distortion of the target voice signal is reduced, and the voice quality effect is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flow chart illustrating steps of a method for separating speech according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a speech separation apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a speech separation method according to an embodiment of the present invention. The method can be executed by a computer, a server and other electronic devices. The method comprises the following steps:
step 101, decomposing the acquired target noisy audio signal to M frequency points in a preset frequency domain range to obtain decomposed signals of the M frequency points.
In this embodiment, a noisy audio signal, which refers to an audio signal including noise and speech, may be collected by a microphone array. The microphone array may include 2 microphones, 4 microphones, or 6 microphones. For example, in the present embodiment, a ring-shaped microphone array composed of 6 microphones is used. The more the number of the microphones is, the richer the information of the collected audio signals is, the more the collected voice energy is, and the better the noise reduction effect is. The band noise frequency signals collected by a plurality of microphones of the microphone array can be aligned, and then the aligned band noise frequency signals are added to obtain the target band noise frequency signal.
Wherein the predetermined frequency domain range is determined according to a sampling frequency of the microphone array. For example, the sampling frequency f of the microphone array is equal to 32k hertz (Hz), and the maximum frequency of the predetermined frequency domain range is equal to one half of the sampling frequency, i.e. the predetermined frequency domain range is 0 to 16 kHz. Fast Fourier Transform (FFT) with 512 sampling points can be adopted to decompose the acquired target noisy audio signal into 256 frequency points in a preset frequency domain range, and the frequency domain bandwidths of the intervals between two adjacent frequency points are the same. I.e. using a 512-point FFT, M is equal to half the number of samples, i.e. M is equal to 256. A 256-point FFT, 128-point FFT may also be employed.
And 102, combining the decomposed signals of the M frequency points into P preset frequency sub-bands based on the auditory perception characteristic of human ears.
The human auditory perception characteristics comprise the characteristics that the resolution of human auditory perception to low-frequency sound is high and the resolution of high-frequency sound is low, according to the human auditory perception characteristics, decomposition signals of M frequency points are combined into P preset frequency sub-bands, and the value of P is equal to 16 for example. For example, the decomposed signals of 256 frequency points are combined into 16 preset frequency sub-bands. The 16 preset frequency subbands include, for example: direct current signals (0Hz), 80Hz, 156Hz, 250Hz, 366Hz, 512Hz, 693Hz, 919Hz, 1.2kHz, 1.5kHz, 2kHz, 3.2kHz, 4kHz, 5.1kHz, 6.3kHZ, 8 kHz. Wherein, the 1 st preset frequency sub-band of the 16 preset frequency sub-bands is 0Hz, the 2 nd preset frequency sub-band of the 16 preset frequency sub-bands is 80Hz, and so on. The decomposed signals of the frequency points with the frequency of 0-40Hz can be merged to 0Hz, the decomposed signals of the frequency points with the frequency of 40Hz-118Hz are merged to the 80Hz sub-band, the decomposed signals of the frequency points with the frequency of 118 Hz-206 Hz are merged to the 156Hz sub-band, and the like, so that the decomposed signals of 256 frequency points are merged to 16 preset frequency sub-bands.
In this embodiment, the decomposed signals of the M frequency points are merged into the P preset frequency subbands, and the number of the preset frequency subbands is smaller than that of the frequency points, so that the calculation complexity and the calculation amount in the subsequent steps can be reduced.
Through steps 101 and 102, the sound is decomposed into small time frequency units, and the time frequency units are classified, added and aggregated according to the auditory field of human ears.
And 103, performing framing processing on the decomposed signals of the frequency points included in each preset frequency sub-band at intervals of preset time length to obtain N start-stop time periods corresponding to each preset frequency sub-band.
The preset time is 20 milliseconds, for example, and at intervals of 20 milliseconds, the decomposed signals of the frequency points included in each preset frequency sub-band are subjected to framing processing to obtain N start-stop time periods corresponding to each preset frequency sub-band in 16 preset frequency sub-bands. If the time length of the target band noise frequency signal is 1000 milliseconds, each preset frequency sub-band corresponds to 50 analysis units of the start-stop time period. Taking 50 start-stop time periods corresponding to the 1 st preset frequency sub-band as an example, the 1 st start-stop time period of the 50 start-stop time periods is 0-20 milliseconds, the 2 nd start-stop time period is 20-40 milliseconds, the 3 rd start-stop time period is 40-60 milliseconds, and so on. Since one start-stop time period corresponds to one analysis unit, each preset frequency subband corresponds to N analysis units.
And step 104, estimating a target ratio corresponding to each analysis unit in the P analysis units of the same start-stop time period corresponding to the P preset frequency sub-bands.
In step 104, the target ratio corresponding to each analysis unit in the P analysis units in the same start-stop time period corresponding to the P preset frequency subbands is estimated by the following steps:
extracting the acoustic features of each analysis unit in the P analysis units;
inputting P acoustic characteristics of the P analysis units into a target model, and outputting a target ratio corresponding to the ith analysis unit in the P analysis units through the target model;
and i is more than or equal to 1 and less than or equal to P, and the target model is obtained by training a pre-constructed preset model based on P acoustic feature samples of P analysis unit samples with noise frequency signal samples.
The acoustic features include, for example, MFCC acoustic features, GFCC acoustic features, PLP acoustic features, etc., in which Mel-scale Frequency Cepstral Coefficients (MFCCs) are Cepstral parameters extracted in the Mel-scale Frequency domain, which describes the non-linear characteristics of human ear frequencies. Auditory filter cepstrum coefficients (GFCCs) extract acoustic energy features using a time-domain non-uniform multi-subband filter bank of biomimetic auditory sense. Perceptual linear prediction features (PLP) obtain cepstral coefficients using a linear prediction autoregressive model.
The target model is obtained by training a pre-established preset model based on P acoustic characteristic samples of P analysis unit samples with noise frequency signal samples. The preset model is, for example, a bayesian classifier, a Support Vector Machine (SVM) classifier, a Deep Neural Network (DNN), or a Recurrent Neural Network (RNN), and may learn, by using a Deep learning algorithm, acoustic features of analysis unit samples in a large number of sounds in a sound library and a noise library and sounds with different signal to noise ratios obtained by mixing the sounds in the sound library and the noise library, so as to obtain statistical prior knowledge, thereby obtaining a target model. After the target model is obtained through training, in an application stage, the target model obtained through training can be used for estimating a target ratio corresponding to each analysis unit in the same time period. For example, a Recurrent Neural Network (RNN) machine learning algorithm with 3-layer Gated Recurrent Unit (GRU) may be used to perform deep learning on the acoustic features, and the model obtained by the learning may be used to perform estimation of noise energy ratio or speech energy ratio on the analysis Unit of the noisy speech.
The target ratio corresponding to the ith analysis unit is the ith target ratio, and one analysis unit corresponds to one target ratio. The target ratio is a first ratio or a second ratio, the first ratio is the ratio of the voice energy to the total energy of the analysis unit, and the second ratio is the ratio of the noise energy to the total energy of the analysis unit. The second ratio may also be referred to as a masking value, and refers to the ratio of the noise energies that need to be masked. The value range of the target ratio is 0-1.
In this embodiment, P analysis units correspond to P preset frequency subbands. The 1 st preset frequency sub-band corresponds to the 1 st analysis unit, the 2 nd preset frequency sub-band corresponds to the 2 nd analysis unit, the 3 rd preset frequency sub-band corresponds to the 3 rd analysis unit, and so on. For example, if there are 16 preset frequency subbands, there are 16 first ratios, and the 1 st first ratio corresponds to the 1 st preset frequency subband.
And 105, based on the P target ratios, performing noise elimination processing on the target noisy audio signals corresponding to the same start-stop time period to obtain target audio signals corresponding to the same start-stop time period.
Wherein P, M is a positive integer, P is less than M, N is an integer greater than or equal to 2, the same start-stop time period is any one of N start-stop time periods,
in this embodiment, step 105 performs noise elimination processing on the target noisy audio signal corresponding to the same start-stop time period based on the P target ratios to obtain a target audio signal corresponding to the same start-stop time period, which may be implemented as follows:
under the condition that the target ratio is a first ratio, taking the product of the ith first ratio in the P first ratios and a target noisy audio signal corresponding to each frequency point in frequency points included in a target preset frequency sub-band as a frequency domain voice signal corresponding to the same start-stop time period, and performing time domain transformation on the frequency domain voice signal corresponding to the same start-stop time period to obtain a target voice signal corresponding to the same start-stop time period;
or under the condition that the target model outputs the second ratio, calculating a difference value between the preset ratio and the second ratio, taking a product of an ith difference value in the P difference values and a target noisy audio signal corresponding to each frequency point in frequency points included in a target preset frequency sub-band as a frequency domain voice signal corresponding to the same start-stop time period, and performing time domain transformation on the frequency domain voice signal corresponding to the same start-stop time period to obtain a target voice signal corresponding to the same start-stop time period;
j is an integer greater than or equal to 1, and the target preset frequency sub-band is a preset frequency sub-band corresponding to the ith analysis unit.
For example, when the target ratio is the first ratio, the frequency domain speech signal corresponding to the same start-stop period, for example, the 1 st start-stop period is obtained based on the product of the 1 st first ratio of the 16 first ratios and the target noisy audio signal corresponding to each frequency point in the frequency points included in the 1 st preset frequency sub-band, the product of the 2 nd first ratio and the target noisy audio signal corresponding to each frequency point in the frequency points included in the 2 nd preset frequency sub-band, the product of the 3 rd first ratio and the target noisy audio signal corresponding to each frequency point in the frequency points included in the 3 rd preset frequency sub-band, … …, the product of the 16 th first ratio and the target noisy audio signal corresponding to each frequency point in the frequency points included in the 16 th preset frequency sub-band, and performing time domain transformation on the frequency domain voice signal corresponding to the 1 st start-stop time period to obtain a target voice signal corresponding to the 1 st start-stop time period. According to the process, the target voice signal corresponding to the 2 nd start-stop time period, the target voice signal corresponding to the 3 rd start-stop time period and the like can be obtained.
In this embodiment, through steps 103 to 105, the noise energy ratio or the voice energy ratio is estimated for the aggregated unit, and the target voice is retained and the noise is masked according to the estimation result of the noise energy ratio or the voice energy ratio.
When the scheme of separating voice by spatial filtering is adopted in the prior art, the broadband frequency domain is divided into a plurality of frequency points for processing, and the gain of each frequency point is different, so that the distortion of the separated voice signal is larger.
The voice separation method provided by the embodiment of the invention obtains the decomposed signals of the M frequency points by decomposing the obtained target audio signals with noise into the M frequency points in the preset frequency domain range, combines the decomposed signals of the M frequency points into the P preset frequency sub-bands based on the auditory perception characteristic of human ears, performs framing processing on the decomposed signals of the frequency points included in each preset frequency sub-band at intervals of preset time to obtain N start-stop time periods corresponding to each preset frequency sub-band, estimates P analysis units of the same start-stop time period corresponding to the P preset frequency sub-bands, performs noise elimination processing on the target audio signals corresponding to the same start-stop time period based on the P target ratio values, and obtains the target voice signals corresponding to the same start-stop time period. Because the target voice signal is obtained by estimating the noise energy ratio or the voice energy ratio of the analysis unit, the fuzzification and noise reduction processing of the noisy audio signal is realized according to the voice energy ratio, the distortion of the target voice signal is reduced, and the voice quality effect is improved.
It should be noted that, the voice separation method provided in the embodiment of the present invention does not need to pay attention to what type of noise the noise is, for example, does not need to pay attention to the type of noise such as flowing water sound, air conditioner sound, sweeper sound, and the like. The characteristics of noise and noise do not need to be estimated, for example, the frequency, amplitude value and the like of the noise do not need to be estimated, and only the energy ratio needs to be estimated. In addition, the estimation of the noise and the voice signal by the blocking matrix filter and the like is not suitable for the unsteady time-varying noise, and further, the error of the noise estimation is large, so that a low noise filtering effect and a large voice damage are brought.
The spatial filtering scheme provided by the prior art needs to determine the direction of the source of the voice, and when the direction of the source of the noise is consistent with the direction of the source of the voice, the voice and the noise cannot be distinguished. The speech separation method provided by the embodiment does not need to depend on the direction of the source of the speech, and therefore, compared with the spatial filtering scheme, the speech separation method provided by the embodiment can separate the speech even if the direction of the source of the noise is consistent with the direction of the source of the speech.
Optionally, the P preset frequency subbands include m low-frequency subbands, n intermediate-frequency subbands, and s high-frequency subbands, and a bandwidth of the intermediate-frequency subband is greater than a bandwidth of the low-frequency subbands and smaller than a bandwidth of the high-frequency subbands.
In this embodiment, the bandwidth of the middle frequency sub-band is greater than the bandwidth of the low frequency sub-band and less than the bandwidth of the high frequency sub-band, so that the number of frequency points corresponding to the low frequency sub-band is less than the number of frequency points corresponding to the middle frequency sub-band, and the number of frequency points corresponding to the middle frequency sub-band is less than the number of frequency points corresponding to the high frequency sub-band. Therefore, the weight of the frequency point corresponding to the low-frequency sub-band is greater than that of the frequency point corresponding to the medium-frequency sub-band, and the weight of the frequency point corresponding to the medium-frequency sub-band is greater than that of the frequency point corresponding to the high-frequency sub-band, so that more detail information of the low-frequency signal is reserved, more voice energy is obtained, more target voice signals are reserved according to the voice energy ratio, distortion of the target voice signals is reduced, and noise is masked.
Optionally, before decomposing the acquired target noisy audio signal into M frequency points in a preset frequency domain range to obtain decomposed signals of the M frequency points in step 101, the method may further include the following steps:
acquiring a plurality of noisy audio signals acquired by a plurality of microphones of a microphone array, wherein the microphone array is an annular array formed by the plurality of microphones;
aligning the plurality of noisy audio signals and adding the aligned plurality of noisy audio signals to obtain a target noisy audio signal.
In the implementation, the behavior of a listener facing to a speaker is simulated, sound signals are aligned and added by combining correlation and energy of multi-microphone signals, and target sound energy in a multi-sensor is obtained, so that random noise is eliminated.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a voice separating apparatus according to an embodiment of the present invention, where the apparatus 200 includes:
the decomposition module 210 is configured to decompose the acquired target noisy audio signal to M frequency points in a preset frequency domain range, so as to obtain decomposed signals of the M frequency points;
a merging module 220, configured to merge the decomposed signals of the M frequency points into P preset frequency subbands based on human auditory perception characteristics;
a framing module 230, configured to perform framing processing on the decomposed signals of the frequency points included in each preset frequency sub-band at intervals of a preset time duration to obtain N start-stop time periods corresponding to each preset frequency sub-band;
an estimating module 240, configured to estimate a target ratio corresponding to each analysis unit in P analysis units of the same start-stop time period corresponding to the P preset frequency subbands;
a processing module 250, configured to perform noise cancellation processing on the target noisy audio signal corresponding to the same start-stop time period based on the P target ratios to obtain a target audio signal corresponding to the same start-stop time period;
p, M is a positive integer, P is smaller than M, N is an integer greater than or equal to 2, the same start-stop time period is any one of N start-stop time periods, the target ratio is a first ratio or a second ratio, the first ratio is a ratio of speech energy to total energy of the analysis unit, and the second ratio is a ratio of noise energy to total energy of the analysis unit.
According to the voice separation device provided by the embodiment of the invention, the acquired target audio signal with noise is decomposed into M frequency points in a preset frequency domain range to obtain the decomposed signals of the M frequency points, the decomposed signals of the M frequency points are combined into P preset frequency sub-bands based on the auditory perception characteristic of human ears, the decomposed signals of the frequency points included in each preset frequency sub-band are subjected to framing processing at intervals of preset time to obtain N starting and stopping time periods of analysis units corresponding to each preset frequency sub-band, P analysis units of the same starting and stopping time period corresponding to the P preset frequency sub-bands are estimated, the target ratio corresponding to each analysis unit is subjected to noise elimination processing based on the P target ratios, and the target audio signal corresponding to the same starting and stopping time period is obtained. Because the target voice signal is obtained by estimating the noise energy ratio or the voice energy ratio of the analysis unit, the fuzzification and noise reduction processing of the noisy audio signal is realized according to the voice energy ratio, the distortion of the target voice signal is reduced, and the voice quality effect is improved.
Optionally, the estimating module 240 specifically includes:
an extraction unit, configured to extract an acoustic feature of each of the P analysis units;
the output unit is used for inputting P acoustic characteristics of the P analysis units into a target model so as to output the target ratio corresponding to the ith analysis unit in the P analysis units through the target model;
and i is more than or equal to 1 and less than or equal to P, and the target model is obtained by training a pre-constructed preset model based on P acoustic feature samples of P analysis unit samples with noise frequency signal samples.
Optionally, the processing module 250 is specifically configured to, when the target ratio is a first ratio, obtain the frequency domain voice signals corresponding to the same start-stop time period based on a product of an ith first ratio of the P first ratios and a target noisy audio signal corresponding to each frequency point of frequency points included in a target preset frequency sub-band, and perform time domain transformation on the frequency domain voice signals corresponding to the same start-stop time period to obtain the target voice signals corresponding to the same start-stop time period;
or, under the condition that the target model outputs the second ratio, calculating a difference value between a preset ratio and the second ratio, obtaining a frequency domain voice signal corresponding to the same start-stop time period based on a product of an ith difference value of the P difference values and a target noisy audio signal corresponding to each of frequency points included in a target preset frequency sub-band, and performing time domain transformation on the frequency domain voice signal corresponding to the same start-stop time period to obtain a target voice signal corresponding to the same start-stop time period;
and the target preset frequency sub-band is a preset frequency sub-band corresponding to the ith analysis unit.
Optionally, the P preset frequency subbands include m low-frequency subbands, n intermediate-frequency subbands, and s high-frequency subbands, and a bandwidth of the intermediate-frequency subband is greater than a bandwidth of the low-frequency subbands and smaller than a bandwidth of the high-frequency subbands.
Optionally, the method further includes:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of noisy audio signals acquired by a plurality of microphones of a microphone array, and the microphone array is an annular array formed by the plurality of microphones;
and the second acquisition module is used for aligning the plurality of noisy audio signals and adding the aligned plurality of noisy audio signals to acquire the target noisy audio signal.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种天然气站场设备运行智能监测方法及系统