Voice signal enhancement processing method, device, equipment and storage medium
1. A method for enhancing a speech signal, the method comprising:
acquiring a target voice signal;
enhancing the target voice signal by adopting a reference voice enhancement mode to obtain a reference enhanced signal;
determining a target voice enhancement mode according to the reference enhancement signal;
and enhancing the target voice signal by adopting the target voice enhancement mode.
2. The method of claim 1, wherein the reference speech enhancement mode comprises a first speech enhancement mode and a second speech enhancement mode, and wherein a sampling rate of the second speech enhancement mode is smaller than a sampling rate of the first speech enhancement mode;
the enhancing the target speech signal by adopting a reference speech enhancing mode to obtain a reference enhanced signal, comprising:
enhancing the target voice signal by adopting the first voice enhancement mode to obtain a first enhanced signal;
performing down-sampling processing on the target voice signal to obtain a down-sampled voice signal;
enhancing the downsampled voice signal by adopting the second voice enhancement mode to obtain a second enhanced signal;
wherein the reference enhanced signal comprises the first enhanced signal and the second enhanced signal.
3. The method of claim 2, wherein determining the target speech enhancement mode based on the reference enhancement signal comprises:
extracting a third enhanced signal from the first enhanced signal according to the frequency range of the second enhanced signal, wherein the frequency range of the third enhanced signal is the same as the frequency range of the second enhanced signal;
and determining the target voice enhancement mode according to the third enhancement signal and the second enhancement signal.
4. The method of claim 3, wherein the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as a frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal;
the determining the target speech enhancement mode according to the third enhancement signal and the second enhancement signal includes:
calculating a correlation coefficient of the third enhanced signal and the second enhanced signal;
determining that the target speech enhancement mode comprises applying the first speech enhancement mode to the first signal portion and applying the second speech enhancement mode to the second signal portion if the correlation coefficient is greater than a first threshold;
determining the target speech enhancement mode comprises applying the first speech enhancement mode to the target speech signal if the correlation coefficient is less than a first threshold.
5. The method of claim 2, wherein the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as a frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal;
the determining a target speech enhancement mode according to the reference enhancement signal includes:
obtaining a target frequency range, wherein the target frequency range comprises at least one frequency;
for a first frequency of the at least one frequency, determining a gain of the first enhanced signal at the first frequency and a gain of the second enhanced signal at the first frequency;
adjusting the value of a gain count according to the magnitude relationship between the gain of the first enhancement signal at the first frequency and the gain of the second enhancement signal at the first frequency; wherein if the gain of the first enhanced signal at the first frequency is greater than the gain of the second enhanced signal at the first frequency, adding one to the gain count; decrementing the gain count if the gain of the first enhanced signal at the first frequency is less than the gain of the second enhanced signal at the first frequency;
determining that the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion, if a value of a gain count for completing an adjustment process is greater than zero;
and under the condition that the value of the gain count in the adjustment process is less than zero, determining the target voice enhancement mode comprises adopting the first voice enhancement mode for the target voice signal.
6. The method of claim 2 wherein the sampling rate of the second speech enhancement mode is one-half of the sampling rate of the first speech enhancement signal.
7. The method according to claim 2, wherein the first speech enhancement mode comprises speech enhancement based on a gated round robin unit GRU; the second speech enhancement mode comprises speech enhancement based on a long-short term memory network (LSTM).
8. The method of claim 1, wherein the enhancing the target speech signal by the reference speech enhancement method to obtain a reference enhanced signal comprises:
performing down-sampling processing on the target voice signal to obtain a down-sampled voice signal;
and enhancing the downsampled voice signal by adopting the reference voice enhancement mode to obtain the reference enhanced signal.
9. The method of claim 8, wherein determining the target speech enhancement mode based on the reference enhancement signal comprises:
performing pitch period estimation on the reference enhanced signal to obtain a pitch period of the reference enhanced signal;
and determining the target voice enhancement mode according to the pitch period of the reference enhancement signal.
10. The method according to claim 9, wherein the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as the frequency range of the reference enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the reference signal portion in the frequency range of the target speech signal;
the determining the target speech enhancement mode according to the pitch period of the reference enhancement signal includes:
determining a pitch of the reference enhanced signal according to a pitch period of the reference enhanced signal;
in the case that the pitch of the reference enhancement signal is greater than a second threshold, determining the target speech enhancement mode comprises applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion; wherein the sampling rate of the second speech enhancement mode is smaller than that of the first speech enhancement mode;
in the event that the pitch of the reference enhancement signal is less than a second threshold, determining the target speech enhancement mode comprises employing a first speech enhancement mode on the target speech signal.
11. The method of claim 1,
the reference voice enhancement mode comprises a first voice enhancement mode and a second voice enhancement mode, and the sampling rate of the second voice enhancement mode is smaller than that of the first voice enhancement mode; the enhancing the target speech signal by adopting a reference speech enhancing mode to obtain a reference enhanced signal, comprising:
enhancing the target voice signal by adopting a first voice enhancement mode to obtain a first enhanced signal;
performing down-sampling processing on the target voice signal to obtain a down-sampled voice signal;
enhancing the downsampled voice signal by adopting a second voice enhancement mode to obtain a second enhanced signal;
wherein the reference enhanced signal comprises the first enhanced signal and the second enhanced signal;
the target speech signal comprises a first signal portion and a second signal portion, the frequency range of the second signal portion being the same as the frequency range of the second enhancement signal, the frequency range of the first signal portion being a frequency range of the target speech signal other than the frequency range of the second signal portion; the determining a target speech enhancement mode according to the reference enhancement signal includes:
extracting a third enhanced signal from the first enhanced signal according to the frequency range of the second enhanced signal, wherein the frequency range of the third enhanced signal is the same as the frequency range of the second enhanced signal;
calculating a correlation coefficient of the third enhanced signal and the second enhanced signal;
determining that the target speech enhancement mode comprises applying the first speech enhancement mode to the first signal portion and applying the second speech enhancement mode to the second signal portion if the correlation coefficient is greater than a first threshold;
under the condition that the correlation coefficient is smaller than a first threshold value, pitch period estimation is carried out on the second enhanced signal to obtain the pitch period of the second enhanced signal;
determining a pitch of the second enhanced signal according to a pitch period of the second enhanced signal;
in the event that the pitch of the second enhancement signal is greater than a second threshold, determining the target speech enhancement mode comprises employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion;
obtaining a target frequency range in the case that a pitch of the second enhancement signal is less than a second threshold, the target frequency range including at least one frequency;
for a first frequency of the at least one frequency, determining a gain of the third enhanced signal at the first frequency and a gain of the second enhanced signal at the first frequency;
adjusting the value of a gain count according to the magnitude relationship between the gain of the third enhanced signal at the first frequency and the gain of the second enhanced signal at the first frequency; wherein if the gain of the third enhanced signal at the first frequency is greater than the gain of the second enhanced signal at the first frequency, adding one to the gain count; decrementing the gain count if the gain of the third enhanced signal at the first frequency is less than the gain of the second enhanced signal at the first frequency;
determining that the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion, if a value of a gain count for completing an adjustment process is greater than zero;
and under the condition that the value of the gain count in the adjustment process is less than zero, determining the target voice enhancement mode comprises adopting the first voice enhancement mode for the target voice signal.
12. The method of any of claims 1 to 11, wherein the target speech signal comprises an ultra-wideband speech signal.
13. An apparatus for enhancing a speech signal, the apparatus comprising:
the voice signal acquisition module is used for acquiring a target voice signal;
the reference signal determining module is used for enhancing the target voice signal by adopting a reference voice enhancing mode to obtain a reference enhanced signal;
the enhancement mode determining module is used for determining a target voice enhancement mode according to the reference enhancement signal;
and the voice signal enhancement module is used for enhancing the target voice signal by adopting the target voice enhancement mode.
14. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the method of enhanced processing of a speech signal according to any one of claims 1 to 12.
15. A computer-readable storage medium, having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of enhanced processing of a speech signal according to any one of claims 1 to 12.
Background
People often acquire a large number of voice signals in situations such as work, life, entertainment and the like. For example, speech signals are involved in teleconferencing, video telephony, live concerts, and the like.
Although voice signals are growing explosively and become an inelegant part of people's work and life, the quality of voice signals with different sources and a variety of types is not uniform, and most voice signals contain noise. In order to suppress noise in the speech signal and enhance a useful signal in the speech signal, the speech signal may be enhanced. In the related art, the enhancement modes for the voice signal include a wideband enhancement mode and an ultra wideband enhancement mode, wherein the wideband enhancement mode can better enhance the low-frequency signal, and is generally used for enhancing the voice signal with a bandwidth of 0 to 8KHz (Kilo Hertz); the ultra-wideband enhancement mode can well enhance high-frequency signals and is generally used for enhancing voice signals with the bandwidth of 8 to 16 KHz. Therefore, in the case that a certain speech signal includes both a low-frequency signal portion and a high-frequency signal portion, that is, the bandwidth of the speech signal is 0 to 16KHz, the related art adopts the following enhancement modes for the speech signal: the low-frequency signal part with the bandwidth of 0-8 KHz adopts a broadband enhancement mode, and the high-frequency signal part with the bandwidth of 8-16 KHz adopts an ultra-wideband enhancement mode.
However, assuming that a useful signal in a low-frequency signal portion of a certain speech signal is almost drowned in noise, it will be difficult to accurately and effectively recognize the useful signal and the noise at this time. If the speech enhancement method in the related art is adopted, that is, the wideband enhancement method is directly adopted for the low-frequency signal portion, noise is likely to be mistakenly taken as a useful signal to enhance the noise, so that the purposes of suppressing the noise and enhancing the useful signal are violated, and the speech signal is not favorably and accurately enhanced.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for enhancing a voice signal, which can accurately and effectively enhance the voice signal and improve the enhancement effect of the voice signal. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides a method for enhancing a speech signal, where the method includes:
acquiring a target voice signal;
enhancing the target voice signal by adopting a reference voice enhancement mode to obtain a reference enhanced signal;
determining a target voice enhancement mode according to the reference enhancement signal;
and enhancing the target voice signal by adopting the target voice enhancement mode.
In another aspect, an embodiment of the present application provides an apparatus for enhancing a speech signal, where the apparatus includes:
the voice signal acquisition module is used for acquiring a target voice signal;
the reference signal determining module is used for enhancing the target voice signal by adopting a reference voice enhancing mode to obtain a reference enhanced signal;
the enhancement mode determining module is used for determining a target voice enhancement mode according to the reference enhancement signal;
and the voice signal enhancement module is used for enhancing the target voice signal by adopting the target voice enhancement mode.
In yet another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method for processing enhancement of a speech signal.
In yet another aspect, an embodiment of the present application provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the method for enhancing processing of a speech signal.
In yet another aspect, embodiments of the present application provide a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the method for enhancing the voice signal.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
after the voice signal is obtained, the voice signal is enhanced by adopting a reference voice enhancement mode to obtain a reference enhanced signal, the voice enhancement mode adopted in the actual enhancement processing is further determined based on the reference enhanced signal, and then the determined voice enhancement mode actually adopted is adopted to enhance the voice signal. The reference enhanced signal can reflect the signal characteristics of the initially acquired voice signal, such as whether the voice signal is an obvious signal with much noise or not, so that the actually adopted voice enhancement mode can be determined in a targeted manner by combining the signal characteristics of the voice signal according to the reference enhanced signal. Compared with the prior art that the different conditions aiming at the voice signals cannot be distinguished and processed by adopting a fixed voice enhancement mode, the voice signal enhancement method and the voice signal enhancement device fully consider the signal characteristics of the voice signals in the voice signal enhancement process, help to accurately and effectively enhance the voice signals, and improve the enhancement effect of the voice signals.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application;
FIG. 2 is a flowchart of a method for enhancing a speech signal according to an embodiment of the present application;
FIG. 3 is a diagram illustrating a method for enhancing a speech signal according to an embodiment of the present application;
FIG. 4 is a schematic illustration of a speech enhancement effect provided by an embodiment of the present application;
FIG. 5 is a block diagram of an apparatus for enhancing a speech signal according to an embodiment of the present application;
fig. 6 is a block diagram of an apparatus for enhancing a speech signal according to another embodiment of the present application;
fig. 7 is a block diagram of a computer device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The technical scheme provided by the embodiment of the application is suitable for any service scene with the requirement of enhancing the voice signal, such as a voice conference, a video conference, voice recording, video recording and other service scenes.
Referring to fig. 1, a schematic diagram of an application scenario provided in an embodiment of the present application is shown. The application scenario can be implemented as a cloud video conference system, which is a video conference platform based on cloud technology.
Cloud Technology refers to a hosting Technology for unifying resources of hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
As shown in fig. 1, the cloud video conference system may include: a terminal 10 and a server 20.
The number of the terminals 10 may be one or more. The terminal 10 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like.
The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform.
The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.
In one example, a client running a target application, such as an application providing a video conference function, is installed in the terminal 10. The server 20 may be a background server for the target application for providing background services to clients of the target application.
In the method for enhancing processing of a voice signal provided in the embodiment of the present application, the execution subject of each step may be the terminal 10, for example, a client installed with a running target application program in the terminal 10, or the server 20, or the terminal 10 and the server 20 are executed in an interactive cooperation manner, that is, a part of the steps of the method is handed to the terminal 10 for execution, and another part of the steps is handed to the server 20 for execution.
For convenience of description, in the embodiment of the speech signal enhancement processing method described below, only the execution subject of each step is described as a computer device, and the computer device refers to an electronic device having data calculation, processing and storage capabilities, such as the terminal 10 or the server 20, and the like, and the embodiment of the present application is not limited thereto.
Referring to fig. 2, a flowchart of a method for enhancing a speech signal according to an embodiment of the present application is shown. The method may be applied in a computer device, such as the above-mentioned terminal 10 or server 20. The method comprises the following steps (steps 210-240):
step 210, obtaining a target voice signal.
The target speech signal refers to a speech signal that needs noise suppression or useful signal enhancement, and may be a speech signal collected by an audio collecting device (such as a microphone) in a real environment. In general, the target speech signal contains noise, which may be ambient noise, howling, or other noise, and the embodiment of the present application does not limit the type of the noise. For example, in a cloud video conference scenario, a microphone collects a voice signal generated when a participant speaks, and at the same time, the microphone may also collect a noise signal due to environment, equipment, and the like, in which case, the noise signal collected by the microphone and the voice signal generated when the participant speaks together constitute a target voice signal.
In one example, the target speech signal comprises an ultra-wideband speech signal. For example, in a cloud video conference scenario, a microphone acquires a target voice signal with a sampling frequency of 32KHz, and the bandwidth of the target voice signal is 0 to 16 KHz. It should be noted that, as the technology evolves, the ultra-wideband voice signal may have a larger bandwidth, or the corresponding name of the voice signal with the larger bandwidth may vary, and it should be understood that these should fall within the protection scope of the present application.
And step 220, performing enhancement processing on the target voice signal by adopting a reference voice enhancement mode to obtain a reference enhanced signal.
As can be seen from the description of the above background art, for a certain speech signal, the related art enhances the speech signal by using a fixed speech enhancement method, for example, in the case that the speech signal is an ultra-wideband speech signal, the related art enhances the speech signal by using the fixed speech enhancement method including: and adopting a broadband enhancement mode for a low-frequency signal part of the voice signal and adopting an ultra-wideband enhancement mode for a high-frequency signal part of the voice signal. However, this fixed speech enhancement method cannot perform a distinction process for different situations of the speech signal, which is not favorable for accurately and effectively enhancing the speech signal.
Based on this, in the embodiment of the present application, after obtaining the target speech signal, the computer device does not directly enhance the target speech signal and output the enhanced speech signal, but first performs enhancement processing on the target speech signal by using a reference speech enhancement method to obtain a reference enhancement signal, then further determines a speech enhancement method, that is, a target speech enhancement method, which is used in actual enhancement processing based on the reference enhancement signal, and then performs enhancement processing on the target speech signal by using the target speech enhancement method. The reference enhancement signal can reflect the signal characteristics of the target voice signal, so that the target voice enhancement mode can be determined in a targeted manner according to the reference enhancement signal, the signal characteristics of the target voice signal are fully considered in the target voice signal enhancement process, the target voice signal can be accurately and effectively enhanced, and the enhancement effect of the target voice signal is improved.
The embodiment of the present application does not limit the type of the reference voice enhancement mode, and optionally, the reference voice mode includes a wideband enhancement mode, such as a voice enhancement mode with a sampling rate of 16 KHz; or, the reference voice enhancement mode includes an ultra-wideband enhancement mode, such as a voice enhancement mode with a sampling rate of 32 KHz; alternatively, the reference speech enhancement mode includes a wideband enhancement mode and an ultra wideband enhancement mode. The embodiment of the present application does not limit the specific content of the reference speech enhancement mode, and optionally, the reference speech enhancement mode includes at least one of the following: LSTM (Long Short-Term Memory) with a sampling rate of 16KHz, LSTM with a sampling rate of 32KHz, GRU (Gated Current Unit) with a sampling rate of 16KHz, and GRU with a sampling rate of 32 KHz. For other descriptions of the reference enhancement mode, the enhancement processing performed on the target speech signal by the reference enhancement mode to obtain the reference enhancement signal, please refer to the following method embodiments, which are not repeated herein.
Step 230, determining a target speech enhancement mode according to the reference enhancement signal.
The computer equipment can further determine the signal characteristics of the target voice signal according to the acquired reference enhancement signal, and determine a target voice enhancement mode according to the signal characteristics of the target voice signal. For example, in the case that the reference enhancement mode includes a wideband enhancement mode, the computer device performs enhancement processing on the target speech signal in the wideband enhancement mode to obtain a reference enhancement signal, and then further processes the reference enhancement signal to determine that the pitch of the reference enhancement signal is small, so as to determine that the useful signal in the low-frequency part of the target speech signal is submerged in the noise, and the computer device should avoid performing enhancement processing on the low-frequency part of the target speech signal in the wideband enhancement mode to avoid that the noise in the low-frequency part is enhanced to violate the purpose of noise suppression, at this time, the computer device may perform enhancement on the low-frequency part in the ultra wideband enhancement mode. For further description of how the computer device determines the target speech enhancement mode according to the reference enhancement signal, please refer to the following method embodiments, which are not repeated herein.
And 240, enhancing the target voice signal by adopting a target voice enhancement mode.
The target speech enhancement mode is an enhancement mode actually adopted when enhancement processing is performed on the target speech signal. Optionally, the target speech enhancement mode may be a single enhancement mode, such as an ultra-wideband enhancement mode; the method may also be a fusion of multiple enhancement modes, for example, a wideband enhancement mode is adopted for a low-frequency signal portion in the target speech signal, and an ultra-wideband enhancement mode is adopted for a high-frequency signal portion in the target speech signal, and the specific content of the target speech enhancement mode is not limited in the embodiment of the present application. After the computer equipment determines the target voice enhancement mode, the target voice signal is enhanced according to the target voice enhancement mode, and the enhanced voice signal can be obtained. Optionally, the computer device may send the enhanced speech signal to an audio output device (e.g., a speaker) so that the audio output device outputs the enhanced speech signal, thereby improving the signal quality of the speech signal.
To sum up, according to the technical scheme provided by the embodiment of the application, after the voice signal is obtained, the voice signal is enhanced by using a reference voice enhancement mode to obtain a reference enhancement signal, the voice enhancement mode used in the actual enhancement processing is further determined based on the reference enhancement signal, and then the determined voice enhancement mode actually used is used to enhance the voice signal. The reference enhanced signal can reflect the signal characteristics of the initially acquired voice signal, such as whether the voice signal is an obvious signal with much noise or not, so that the actually adopted voice enhancement mode can be determined in a targeted manner by combining the signal characteristics of the voice signal according to the reference enhanced signal. Compared with the prior art that the different conditions aiming at the voice signals cannot be distinguished and processed by adopting a fixed voice enhancement mode, the voice signal enhancement method and the voice signal enhancement device fully consider the signal characteristics of the voice signals in the voice signal enhancement process, help to accurately and effectively enhance the voice signals, and improve the enhancement effect of the voice signals.
In one example, the reference speech enhancement mode includes a first speech enhancement mode and a second speech enhancement mode.
Due to the variety of the voice enhancement modes, the computer device may process the target voice signal based on the various voice enhancement modes to obtain the reference enhancement signal, and then process the reference enhancement signal based on the obtained reference enhancement signal to determine the voice enhancement mode actually adopted, i.e., the target voice enhancement mode. In order to enable effective contrast to be formed between the reference enhanced signals, in the embodiment of the present application, the computer device may perform enhancement processing on the target speech signal based on speech enhancement modes with different sampling rates to obtain the reference enhanced signal. Based on this, the reference speech enhancement mode may include a first speech enhancement mode and a second speech enhancement mode, wherein a sampling rate of the second speech enhancement mode is smaller than a sampling rate of the first speech enhancement mode. Optionally, the sampling rate of the second speech enhancement mode is one half of the sampling rate of the first speech enhancement signal. Illustratively, the second speech enhancement mode is a wideband speech enhancement mode with a sampling rate of 16KHz, for example, the second speech enhancement mode includes speech enhancement based on LSTM with a sampling rate of 16 KHz; the first speech enhancement mode is an ultra-wideband speech enhancement mode with a sampling rate of 32KHz, for example, the first speech enhancement mode includes speech enhancement based on GRUs with a sampling rate of 32 KHz.
Based on this, the step 220 includes the following steps:
step 221, performing enhancement processing on the target speech signal by using a first speech enhancement mode to obtain a first enhanced signal.
Optionally, in order to facilitate fast determination of the reference speech enhancement mode, the sampling rate of the first speech enhancement mode is the same as the sampling rate of the target speech signal. Therefore, the computer device can directly adopt the first voice enhancement mode to carry out enhancement processing on the target voice signal so as to obtain a first enhanced signal.
Step 223, down-sampling the target speech signal to obtain a down-sampled speech signal.
The sampling rate of the first speech enhancement mode is the same as that of the target speech signal, and the sampling rate of the first speech enhancement mode is greater than that of the second speech enhancement mode, so that the sampling rate of the second speech enhancement mode is less than that of the target speech signal. Therefore, before the computer device performs enhancement processing on the target speech signal by using the second speech enhancement mode, the computer device needs to perform down-sampling processing on the target speech signal first to reduce the sampling rate of the target speech signal, so that the sampling rate of the down-sampled speech signal is the same as that of the second speech enhancement mode.
And 225, enhancing the down-sampled voice signal by adopting a second voice enhancement mode to obtain a second enhanced signal.
After the down-sampled voice signal is obtained, the computer device can perform enhancement processing on the down-sampled voice signal by adopting a second voice enhancement mode to obtain a second enhanced signal. Thus, the reference enhancement signal comprises the first enhancement signal and the second enhancement signal described above.
For example, the target speech signal comprises a speech signal having a bandwidth of 0 to 16KHz and a sampling rate of 32KHz, the first speech enhancement mode comprises speech enhancement based on GRUs of 32KHz, and the second speech enhancement mode comprises speech enhancement based on LSTMs of 16 KHz. The computer equipment directly enhances the target voice signal by adopting a first voice enhancement mode to obtain a first enhanced signal; and the computer equipment performs down-sampling processing on the target voice signal, reduces the sampling rate of the target voice signal to 16KHz to obtain a down-sampled voice signal, and then performs enhancement processing on the down-sampled voice signal by sampling a second enhancement mode to obtain a second enhancement signal.
It should be noted that, in the embodiment of the present application, the execution sequence between step 221 and steps 223 and 225 is not limited, and optionally, step 221 is executed before steps 223 and 225; alternatively, step 221 is performed simultaneously with steps 223 and 225; alternatively, step 221 is performed after step 223 and step 225. It should be understood that these are all intended to fall within the scope of the present application.
Based on the above steps 221 to 225, in an example, the above step 230 includes the following steps:
step 232, extracting a third enhanced signal from the first enhanced signal according to the frequency range of the second enhanced signal, wherein the frequency range of the third enhanced signal is the same as the frequency range of the second enhanced signal.
Since the sampling rate of the second speech enhancement mode is smaller than the sampling rate of the first speech enhancement mode, the frequency range of the second enhanced signal obtained by the second speech enhancement mode will also be smaller than the frequency range of the first enhanced signal obtained by the first speech enhancement mode. If the processing such as comparison and calculation is performed based on the first enhancement signal and the second enhancement signal, it is necessary to compare portions of the enhancement signals having the same frequency range to improve the accuracy of the processing result.
Therefore, the computer device needs to extract the enhanced signal portion corresponding to the same frequency range as the frequency range of the second enhanced signal, i.e., the third enhanced signal, from the first enhanced signal according to the frequency range of the second enhanced signal. For example, the frequency range of the first enhanced signal is 0 to 16KHz and the frequency range of the second enhanced signal is 0 to 8KHz, then the computer device needs to extract the enhanced signal portion with the frequency range of 0 to 8KHz from the first enhanced signal as the third enhanced signal.
And step 234, determining a target voice enhancement mode according to the third enhancement signal and the second enhancement signal.
In the embodiment of the present application, the computer device determines the actually used speech enhancement mode based on two enhancement signals with the same frequency range, namely, the third enhancement signal and the second enhancement signal.
Optionally, the target speech signal comprises a first signal portion and a second signal portion, the frequency range of the second signal portion being the same as the frequency range of the second enhancement signal, the frequency range of the first signal portion being a frequency range of the target speech signal other than the frequency range of the second signal portion; the step 234 includes the following steps:
(1) the correlation coefficient of the third enhancement signal and the second enhancement signal is calculated.
The degree of correlation between the two signals can be reflected by the correlation coefficient of the two signals. In an embodiment of the application, the computer device may calculate a correlation coefficient of the third enhanced signal and the second enhanced signal to determine a degree of correlation between the third enhanced signal and the second enhanced signal. Illustratively, assume that the third enhancement signal corresponds to a gain of g1The gain corresponding to the second enhancement signal is g2Then the correlation coefficient corr of the third enhancement signal and the second enhancement signal is calculated as follows:
(2) in the case where the correlation coefficient is greater than the first threshold, determining the target speech enhancement mode includes applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion.
Generally, the larger the correlation coefficient, the higher the degree of correlation between the two signals. In the embodiment of the application, a first threshold is set, and when the correlation coefficient is greater than the first threshold, it is determined that the third enhancement signal and the second enhancement signal have strong correlation; in the case where the correlation coefficient is smaller than the first threshold value, it is determined that the correlation between the third enhanced signal and the second enhanced signal is weak. Optionally, the first threshold is 0.05, or 0.06, or 0.04, and in the application process, the value of the first threshold may be actually determined according to the requirement of calculation accuracy, and the like, and the value of the first threshold is not limited in this embodiment of the application.
And under the condition that the correlation coefficient is larger than the first threshold value, the computer equipment determines that the third enhancement signal and the second enhancement signal have stronger correlation, so that the target speech signal can be enhanced by adopting a mode of fusing two speech enhancement modes. In the embodiment of the present application, in the case that the correlation coefficient is greater than the first threshold, the computer device performs enhancement processing on a high-frequency signal portion (i.e., a first signal portion) in the target speech signal by using a first speech enhancement mode, and performs enhancement processing on a low-frequency signal portion (i.e., a second signal portion) in the target speech signal by using a second speech enhancement mode.
(3) In the case that the correlation coefficient is smaller than the first threshold, determining the target speech enhancement mode includes applying the first speech enhancement mode to the target speech signal.
In case the correlation coefficient is smaller than the first threshold, the computer device determines that the correlation between the third enhancement signal and the second enhancement signal is weak, which may be due to the presence of more noise in the low frequency signal portion of the target speech signal. In order to avoid enhancing noise against the purpose of noise suppression, in the embodiment of the present application, in the case that the correlation coefficient is smaller than the first threshold, the computer device may perform enhancement processing on the target speech signal in the first speech enhancement mode.
It should be noted that, in the case that the correlation coefficient is equal to the first threshold, the computer device may perform the processing manner in the case that the correlation coefficient is smaller than the first threshold, that is, determining the target speech enhancement manner includes applying the first speech enhancement manner to the target speech signal; it is also possible to perform the processing mode if the correlation coefficient is larger than the first threshold, i.e. determining the target speech enhancement mode comprises applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion. It should be understood that both of these approaches are intended to fall within the scope of the present application.
Based on the above steps 221 to 225, in another example, the target speech signal includes a first signal portion and a second signal portion, the frequency range of the second signal portion is the same as the frequency range of the second enhancement signal, and the frequency range of the first signal portion is a frequency range other than the frequency range of the second signal portion in the frequency range of the target speech signal; the step 230 includes the following steps:
step 231, a target frequency range is obtained, the target frequency range including at least one frequency.
In the enhancement processing of a speech signal, the larger the gain of the speech signal after the enhancement processing is, the worse the noise suppression effect of the enhancement processing is. Thus, in embodiments of the present application, the computer device may compare the gain of the first enhancement signal to the gain of the second enhancement signal to determine the actual speech enhancement mode employed. In consideration of various factors such as processing overhead and accuracy, in the embodiment of the present application, the computer device compares gains corresponding to the first enhancement signal and the second enhancement signal respectively at least one frequency within a certain frequency range, and then determines the target speech enhancement mode according to the final gain count.
Therefore, the computer device first needs to determine the frequency range of the gain comparison. As can be seen from the above description, since the sampling rate of the second speech enhancement mode is smaller than that of the first speech enhancement mode, the frequency range of the second enhancement signal will also be smaller than that of the first enhancement signal. In the process of gain comparison, in order to improve the accuracy, the gain comparison needs to be performed based on the same frequency range, so that the target frequency range is determined based on the frequency range of the second enhancement signal in the embodiment of the present application.
Alternatively, the computer device may directly take the frequency range of the second enhancement signal as the target frequency range; or, the computer device intercepts a part of the frequency range from the frequency range of the second enhancement signal as a target frequency range, the size of the target frequency range is not limited in the embodiment of the present application, and in the application process, the size of the target frequency range may be actually determined by combining the calculation accuracy, the processing overhead of the computer device, and other factors. For example, the frequency range of the second enhancement signal is 0 to 8KHz, and the target frequency range may be 0 to 8KHz, or 0.6 to 1.5 KHz.
For a first frequency of the at least one frequency, a gain of the first enhanced signal at the first frequency and a gain of the second enhanced signal at the first frequency are determined, step 233.
The target frequency range includes at least one frequency. The dividing mode of at least one frequency in the target frequency range is not limited in the embodiment of the application, optionally, at least one frequency is associated with the number of sampling points, that is, one sampling point corresponds to one frequency in the target frequency range; alternatively, at least one frequency is chosen randomly within the target frequency range.
The computer device will compare the gains of the first and second enhanced signals at each of the at least one frequency. Thus, the computer device needs to first determine the gain of the first and second enhancement signals at each of the at least one frequency. Taking a first frequency of the at least one frequency as an example, the computer device needs to determine the gain of the first enhanced signal at the first frequency and the gain of the second enhanced signal at the first frequency, respectively.
In step 235, the value of the gain count is adjusted according to the magnitude relationship between the gain of the first enhanced signal at the first frequency and the gain of the second enhanced signal at the first frequency.
At each of the at least one frequency, the computer device compares the gain of the first enhancement signal to the gain of the second enhancement signal and adjusts the value of the gain count based on the result of the comparison. In this embodiment of the present application, the adjustment manner of the gain count includes plus and minus processes, and optionally, the gain count is added with one when the gain of the first enhancement signal is greater than the gain of the second enhancement signal; alternatively, the gain count is decremented if the gain of the first enhancement signal is less than the gain of the second enhancement signal. Taking a first frequency of the at least one frequency as an example, if the gain of the first enhanced signal at the first frequency is greater than the gain of the second enhanced signal at the first frequency, adding one to the gain count; if the gain of the first enhanced signal at the first frequency is less than the gain of the second enhanced signal at the first frequency, the gain count is decremented by one.
It should be noted that, in the following steps 237 and 239, an example of adding one to the gain count when the gain of the first enhancement signal is greater than the gain of the second enhancement signal is described. It should be noted that, in the embodiment of the present application, when the gain of the first enhancement signal is equal to the gain of the second enhancement signal, the gain count may be subjected to an addition process, the gain count may be subjected to a subtraction process, or the gain count may not be adjusted. It should be understood that these are all intended to fall within the scope of the present application.
For example, assuming that the target frequency range is 0.6 to 1.5KHz, i represents a frequency within the target frequency range, and i is a frequency corresponding to a sampling point in 0.6 to 1.5KHz, the gain of the first enhancement signal is g1The gain of the second enhancement signal is g2Then, the adjustment process of the gain count is as follows.
count=0
if g1[i]>g2[i],0.6KHz≤i≤1.5KHz
count+1
else
count-1
In step 237, determining the target speech enhancement mode includes applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion when the gain count value for completing the adjustment process is greater than zero.
After the adjustment process of the gain count is completed, the computer equipment determines a target voice enhancement mode according to the value of the gain count after the adjustment process is completed. As is clear from the above description, the larger the signal gain after the enhancement processing is, the worse the noise suppression effect is. In the embodiment of the present application, an example of adding one to the gain count when the gain of the first enhancement signal is greater than the gain of the second enhancement signal is described, so that if the gain count value of the adjustment process is greater than zero, it indicates that the gain of the first enhancement signal is greater than the gain of the second enhancement signal, and it is clear that the noise suppression effect of the first speech enhancement method is worse than the noise suppression effect of the second speech enhancement method in the frequency range corresponding to the sampling rate of the second speech enhancement method. Therefore, when the gain count value of the adjustment process is greater than zero, the computer device performs enhancement processing on the high-frequency signal portion (i.e., the first signal portion) in the target speech signal by using a first speech enhancement mode, and performs enhancement processing on the low-frequency signal portion (i.e., the second signal portion) in the target speech signal by using a second speech enhancement mode.
And 239, determining the target speech enhancement mode including adopting a first speech enhancement mode for the target speech signal under the condition that the gain count value after the adjustment process is less than zero.
If the value of the gain count for completing the adjustment process is less than zero, it indicates that the gain of the first enhancement signal is less than the gain of the second enhancement signal, so that it is clear that the noise suppression effect of the first speech enhancement mode is better than that of the second speech enhancement mode in the frequency range corresponding to the sampling rate of the second speech enhancement mode. Therefore, under the condition that the value of the gain count for completing the adjusting process is less than zero, the computer equipment performs enhancement processing on the target voice signal by adopting a first voice enhancement mode.
It should be noted that, in a case where a value of the gain count after the adjustment process is equal to zero, the computer device may execute a processing manner in a case where the value of the gain count after the adjustment process is less than zero, that is, determining the target speech enhancement manner includes applying a first speech enhancement manner to the target speech signal; the processing mode may also be performed if the gain count value after the adjustment process is greater than zero, that is, determining the target speech enhancement mode includes applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion. It should be understood that both of these approaches are intended to fall within the scope of the present application.
In summary, according to the technical solution provided in the embodiment of the present application, after the reference enhanced signal is obtained, the correlation coefficient of the reference enhanced signal is further determined, and then the speech enhancement mode adopted in the actual enhancement processing is determined according to the size difference of the correlation coefficient of the reference enhanced signal. Since the correlation coefficient of the reference enhanced signal can reflect the degree of correlation between the reference enhanced signals, the signal characteristics of the speech signal, such as whether the low-frequency signal part of the speech signal is over-noisy, can be further determined by the degree of correlation between the reference enhanced signals. The embodiment of the application determines the actually adopted voice enhancement mode according to the correlation coefficient of the reference enhancement signal, fully considers the signal characteristics of the voice signal and improves the enhancement effect of the voice signal.
In addition, according to the technical scheme provided by the embodiment of the application, after the reference enhanced signal is obtained, the gain of the reference enhanced signal is compared on at least one frequency in a specific frequency range, the value of the gain count is adjusted according to the magnitude relation between gains, and the speech enhancement mode adopted in the actual enhancement processing is further determined according to the value of the gain count. The larger the signal gain after enhancement processing is, the worse the noise suppression effect is, and by comparing the gain of the reference enhancement signal, the noise suppression effect of each reference enhancement mode can be determined, so that a reference is provided for the computer equipment to determine the actually adopted voice enhancement mode, and the computer equipment is facilitated to select the voice enhancement mode with the better noise suppression effect.
In another example, the step 220 includes the following steps:
and step 22A, performing down-sampling processing on the target voice signal to obtain a down-sampled voice signal.
In general, a wideband enhancement mode is adopted for a low-frequency signal to achieve a better voice enhancement effect, so that the wideband enhancement mode can be firstly adopted to enhance the low-frequency signal to obtain an enhanced voice signal, and then the enhanced voice signal is analyzed to determine whether the low-frequency signal is over-noisy or not, whether the wideband enhancement mode obviously enhances the noise or not, and the like.
In the embodiment of the present application, in order to implement enhancement processing on a target speech signal by using a reference speech enhancement mode with a lower sampling rate, before enhancement processing on the target speech signal by using the reference speech enhancement mode, a computer device needs to perform down-sampling processing on the target speech signal first to reduce the sampling rate of the target speech signal, and the sampling rate of the down-sampled speech signal is the same as that of the reference speech enhancement mode. For example, if the sampling rate of the target speech signal is 32KHz and the sampling rate of the reference speech enhancement method is 16KHz, the sampling rate of the target speech signal needs to be reduced to 16 KHz.
And step 22B, enhancing the down-sampled voice signal by adopting a reference voice enhancement mode to obtain a reference enhanced signal.
After the down-sampled voice signal is obtained, the computer device can perform enhancement processing on the down-sampled voice signal by adopting a reference voice enhancement mode to obtain a reference enhanced signal.
Based on the above steps 22A and 22B, the above step 230 includes the following steps:
and step 23A, performing pitch period estimation on the reference enhanced signal to obtain a pitch period of the reference enhanced signal.
From the pitch period of the signal, it can be determined whether the signal carries excessive noise. Thus, the computer device may first perform a pitch period estimation on the reference enhanced signal to obtain a pitch period of the reference enhanced signal. The pitch period estimation method is not limited in the embodiment of the present application, and optionally, the pitch period estimation includes any one of the following methods: time domain autocorrelation method, frequency domain transformation method.
And step 23B, determining a target voice enhancement mode according to the pitch period of the reference enhancement signal.
After obtaining the pitch period of the reference enhanced signal, the computer device may determine the target speech enhancement mode directly according to the pitch period of the reference enhanced signal, for example, compare the pitch period of the reference enhanced signal with a period threshold, and determine the target speech enhancement mode according to the comparison result; the target speech enhancement mode may also be determined according to the pitch period of the reference enhancement signal, for example, a pitch of the reference enhancement signal or a pitch frequency of the reference enhancement signal is further obtained according to the pitch period of the reference enhancement signal, and then the target speech enhancement mode is further determined according to the pitch of the reference enhancement signal or the pitch frequency of the reference enhancement signal.
In the following, the following description will be made by using a computer device to further process the pitch period of the reference enhanced signal and then determine the target speech enhancement mode according to the processing result.
Optionally, the target speech signal comprises a first signal portion and a second signal portion, the frequency range of the second signal portion being the same as the frequency range of the reference enhancement signal, the frequency range of the first signal portion being a frequency range of the target speech signal other than the frequency range of the reference signal portion. The step 23B includes: determining a pitch of the reference enhanced signal according to a pitch period of the reference enhanced signal; determining the target speech enhancement mode comprises adopting a first speech enhancement mode for the first signal part and adopting a second speech enhancement mode for the second signal part under the condition that the pitch of the reference enhancement signal is larger than a second threshold value; in the event that the pitch of the reference enhancement signal is less than a second threshold, determining the target speech enhancement mode includes employing a first speech enhancement mode on the target speech signal.
By referring to the pitch period of the enhanced signal, the computer device may further determine a pitch of the reference enhanced signal. Generally, the higher the pitch of a certain signal, the larger the useful signal component in the signal; the lower the pitch of a signal, the greater the noise component in the signal. Therefore, in the embodiment of the present application, a second threshold is set, and if the pitch of the reference enhanced signal is greater than the second threshold, it indicates that the useful signal component of the reference enhanced signal is larger; if the pitch of the reference enhanced signal is smaller than the second threshold, it means that the noise component of the reference enhanced signal is larger. The specific value of the second threshold is not limited in the embodiment of the application, optionally, the value of the second threshold is 50, or 60, or 80, and in the application process, the value of the second threshold may be actually determined by combining factors such as calculation accuracy.
Because the sampling rate of the reference speech enhancement mode is less than that of the target speech signal, if the pitch of the reference enhancement signal obtained by the reference speech enhancement mode is higher, the reference speech enhancement mode achieves a better speech enhancement effect on the target speech signal, and therefore, the low-frequency signal part of the target speech signal can be enhanced by adopting the reference speech enhancement mode. Based on this, in this embodiment of the present application, when the pitch of the reference enhanced signal is greater than the second threshold, the computer device determines that the target speech enhancement mode includes applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion, where the second speech enhancement mode is the reference speech enhancement mode, and a sampling rate of the second speech enhancement mode is less than a sampling rate of the first speech enhancement mode; in the event that the pitch of the reference enhancement signal is less than a second threshold, the computer device determines the target speech enhancement mode includes employing a first speech enhancement mode on the target speech signal.
It should be noted that, in the case where the pitch of the reference enhanced signal is equal to the second threshold, the computer device may perform the processing manner as in the case where the pitch of the reference enhanced signal is smaller than the second threshold, that is, determining the target speech enhancement manner includes applying the first speech enhancement manner to the target speech signal; it is also possible to perform the processing as in the case where the pitch of the reference enhancement signal is larger than a second threshold, i.e. determining the target speech enhancement mode comprises applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion. It should be understood that both of these approaches are intended to fall within the scope of the present application.
In summary, according to the technical solution provided in the embodiment of the present application, after the reference enhanced signal is obtained, the pitch period of the reference enhanced signal is estimated, and then the speech enhancement mode adopted in the actual enhancement processing is determined according to the estimated pitch period. Because the signal characteristics of the reference enhancement signal, such as the magnitude relation between the noise component and the useful signal component of the reference enhancement signal, can be made clear by the pitch period, the computer equipment can determine the noise suppression effect of the reference enhancement mode, thereby providing a reference for determining the actually adopted speech enhancement mode and being beneficial to the computer equipment to effectively and accurately select the speech enhancement mode.
While the above embodiments specifically describe three schemes for determining a target speech enhancement mode according to a reference enhancement signal, it should be understood that the three schemes may be combined to determine the target speech enhancement mode in practical applications. The embodiment of the present application does not limit the combination manner and the combination order of the above three schemes, and a description is provided below to describe a possible combination manner and a possible combination order.
In one example, the reference speech enhancement mode includes a first speech enhancement mode and a second speech enhancement mode, and a sampling rate of the second speech enhancement mode is smaller than a sampling rate of the first speech enhancement mode.
Based on this, the step 220 includes: enhancing the target voice signal by adopting a first voice enhancement mode to obtain a first enhanced signal; performing down-sampling processing on a target voice signal to obtain a down-sampled voice signal; enhancing the downsampled voice signal by adopting a second voice enhancement mode to obtain a second enhanced signal; wherein the reference enhanced signal comprises a first enhanced signal and a second enhanced signal.
Based on this, in one example, the target speech signal comprises a first signal portion and a second signal portion, the frequency range of the second signal portion being the same as the frequency range of the second enhancement signal, the frequency range of the first signal portion being a frequency range of the target speech signal other than the frequency range of the second signal portion; the step 230 includes:
(1) extracting a third enhanced signal from the first enhanced signal according to the frequency range of the second enhanced signal, wherein the frequency range of the third enhanced signal is the same as that of the second enhanced signal; calculating a correlation coefficient of the third enhanced signal and the second enhanced signal; in the case where the correlation coefficient is greater than the first threshold, determining the target speech enhancement mode includes applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion.
(2) Under the condition that the correlation coefficient is smaller than a first threshold value, pitch period estimation is carried out on the second enhanced signal to obtain the pitch period of the second enhanced signal; determining a pitch of the second enhanced signal according to a pitch period of the second enhanced signal; in the event that the pitch of the second enhancement signal is greater than a second threshold, determining the target speech enhancement mode includes employing a first speech enhancement mode for the first signal portion and employing a second speech enhancement mode for the second signal portion.
(3) Acquiring a target frequency range under the condition that the pitch of the second enhanced signal is smaller than a second threshold, wherein the target frequency range comprises at least one frequency; for a first frequency of the at least one frequency, determining a gain of the third enhanced signal at the first frequency and a gain of the second enhanced signal at the first frequency; adjusting the value of the gain count according to the magnitude relationship between the gain of the third enhanced signal at the first frequency and the gain of the second enhanced signal at the first frequency; wherein if the gain of the third enhanced signal at the first frequency is greater than the gain of the second enhanced signal at the first frequency, adding one to the gain count; if the gain of the third enhanced signal at the first frequency is less than the gain of the second enhanced signal at the first frequency, subtracting one from the gain count; determining a target speech enhancement mode under the condition that the value of the gain count for completing the adjustment process is larger than zero, wherein the target speech enhancement mode comprises a first speech enhancement mode adopted for a first signal part and a second speech enhancement mode adopted for a second signal part; and under the condition that the value of the gain count in the adjustment process is less than zero, determining the target voice enhancement mode comprises adopting a first voice enhancement mode for the target voice signal.
For steps and terms that are not described in this example, reference may be made to the description of the above embodiments, which are not repeated herein.
To sum up, according to the technical scheme provided by the embodiment of the application, after the voice signal is obtained, the voice signal is enhanced by adopting a reference voice enhancement mode to obtain a reference enhancement signal, and then the voice enhancement mode adopted in the actual enhancement processing is further determined by combining multiple modes based on the reference enhancement signal, so that the voice enhancement mode actually adopted is determined from multiple dimensions, and the accuracy of the determination of the voice enhancement mode actually adopted is further improved.
Hereinafter, the technical solution of the present application will be described by taking as an example a case where the reference enhancement mode includes a GRU with a sampling rate of 32KHz (hereinafter, abbreviated as "GRU with 32 KHz") and an LSTM with a sampling rate of 16KHz (hereinafter, abbreviated as "LSTM with 16 KHz"), and the target voice signal is a voice signal with a sampling rate of 32KHz (hereinafter, abbreviated as "voice signal with 32 KHz"). Referring to fig. 3, a schematic diagram of a method for enhancing a speech signal according to an embodiment of the present application is shown, where the method includes the following steps:
after the computer equipment acquires the 32KHz voice signal, on one hand, the 32KHz voice signal is enhanced by adopting the 32KHz GRU to obtain a 32KHz enhanced signal; on the other hand, the voice signal of 32KHz is subjected to down-sampling processing to obtain a down-sampled voice signal, and then the down-sampled voice signal is subjected to enhancement processing by adopting LSTM of 16KHz to obtain an enhanced signal of 16 KHz.
Wherein, the bandwidth of the enhanced signal of 32KHz is 0 to 16KHz, and the bandwidth of the enhanced signal of 16KHz is 0 to 8 KHz. The computer device calculates the correlation coefficient corr of the signal part with the bandwidth of 0 to 8KHz in the enhanced signal of 32KHz and the enhanced signal of 16KHz, namely, performs the signal cross-correlation calculation.
As shown in fig. 3, in the case that the calculated correlation coefficient corr is greater than or equal to 0.05, the computer device determines that the voice enhancement mode actually adopted for the 32KHz voice signal includes: LSTM at 16KHz is used for the signal part with the bandwidth of 0 to 8KHz in the voice signal at 32KHz, and GRU at 32KHz is used for the signal part with the bandwidth of 8 to 16KHz in the voice signal at 32 KHz.
Under the condition that the calculated correlation coefficient corr is less than 0.05, the computer equipment further carries out pitch period estimation on the 16KHz enhanced signal and further processes the estimated gene period to obtain the pitch of the 16KHz enhanced signal. As shown in FIG. 3, in the case of a pitch greater than 50, the computer device determines the actual speech enhancement mode to be applied to the 32KHz speech signal to include: LSTM at 16KHz is used for the signal part with the bandwidth of 0 to 8KHz in the voice signal at 32KHz, and GRU at 32KHz is used for the signal part with the bandwidth of 8 to 16KHz in the voice signal at 32 KHz.
As shown in fig. 3, in the case that the pitch is less than or equal to 50, the computer device further compares the gain of the enhanced signal of 32KHz with the gain of the enhanced signal of 16KHz, and adjusts the value of the gain technique count, that is, the computer device performs the gain comparison count process. For a detailed process of the gain comparison and count processing, please refer to the description of the above embodiments, which is not repeated herein. As shown in fig. 3, in the case that the value of the gain count is greater than 0 after the adjustment process is completed, the step of determining, by the computer device, the voice enhancement mode actually adopted for the 32KHz voice signal includes: LSTM at 16KHz is used for the signal part with the bandwidth of 0 to 8KHz in the voice signal at 32KHz, and GRU at 32KHz is used for the signal part with the bandwidth of 8 to 16KHz in the voice signal at 32 KHz. As shown in fig. 3, in the case that the value of the gain count after the adjustment process is less than or equal to 0, the computer device determines that the voice enhancement mode actually adopted for the 32KHz voice signal includes: a GRU of 32KHz is used for a voice signal of 32 KHz.
Please refer to fig. 4, which illustrates a schematic diagram of a speech enhancement effect provided by an embodiment of the present application. Fig. 4(a) is an enhanced signal obtained by enhancing a voice signal using 16KHz LSTM, fig. 4(b) is an enhanced signal obtained by enhancing a voice signal using 32KHz GRU, and as can be seen from fig. 4(a) and fig. 4(b), a high-frequency signal portion of a voice signal cannot be effectively enhanced using 16KHz LSTM, and a low-frequency signal portion of a voice signal cannot be accurately and effectively enhanced using 32KHz GRU. Fig. 4(c) is an enhanced signal obtained by performing enhancement processing on a speech signal by using the technical solution provided by the embodiment of the present application, and comparing fig. 4(c) with fig. 4(a) and fig. 4(b), it can be seen that the speech signal can be accurately and effectively enhanced by using the technical solution provided by the embodiment of the present application.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Referring to fig. 5, a block diagram of an apparatus for enhancing a speech signal according to an embodiment of the present application is shown. The device has the function of implementing the above-mentioned enhancement processing method example of the voice signal, and the function can be implemented by hardware, and also can be implemented by hardware executing corresponding software. The apparatus may be the computer device described above, or may be provided in a computer device. The apparatus 500 may comprise: a speech signal acquisition module 510, a reference signal determination module 520, an enhancement mode determination module 530, and a speech signal enhancement module 540.
The voice signal obtaining module 510 is configured to obtain a target voice signal.
A reference signal determining module 520, configured to perform enhancement processing on the target speech signal in a reference speech enhancement manner, so as to obtain a reference enhanced signal.
An enhancement mode determining module 530, configured to determine a target speech enhancement mode according to the reference enhancement signal.
And the voice signal enhancement module 540 is configured to perform enhancement processing on the target voice signal in the target voice enhancement mode.
In one example, the reference speech enhancement mode comprises a first speech enhancement mode and a second speech enhancement mode, wherein the sampling rate of the second speech enhancement mode is smaller than that of the first speech enhancement mode; the reference signal determination module 520 is configured to: enhancing the target voice signal by adopting the first voice enhancement mode to obtain a first enhanced signal; performing down-sampling processing on the target voice signal to obtain a down-sampled voice signal; enhancing the downsampled voice signal by adopting the second voice enhancement mode to obtain a second enhanced signal; wherein the reference enhanced signal comprises the first enhanced signal and the second enhanced signal.
In one example, as shown in fig. 6, the enhancement mode determining module 530 includes: a reference signal extracting unit 532, configured to extract a third enhanced signal from the first enhanced signal according to a frequency range of the second enhanced signal, where the frequency range of the third enhanced signal is the same as the frequency range of the second enhanced signal; an enhancement mode determining unit 534, configured to determine the target speech enhancement mode according to the third enhancement signal and the second enhancement signal.
In one example, the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as the frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal; as shown in fig. 6, the enhancement mode determining unit 534 is configured to: calculating a correlation coefficient of the third enhanced signal and the second enhanced signal; determining that the target speech enhancement mode comprises applying the first speech enhancement mode to the first signal portion and applying the second speech enhancement mode to the second signal portion if the correlation coefficient is greater than a first threshold; determining the target speech enhancement mode comprises applying the first speech enhancement mode to the target speech signal if the correlation coefficient is less than a first threshold.
In one example, the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as the frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal; as shown in fig. 6, the enhancement mode determining module 530 includes: a frequency range module unit 531, configured to obtain a target frequency range, where the target frequency range includes at least one frequency; a signal gain determining unit 533, configured to determine, for a first frequency of the at least one frequency, a gain of the first enhanced signal at the first frequency and a gain of the second enhanced signal at the first frequency; a gain count adjusting unit 535, configured to adjust a value of a gain count according to a magnitude relationship between a gain of the first enhancement signal at the first frequency and a gain of the second enhancement signal at the first frequency; wherein if the gain of the first enhanced signal at the first frequency is greater than the gain of the second enhanced signal at the first frequency, adding one to the gain count; decrementing the gain count if the gain of the first enhanced signal at the first frequency is less than the gain of the second enhanced signal at the first frequency; an enhancement mode determining unit 537, configured to determine, when a value of the gain count in the adjustment process is greater than zero, that the target speech enhancement mode includes adopting the first speech enhancement mode for the first signal portion and adopting the second speech enhancement mode for the second signal portion; the enhancement mode determining unit 537 is further configured to determine that the target speech enhancement mode includes adopting the first speech enhancement mode for the target speech signal when the value of the gain count in the adjustment process is less than zero.
In one example, the sampling rate of the second speech enhancement mode is one-half of the sampling rate of the first speech enhancement signal.
In one example, the first speech enhancement mode comprises speech enhancement based on a GRU; the second speech enhancement mode comprises speech enhancement based on LSTM.
In one example, the reference signal determination module 520 is configured to: performing down-sampling processing on the target voice signal to obtain a down-sampled voice signal; and enhancing the downsampled voice signal by adopting the reference voice enhancement mode to obtain the reference enhanced signal.
In one example, as shown in fig. 6, the enhancement mode determining module 530 includes: a pitch period determining unit 53A, configured to perform pitch period estimation on the reference enhanced signal to obtain a pitch period of the reference enhanced signal; an enhancement mode determining unit 53B, configured to determine the target speech enhancement mode according to the pitch period of the reference enhanced signal.
In one example, the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as the frequency range of the reference enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the reference signal portion in the frequency range of the target speech signal; as shown in fig. 6, the enhancement mode determining unit 53B is configured to: determining a pitch of the reference enhanced signal according to a pitch period of the reference enhanced signal; in the case that the pitch of the reference enhancement signal is greater than a second threshold, determining the target speech enhancement mode comprises applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion; wherein the sampling rate of the second speech enhancement mode is smaller than that of the first speech enhancement mode; in the event that the pitch of the reference enhancement signal is less than a second threshold, determining the target speech enhancement mode comprises employing a first speech enhancement mode on the target speech signal.
In one example, the reference speech enhancement mode comprises a first speech enhancement mode and a second speech enhancement mode, wherein the sampling rate of the second speech enhancement mode is smaller than that of the first speech enhancement mode; the reference signal determination module 520 is configured to: enhancing the target voice signal by adopting a first voice enhancement mode to obtain a first enhanced signal; performing down-sampling processing on the target voice signal to obtain a down-sampled voice signal; enhancing the downsampled voice signal by adopting a second voice enhancement mode to obtain a second enhanced signal; wherein the reference enhanced signal comprises the first enhanced signal and the second enhanced signal. The target speech signal comprises a first signal portion and a second signal portion, the frequency range of the second signal portion being the same as the frequency range of the second enhancement signal, the frequency range of the first signal portion being a frequency range of the target speech signal other than the frequency range of the second signal portion; the enhancement mode determining module 530 is configured to: extracting a third enhanced signal from the first enhanced signal according to the frequency range of the second enhanced signal, wherein the frequency range of the third enhanced signal is the same as the frequency range of the second enhanced signal; calculating a correlation coefficient of the third enhanced signal and the second enhanced signal; determining that the target speech enhancement mode comprises applying the first speech enhancement mode to the first signal portion and applying the second speech enhancement mode to the second signal portion if the correlation coefficient is greater than a first threshold; under the condition that the correlation coefficient is smaller than a first threshold value, pitch period estimation is carried out on the second enhanced signal to obtain the pitch period of the second enhanced signal; determining a pitch of the second enhanced signal according to a pitch period of the second enhanced signal; in the event that the pitch of the second enhancement signal is greater than a second threshold, determining the target speech enhancement mode comprises employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion; obtaining a target frequency range in the case that a pitch of the second enhancement signal is less than a second threshold, the target frequency range including at least one frequency; for a first frequency of the at least one frequency, determining a gain of the third enhanced signal at the first frequency and a gain of the second enhanced signal at the first frequency; adjusting the value of a gain count according to the magnitude relationship between the gain of the third enhanced signal at the first frequency and the gain of the second enhanced signal at the first frequency; wherein if the gain of the third enhanced signal at the first frequency is greater than the gain of the second enhanced signal at the first frequency, adding one to the gain count; decrementing the gain count if the gain of the third enhanced signal at the first frequency is less than the gain of the second enhanced signal at the first frequency; determining that the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion, if a value of a gain count for completing an adjustment process is greater than zero; and under the condition that the value of the gain count in the adjustment process is less than zero, determining the target voice enhancement mode comprises adopting the first voice enhancement mode for the target voice signal.
In one example, the target speech signal comprises an ultra-wideband speech signal.
To sum up, according to the technical scheme provided by the embodiment of the application, after the voice signal is obtained, the voice signal is enhanced by using a reference voice enhancement mode to obtain a reference enhancement signal, the voice enhancement mode used in the actual enhancement processing is further determined based on the reference enhancement signal, and then the determined voice enhancement mode actually used is used to enhance the voice signal. The reference enhanced signal can reflect the signal characteristics of the initially acquired voice signal, such as whether the voice signal is an obvious signal with much noise or not, so that the actually adopted voice enhancement mode can be determined in a targeted manner by combining the signal characteristics of the voice signal according to the reference enhanced signal. Compared with the prior art that the different conditions aiming at the voice signals cannot be distinguished and processed by adopting a fixed voice enhancement mode, the voice signal enhancement method and the voice signal enhancement device fully consider the signal characteristics of the voice signals in the voice signal enhancement process, help to accurately and effectively enhance the voice signals, and improve the enhancement effect of the voice signals.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
Referring to fig. 7, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a terminal or a server. Specifically, the method comprises the following steps:
the computer device 700 includes a Central Processing Unit (CPU) 701, a system Memory 704 including a Random Access Memory (RAM) 702 and a Read Only Memory (ROM) 703, and a system bus 705 connecting the system Memory 704 and the CPU 701. The computer device 700 also includes a basic Input/Output system (I/O) 706, which facilitates transfer of information between devices within the computer, and a mass storage device 707 for storing an operating system 713, application programs 714, and other program modules 715.
The basic input/output system 706 includes a display 708 for displaying information and an input device 709, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 708 and the input device 709 are connected to the central processing unit 701 through an input output controller 710 connected to the system bus 705. The basic input/output system 706 may also include an input/output controller 710 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 710 may also provide output to a display screen, a printer, or other type of output device.
The mass storage device 707 is connected to the central processing unit 701 through a mass storage controller (not shown) connected to the system bus 705. The mass storage device 707 and its associated computer-readable media provide non-volatile storage for the computer device 700. That is, the mass storage device 707 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk Read-Only Memory) drive.
Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 704 and mass storage device 707 described above may be collectively referred to as memory.
According to various embodiments of the present application, the computer device 700 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 700 may be connected to the network 712 through the network interface unit 711 connected to the system bus 705, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 711.
The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the method of enhanced processing of speech signals described above.
In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which when executed by a processor, implements the above-described method of enhanced processing of a speech signal.
Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM).
In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the method for enhancing the voice signal.
It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.
The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:语音信号处理方法、装置、存储介质及设备