Voice signal processing method, device, storage medium and equipment

文档序号:9805 发布日期:2021-09-17 浏览:37次 中文

1. A speech signal processing method, comprising:

acquiring an original voice signal to be processed;

separating the original voice signal to obtain an effective voice signal in the original voice signal;

extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating an enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal.

2. The method according to claim 1, wherein said extracting features of the original speech signal to obtain feature information of the original speech signal, and generating enhancement coefficients of the effective speech signal according to the feature information of the original speech signal comprises:

dividing the original voice signal to obtain at least two original voice signal segments, and dividing the effective voice signal to obtain at least two effective voice signal segments, wherein one original voice signal segment corresponds to one effective voice signal segment;

performing feature extraction on each original voice signal segment in the at least two original voice signal segments to obtain feature information of each original voice signal segment;

generating an enhancement coefficient of a corresponding effective voice signal segment in the at least two effective voice signal segments according to the characteristic information of each original voice signal segment;

and taking the enhancement coefficient corresponding to each effective speech signal segment in the at least two effective speech signal segments as the enhancement coefficient of the effective speech signal.

3. The method according to claim 2, wherein said generating enhancement coefficients for a corresponding valid speech signal segment of the at least two valid speech signal segments according to the feature information of each original speech signal segment comprises:

determining the data volume ratio of the effective voice signals included in each original voice signal segment according to the characteristic information of each original voice signal segment;

and generating the enhancement coefficient of the corresponding effective voice signal segment in the at least two effective voice signal segments by adopting the data volume ratio.

4. The method according to claim 3, wherein the determining the data volume ratio of the valid speech signal included in each original speech signal segment according to the feature information of each original speech signal segment comprises:

determining the data volume of the effective voice signal included in each original voice signal segment according to the characteristic information of each original voice signal segment;

acquiring the total data volume of the original voice signal;

and acquiring the ratio of the data quantity of the effective voice signals included in each original voice signal segment to the total data quantity of the original voice signals to obtain the data quantity ratio of the effective voice signals included in each original voice signal segment.

5. The method of claim 3, wherein the at least two original speech signal segments include a target original speech signal segment, and wherein the at least two valid speech signal segments include a target valid speech signal segment corresponding to the target original speech signal segment;

the enhancing the effective speech signal according to the enhancement coefficient of the effective speech signal and the original speech signal to obtain an enhanced target speech signal, including:

if the enhancement coefficient of the target effective voice signal segment is larger than a first enhancement coefficient threshold value and smaller than a second enhancement coefficient threshold value, extracting an original voice signal sub-segment of a target data volume from the target original voice signal segment, and carrying out fusion processing on the original voice signal sub-segment and the target effective voice signal segment to obtain an enhanced target effective voice signal segment; the target data volume is determined according to the data volume proportion of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold is smaller than the second enhancement coefficient threshold;

if the enhancement coefficient of the target effective voice signal segment is larger than or equal to the second enhancement coefficient threshold value, taking the target original voice signal segment as an enhanced target effective voice signal segment;

and splicing the enhanced target effective voice signal segments to obtain an enhanced target voice signal.

6. The method according to claim 5, wherein said generating enhancement coefficients for corresponding ones of said at least two valid speech signal segments using said data volume fraction comprises:

if the data volume proportion corresponding to the target original voice information fragment is larger than a first data volume proportion threshold and smaller than a second data volume proportion threshold, determining a first enhancement coefficient as the enhancement coefficient of the target effective voice information fragment; the first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold;

if the data volume ratio corresponding to the target original voice information segment is larger than the second data volume ratio threshold, determining a second enhancement coefficient as the enhancement coefficient of the target effective voice information segment; the second enhancement coefficient is greater than or equal to the second enhancement coefficient threshold.

7. The method of claim 1, wherein the separating the original speech signal to obtain a valid speech signal in the original speech signal comprises:

according to the characteristic information of the original voice signal, performing mask processing on the original voice signal to obtain a mask matrix corresponding to the original voice signal;

and separating the effective voice signal from the original voice signal according to the mask matrix corresponding to the original voice signal.

8. A speech signal processing apparatus, comprising:

the acquisition module is used for acquiring an original voice signal to be processed;

the separation processing module is used for separating the original voice signal to obtain an effective voice signal in the original voice signal;

the generating module is used for extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal and generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and the enhancement processing module is used for enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal.

9. A computer device, comprising: a processor and a memory;

wherein the memory is configured to store program code and the processor is configured to invoke the program code to perform the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the steps of the method according to any one of claims 1 to 7.

Background

The voice separation technology is a technology for separating effective voice signals from voice signals so as to filter background interference signals, wherein the effective voice signals refer to signals with utilization value; for example, in a certain conference, the valid speech signal may refer to the speech content of a certain main participant, and it is beneficial for the user to understand the main content of the conference according to the valid speech signal, or in a concert, the valid speech signal may refer to the singing speech signal of a singer, and it is beneficial for providing a good auditory effect for the user according to the valid speech signal. Therefore, the voice separation algorithm has great practical value.

At present, a time domain processing method is mainly adopted to separate voice signals, so that part of effective voice signals in the voice signals can be separated, but background interference signals still exist in the separated effective voice signals, and part of effective voice signals are lost.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide a method, an apparatus, a storage medium, and a device for processing a voice signal, which can effectively avoid loss of an effective voice signal and improve a signal-to-noise ratio of the effective voice signal.

An aspect of the present embodiment provides a method for processing a speech signal, including:

acquiring an original voice signal to be processed;

separating the original voice signal to obtain an effective voice signal in the original voice signal;

extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating an enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal.

Wherein, the extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal includes:

dividing the original voice signal to obtain at least two original voice signal segments, and dividing the effective voice signal to obtain at least two effective voice signal segments, wherein one original voice signal segment corresponds to one effective voice signal segment;

performing feature extraction on each original voice signal segment in the at least two original voice signal segments to obtain feature information of each original voice signal segment;

generating an enhancement coefficient of a corresponding effective voice signal segment in the at least two effective voice signal segments according to the characteristic information of each original voice signal segment;

and taking the enhancement coefficient corresponding to each effective speech signal segment in the at least two effective speech signal segments as the enhancement coefficient of the effective speech signal.

Wherein, the generating the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments according to the feature information of each original speech signal segment includes:

determining the data volume ratio of the effective voice signals included in each original voice signal segment according to the characteristic information of each original voice signal segment;

and generating the enhancement coefficient of the corresponding effective voice signal segment in the at least two effective voice signal segments by adopting the data volume ratio.

Wherein, the determining the data volume ratio of the effective voice signal included in each original voice signal segment according to the characteristic information of each original voice signal segment includes:

determining the data volume of the effective voice signal included in each original voice signal segment according to the characteristic information of each original voice signal segment;

acquiring the total data volume of the original voice signal;

and acquiring the ratio of the data quantity of the effective voice signals included in each original voice signal segment to the total data quantity of the original voice signals to obtain the data quantity ratio of the effective voice signals included in each original voice signal segment.

The at least two original voice signal segments comprise a target original voice signal segment, and the at least two effective voice signal segments comprise a target effective voice signal segment corresponding to the target original voice signal segment;

the enhancing the effective speech signal according to the enhancement coefficient of the effective speech signal and the original speech signal to obtain an enhanced target speech signal, including:

if the enhancement coefficient of the target effective voice signal segment is larger than a first enhancement coefficient threshold value and smaller than a second enhancement coefficient threshold value, extracting an original voice signal sub-segment of a target data volume from the target original voice signal segment, and carrying out fusion processing on the original voice signal sub-segment and the target effective voice signal segment to obtain an enhanced target effective voice signal segment; the target data volume is determined according to the data volume proportion of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold is smaller than the second enhancement coefficient threshold;

if the enhancement coefficient of the target effective voice signal segment is larger than or equal to the second enhancement coefficient threshold value, taking the target original voice signal segment as an enhanced target effective voice signal segment;

and splicing the enhanced target effective voice signal segments to obtain an enhanced target voice signal.

Wherein, the generating the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments by adopting the data volume ratio comprises:

if the data volume proportion corresponding to the target original voice information fragment is larger than a first data volume proportion threshold and smaller than a second data volume proportion threshold, determining a first enhancement coefficient as the enhancement coefficient of the target effective voice information fragment; the first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold;

if the data volume ratio corresponding to the target original voice information segment is larger than the second data volume ratio threshold, determining a second enhancement coefficient as the enhancement coefficient of the target effective voice information segment; the second enhancement coefficient is greater than or equal to the second enhancement coefficient threshold.

Wherein, the separating the original voice signal to obtain an effective voice signal in the original voice signal includes:

according to the characteristic information of the original voice signal, performing mask processing on the original voice signal to obtain a mask matrix corresponding to the original voice signal;

and separating the effective voice signal from the original voice signal according to the mask matrix corresponding to the original voice signal.

An aspect of an embodiment of the present application provides a speech signal apparatus, including:

the acquisition module is used for acquiring an original voice signal to be processed;

the separation processing module is used for separating the original voice signal to obtain an effective voice signal in the original voice signal;

the generating module is used for extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal and generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and the enhancement processing module is used for enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal.

Wherein, the generating module comprises:

the dividing processing unit is used for dividing the original voice signal to obtain at least two original voice signal segments and dividing the effective voice signal to obtain at least two effective voice signal segments, wherein one original voice signal segment corresponds to one effective voice signal segment;

the feature extraction unit is used for extracting features of each original voice signal segment of the at least two original voice signal segments to obtain feature information of each original voice signal segment;

the generating unit is used for generating an enhancement coefficient of a corresponding effective speech signal segment in the at least two effective speech signal segments according to the characteristic information of each original speech signal segment;

a first determining unit, configured to use an enhancement coefficient corresponding to each of the at least two valid speech signal segments as an enhancement coefficient of the valid speech signal.

Wherein, the generating unit is specifically configured to:

determining the data volume ratio of the effective voice signals included in each original voice signal segment according to the characteristic information of each original voice signal segment;

and generating the enhancement coefficient of the corresponding effective voice signal segment in the at least two effective voice signal segments by adopting the data volume ratio.

Wherein, the generating unit is further specifically configured to:

determining the data volume of the effective voice signal included in each original voice signal segment according to the characteristic information of each original voice signal segment;

acquiring the total data volume of the original voice signal;

and acquiring the ratio of the data quantity of the effective voice signals included in each original voice signal segment to the total data quantity of the original voice signals to obtain the data quantity ratio of the effective voice signals included in each original voice signal segment.

The at least two original voice signal segments comprise a target original voice signal segment, and the at least two effective voice signal segments comprise a target effective voice signal segment corresponding to the target original voice signal segment;

the enhancement processing module comprises:

a fusion processing unit, configured to, if an enhancement coefficient of the target valid speech signal segment is greater than a first enhancement coefficient threshold and smaller than a second enhancement coefficient threshold, extract an original speech signal sub-segment of a target data amount from the target original speech signal segment, and perform fusion processing on the original speech signal sub-segment and the target valid speech signal segment to obtain an enhanced target valid speech signal segment; the target data volume is determined according to the data volume proportion of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold is smaller than the second enhancement coefficient threshold;

a second determining unit, configured to, if an enhancement coefficient of the target valid speech signal segment is greater than or equal to the second enhancement coefficient threshold, take the target original speech signal segment as an enhanced target valid speech signal segment;

and the splicing unit is used for splicing the enhanced target effective voice signal segments to obtain an enhanced target voice signal.

Wherein, the generating unit is further specifically configured to:

if the data volume proportion corresponding to the target original voice information fragment is larger than a first data volume proportion threshold and smaller than a second data volume proportion threshold, determining a first enhancement coefficient as the enhancement coefficient of the target effective voice information fragment; the first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold;

if the data volume ratio corresponding to the target original voice information segment is larger than the second data volume ratio threshold, determining a second enhancement coefficient as the enhancement coefficient of the target effective voice information segment; the second enhancement coefficient is greater than or equal to the second enhancement coefficient threshold.

Wherein, the separation processing module includes:

the mask processing unit is used for performing mask processing on the original voice signal according to the characteristic information of the original voice signal to obtain a mask matrix corresponding to the original voice signal;

and the separation unit is used for separating the effective voice signal from the original voice signal according to the mask matrix corresponding to the original voice signal.

One aspect of the present application provides a computer device, comprising: a processor and a memory;

wherein, the memory is used for storing computer programs, and the processor is used for calling the computer programs to execute the following steps:

acquiring an original voice signal to be processed;

separating the original voice signal to obtain an effective voice signal in the original voice signal;

extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating an enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal.

An aspect of the embodiments of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, perform the following steps:

acquiring an original voice signal to be processed;

separating the original voice signal to obtain an effective voice signal in the original voice signal;

extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating an enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal.

In the embodiment of the application, the original voice signal to be processed is obtained, the original voice signal is separated to obtain the effective voice signal in the original voice signal, the original voice signal is subjected to feature extraction to obtain the feature information of the original voice signal, and the enhancement coefficient of the effective voice signal is generated according to the feature information of the original voice signal. Enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal; the information loss of the effective voice signal can be effectively avoided, namely, the performance damage of the effective voice signal is reduced; and background interference signals in the effective voice signals are reduced, and the signal-to-noise ratio of the effective voice signals can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of a speech signal processing system according to the present application;

FIG. 2 is a flow chart of a speech signal processing method provided by the present application;

fig. 3 is a schematic diagram of a method for separating a speech signal by using a Conv-TasNet pure time domain processing method according to an embodiment of the present application;

FIG. 4 is a diagram illustrating separation of an original speech signal according to an embodiment of the present application;

fig. 5 is a schematic diagram of a 1 × D convolution processing module according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a method for generating enhancement coefficients of a valid speech signal according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a system for obtaining an enhanced target speech signal according to an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram of another speech signal processing method provided herein;

fig. 9 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Among the key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. In the application, the original voice signal can be separated by utilizing the voice technology to obtain the effective voice signal in the original voice signal, and then the original voice signal is subjected to feature extraction to obtain the feature information of the original voice signal. And generating an enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal, and enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal. Therefore, the signal-to-noise ratio of the target language information can be remarkably improved, and the performance damage of the separated target language information is reduced.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a speech signal processing system according to an embodiment of the present application. As shown in fig. 1, the voice signal processing system may include a server 10 and a user terminal cluster. The user terminal cluster may comprise one or more user terminals, where the number of user terminals will not be limited. As shown in fig. 1, the system may specifically include a user terminal 100a, a user terminal 100b, user terminals 100c and …, and a user terminal 100 n. As shown in fig. 1, the user terminal 100a, the user terminal 100b, the user terminals 100c, …, and the user terminal 100n may be respectively connected to the server 10 via a network, so that each user terminal may interact with the server 10 via the network.

Wherein, each ue in the ue cluster may include: the intelligent terminal comprises an intelligent terminal with a service number processing function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, wearable equipment, an intelligent home, and head-mounted equipment. It should be understood that each user terminal in the user terminal cluster shown in fig. 1 may be installed with a target application (i.e., an application client), and when the application client runs in each user terminal, data interaction may be performed with the server 10 shown in fig. 1.

As shown in fig. 1, the server 10 may be configured to separate an original voice signal to obtain an effective voice signal in the original voice signal, generate an enhancement coefficient of the effective voice signal according to feature information of the original voice signal, and perform enhancement processing on the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal; the server 10 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

For convenience of understanding, in the embodiment of the present application, one user terminal may be selected as a target user terminal from the plurality of user terminals shown in fig. 1. For example, the user terminal 100a shown in fig. 1 may be used as a target user terminal in the embodiment of the present application, and a target application (i.e., an application client) having the service data processing function may be integrated in the target user terminal. At this time, the target user terminal may implement data interaction with the server 10 through the service data platform corresponding to the application client. If the target user terminal can send the original voice signal to the server 10, the server 10 can separate the original voice signal to obtain an effective voice signal, then enhance the effective voice signal, and send the target voice signal to the target user terminal after obtaining the target voice signal.

Please refer to fig. 2, which is a flowchart illustrating a speech signal processing method according to an embodiment of the present application. The method may be performed by a computer device, which may refer to the server 11 or any terminal in fig. 1, and as shown in fig. 2, the voice signal processing method may include steps S101-S104.

S101, acquiring an original voice signal to be processed.

Acquiring an original voice signal to be processed, where the original voice signal may be acquired by recording with a voice acquisition module, for example, acquired by recording a voice requested by a host with a smart television, or acquired by recording a voice signal in a video of a comment about something uploaded by a user.

S102, separating the original voice signal to obtain an effective voice signal in the original voice signal.

After the original voice signal is obtained, because the original voice signal may contain an interfering voice signal, the interfering voice signal refers to other signals except the valid voice signal, for example, if the voice signal corresponding to the target user in a segment of video data is used as the valid voice signal, other vehicle whistling sounds and the like are used as the interfering signals. The original voice signals can be separated by adopting two processing methods of time-frequency domain processing and pure time domain processing, so that effective voice signals in the original voice signals are obtained. The method comprises the steps that effective voice signals and interference signals in original voice signals can be separated based on a time-frequency domain processing method, and the effective voice signals in the original voice signals are obtained; the effective voice signal and the interference signal in the original voice signal can be separated based on a pure time domain processing method to obtain the effective voice signal in the original voice signal, the pure time domain processing can directly process the original voice signal, and the phase information of the original voice signal can be reserved, so that better performance is obtained. The time-frequency domain processing method is to convert the original voice signal into a relation which takes a time axis as a coordinate to represent the original voice signal and convert the original voice signal into a relation which takes a frequency axis as a coordinate to represent the original voice signal, so as to analyze the original voice signal and separate an effective voice signal from the original voice signal.

Optionally, mask processing is performed on the original voice signal according to the feature information of the original voice signal, so as to obtain a mask matrix corresponding to the original voice signal. And separating the effective voice signal from the original voice signal according to the mask matrix corresponding to the original voice signal.

The method can distinguish the sound characteristics of different sound sources according to the characteristic information of the original voice signal, namely, based on the characteristic information of the original voice signal, so as to construct mask matrixes corresponding to the different sound sources, wherein the mask matrixes comprise mask vectors corresponding to all the sound sources respectively, and effective voice signals are separated from the original voice signal according to the mask matrixes corresponding to the original voice signal. The time domain voice features corresponding to the sound sources can be obtained by converting the time domain voice features based on the mask vectors corresponding to any sound source, so that the voice signals corresponding to different sound sources in the voice signals are separated, and the voice signals of different sound sources are obtained and output as voice separation results.

Optionally, the speech separation model may be obtained by training in advance, and specifically, the speech separation model may be obtained by training in the following manner: firstly, a candidate voice separation model, a sample voice signal and a labeled voice separation result corresponding to the sample voice signal are obtained. And inputting the sample voice signal into the candidate voice separation model, and separating the sample voice signal to obtain a predicted voice separation result. And calculating a model loss value according to the marked voice separation result and the predicted voice separation result, adjusting the candidate voice separation model according to the model loss value until the candidate voice separation model meets the convergence condition, taking the candidate voice separation model meeting the convergence condition as a target voice separation model, and separating the original voice signal according to the target voice separation model.

As shown in fig. 3, the schematic diagram of the method for separating a speech signal by using a Conv-TasNet pure time domain processing method provided in this embodiment of the present application is shown in fig. 3, where the Conv-TasNet pure time domain processing method is a full convolution time domain audio separation network, and is mainly composed of an Encoder (encoding), a Separator (separating), and a Decoder (decoding), where the Encoder (encoding) directly encodes an original speech signal in the time domain, and converts a segment of a time domain waveform into a corresponding representation in an intermediate feature space; the Separator is composed of a series of tcns (temporal connected networks), and estimates a Mask from the output of the encoder, where the Mask acts on the output of the encoder to filter the useful signal and remove interference. And finally, reconstructing the output after the Mask by a Decoder module to obtain the separated effective voice signal. After an original voice signal is input into a Conv-TasNet model, an Encoder coding module converts a segment of a time domain waveform into a corresponding representation in an intermediate feature space, and performs feature extraction on the original voice signal to obtain feature information of the original voice signal. The Separator separation module generates a mask matrix according to the characteristic information of the original voice signal output by the Encoder coding module, and a single voice spectrogram is obtained. And the Decoder decoding module reconstructs the output after the mask matrix is generated, namely decodes the output to obtain the separated effective voice signal.

As shown in fig. 4, which is a schematic diagram of separating an original speech signal according to an embodiment of the present application, as shown in fig. 4, the original speech signal may be input into a speech separation model, and then a 1 × 1 convolution process may be performed in an encoding module. And then taking the output of the coding module as the input of a separation model, carrying out 1 × 1 convolution processing and multiple 1 × D convolution processing on the original voice signal, classifying the original voice signal after the convolution processing, and separating and processing the effective voice signal. And decoding the separated effective voice signals to obtain the effective voice signals in the original voice signals.

As shown in fig. 5, a schematic diagram of a 1 × D convolution processing module provided in an embodiment of the present application is shown, and as shown in fig. 5, the 1 × D convolution processing module performs 1 × 1 convolution processing on its input, and then activates and normalizes the 1 × 1 convolution processing. And then performing D convolution processing, activating and standardizing the content of the D convolution processing, performing 1 × 1 convolution processing as output, and performing jump connection, wherein the jump connection refers to skip connections and is generally used in a residual error network, and the jump connection has the function of solving the problems of gradient explosion and gradient disappearance in the training process in a deeper network.

S103, extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating the enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal.

The feature information of the original speech signal may include the data amount ratio of the valid speech signal in the original speech signal, the number of different speech signals in the original speech signal, and the type of the original speech signal (e.g., the speech signal of a dubber recorded in a recording studio, the speech signal of a reporter recorded in bad weather, etc.). And generating an enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal.

Fig. 6 is a schematic diagram of a method for generating enhancement coefficients of a valid speech signal according to an embodiment of the present application, and as shown in fig. 6, the method for generating enhancement coefficients of a valid speech signal includes steps S21-S24.

S21, the original speech signal may be divided to obtain at least two original speech signal segments, and the effective speech signal may be divided to obtain at least two effective speech signal segments.

In an optional embodiment, an original speech signal segment corresponds to an effective speech signal segment, and the original speech signal may be divided to obtain at least two original speech signal segments, for example, the original speech signal is divided into T original speech signal segments, where T is a positive integer greater than or equal to 2. The length of each original speech signal segment may be equal, for example, the length of each original speech signal segment is L, where L is a natural number greater than 0, and of course, the length of each original speech signal segment may also be unequal, which is not limited in this embodiment. After the original voice signal is separated and processed to obtain the voice signal in the original voice signal, the effective voice signal can be divided according to the method for dividing the original voice signal to obtain at least two effective voice signal segments. For example, the third original speech signal segment of the at least two original speech signal segments corresponds to the third valid speech signal segment of the at least two valid speech signal segments, and the length and the position of the third original speech signal segment are the same as those of the third valid speech signal segment.

S22, extracting the characteristic of each original voice signal segment in at least two original voice signal segments to obtain the characteristic information of each original voice signal segment.

And similarly, the characteristic information of each original voice signal segment comprises the data volume ratio of effective voice signals in the original voice signal segment, the number of different voice signals in the original effective voice signal segment, the type of the original voice signal segment and the like (such as voice signals of dubbing actors recorded in a recording studio, voice signals of reporters recorded in severe weather and the like).

And S23, generating an enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments according to the characteristic information of each original speech signal segment.

S24, using the enhancement coefficient corresponding to each of the at least two valid speech signal segments as the enhancement coefficient of the valid speech signal.

Then, the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments is generated according to the feature information of each original speech signal segment, that is, the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments can be generated according to at least one of the data volume ratio of the effective speech signal in the original speech signal segment, the number of different speech signals in the original speech signal segment and the type of the original speech signal segment.

Optionally, when the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments is generated according to the feature information of each original speech signal segment, the data amount ratio of the effective speech signal included in each original speech signal segment may be determined according to the feature information of each original speech signal segment, and the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments is generated by using the data amount ratio.

The data volume ratio of the effective speech signal included in each original speech signal segment, which refers to how much effective speech signal is in each original speech signal segment, can be determined according to the feature information of each original speech signal segment. If the number of valid voice signals in the original voice signal segment is more, the corresponding data volume ratio is larger, and if the number of valid voice signals in the original voice signal segment is less, the corresponding data volume ratio is smaller. The data volume ratio is adopted to generate an enhancement coefficient of a corresponding effective voice signal segment in at least two effective voice signal segments, wherein the enhancement coefficient refers to the signal-to-noise ratio of the effective voice signal, if the signal-to-noise ratio of the effective voice signal is higher, the enhancement coefficient corresponding to the effective voice signal is higher, and the enhancement processing intensity of the effective voice signal can be reduced; if the signal-to-noise ratio of the effective speech signal is lower, it indicates that the enhancement coefficient of the effective speech signal is lower, and the enhancement processing strength of the effective speech signal can be increased. The enhancement coefficient is used for determining whether a part of original voice signals need to be extracted from the original voice signals and fused into effective voice signals, and enhancing the performance of the effective voice signals. If the data volume proportion of the effective voice signal in the original voice signal segment is larger, the enhancement coefficient of the corresponding effective voice signal segment is larger; if the data volume ratio of the effective speech signal in the original speech signal segment is smaller, the enhancement coefficient of the corresponding effective speech signal segment is smaller. The data volume of the effective voice signal contained in the voice signal of each position in the original voice signal is different, namely the degree of the mixed interference signal of each position in the original voice signal is different, so that the original voice signal is divided to obtain at least two original voice signal segments, and the enhancement coefficient corresponding to the effective voice signal segment is generated according to the characteristic information of each original voice signal segment, so that the effective voice signal segment can be more accurately enhanced, the signal-to-noise ratio of the enhanced target voice signal is improved, and the performance of the enhanced target voice signal is improved.

Optionally, the enhancement coefficients of the corresponding valid speech signal segments of the at least two valid speech signal segments may be generated according to the number of different speech signals in the original speech signal segment. If the number of different voice signals is larger, which indicates that the original voice signals are more mixed, the enhancement coefficient is smaller; the enhancement factor will be larger if the number of different speech signals is smaller, which indicates that the original speech signal is clean. The enhancement coefficients of the corresponding effective speech signal segments in the at least two effective speech signal segments can be generated according to the types of the original speech signal segments, for example, if the speech signals of dubbing actors recorded in a recording studio are cleaner and have no excessive interference signals, the corresponding enhancement coefficients are larger, the speech signals of the reporter recorded in severe weather are mixed, the interference signals are more, the corresponding enhancement coefficients are lower, and the like.

And after obtaining the enhancement coefficient of each effective voice signal segment in the at least two effective voice signal segments, taking the enhancement coefficient corresponding to each effective voice signal segment in the at least two effective voice signal segments as the enhancement coefficient of the effective voice signal.

Optionally, when determining the data volume proportion of the valid speech signal included in each original speech signal segment according to the feature information of each original speech signal segment, the data volume of the valid speech signal included in each original speech signal segment may be determined according to the feature information of each original speech signal segment. And then acquiring the total data volume of the original voice signals, acquiring the ratio of the data volume of the effective voice signals included in each original voice signal segment to the total data volume of the original voice signals, and acquiring the data volume ratio of the effective voice signals included in each original voice signal segment.

The data volume of the effective voice signal included in each original voice signal segment can be determined according to the characteristic information of each original voice signal segment, and then the total data volume of the original voice signal is obtained. Then, the ratio of the data quantity of the effective voice signal included in each original voice signal segment to the total data quantity of the original voice signal is calculated, so that the data quantity ratio of the effective voice signal included in each original voice signal segment can be obtained. Therefore, the data volume proportion of the effective voice signal contained in each original voice signal segment can be known more accurately, an enhancement coefficient is generated according to the data volume proportion of the effective voice signal contained in each original voice signal segment, the effective voice signal segment corresponding to the original voice signal segment is enhanced, the signal-to-noise ratio of the enhanced target voice signal can be improved, and the performance of the enhanced target voice signal can be improved.

And S104, enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal.

After the enhancement coefficient of the effective voice signal is obtained, the effective voice signal can be enhanced according to the enhancement coefficient of the effective voice signal and the original voice signal, so that an enhanced target voice signal is obtained. Therefore, the signal-to-noise ratio of the target speech signal after enhancement processing can be improved, and the performance damage of the effective speech signal after separation from the original speech signal can be reduced, namely the performance of the target speech signal after enhancement processing is improved.

Optionally, if the enhancement coefficient of the target valid speech signal segment is greater than the first enhancement coefficient threshold and smaller than the second enhancement coefficient threshold, extracting an original speech signal sub-segment of the target data amount from the target original speech signal segment, and performing fusion processing on the original speech signal sub-segment and the target valid speech signal segment to obtain an enhanced target valid speech signal segment. The target data volume is determined according to the data volume proportion of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold value is smaller than the second enhancement coefficient threshold value; and if the enhancement coefficient of the target effective voice signal segment is larger than or equal to the second enhancement coefficient threshold value, taking the target original voice signal segment as the enhanced target effective voice signal segment. And then splicing the enhanced target effective voice signal segments to obtain an increased target voice signal.

The at least two original voice signal segments comprise a target original voice signal segment, the target original voice signal segment is any one of the at least two original voice signal segments, and the at least two effective voice signal segments comprise a target effective voice signal segment corresponding to the target original voice signal segment. If the enhancement coefficient of the target effective speech signal segment is greater than the first enhancement coefficient threshold and smaller than the second enhancement coefficient threshold, it is indicated that the target original speech signal segment corresponding to the target effective speech signal segment contains an effective speech signal with a certain data amount and an interference signal with a certain data amount, that is, the target original speech signal segment is a mixed speech signal. Because the target effective voice signal segment is separated from the target original voice signal segment, and the problem of performance damage exists, the original voice signal sub-segment with the target data volume can be extracted from the target original voice signal segment, and the original voice signal sub-segment and the target effective voice signal segment are fused to obtain the enhanced target effective voice signal segment. Therefore, the signal-to-noise ratio of the separated target effective voice signal segment can be obviously improved, and the performance damage of the separated target effective voice signal segment can be reduced. The target data amount is determined according to the data amount ratio of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold is smaller than the second enhancement coefficient threshold.

If the enhancement coefficient of the target effective speech signal segment is greater than or equal to the second enhancement coefficient threshold, it is indicated that the target original speech signal segment corresponding to the target effective speech signal segment only contains the effective speech signal, i.e. the target original speech signal segment is a pure clean speech signal. Since the target original speech signal segment does not have any interference signal, the target valid speech signal segment is separated from the target original speech signal segment, and the target valid speech signal segment also has the problem of performance impairment. Therefore, the target original voice signal segment can be used as the enhanced target effective voice signal segment, and thus, the enhanced target effective voice signal segment has no problem of performance damage. After obtaining the enhanced target effective voice signal segment of each effective voice signal segment in the at least two effective voice signal segments, splicing the enhanced target effective voice signal segments of each effective voice signal segment to obtain an enhanced target voice signal.

Optionally, when the data amount ratio is used to generate the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments, if the data amount ratio corresponding to the target original speech information segment is greater than the first data amount ratio threshold and smaller than the second data amount ratio threshold, the first enhancement coefficient is determined as the enhancement coefficient of the target effective speech information segment. The first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold. And if the data volume proportion corresponding to the target original voice information fragment is larger than the second data volume proportion threshold value, determining the second enhancement coefficient as the enhancement coefficient of the target effective voice information fragment, wherein the second enhancement coefficient is larger than or equal to the second enhancement coefficient threshold value.

If the data volume proportion corresponding to the target original voice information segment is greater than the first data volume proportion threshold and smaller than the second data volume proportion threshold, it is indicated that the target original voice signal segment corresponding to the target effective voice signal segment contains an effective voice signal with a certain data volume and an interference signal with a certain data volume, that is, the target original voice signal segment is a mixed voice signal. A first enhancement coefficient may be determined as the enhancement coefficient for the target valid speech information segment, the first enhancement coefficient being greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold. If the data volume proportion corresponding to the target original voice information segment is greater than the second data volume proportion threshold, it is indicated that the target original voice signal segment corresponding to the target valid voice signal segment only contains the valid voice signal, that is, the target original voice signal segment is a pure clean voice signal. Then a second enhancement coefficient may be determined as the enhancement coefficient for the target valid speech information segment, the second enhancement coefficient being greater than or equal to the second enhancement coefficient threshold. If the data volume proportion corresponding to the target original voice information segment is smaller than the first data volume proportion threshold, it is indicated that the target original voice signal segment does not contain the valid voice signal and only contains the interference signal, and an all-zero signal can be output and does not have any signal information. Therefore, the interference signal can be completely removed, and the signal-to-noise ratio of the target voice signal can be improved.

For example, the first data amount ratio threshold may be 0, the second data amount ratio threshold may be 1, and if the data amount ratio corresponding to the target original voice information segment is greater than the first data amount ratio threshold and smaller than the second data amount ratio threshold, it is described that the target original voice signal segment includes not only the valid voice signal but also other voice signals, and is a mixed voice signal. A first enhancement coefficient may be determined as the enhancement coefficient for the target valid speech information segment, the first enhancement coefficient being greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold. The threshold value of the first enhancement coefficient may be determined according to a data amount ratio of the valid speech signal in the target original speech signal segment. If the data volume ratio of the effective voice signal in the target original voice signal segment is higher, the threshold value of the first enhancement coefficient is higher; the lower the data volume proportion of the valid speech signal in the target original speech signal segment, the lower the threshold value of the first enhancement coefficient. And when the enhancement coefficient of the target effective voice information segment is the second enhancement coefficient, the target original voice signal segment only contains the effective voice signal.

Referring to fig. 7, a schematic diagram of a system for obtaining an enhanced target speech signal according to an embodiment of the present application is shown, and referring to fig. 7, the system for obtaining an enhanced target speech signal includes two modules. The module is a speech separation model, and its input is original speech signal X, which has a length of N, i.e. contains N sampling points. And separating the original voice signal, and outputting a separated effective voice signal S with the same length of N. Wherein, the voice separation module of the first module can be a ConvTasNet network. The module two comprises four parts, namely a segmentation part, an encoding part, a self-attention network part and a classification part. Preferably, the segmenting section performs a dividing process on the input original speech signal to obtain at least two original speech signal segments, for example, the original speech signal is divided into T original speech signal segments with a length of L, which are not overlapped with each other, where T is N/L, that is, an original speech signal X with dimension of T X L is obtained. And then dividing the separated effective voice signals to obtain at least two effective voice signal segments, for example, dividing the effective voice signals according to the original voice signal dividing method to obtain T segments with the length of L, wherein T is not overlapped with each other, and T is N/L, namely the effective voice signal S with the dimension of T. Then, at least two original voice signals are inputted to the encoding section, that is, the original voice signal X is sent to an encoder composed of 1D-Conv for encoding processing. The convolution kernel size of the one-dimensional convolution network in the encoder is L, stride is also L, namely, non-overlapping convolution operation is performed, the number of input channels is 1, the number of output channels is D, and the dimension of the coded feature is represented, so that a T D-dimensional coding feature can be obtained after processing, namely, a 1D-dimensional coding feature can be obtained for one original voice signal segment, and therefore feature information of each original voice signal segment in at least two original voice signal segments is obtained.

Then, inputting the coding characteristics of at least two original speech signal segments into a Self-Attention network (Self Attention network), wherein a position encoding operation is added before the first Self attribute layer to add position information, and a Linear layer with an output of 3 and a Softmax layer are added after the last Self attribute layer, namely, the characteristics with an output of T x 3 dimensions are added, wherein each dimension represents the weighting proportion coefficient of an effective speech signal, an original speech signal and an all-zero signal; the self-attention network part plays a role of learning to obtain a weighting coefficient corresponding to each original voice signal segment in the T original voice signal segments (namely the weighting coefficient corresponding to the target effective voice signal segment, the weighting coefficient of the target original voice signal segment and the weighting coefficient of the all-zero signal) by utilizing the information of the whole original voice signal so as to play a role of a gating mechanism, wherein the weighting coefficient corresponding to the target effective voice signal segment is an enhancement coefficient. And finally, obtaining at least two original voice signal segments and effective voice signal segments by the segmentation part, and obtaining the enhanced target voice signal after weighted summation is respectively carried out according to weighting coefficients obtained from the attention network. Therefore, when the input original voice signal is a mixed voice signal, the original voice signal can be separated to obtain an effective voice signal in the original voice signal, and then an enhancement coefficient of the effective voice signal is determined according to the data volume ratio of the effective voice signal in the original voice signal. And extracting an original voice signal of a target data volume from the original voice signal according to the enhancement coefficient, fusing the effective voice signal and the original voice signal of the target data volume to obtain an enhanced target voice signal, and repairing the separated effective voice signal to a certain extent, so that the signal-to-noise ratio of the separated effective voice signal can be effectively improved, and the accuracy of subsequent voice recognition can be improved. When the input original voice signal is a pure clean voice signal, namely a pure effective voice signal, the original voice signal is separated, the obtained effective voice signal also has the problem of performance damage, and then the original voice signal can be directly used as an enhanced target voice signal to directly output the original voice signal, so that the problem of performance damage after separation of the obtained target voice signal is solved. When the input original voice signal is a pure interference signal, if the original voice signal is separated, part of the interference signal remains, so that the all-zero signal can be directly used as a target voice signal, and the input all-zero signal does not have any signal information, so that the interference information can be perfectly removed, and the signal-to-noise ratio of the target voice signal is improved. The scheme can be used for a scene of enhancing the separated effective voice signals after voice separation, and can also be directly applied to a scene of voice enhancement.

Optionally, the candidate speech enhancement model and the sample original speech signal may be obtained, and the labeled target speech signal of the sample original speech data may be obtained, the candidate speech enhancement model is adopted to separate the sample original speech signal, so as to obtain an effective speech signal in the original speech signal, and the feature extraction is performed on the original speech signal, so as to obtain the feature information of the original speech signal. And determining the enhancement coefficient of the effective speech signal in the original speech signal according to the characteristic information of the original speech signal, namely obtaining the weighting coefficient (the weighting coefficient of the target effective speech signal segment, the weighting coefficient of the target original speech signal segment, and the weighting coefficient of the all-zero signal) corresponding to the original speech signal, namely determining the enhancement coefficient of the effective speech signal. And according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal and outputting a prediction target voice signal. And determining the prediction loss value of the candidate voice enhancement model according to the labeled target voice signal and the predicted target voice signal. And adjusting the candidate enhanced speech signal according to the prediction loss value until the candidate speech enhancement model meets the convergence condition, and taking the candidate speech enhancement model meeting the convergence condition as the target speech enhancement model. The input original speech signal may be processed according to the target speech enhancement model to obtain an enhanced target speech signal.

In the embodiment of the application, the original voice signal to be processed is obtained, the original voice signal is separated to obtain the effective voice signal in the original voice signal, the original voice signal is subjected to feature extraction to obtain the feature information of the original voice signal, and the enhancement coefficient of the effective voice signal is generated according to the feature information of the original voice signal. And according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal. When the input original voice signal is a mixed voice signal, extracting an original voice sub-signal of a target data volume from the original voice signal according to the enhancement coefficient of the effective voice signal, and fusing the effective voice signal and the original voice sub-signal to obtain an enhanced target voice signal. When the input original voice signal is a pure clean voice signal, the original voice signal can be directly used as an enhanced target voice signal, and the original voice signal is directly output, so that the problem that the performance of the obtained target voice signal is damaged after separation does not exist. When the input original voice signal is a pure interference signal, the all-zero signal can be directly used as a target voice signal, and the input all-zero signal does not have any signal information, so that the interference information can be perfectly removed, and the signal-to-noise ratio of the target voice signal is improved. Enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal; the information loss of the effective voice signal can be effectively avoided, namely, the performance damage of the effective voice signal is reduced; and background interference signals in the effective voice signals are reduced, and the signal-to-noise ratio of the effective voice signals can be improved.

As shown in fig. 8, which is a schematic diagram of another speech signal processing method provided in the embodiment of the present application, as shown in fig. 8, the steps of the another speech signal processing method include S201-207.

S201, acquiring an original voice signal to be processed.

S202, the original voice signal is separated to obtain an effective voice signal in the original voice signal.

The detailed contents of steps S201-202 can refer to the contents of the embodiment described in fig. 2, and the embodiment will not be described herein again.

S203, the original voice signal is divided to obtain at least two original voice signal segments, and the effective voice signal is divided to obtain at least two effective voice signal segments.

S204, extracting the characteristics of each original voice signal segment in the at least two original voice signal segments to obtain the characteristic information of each original voice signal segment.

S205, generating an enhancement coefficient of a corresponding effective speech signal segment in at least two effective speech signal segments according to the feature information of each original speech signal segment.

S206, taking the enhancement coefficient corresponding to each effective voice signal segment in the at least two effective voice signal segments as the enhancement coefficient of the effective voice signal.

And S207, enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal.

S21, the original speech signal may be divided to obtain at least two original speech signal segments, and the effective speech signal may be divided to obtain at least two effective speech signal segments.

In this embodiment of the present application, the original speech signal may be divided to obtain at least two original speech signal segments, for example, the original speech signal is divided into T original speech signal segments, where T is a positive integer greater than or equal to 2. The length of each original speech signal segment may be equal, for example, the length of each original speech signal segment is L, where L is a natural number greater than 0, and of course, the length of each original speech signal segment may also be unequal, which is not limited in this embodiment. After the original voice signal is separated and processed to obtain the voice signal in the original voice signal, the effective voice signal can be divided according to the method for dividing the original voice signal to obtain at least two effective voice signal segments. For example, the third original speech signal segment of the at least two original speech signal segments corresponds to the third valid speech signal segment of the at least two valid speech signal segments, and the length and the position of the third original speech signal segment are the same as those of the third valid speech signal segment.

And similarly, the characteristic information of each original voice signal segment comprises the data volume ratio of effective voice signals in the original voice signal segment, the number of different voice signals in the original effective voice signal segment, the type of the original voice signal segment and the like (such as voice signals of dubbing actors recorded in a recording studio, voice signals of reporters recorded in severe weather and the like).

Then, the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments is generated according to the feature information of each original speech signal segment, that is, the enhancement coefficient of the corresponding effective speech signal segment in the at least two effective speech signal segments can be generated according to at least one of the data volume ratio of the effective speech signal in the original speech signal segment, the number of different speech signals in the original speech signal segment and the type of the original speech signal segment.

After the enhancement coefficient of the effective voice signal is obtained, the effective voice signal can be enhanced according to the enhancement coefficient of the effective voice signal and the original voice signal, so that an enhanced target voice signal is obtained. Therefore, the signal-to-noise ratio of the target speech signal after enhancement processing can be improved, and the performance damage of the effective speech signal after separation from the original speech signal can be reduced, namely the performance of the target speech signal after enhancement processing is improved.

The detailed contents of this embodiment can refer to the contents of the embodiment described in fig. 2, and the present embodiment will not be described herein again.

In the embodiment of the application, the original voice signal to be processed is obtained, the original voice signal is separated to obtain the effective voice signal in the original voice signal, the original voice signal is subjected to feature extraction to obtain the feature information of the original voice signal, and the enhancement coefficient of the effective voice signal is generated according to the feature information of the original voice signal. And according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal. When the input original voice signal is a mixed voice signal, extracting an original voice sub-signal of a target data volume from the original voice signal according to the enhancement coefficient of the effective voice signal, and fusing the effective voice signal and the original voice sub-signal to obtain an enhanced target voice signal. When the input original voice signal is a pure clean voice signal, the original voice signal can be directly used as an enhanced target voice signal, and the original voice signal is directly output, so that the problem that the performance of the obtained target voice signal is damaged after separation does not exist. When the input original voice signal is a pure interference signal, the all-zero signal can be directly used as a target voice signal, and the input all-zero signal does not have any signal information, so that the interference information can be perfectly removed, and the signal-to-noise ratio of the target voice signal is improved. Enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal; the information loss of the effective voice signal can be effectively avoided, namely, the performance damage of the effective voice signal is reduced; and background interference signals in the effective voice signals are reduced, and the signal-to-noise ratio of the effective voice signals can be improved.

Fig. 9 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present application. The speech signal processing means may be a computer program (including program code) running on a computer device, for example, the speech signal processing means is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 9, the voice signal processing apparatus may include: the device comprises an acquisition module 11, a separation processing module 12, a generation module 13 and an enhancement processing module 14.

The obtaining module 11 is configured to obtain an original voice signal to be processed.

And the separation processing module 12 is configured to perform separation processing on the original voice signal to obtain an effective voice signal in the original voice signal.

The generating module 13 is configured to perform feature extraction on the original voice signal to obtain feature information of the original voice signal, and generate an enhancement coefficient of the effective voice signal according to the feature information of the original voice signal.

And the enhancement processing module 14 is configured to perform enhancement processing on the effective speech signal according to the enhancement coefficient of the effective speech signal and the original speech signal, so as to obtain an enhanced target speech signal.

Wherein, the generating module 13 includes:

the dividing processing unit is used for dividing the original voice signal to obtain at least two original voice signal segments and dividing the effective voice signal to obtain at least two effective voice signal segments, wherein one original voice signal segment corresponds to one effective voice signal segment;

the feature extraction unit is used for extracting features of each original voice signal segment of the at least two original voice signal segments to obtain feature information of each original voice signal segment;

the generating unit is used for generating an enhancement coefficient of a corresponding effective speech signal segment in the at least two effective speech signal segments according to the characteristic information of each original speech signal segment;

a first determining unit, configured to use an enhancement coefficient corresponding to each of the at least two valid speech signal segments as an enhancement coefficient of the valid speech signal.

Wherein, the generating unit is specifically configured to:

determining the data volume ratio of the effective voice signals included in each original voice signal segment according to the characteristic information of each original voice signal segment;

and generating the enhancement coefficient of the corresponding effective voice signal segment in the at least two effective voice signal segments by adopting the data volume ratio.

Wherein, the generating unit is further specifically configured to:

determining the data volume of the effective voice signal included in each original voice signal segment according to the characteristic information of each original voice signal segment;

acquiring the total data volume of the original voice signal;

and acquiring the ratio of the data quantity of the effective voice signals included in each original voice signal segment to the total data quantity of the original voice signals to obtain the data quantity ratio of the effective voice signals included in each original voice signal segment.

The at least two original voice signal segments comprise a target original voice signal segment, and the at least two effective voice signal segments comprise a target effective voice signal segment corresponding to the target original voice signal segment;

the enhancement processing module 14 includes:

a fusion processing unit, configured to, if an enhancement coefficient of the target valid speech signal segment is greater than a first enhancement coefficient threshold and smaller than a second enhancement coefficient threshold, extract an original speech signal sub-segment of a target data amount from the target original speech signal segment, and perform fusion processing on the original speech signal sub-segment and the target valid speech signal segment to obtain an enhanced target valid speech signal segment; the target data volume is determined according to the data volume proportion of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold is smaller than the second enhancement coefficient threshold;

a second determining unit, configured to, if an enhancement coefficient of the target valid speech signal segment is greater than or equal to the second enhancement coefficient threshold, take the target original speech signal segment as an enhanced target valid speech signal segment;

and the splicing unit is used for splicing the enhanced target effective voice signal segments to obtain an enhanced target voice signal.

Wherein, the generating unit is further specifically configured to:

if the data volume proportion corresponding to the target original voice information fragment is larger than a first data volume proportion threshold and smaller than a second data volume proportion threshold, determining a first enhancement coefficient as the enhancement coefficient of the target effective voice information fragment; the first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold;

if the data volume ratio corresponding to the target original voice information segment is larger than the second data volume ratio threshold, determining a second enhancement coefficient as the enhancement coefficient of the target effective voice information segment; the second enhancement coefficient is greater than or equal to the second enhancement coefficient threshold.

Wherein, the separation processing module 12 includes:

the mask processing unit is used for performing mask processing on the original voice signal according to the characteristic information of the original voice signal to obtain a mask matrix corresponding to the original voice signal;

and the separation unit is used for separating the effective voice signal from the original voice signal according to the mask matrix corresponding to the original voice signal.

According to an embodiment of the present application, the steps involved in the speech signal processing method shown in fig. 2 may be performed by various modules in the speech signal processing apparatus shown in fig. 9. For example, step S101 shown in fig. 2 may be performed by the acquisition module 11 in fig. 9, and step S102 shown in fig. 2 may be performed by the separation processing module 12 in fig. 9; step S103 shown in fig. 2 may be performed by the generation module 13 in fig. 9; step S104 shown in fig. 2 may be performed by the enhancement processing module 14 in fig. 9.

According to an embodiment of the present application, each module in the speech signal processing apparatus shown in fig. 9 may be respectively or entirely combined into one or several units to form the unit, or some unit(s) may be further split into multiple sub-units with smaller functions, which may implement the same operation without affecting implementation of technical effects of the embodiment of the present application. The modules are divided based on logic functions, and in practical application, the functions of one module can be realized by a plurality of units, or the functions of a plurality of modules can be realized by one unit. In other embodiments of the present application, the speech signal processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.

According to an embodiment of the present application, the speech signal processing apparatus as shown in fig. 9 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2 or fig. 8 on a general-purpose computer device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and implementing the speech signal processing method of the embodiment of the present application. The computer program may be recorded on a computer-readable recording medium, for example, and loaded into and executed by the computing apparatus via the computer-readable recording medium.

In the embodiment of the application, the original voice signal to be processed is obtained, the original voice signal is separated to obtain the effective voice signal in the original voice signal, the original voice signal is subjected to feature extraction to obtain the feature information of the original voice signal, and the enhancement coefficient of the effective voice signal is generated according to the feature information of the original voice signal. And according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal. When the input original voice signal is a mixed voice signal, extracting an original voice sub-signal of a target data volume from the original voice signal according to the enhancement coefficient of the effective voice signal, and fusing the effective voice signal and the original voice sub-signal to obtain an enhanced target voice signal. When the input original voice signal is a pure clean voice signal, the original voice signal can be directly used as an enhanced target voice signal, and the original voice signal is directly output, so that the problem that the performance of the obtained target voice signal is damaged after separation does not exist. When the input original voice signal is a pure interference signal, the all-zero signal can be directly used as a target voice signal, and the input all-zero signal does not have any signal information, so that the interference information can be perfectly removed, and the signal-to-noise ratio of the target voice signal is improved. Enhancing the effective voice signal according to the enhancement coefficient of the effective voice signal and the original voice signal to obtain an enhanced target voice signal; the information loss of the effective voice signal can be effectively avoided, namely, the performance damage of the effective voice signal is reduced; and background interference signals in the effective voice signals are reduced, and the signal-to-noise ratio of the effective voice signals can be improved.

Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 10, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 10, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 10, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:

acquiring an original voice signal to be processed;

separating the original voice signal to obtain an effective voice signal in the original voice signal;

extracting the characteristics of the original voice signal to obtain the characteristic information of the original voice signal, and generating an enhancement coefficient of the effective voice signal according to the characteristic information of the original voice signal;

and according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal.

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:

dividing the original voice signal to obtain at least two original voice signal segments, and dividing the effective voice signal to obtain at least two effective voice signal segments, wherein one original voice signal segment corresponds to one effective voice signal segment;

performing feature extraction on each original voice signal segment in the at least two original voice signal segments to obtain feature information of each original voice signal segment;

generating an enhancement coefficient of a corresponding effective voice signal segment in the at least two effective voice signal segments according to the characteristic information of each original voice signal segment;

and taking the enhancement coefficient corresponding to each effective speech signal segment in the at least two effective speech signal segments as the enhancement coefficient of the effective speech signal.

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:

determining the data volume ratio of the effective voice signals included in each original voice signal segment according to the characteristic information of each original voice signal segment;

and generating the enhancement coefficient of the corresponding effective voice signal segment in the at least two effective voice signal segments by adopting the data volume ratio.

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:

determining the data volume of the effective voice signal included in each original voice signal segment according to the characteristic information of each original voice signal segment;

acquiring the total data volume of the original voice signal;

and acquiring the ratio of the data quantity of the effective voice signals included in each original voice signal segment to the total data quantity of the original voice signals to obtain the data quantity ratio of the effective voice signals included in each original voice signal segment.

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:

the enhancing the effective speech signal according to the enhancement coefficient of the effective speech signal and the original speech signal to obtain an enhanced target speech signal, including:

if the enhancement coefficient of the target effective voice signal segment is larger than a first enhancement coefficient threshold value and smaller than a second enhancement coefficient threshold value, extracting an original voice signal sub-segment of a target data volume from the target original voice signal segment, and carrying out fusion processing on the original voice signal sub-segment and the target effective voice signal segment to obtain an enhanced target effective voice signal segment; the target data volume is determined according to the data volume proportion of the effective voice information included in the target original voice signal segment, and the first enhancement coefficient threshold is smaller than the second enhancement coefficient threshold;

if the enhancement coefficient of the target effective voice signal segment is larger than or equal to the second enhancement coefficient threshold value, taking the target original voice signal segment as an enhanced target effective voice signal segment;

and splicing the enhanced target effective voice signal segments to obtain an enhanced target voice signal.

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:

if the data volume proportion corresponding to the target original voice information fragment is larger than a first data volume proportion threshold and smaller than a second data volume proportion threshold, determining a first enhancement coefficient as the enhancement coefficient of the target effective voice information fragment; the first enhancement coefficient is greater than the first enhancement coefficient threshold and less than the second enhancement coefficient threshold;

if the data volume ratio corresponding to the target original voice information segment is larger than the second data volume ratio threshold, determining a second enhancement coefficient as the enhancement coefficient of the target effective voice information segment; the second enhancement coefficient is greater than or equal to the second enhancement coefficient threshold.

Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:

according to the characteristic information of the original voice signal, performing mask processing on the original voice signal to obtain a mask matrix corresponding to the original voice signal;

and separating the effective voice signal from the original voice signal according to the mask matrix corresponding to the original voice signal.

In the embodiment of the application, the original voice signal to be processed is obtained, the original voice signal is separated to obtain the effective voice signal in the original voice signal, the original voice signal is subjected to feature extraction to obtain the feature information of the original voice signal, and the enhancement coefficient of the effective voice signal is generated according to the feature information of the original voice signal, and the enhancement coefficient is used for determining whether a part of the original voice signal needs to be extracted from the original voice signal and is fused into the effective voice signal, so that the performance of the effective voice signal is enhanced. And according to the enhancement coefficient of the effective voice signal and the original voice signal, carrying out enhancement processing on the effective voice signal to obtain an enhanced target voice signal. When the input original voice signal is a mixed voice signal, extracting an original voice sub-signal of a target data volume from the original voice signal according to an enhancement coefficient of the effective voice signal, fusing the effective voice signal and the original voice sub-signal to obtain an enhanced target voice signal, repairing the separated effective voice signal to a certain extent, effectively improving the signal-to-noise ratio of the target voice signal, and improving the accuracy of subsequent target voice recognition. When the input original voice signal is a pure clean voice signal, namely a pure effective voice signal, the original voice signal is separated, the obtained effective voice signal also has the problem of performance damage, and then the original voice signal can be directly used as an enhanced target voice signal to directly output the original voice signal, so that the problem of performance damage after separation of the obtained target voice signal is solved. When the input original voice signal is a pure interference signal, if the original voice signal is separated, part of the interference signal remains, so that the all-zero signal can be directly used as a target voice signal, and the input all-zero signal does not have any signal information, so that the interference information can be perfectly removed, and the signal-to-noise ratio of the target voice signal is improved. By the method and the device, the signal-to-noise ratio of the target voice signal can be remarkably improved, the performance damage of the target voice signal is reduced, and the accuracy of subsequent target voice recognition processing is improved.

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the voice signal processing method in the embodiment corresponding to fig. 2 or fig. 8, and may also perform the description of the voice signal processing apparatus in the embodiment corresponding to fig. 9, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device can execute the description of the speech signal processing method in the embodiment corresponding to fig. 2 or fig. 8, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

By way of example, the program instructions described above may be executed on one computer device, or on multiple computer devices located at one site, or distributed across multiple sites and interconnected by a communication network, which may comprise a blockchain network.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:一种基于变张成广义子空间的多通道频域语音增强算法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!