Audio generation method, device, equipment and storage medium based on artificial intelligence
1. A method for artificial intelligence based audio generation, the method comprising:
coding a phoneme sequence corresponding to a text to obtain a context representation of the phoneme sequence;
determining an alignment position of the first frame implicit state relative to the context representation based on the first frame implicit state corresponding to each phoneme in the phoneme sequence;
when the alignment position corresponds to a non-tail position in the context representation, decoding processing is carried out based on the context representation and the first frame implicit state to obtain a second frame implicit state;
and synthesizing based on the first frame hidden state and the second frame hidden state to obtain audio data corresponding to the text.
2. The method of claim 1,
the first frame hidden state represents the hidden state of a first frame, the second frame hidden state represents the hidden state of a second frame, and the first frame and the second frame are two adjacent frames in the spectrum data corresponding to the phoneme;
when the first frame implicit state is recorded as a tth frame implicit state, the determining, based on the first frame implicit state corresponding to each phoneme in the phoneme sequence, an alignment position of the first frame implicit state with respect to the context token includes:
performing the following for each phoneme in the sequence of phonemes:
determining an alignment position of the t frame implicit state relative to the context representation based on the t frame implicit state corresponding to the phoneme;
when the alignment position corresponds to a non-end position in the context representation, performing decoding processing based on the context representation and the first frame implicit state to obtain a second frame implicit state, including:
when the alignment position of the t frame implicit state relative to the context representation corresponds to a non-tail position in the context representation, decoding processing is carried out based on the context representation and the t frame implicit state to obtain a t +1 frame implicit state;
and T is a natural number which is increased from 1, and the value of T is more than or equal to 1 and less than or equal to T, wherein T is the total frame number corresponding to the phoneme sequence when the alignment position corresponds to the tail position in the context representation, and the total frame number represents the frame number of the spectrum data corresponding to the hidden state of each phoneme in the phoneme sequence.
3. The method of claim 2, wherein the synthesizing based on the first frame hidden state and the second frame hidden state to obtain the audio data corresponding to the text comprises:
when the alignment position corresponds to the tail position in the context representation, splicing T-frame hidden states to obtain a hidden state corresponding to the text;
carrying out smoothing processing on the hidden state corresponding to the text to obtain frequency spectrum data corresponding to the text;
and carrying out Fourier transform on the frequency spectrum data corresponding to the text to obtain audio data corresponding to the text.
4. The method of claim 2, wherein the determining an alignment position of the t-th frame implicit state relative to the context token based on the t-th frame implicit state corresponding to the phoneme comprises:
performing Gaussian prediction processing based on the t-th frame hidden state corresponding to the phoneme to obtain a t-th Gaussian parameter corresponding to the t-th frame hidden state;
determining an alignment position of the t frame implicit state relative to the context representation based on the t Gaussian parameter.
5. The method of claim 4, wherein the performing the Gaussian prediction processing based on the t-th frame hidden state corresponding to the phoneme to obtain a t-th Gaussian parameter corresponding to the t-th frame hidden state comprises:
performing prediction processing based on a Gaussian function based on the t-th frame hidden state corresponding to the phoneme to obtain a t-th Gaussian variance and a t-th Gaussian mean variation corresponding to the t-th frame hidden state;
acquiring a t-1 Gaussian parameter corresponding to the hidden state of the t-1 frame;
adding the t-1 Gaussian mean value included by the t-1 Gaussian parameter and the variation of the t Gaussian mean value, and taking the obtained addition result as the t Gaussian mean value corresponding to the t frame hidden state;
taking the set of the tth Gaussian variance and the tth Gaussian mean value as a tth Gaussian parameter corresponding to the implicit state of the tth frame;
the determining an alignment position of the t frame implicit state relative to the context representation based on the t Gaussian parameter includes:
and taking the tth Gaussian mean value as the alignment position of the implicit state of the tth frame relative to the context representation.
6. The method of claim 5, further comprising:
acquiring the content text length of the context representation of the phoneme sequence;
when the tth Gaussian mean value is larger than the length of the content text, determining that the alignment position corresponds to the tail position in the context representation;
and when the tth Gaussian average value is smaller than or equal to the length of the content text, determining that the alignment position corresponds to a non-end position in the context representation.
7. The method of claim 2, wherein the decoding based on the context characterization and the t-th frame implicit state to obtain a t + 1-th frame implicit state comprises:
acquiring attention weight corresponding to the implicit state of the t-th frame;
weighting the context characterization based on the attention weight to obtain a context vector corresponding to the context characterization;
and performing state prediction processing based on the context vector and the t frame implicit state to obtain a t +1 frame implicit state.
8. The method of claim 7, wherein the obtaining the attention weight corresponding to the t-th frame hidden state comprises:
acquiring a tth Gaussian parameter corresponding to the implicit state of the tth frame, wherein the tth Gaussian parameter comprises a tth Gaussian variance and a tth Gaussian mean;
and carrying out Gaussian processing on the context characterization based on the tth Gaussian variance and the tth Gaussian average value to obtain the attention weight corresponding to the t-th frame hidden state.
9. The method of claim 1,
the audio generation method is realized by calling a neural network model;
the training process of the neural network model comprises the following steps:
coding a phoneme sequence sample corresponding to a text sample through the initialized neural network model to obtain a context representation of the phoneme sequence sample;
determining a predicted alignment position of a third frame implicit state relative to the context representation based on the third frame implicit state corresponding to each phoneme in the phoneme sequence sample;
when the predicted alignment position corresponds to a non-end position in the context representation, decoding processing is carried out based on the context representation and the third frame implicit state to obtain a fourth frame implicit state;
performing spectrum post-processing based on the third frame hidden state and the fourth frame hidden state to obtain predicted spectrum data corresponding to the text sample;
constructing a loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample;
updating parameters of the neural network model, and taking the updated parameters of the neural network model when the loss function is converged as the parameters of the trained neural network model;
the third frame hidden state represents the hidden state of a third frame, the fourth frame hidden state represents the hidden state of a fourth frame, and the third frame and the fourth frame are any two adjacent frames in the spectrum data corresponding to each phoneme in the phoneme sequence sample.
10. The method of claim 9, wherein before updating the parameters of the neural network model, further comprising:
constructing a parameter matrix based on the parameters of the neural network model;
dividing the parameter matrix into blocks to obtain a plurality of matrix blocks included in the parameter matrix;
when the structure sparse opportunity is reached, determining the mean value of the parameters in each matrix block;
the matrix blocks are sorted in an ascending order based on the mean value of the parameters in each matrix block, and the parameters in a plurality of matrix blocks sorted in the front in the ascending order sorting result are reset to obtain a reset parameter matrix;
wherein the reset parameter matrix is used for updating the parameters of the neural network model.
11. The method of claim 9, wherein before constructing the loss function of the neural network model, further comprising:
acquiring the content text length of the context representation of the phoneme sequence sample;
when the predicted alignment position corresponds to an end position in the context representation, constructing a position loss function of the neural network model based on the predicted alignment position and the content text length;
constructing a loss function of the neural network model based on the prediction spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample, including:
constructing a spectrum loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample;
and carrying out weighted summation on the spectrum loss function and the position loss function, and taking the weighted summation result as the loss function of the neural network model.
12. The method of claim 1, wherein the encoding the phoneme sequence corresponding to the text to obtain the context characterization of the phoneme sequence comprises:
carrying out forward coding processing on a phoneme sequence corresponding to a text to obtain a forward implicit vector of the phoneme sequence;
carrying out backward encoding processing on the phoneme sequence corresponding to the text to obtain a backward implicit vector of the phoneme sequence;
and performing fusion processing on the forward hidden vector and the backward hidden vector to obtain the context representation of the phoneme sequence.
13. An apparatus for audio generation, the apparatus comprising:
the coding module is used for coding a phoneme sequence corresponding to the text to obtain a context representation of the phoneme sequence;
an attention module, configured to determine, based on a first frame implicit state corresponding to each phoneme in the sequence of phonemes, an alignment position of the first frame implicit state with respect to the context representation;
a decoding module, configured to, when the alignment position corresponds to a non-end position in the context representation, perform decoding processing based on the context representation and the first frame implicit state to obtain a second frame implicit state;
and the synthesis module is used for carrying out synthesis processing on the basis of the first frame hidden state and the second frame hidden state to obtain audio data corresponding to the text.
14. An electronic device, characterized in that the electronic device comprises:
a memory for storing executable instructions;
a processor for implementing the artificial intelligence based audio generation method of any one of claims 1 to 12 when executing executable instructions stored in the memory.
15. A computer-readable storage medium storing executable instructions for implementing the artificial intelligence based audio generation method of any one of claims 1 to 12 when executed by a processor.
Background
Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning/deep learning, etc., and along with the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important values.
In the related art, the audio synthesis mode is rough, usually, the frequency spectrum corresponding to the text data is directly synthesized to obtain the audio data corresponding to the text data, and the synthesis mode cannot realize accurate audio synthesis, so that normal audio synthesis of user experience is influenced.
Disclosure of Invention
The embodiment of the application provides an audio generation method and device based on artificial intelligence, an electronic device and a computer-readable storage medium, and the accuracy of audio synthesis can be improved.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides an audio generation method based on artificial intelligence, which comprises the following steps:
coding a phoneme sequence corresponding to a text to obtain a context representation of the phoneme sequence;
determining an alignment position of the first frame implicit state relative to the context representation based on the first frame implicit state corresponding to each phoneme in the phoneme sequence;
when the alignment position corresponds to a non-tail position in the context representation, decoding processing is carried out based on the context representation and the first frame implicit state to obtain a second frame implicit state;
and synthesizing based on the first frame hidden state and the second frame hidden state to obtain audio data corresponding to the text.
In the above technical solution, the encoding the phoneme sequence corresponding to the text to obtain the context representation of the phoneme sequence includes:
carrying out forward coding processing on a phoneme sequence corresponding to a text to obtain a forward implicit vector of the phoneme sequence;
carrying out backward encoding processing on the phoneme sequence corresponding to the text to obtain a backward implicit vector of the phoneme sequence;
and performing fusion processing on the forward hidden vector and the backward hidden vector to obtain the context representation of the phoneme sequence.
In the above technical solution, the forward encoding processing on the phoneme sequence corresponding to the text to obtain a forward hidden vector of the phoneme sequence includes:
sequentially coding each phoneme in a phoneme sequence corresponding to the text according to a first direction through a coder to obtain a hidden vector of each phoneme in the first direction;
the performing backward encoding processing on the phoneme sequence corresponding to the text to obtain a backward implicit vector of the phoneme sequence includes:
sequentially coding each phoneme according to a second direction through the coder to obtain a hidden vector of each phoneme in the second direction;
the fusion processing of the forward hidden vector and the backward hidden vector to obtain the context representation of the phoneme sequence includes:
splicing the forward implicit vector and the backward implicit vector to obtain a context representation of the phoneme sequence;
wherein the second direction is opposite to the first direction.
An embodiment of the present application provides an audio generating apparatus, including:
the coding module is used for coding a phoneme sequence corresponding to the text to obtain a context representation of the phoneme sequence;
an attention module, configured to determine, based on a first frame implicit state corresponding to each phoneme in the sequence of phonemes, an alignment position of the first frame implicit state with respect to the context representation;
a decoding module, configured to, when the alignment position corresponds to a non-end position in the context representation, perform decoding processing based on the context representation and the first frame implicit state to obtain a second frame implicit state;
and the synthesis module is used for carrying out synthesis processing on the basis of the first frame hidden state and the second frame hidden state to obtain audio data corresponding to the text.
In the above technical solution, the implicit state of the first frame represents the implicit state of the first frame, the implicit state of the second frame represents the implicit state of the second frame, and the first frame and the second frame are any two adjacent frames in the spectrum data corresponding to the phoneme;
when the first frame implicit state is denoted as a tth frame implicit state, the attention module is further configured to perform the following for each phoneme in the sequence of phonemes:
determining an alignment position of the t frame implicit state relative to the context representation based on the t frame implicit state corresponding to the phoneme;
the decoding module is further configured to perform decoding processing based on the context representation and the t-th frame implicit state to obtain a t + 1-th frame implicit state when the t-th frame implicit state corresponds to a non-end position in the context representation relative to an alignment position of the context representation;
and T is a natural number which is increased from 1, and the value of T is more than or equal to 1 and less than or equal to T, wherein T is the total frame number corresponding to the phoneme sequence when the alignment position corresponds to the tail position in the context representation, and the total frame number represents the frame number of the spectrum data corresponding to the hidden state of each phoneme in the phoneme sequence.
In the above technical solution, the synthesis module is further configured to, when the alignment position corresponds to an end position in the context representation, perform splicing processing on the T-frame hidden state to obtain a hidden state corresponding to the text;
carrying out smoothing processing on the hidden state corresponding to the text to obtain frequency spectrum data corresponding to the text;
and carrying out Fourier transform on the frequency spectrum data corresponding to the text to obtain audio data corresponding to the text.
In the above technical solution, the attention module is further configured to perform gaussian prediction processing based on the t-th frame hidden state corresponding to the phoneme to obtain a t-th gaussian parameter corresponding to the t-th frame hidden state;
determining an alignment position of the t frame implicit state relative to the context representation based on the t Gaussian parameter.
In the above technical solution, the attention module is further configured to perform prediction processing based on a gaussian function based on a t-th frame hidden state corresponding to the phoneme to obtain a t-th gaussian variance and a t-th gaussian mean variation corresponding to the t-th frame hidden state;
acquiring a t-1 Gaussian parameter corresponding to the hidden state of the t-1 frame;
adding the t-1 Gaussian mean value included by the t-1 Gaussian parameter and the variation of the t Gaussian mean value, and taking the obtained addition result as the t Gaussian mean value corresponding to the t frame hidden state;
taking the set of the tth Gaussian variance and the tth Gaussian mean value as a tth Gaussian parameter corresponding to the implicit state of the tth frame;
and taking the tth Gaussian mean value as the alignment position of the implicit state of the tth frame relative to the context representation.
In the above technical solution, the attention module is further configured to obtain a content text length of the context representation of the phoneme sequence;
when the tth Gaussian mean value is larger than the length of the content text, determining that the alignment position corresponds to the tail position in the context representation;
and when the tth Gaussian average value is smaller than or equal to the length of the content text, determining that the alignment position corresponds to a non-end position in the context representation.
In the above technical solution, the decoding module is further configured to obtain an attention weight corresponding to the t-th frame hidden state;
weighting the context characterization based on the attention weight to obtain a context vector corresponding to the context characterization;
and performing state prediction processing based on the context vector and the t frame implicit state to obtain a t +1 frame implicit state.
In the above technical solution, the attention module is further configured to obtain a tth gaussian parameter corresponding to the t-th frame hidden state, where the tth gaussian parameter includes a tth gaussian variance and a tth gaussian mean;
and carrying out Gaussian processing on the context characterization based on the tth Gaussian variance and the tth Gaussian average value to obtain the attention weight corresponding to the t-th frame hidden state.
In the above technical solution, the audio generation method is implemented by calling a neural network model; the device further comprises:
the training module is used for coding a phoneme sequence sample corresponding to a text sample through the initialized neural network model to obtain the context representation of the phoneme sequence sample;
determining a predicted alignment position of a third frame implicit state relative to the context representation based on the third frame implicit state corresponding to each phoneme in the phoneme sequence sample;
when the predicted alignment position corresponds to a non-end position in the context representation, decoding processing is carried out based on the context representation and the third frame implicit state to obtain a fourth frame implicit state;
performing spectrum post-processing based on the third frame hidden state and the fourth frame hidden state to obtain predicted spectrum data corresponding to the text sample;
constructing a loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample;
updating parameters of the neural network model, and taking the updated parameters of the neural network model when the loss function is converged as the parameters of the trained neural network model;
the third frame hidden state represents the hidden state of a third frame, the fourth frame hidden state represents the hidden state of a fourth frame, and the third frame and the fourth frame are any two adjacent frames in the spectrum data corresponding to each phoneme in the phoneme sequence sample.
In the above technical solution, the training module is further configured to construct a parameter matrix based on the parameters of the neural network model;
dividing the parameter matrix into blocks to obtain a plurality of matrix blocks included in the parameter matrix;
when the structure sparse opportunity is reached, determining the mean value of the parameters in each matrix block;
the matrix blocks are sorted in an ascending order based on the mean value of the parameters in each matrix block, and the parameters in a plurality of matrix blocks sorted in the front in the ascending order sorting result are reset to obtain a reset parameter matrix;
wherein the reset parameter matrix is used for updating the parameters of the neural network model.
In the above technical solution, the training module is further configured to obtain a content text length of the context representation of the phoneme sequence sample;
when the predicted alignment position corresponds to an end position in the context representation, constructing a position loss function of the neural network model based on the predicted alignment position and the content text length;
constructing a spectrum loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample;
and carrying out weighted summation on the spectrum loss function and the position loss function, and taking the weighted summation result as the loss function of the neural network model.
In the above technical solution, the encoding module is further configured to perform forward encoding processing on a phoneme sequence corresponding to a text to obtain a forward hidden vector of the phoneme sequence;
carrying out backward encoding processing on the phoneme sequence corresponding to the text to obtain a backward implicit vector of the phoneme sequence;
and performing fusion processing on the forward hidden vector and the backward hidden vector to obtain the context representation of the phoneme sequence.
In the above technical solution, the encoding module is further configured to sequentially encode, by an encoder, each phoneme in a phoneme sequence corresponding to the text according to a first direction to obtain a hidden vector of each phoneme in the first direction;
sequentially coding each phoneme according to a second direction through the coder to obtain a hidden vector of each phoneme in the second direction;
splicing the forward implicit vector and the backward implicit vector to obtain a context representation of the phoneme sequence;
wherein the second direction is opposite to the first direction.
An embodiment of the present application provides an electronic device for audio generation, where the electronic device includes:
a memory for storing executable instructions;
and the processor is used for realizing the audio generation method based on artificial intelligence provided by the embodiment of the application when the executable instructions stored in the memory are executed.
The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the artificial intelligence-based audio generation method provided by the embodiment of the application.
The embodiment of the application has the following beneficial effects:
by determining the alignment position of the implicit state relative to the context representation, subsequent decoding operations are performed based on the accurate alignment position, thereby achieving accurate audio generation based on the accurate implicit state.
Drawings
Fig. 1 is a schematic view of an application scenario of an audio generation system provided in an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device for audio generation provided by an embodiment of the present application;
3-5 are schematic flow diagrams of artificial intelligence-based audio generation methods provided by embodiments of the present application;
FIG. 6 is a schematic encoding diagram of a content encoder provided in an embodiment of the present application;
FIG. 7 is a flowchart illustrating an artificial intelligence based audio generation method according to an embodiment of the present application;
FIG. 8 is a diagram illustrating an alignment position provided by an embodiment of the present application corresponding to an end position in a context token;
FIG. 9 is a diagram illustrating an alignment position corresponding to a non-end position in a context token provided by an embodiment of the present application;
FIG. 10 is a schematic diagram of a parameter matrix provided by an embodiment of the present application;
FIG. 11 is a schematic diagram illustrating a training process of an end-to-end speech synthesis acoustic model according to an embodiment of the present application;
fig. 12 is a schematic diagram of an inference flow of an end-to-end speech synthesis acoustic model provided in an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, references to the terms "first", "second", and the like are only used for distinguishing similar objects and do not denote a particular order or importance, but rather the terms "first", "second", and the like may be used interchangeably with the order of priority or the order in which they are expressed, where permissible, to enable embodiments of the present application described herein to be practiced otherwise than as specifically illustrated and described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Convolutional Neural Networks (CNN), Convolutional Neural Networks: one class of feed Forward Neural Networks (FNNs) that includes convolution calculations and has a deep structure is one of the algorithms that represent deep learning (deep learning). The convolutional neural network has a representation learning (representation learning) capability, and can perform shift-invariant classification (shift-invariant classification) on an input image according to a hierarchical structure of the input image.
2) Recurrent Neural Network (RNN): one type is a recurrent Neural Network (recurrent Neural Network) in which sequence data is input, recursion is performed in the evolution direction of the sequence, and all nodes (cyclic units) are connected in a chain. The recurrent neural network has memory, parameter sharing and graph completion (training complexity), and thus has certain advantages in learning the nonlinear characteristics of a sequence.
3) Phoneme: the smallest fundamental unit in speech, the phoneme, is the basis on which humans can distinguish one word from another. Phonemes constitute syllables which in turn constitute different words and phrases.
4) Implicit status: the sequence of spectral data output by the decoder (e.g., hidden markov model) is used to represent the spectral data, and the hidden state is smoothed to obtain the corresponding spectral data. Since the audio signal is a non-stationary signal for a long period of time (e.g., more than one second), it can be approximated as a stationary signal for a short period of time (e.g., 50 milliseconds). A stationary signal is characterized by a stationary spectral distribution of the signal, which is similar over different time periods. The hidden Markov model classifies a continuous signal corresponding to a small segment of similar frequency spectrum into a hidden state, and the hidden state is an actual hidden state in the Markov model and cannot be obtained through direct observation to represent a sequence of frequency spectrum data. The training process of the hidden Markov model is to maximize the likelihood, the data generated by each hidden state is represented by a probability distribution, and the likelihood can be as large as possible only when similar continuous signals are classified into the same state.
In the embodiment of the present application, the implicit state of the first frame indicates the implicit state of the first frame, and the implicit state of the second frame indicates the implicit state of the second frame, where the first frame and the second frame are any two adjacent frames in the spectrum data corresponding to the phoneme.
5) Context characterization: a sequence of vectors output by the encoder for characterizing the context content of the text.
6) End position: a position after the last data (e.g. phoneme, word, etc.) in the text, for example, a phoneme sequence corresponding to a certain text has 5 phonemes, then position 0 represents a start position of the phoneme sequence, position 1 represents a position of a first phoneme in the phoneme sequence, …, position 5 represents a position of a fifth phoneme in the phoneme sequence, and position 6 represents an end position in the phoneme sequence, where positions 0-5 represent non-end positions in the phoneme sequence.
7) Mean Absolute Error (MAE, Mean Absolute Error): also known as L1 Loss, the average of the distance between the predicted value of the model f (x) and the true value y.
8) Block sparsification (Block Sparsity): the weights are firstly partitioned in the training process, then the parameters are sorted according to the average absolute value of the parameters in each block when being updated, and the weight on the block with the smaller absolute value is set to be 0.
9) Synthesis real-time rate: one second of audio is combined with the computer run time required to synthesize the one second of audio, for example, if 100 milliseconds of computer run time is required to synthesize 1 second of audio, the synthesis real-time rate is 10 times.
The embodiment of the application provides an audio generation method and device based on artificial intelligence, an electronic device and a computer readable storage medium, and the accuracy of audio synthesis can be improved.
The audio generation method based on artificial intelligence provided by the embodiment of the application can be independently realized by a terminal/a server; the method can also be cooperatively implemented by the terminal and the server, for example, the terminal solely undertakes an artificial intelligence-based audio generation method described below, or the terminal sends a generation request (including a text of an audio to be generated) for the audio to the server, the server executes the artificial intelligence-based audio generation method according to the received generation request for the audio, and in response to the generation request for the audio, when an aligned position corresponds to a non-end position in a context representation, decoding processing is performed based on the context representation and a first frame implicit state to obtain a second frame implicit state, and synthesis processing is performed based on the first frame implicit state and the second frame implicit state to obtain audio data corresponding to the text, so that intelligent and accurate audio generation is realized.
The electronic device for audio generation provided by the embodiment of the application may be various types of terminal devices or servers, wherein the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services; the terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
Taking a server as an example, for example, the server cluster may be deployed in a cloud, and open an artificial intelligence cloud Service (AI as a Service, AIaaS) to users, the AIaaS platform may split several types of common AI services, and provide an independent or packaged Service in the cloud, this Service mode is similar to an AI theme mall, and all users may access one or more artificial intelligence services provided by the AIaaS platform by using an application programming interface.
For example, one of the artificial intelligence cloud services may be an audio generation service, that is, a server in the cloud end encapsulates the audio generation program provided in the embodiments of the present application. The method comprises the steps that a user calls an audio generation service in cloud services through a terminal (a client is operated, such as a sound client, a vehicle-mounted client and the like), so that a server deployed at the cloud calls a packaged audio generation program, when an aligned position corresponds to a non-tail position in a context representation, decoding processing is carried out based on the context representation and a first frame hidden state, a second frame hidden state is obtained, synthesis processing is carried out based on the first frame hidden state and the second frame hidden state, and audio data corresponding to a text is obtained.
As an application example, for a stereo client, the user may be a broadcaster of a certain broadcast platform, and needs to regularly broadcast notes, small living knowledge, etc. to residents of the community. For example, a broadcaster inputs a text at a stereo client, the text needs to be converted into audio to be broadcast to residents in a community, and in the process of converting the text into the audio, the alignment position of the hidden state relative to the context representation of the phoneme sequence corresponding to the text is continuously judged to perform subsequent decoding operation based on the accurate alignment position, so that accurate audio generation is realized based on the accurate hidden state to broadcast the generated audio to the residents.
As another application example, for an in-vehicle client, when a user is driving, it is inconvenient to know information in a text form, but the information can be known by reading audio, so that important information is prevented from being missed. For example, when a user drives, a leader sends a text of an important meeting to the user, and the user is required to read and process the text in time, after receiving the text, the vehicle-mounted client needs to convert the text into audio to play the audio to the user, and the alignment position of the hidden state relative to the context representation of the phoneme sequence corresponding to the text is continuously judged in the process of converting the text into the audio to perform subsequent decoding operation based on the accurate alignment position, so that accurate audio generation is realized based on the accurate hidden state, the generated audio is played to the user, and the user can read the audio in time.
Referring to fig. 1, fig. 1 is a schematic view of an application scenario of an audio generating system 10 provided in an embodiment of the present application, a terminal 200 is connected to a server 100 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.
The terminal 200 (running with a client, such as an audio client, a vehicle-mounted client, etc.) may be used to obtain a generation request for audio, for example, when a user inputs a text of an audio to be generated through the terminal 200, the terminal 200 automatically obtains the text of the audio to be generated and automatically generates the generation request for the audio.
In some embodiments, an audio generation plug-in may be implanted in a client running in the terminal, so as to implement the artificial intelligence based audio generation method locally at the client. For example, after the terminal 200 obtains a request for generating audio (including a text of an audio to be generated), it calls an audio generating plug-in to implement an audio generating method based on artificial intelligence, and when an aligned position corresponds to a non-end position in a context representation, it performs decoding processing based on the context representation and a first frame hidden state to obtain a second frame hidden state, and performs synthesis processing based on the first frame hidden state and the second frame hidden state to obtain audio data corresponding to the text, thereby implementing intelligent and accurate generation of the audio, for example, for a recording application, a user cannot perform high-quality personalized sound customization in a non-studio scene, then inputs a section of text to be recorded in a recording client, the text needs to be converted into personalized audio, by continuously judging the aligned position of the hidden state with respect to the context representation of a phoneme sequence corresponding to the text in the process of converting the text into the audio, and performing subsequent decoding operation based on the accurate alignment position, so as to generate accurate personalized audio based on the accurate hidden state, thereby realizing personalized sound customization in a non-recording studio scene.
In some embodiments, after the terminal 200 obtains the generation request for the audio, an audio generation interface (which may be provided in a form of a cloud service, that is, an audio generation service) of the server 100 is called, when the aligned position corresponds to a non-end position in the context representation, the server 100 performs decoding processing based on the context representation and the first frame hidden state to obtain a second frame hidden state, performs synthesis processing based on the first frame hidden state and the second frame hidden state to obtain audio data corresponding to the text, and sends the audio data to the terminal 200, for example, for a recording application, if a user cannot perform high-quality personalized sound customization in a non-studio scene, a section of text to be recorded is input in the terminal 200, and a generation request for the audio is automatically generated and sent to the server 100, the server 100 converts the text into the audio, the alignment position of the implicit state relative to the context representation of the phoneme sequence corresponding to the text is continuously judged, and subsequent decoding operation is performed based on the accurate alignment position, so that accurate personalized audio is generated based on the accurate implicit state, and the generated personalized audio is sent to the terminal 200 so as to respond to a generation request for the audio and realize personalized sound customization in a non-recording studio scene.
The following describes a structure of an electronic device for audio generation provided in an embodiment of the present application, referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 for audio generation provided in an embodiment of the present application, and taking the electronic device 500 as an example of a server, the electronic device 500 for audio generation shown in fig. 2 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.
The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.
In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
in some embodiments, the audio generating apparatus provided in this embodiment of the present application may be implemented in a software manner, for example, the audio generating apparatus may be an audio generating plug-in the terminal described above, and may be an audio generating service in the server described above. Of course, without limitation, the audio generation apparatus provided by the embodiments of the present application may be provided as various software embodiments, including various forms of applications, software modules, scripts or code.
Fig. 2 shows an audio generation means 555 stored in memory 550, which may be software in the form of programs and plug-ins, such as an audio generation plug-in, and comprises a series of modules including an encoding module 5551, an attention module 5552, a decoding module 5553, a synthesis module 5554, and a training module 5555; the encoding module 5551, the attention module 5552, the decoding module 5553, and the synthesizing module 5554 are configured to implement the audio generating function provided in the embodiment of the present application, and the training module 5555 is configured to train a neural network model, where the audio generating method is implemented by calling the neural network model.
As mentioned above, the artificial intelligence based audio generation method provided by the embodiments of the present application can be implemented by various types of electronic devices. Referring to fig. 3, fig. 3 is a schematic flowchart of an artificial intelligence based audio generation method provided in an embodiment of the present application, and is described in conjunction with the steps shown in fig. 3.
In the following steps, a piece of text corresponds to a phoneme sequence, a phoneme corresponds to a multiframe spectrum data set (i.e., audio data), for example, phoneme a corresponds to 50 ms of spectrum data, and a frame of spectrum data is 10 ms, phoneme a corresponds to 5 frames of spectrum data.
In step 101, a phoneme sequence corresponding to the text is encoded to obtain a context representation of the phoneme sequence.
As an example of obtaining the text, a user inputs a text of an audio to be generated through a terminal, the terminal automatically obtains the text of the audio to be generated and automatically generates a generation request for the audio, the generation request for the audio is sent to a server, the server parses the generation request for the audio to obtain the text of the audio to be generated, and preprocesses the text of the audio to be generated to obtain a phoneme sequence corresponding to the text, so as to perform encoding processing based on the phoneme sequence, for example, the phoneme sequence corresponding to the text "speech synthesis" is "v 3in1 h e2 ch eng 2". The phoneme sequence is coded through a content coder (a model with context correlation), so that the context representation of the phoneme sequence is obtained, and the context representation output by the content coder has the capability of modeling the context.
In some embodiments, the encoding a phoneme sequence corresponding to the text to obtain a context characterization of the phoneme sequence includes: forward coding is carried out on a phoneme sequence corresponding to the text to obtain a forward implicit vector of the phoneme sequence; carrying out backward coding processing on a phoneme sequence corresponding to the text to obtain a backward implicit vector of the phoneme sequence; and performing fusion processing on the forward implicit vector and the backward implicit vector to obtain the context representation of the phoneme sequence.
For example, the phoneme sequence may be input to a content encoder (e.g., RNN, Bidirectional Long Short-term Memory network (BLSTM or BiLSTM), and the phoneme sequence is subjected to forward encoding and backward encoding respectively by the content encoder, so as to obtain a forward hidden vector and a backward hidden vector corresponding to the phoneme sequence, and the forward hidden vector and the backward hidden vector are subjected to fusion processing, so as to obtain a context representation including context information, where the forward hidden vector includes all forward information and the backward hidden vector includes all backward information. Therefore, the coding information obtained by fusing the forward hidden vector and the backward hidden vector includes all information of the phoneme sequence.
In some embodiments, forward encoding the phoneme sequence corresponding to the text to obtain a forward hidden vector of the phoneme sequence includes: coding each phoneme in a phoneme sequence corresponding to the text in sequence according to a first direction through a coder to obtain a hidden vector of each phoneme in the first direction; correspondingly, the backward encoding processing is performed on the phoneme sequence corresponding to the text to obtain a backward implicit vector of the phoneme sequence, and the backward implicit vector comprises: coding each phoneme in sequence according to a second direction through a coder to obtain a hidden vector of each phoneme in the second direction; correspondingly, the fusion processing is carried out on the forward implicit vector and the backward implicit vector to obtain the context representation of the phoneme sequence, and the method comprises the following steps: and splicing the forward implicit vector and the backward implicit vector to obtain the context representation of the phoneme sequence.
As shown in fig. 6, the second direction is opposite to the first direction, and when the first direction is from the first phoneme to the last phoneme in the phoneme sequence, the second direction is from the last phoneme to the first phoneme in the phoneme sequence; when the first direction is from the last phoneme to the first phoneme in the phoneme sequence, the second direction is from the first phoneme to the last phoneme in the phoneme sequence. The method comprises the steps of sequentially coding each phoneme in a phoneme sequence according to a first direction and a second direction through a content coder to obtain a hidden vector of each phoneme in the first direction and a hidden vector of each phoneme in the second direction, and splicing a forward hidden vector and a backward hidden vector to obtain a context representation containing context information, wherein the hidden vector in the first direction contains all information in the first direction, and the hidden vector in the second direction contains all information in the second direction. Therefore, the coding information obtained by concatenating the hidden vector in the first direction and the hidden vector in the second direction includes all the information of the phoneme sequence.
For example, 0 < j ≦ M, and j, M are positive integers, and M is the number of phonemes in the phoneme sequence. When there are M phonemes in the phoneme sequence, the M phonemes are encoded in the first direction to sequentially obtain M hidden vectors in the first direction, for example, after the phoneme sequence is encoded in the first direction, the hidden vector in the first direction is obtained as { h }1l,h2l,...hjl...,hMlIn which hilA jth hidden vector representing the jth phoneme in the first direction. The M phonemes are encoded in the second direction to sequentially obtain M hidden vectors in the second direction, for example, after the phonemes are encoded in the second direction, the hidden vector in the second direction is obtained as { h }1r,h2r,...hjr...,hMrIn which hirA jth hidden vector representing the j phonemes in the second direction. The implicit vector in the first direction is { h }1l,h2l,...hjl...,hMlAnd the implicit vector in the second direction is { h }1r,h2r,...hjr...,hMrSplicing to obtain a context representation (h) containing context information1l,h1r],[h2l,h2r],...[hjl,hjr]...,[hMl,hMr]E.g. a j-th hidden vector h with the j-th vector in the first directionjlA jth hidden vector h of the jth vector in the second directionjrSplicing to obtain ith coding information { h) containing context informationjl,hjr}. To save the calculation processSince the last hidden vector in the first direction contains most of the information in the first direction and the last hidden vector in the second direction contains most of the information in the second direction, the last hidden vector in the first direction and the last hidden vector in the second direction can be directly fused to obtain the context representation containing the context information.
In step 102, an alignment position of the first frame implicit state relative to the context token is determined based on the first frame implicit state corresponding to each phoneme in the phoneme sequence.
In step 103, when the alignment position corresponds to a non-end position in the context representation, decoding processing is performed based on the context representation and the implicit state of the first frame to obtain the implicit state of the second frame.
Wherein each phoneme corresponds to a multi-frame implicit state. The first frame hidden state represents the hidden state of the first frame, the second frame hidden state represents the hidden state of the second frame, and the first frame and the second frame are two adjacent frames in the spectrum data corresponding to the phoneme.
Referring to fig. 4, fig. 4 is an alternative flowchart of an artificial intelligence based audio generation method provided in an embodiment of the present application, and fig. 4 shows that step 102 in fig. 3 can be implemented by step 102A shown in fig. 4: in step 102A, when the first frame implicit state is recorded as the tth frame implicit state, the following processing is performed for each phoneme in the phoneme sequence: determining the alignment position of the t frame implicit state relative to the context representation based on the t frame implicit state corresponding to the phoneme; correspondingly, step 103 may be implemented by step 103A shown in fig. 4: in step 103A, when the alignment position of the t-th frame implicit state relative to the context representation corresponds to the non-end position in the context representation, performing decoding processing based on the context representation and the t-th frame implicit state to obtain a t + 1-th frame implicit state; wherein T is a natural number which is increased from 1, the value of T is more than or equal to 1 and less than or equal to T, T is the total frame number corresponding to the phoneme sequence when the alignment position corresponds to the tail position in the context representation, and the total frame number represents the frame number of the spectrum data corresponding to the hidden state of each phoneme in the phoneme sequence.
As shown in fig. 7, the following iterative process is performed for each phoneme in the phoneme sequence: inputting the t-th frame implicit state output by the autoregressive decoder into a Gaussian attention mechanism, determining the alignment position of the t-th frame implicit state relative to the context representation based on the t-th frame implicit state by the Gaussian attention mechanism, when the alignment position of the t-th frame implicit state relative to the context representation corresponds to the non-end position in the context representation, continuing decoding processing by the autoregressive decoder, performing decoding processing based on the context representation and the t-th frame implicit state to obtain a t + 1-th frame implicit state, and stopping iterative processing until the alignment position of the implicit state relative to the context representation corresponds to the end position in the context representation.
Referring to fig. 5, fig. 5 is an alternative flowchart of an artificial intelligence based audio generation method provided in an embodiment of the present application, and fig. 5 shows that step 102A in fig. 4 can be implemented by steps 1021A to steps 1022A shown in fig. 5: in step 1021A, performing Gaussian prediction processing based on the t-th frame hidden state corresponding to the phoneme to obtain a t-th Gaussian parameter corresponding to the t-th frame hidden state; in step 1022A, an alignment position of the implicit state of the tth frame with respect to the context characterization is determined based on the tth gaussian parameter.
In connection with the above example, the gaussian attention mechanism includes a full connection layer, and gaussian prediction processing is performed on the full connection layer based on the t-th frame hidden state corresponding to the phoneme to obtain a t-th gaussian parameter corresponding to the t-th frame hidden state, so that an alignment position of the t-th frame hidden state relative to the context representation is determined based on the t-th gaussian parameter.
For example, a prediction processing based on a gaussian function is performed based on the t-th frame hidden state corresponding to the phoneme, so as to obtain a t-th gaussian variance and a t-th gaussian mean variation corresponding to the t-th frame hidden state; acquiring a t-1 Gaussian parameter corresponding to the hidden state of the t-1 frame; adding the t-1 Gaussian mean value included by the t-1 Gaussian parameter and the variation of the t Gaussian mean value, and taking the obtained addition result as the t Gaussian mean value corresponding to the t frame hidden state; taking the set of the tth Gaussian variance and the tth Gaussian mean value as a tth Gaussian parameter corresponding to the hidden state of the tth frame; and taking the t-th Gaussian average value as the alignment position of the implicit state of the t-th frame relative to the context representation.
In some embodiments, a content text length of a contextual representation of a phoneme sequence is obtained; when the tth Gaussian average value is larger than the length of the content text, determining the alignment position corresponding to the tail position in the context representation; and when the tth Gaussian average value is smaller than or equal to the length of the content text, determining that the alignment position corresponds to a non-end position in the context representation.
As shown in fig. 8, for example, the length of the content text of the context token is 6, and when the tth gaussian mean is greater than the length of the content text, the aligned position corresponds to the end position in the context token, that is, the aligned position points to the end position of the context token.
As shown in fig. 9, for example, the text length of the content of the context token is 6, and when the tth gaussian mean is less than or equal to the text length of the content, the aligned position corresponds to a non-end position in the context token, that is, the aligned position points to a position in the context token that includes the content, for example, the aligned position points to a position of a second content in the context token.
In some embodiments, decoding based on the context characterization and the t-th frame implicit state to obtain a t + 1-th frame implicit state includes: acquiring attention weight corresponding to the hidden state of the t-th frame; carrying out weighting processing on the context characterization based on the attention weight to obtain a context vector corresponding to the context characterization; and performing state prediction processing based on the context vector and the t frame implicit state to obtain a t +1 frame implicit state.
For example, when the alignment position corresponds to a non-end position in the context representation, it is described that decoding processing needs to be continued, the attention weight corresponding to the t-th frame implicit state is determined by a gaussian attention mechanism, the context representation is weighted based on the attention weight to obtain a context vector corresponding to the context representation, the context vector is sent to an autoregressive decoder, and the autoregressive decoder performs state prediction processing based on the context vector and the t-th frame implicit state to obtain a t + 1-th frame implicit state, so that autoregression of the implicit state is achieved, and the autoregression has front-back relevance.
In some embodiments, obtaining the attention weight corresponding to the implicit state of the tth frame includes: acquiring a tth Gaussian parameter corresponding to the hidden state of the tth frame, wherein the tth Gaussian parameter comprises a tth Gaussian variance and a tth Gaussian mean; and performing Gaussian processing on the context representation based on the tth Gaussian variance and the tth Gaussian mean value to obtain the attention weight corresponding to the t-th frame hidden state.
For example, the attention weight is calculated by the formulaWherein alpha ist,jRepresents the attention weight, mu, of the jth element of the input content encoder phoneme sequence at the time of the t-th iterative computation (t-th frame implicit state)tMeans and sigma of the Gaussian function in the t-th calculationt 2The variance of the gaussian function at step t is shown. The embodiments of the present application are not limited toOther modified weight calculation formulas are also applicable to the embodiments of the present application.
In step 104, a synthesis process is performed based on the first frame hidden state and the second frame hidden state to obtain audio data corresponding to the text.
For example, the hidden state of the first frame represents the hidden state of the first frame, the hidden state of the second frame represents the hidden state of the second frame, the first frame and the second frame are any two adjacent frames in the frequency spectrum data corresponding to the phoneme, when the alignment position corresponds to the tail position in the context representation, the hidden state of the T frame is spliced to obtain the hidden state corresponding to the text, the hidden state corresponding to the text is smoothed to obtain the frequency spectrum data corresponding to the text, the frequency spectrum data corresponding to the text is subjected to Fourier transform to obtain the audio data corresponding to the text, and the decoding is judged when to stop based on the alignment position, so that the problem of stopping decoding in advance is solved, and the naturalness and the stability of speech synthesis are improved.
In some embodiments, the neural network model needs to be trained, so that the trained neural network model can realize audio generation, and the audio generation method is realized by calling the neural network model; the training process of the neural network model comprises the following steps: coding the phoneme sequence sample corresponding to the text sample through the initialized neural network model to obtain the context representation of the phoneme sequence sample; determining a predicted alignment position of the third frame implicit state relative to the context representation based on the third frame implicit state corresponding to each phoneme in the phoneme sequence sample; when the predicted alignment position corresponds to a non-tail position in the context representation, decoding processing is carried out based on the context representation and the implicit state of the third frame to obtain the implicit state of the fourth frame; performing spectrum post-processing based on the third frame hidden state and the fourth frame hidden state to obtain predicted spectrum data corresponding to the text sample; constructing a loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample; updating parameters of the neural network model, and taking the updated parameters of the neural network model when the loss function is converged as the parameters of the trained neural network model; the third frame hidden state represents the hidden state of the third frame, the fourth frame hidden state represents the hidden state of the fourth frame, and the third frame and the fourth frame are two frames which are arbitrarily adjacent in the spectrum data corresponding to each phoneme in the phoneme sequence sample.
For example, after determining the value of the loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample, it may be determined whether the value of the loss function of the neural network model exceeds a preset threshold, when the value of the loss function of the neural network model exceeds the preset threshold, an error signal of the neural network model is determined based on the loss function of the neural network model, the error information is propagated in the neural network model in a reverse direction, and the model parameters of each layer are updated in the propagation process.
Describing backward propagation, inputting training sample data into an input layer of a neural network model, passing through a hidden layer, finally reaching an output layer and outputting a result, which is a forward propagation process of the neural network model, wherein because the output result of the neural network model has an error with an actual result, an error between the output result and the actual value is calculated and is propagated backward from the output layer to the hidden layer until the error is propagated to the input layer, and in the process of backward propagation, the value of a model parameter is adjusted according to the error; and continuously iterating the process until convergence.
In some embodiments, before updating the parameters of the neural network model, the method further comprises: constructing a parameter matrix based on parameters of the neural network model; dividing the parameter matrix into blocks to obtain a plurality of matrix blocks included in the parameter matrix; when the structure sparseness opportunity is reached, determining the mean value of the parameters in each matrix block; performing ascending sorting on the matrix blocks based on the mean value of the parameters in each matrix block, and resetting the parameters in a plurality of matrix blocks sorted in the ascending sorting result to obtain a reset parameter matrix; and the reset parameter matrix is used for updating parameters of the neural network model.
As shown in fig. 10, in order to increase the audio synthesis speed, in the process of training the neural network model, the parameters in the neural network model may be trained in blocks, a parameter matrix is first constructed based on all the parameters of the neural network model, then the parameter matrix is divided into blocks to obtain a matrix block 1, a matrix block 2, …, and a matrix block 16, when a preset training frequency or a preset training time is reached, a mean value of the parameters in each matrix block is determined, the matrix blocks are sorted in an ascending order based on the mean value of the parameters in each matrix block, the parameters in a plurality of matrix blocks sorted in the ascending order result are reset to 0, for example, the parameters in the first 8 matrix blocks are charged to 0, and the matrix block 3, the matrix block 4, the matrix block 7, the matrix block 8, the matrix block 9, the matrix block 10, the matrix block 13, and the matrix block 14 are the first 8 matrix blocks in the ascending order result, the parameters in the dashed-line box 1001 (including the matrix block 3, the matrix block 4, the matrix block 7, and the matrix block 8) and the dashed-line box 1002 (including the matrix block 9, the matrix block 10, the matrix block 13, and the matrix block 14) are reset to 0, so as to obtain a reset parameter matrix, and then the multiplication operation of the parameter matrix can be accelerated, thereby increasing the training speed.
In some embodiments, before constructing the loss function of the neural network model, the method further includes: acquiring the content text length of the context representation of the phoneme sequence sample; when the predicted alignment position corresponds to the tail position in the context representation, constructing a position loss function of the neural network model based on the predicted alignment position and the content text length; constructing a spectrum loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample; and carrying out weighted summation on the spectrum loss function and the position loss function, and taking the weighted summation result as the loss function of the neural network model.
For example, in order to solve the problems of decoding stop in advance, word missing, repeated reading and the like, a position loss function of the neural network model is constructed, so that the trained neural network model learns the capability of accurately predicting the alignment position, and the naturalness and the stability of voice generation are improved.
In the following, an exemplary application of the embodiments of the present application in a practical speech synthesis application scenario will be described.
The embodiment of the application can be applied to various speech synthesis application scenes (for example, intelligent sound boxes, sound boxes with screens, intelligent watches, smart phones, smart homes, intelligent maps, intelligent vehicles and other intelligent devices with speech synthesis capability, and other applications with speech synthesis capability such as online education, intelligent robots, artificial intelligence customer service and speech synthesis cloud service), for example, for vehicle-mounted applications, when a user is driving, the user can not conveniently know information in a text form, but can know information in a speech reading mode, so that important information is avoided being omitted, after the vehicle-mounted client receives the text, the text needs to be converted into speech, the speech is played to the user, and the user can timely read the speech corresponding to the text.
In the related art, the acoustic model uses a content-based, location-based attention mechanism, or a hybrid of both, in conjunction with a stop token (stop token) mechanism to predict a stop location at which audio is generated. The related technical scheme has the following problems: 1) the alignment error occurs, so that the problems of word missing or repeated word reading and the like which are difficult to bear can occur, and the voice synthesis system is difficult to put into practical application; 2) the problem of premature stop of the synthesis of long sentences and complex sentences can occur, so that the audio synthesis is incomplete; 3) the speed of training and reasoning is so slow that deployment of Speech synthesis (TTS) on edge devices such as handsets is difficult To achieve.
In order to solve the above problems, the embodiment of the present application uses a Single Gaussian Attention mechanism (Single Gaussian Attention mechanism), which is a monotonous, normalized, stable, and more powerful Attention mechanism, to solve the instability problem of the Attention mechanism used in the related art, and removes the Stop Token mechanism, proposes to use an Attention Stop prediction (Attention Stop Loss) (a value for determining a Stop in an autoregressive decoding process, for example, set to 0.5 where a probability exceeds a threshold) to ensure a result, and directly determines a Stop based on alignment, to solve the problem of early Stop, and to improve the naturalness and stability of speech synthesis; on the other hand, the embodiment of the present application performs block sparsification on an Autoregressive Decoder (Autoregressive Decoder) by using a pruning technique, further increases the training and synthesis speed, and can achieve a synthesis real-time rate 35 times on a single-core Central Processing Unit (CPU), so that the deployment of TTS on edge devices becomes possible.
The embodiment of the application can be applied to all products with voice synthesis capability, including but not limited to smart devices such as smart speakers, sound boxes with screens, smart watches, smart phones, smart homes, smart cars and the like, smart robots, AI customer service, TTS cloud service and the like, and the using scheme of the embodiment of the application can enhance the stability of synthesis and improve the speed of synthesis through the algorithm provided by the embodiment of the application.
As shown in fig. 11, the end-to-end speech synthesis acoustic model (neural network model) in the embodiment of the present application includes a content encoder, a gaussian attention mechanism, an autoregressive decoder, and a spectrum post-processing network, and the following describes each module of the end-to-end speech synthesis acoustic model specifically:
1) a content encoder: the input phoneme sequence is converted into implicit characteristics (context characterization) related to the context, the content encoder is composed of a model with context relevance, and the characteristics obtained by the content encoder have the capacity of modeling the context. The linguistic features represent the text content to be synthesized, the basic units containing the text are characters or phonemes, and in Chinese speech synthesis, the text is composed of initials, finals and silent syllables, wherein the finals are toned. For example, the sequence of phones with tones of the text "speech synthesis" is "v 3in1 h e2 ch eng 2".
2) Gaussian attention mechanism: corresponding content context information (context vector) is generated in conjunction with the current state of the decoder for the autoregressive decoder to better predict the next frame spectrum. Speech synthesis is a task of creating a monotonic mapping from text sequences to spectral sequences, so that only a small portion of the phoneme content, which is generated by an attention mechanism, needs to be focused when generating a mel-frequency spectrum for each frame. Wherein Speaker Identity information (Speaker Identity) represents a unique identification of a Speaker by a set of vectors.
3) An autoregressive decoder: the spectrum of the current frame is generated by using the context information of the content generated by the current gaussian attention mechanism and the predicted spectrum of the previous frame, and the spectrum of the current frame needs to depend on the output of the previous frame, so the method is called an autoregressive decoder. In which, the training speed can be further increased by replacing the autoregressive decoder with a parallel full-connected form.
4) Mel-spectrum post-processing network: the autoregressive decoder predicted spectrum is smoothed to obtain a higher quality spectrum.
The following describes the optimization of the embodiment of the present application in terms of stability and speed of speech synthesis with reference to fig. 11 and 12:
A) as shown in fig. 11, the embodiment of the present application employs a single gaussian attention mechanism, which is a monotonous, normalized, stable, and more powerful attention mechanism. Wherein the single-gauss attention mechanism calculates the attention weight in the manner of formula (1) and formula (2).
μi=μi-1+Δi (2)
Wherein alpha isi,jRepresenting the attention weight for the jth element of the input content encoder phoneme sequence at the i-th iteration, exp representing an exponential function, muiMeans and sigma of the Gaussian function in the ith calculationi 2Representing the variance, Δ, of the Gaussian function at step iiAnd (4) representing the predicted mean change amount in the iterative computation of the ith step. Wherein, the mean value variation, the variance, etc. are obtained through a full-connection network based on the implicit state of the autoregressive decoder.
Each iteration predicts a mean change and a variance of the Gaussian at the current time, where the cumulative sum of the mean changes characterizes the position of the attention window at the current time, i.e., the position of the input linguistic feature aligned therewith, and the variance characterizes the width of the attention window. A context vector required by an autoregressive decoder is obtained through a Gaussian attention mechanism by using a phoneme sequence as an input of a content encoder, the autoregressive decoder generates a Mel spectrum in an autoregressive mode, and a stop flag of autoregressive decoding is judged whether a mean value of Gaussian attention distribution touches the end of the phoneme sequence or not. The embodiment of the application ensures that the mean variation is nonnegative, thereby ensuring the monotonicity of the alignment process, and ensuring the stability of the attention mechanism because the Gaussian function is normalized.
The context vector required by the autoregressive decoder at each moment is obtained by weighting the weight generated by a Gaussian attention mechanism and the output of the content encoder, the size distribution of the weight is determined by the mean value of the Gaussian attention, and the speech synthesis task is a strictly monotonous task, namely the output Mel spectrum is generated monotonously from left to right according to the input text, so if the mean value of the Gaussian attention is positioned at the end of the input phoneme sequence, the Mel spectrum generation is close to the tail. The width of the attention window, which represents the range of the content output by the content encoder required for each decoding, is influenced by the language structure, e.g. for silent prediction of pauses, the width is smaller; when a word or phrase is encountered, the width is larger because the pronunciation of a certain word in the word or phrase is affected by the previous and next words.
B) The embodiment of the application removes a separated Stop Token framework, directly judges Stop based on alignment by using Gaussian Attention (Gaussian Attention), and proposes an Attention Stop Loss to ensure an alignment result, thereby solving the problem of premature Stop of complex or long sentences. Assuming that the mean of the last moment of training should iterate to the next position of the input text length, an L1 Loss (i.e., L) between the mean of the Gaussian distribution and the length of the input text sequence is constructed based on this assumptionstop) As shown in formula (3), as shown in fig. 12, in the inference process, the scheme of the embodiment of the present application determines whether to stop according to whether the mean value of the Gaussian attention at the current time is greater than the length of the input text plus one:
Lstop=|μI-J-1| (3)
wherein, muIIs the total number of iterations and J is the length of the phoneme sequence.
If the Stop Token architecture is adopted, it may Stop prematurely because the Stop Token architecture does not take into account the integrity of the phonemes. One significant problem brought by the Stop Token architecture is that it must be ensured that the beginning and the end of the recorded audio are muted and the pause needs to be kept at similar lengths, so that the Stop Token architecture prediction is more accurate, and once the pause duration of a recorder is longer, the trained Stop Token prediction is inaccurate. Therefore, the Stop Token architecture has high requirements on data quality, which brings high auditing cost. The Attention Stop Loss provided by the embodiment of the application can reduce the requirement on data quality, thereby reducing the cost.
C) The method and the device for block sparsification of the autoregressive decoder improve the calculation speed of the autoregressive decoder. For example, the sparse scheme employed in the present application is: starting from the 1000 training step, carrying out structured sparseness every 400 steps until training reaches 50% sparseness cut-off by 120K steps, wherein the parameters of the whole model are optimized by a random gradient descent algorithm with L1 Loss between the predicted Mel spectrum and the true Mel spectrum of the model as an optimization target. In the embodiment of the present application, the weight matrix is divided into a plurality of blocks (matrix blocks), then the blocks are sorted from small to large according to the average value of the model parameters in each block, and the model parameters of the first 50% (set according to the actual situation) blocks are set to 0, so as to accelerate the decoding process.
When a matrix is block sparse, that is, the matrix is divided into N blocks, and the elements of some blocks are 0, the multiplication of the matrix can be accelerated. In the training, the element in a partial block is 0, which is determined according to the amplitude of the element, that is, if the average amplitude of the element in a block is small or close to 0 (i.e. less than a certain threshold), the element in the block is approximately 0, so as to achieve the purpose of sparseness. In practice, the amplitudes of the elements in the blocks of a matrix may be sorted by average value, and the first 50% of the blocks with smaller average amplitudes will be thinned out, i.e. the elements are set to zero uniformly.
In practical application, firstly, a text is converted into a phoneme sequence, the phoneme sequence obtains an abstract representation (namely a context representation) containing context information through a content encoder, when a Mel spectrum is predicted, a section of all-zero vector is used as an initial context vector to be input into an autoregressive decoder, then an implicit state output by the autoregressive decoder every time is used as an input of a Gaussian attention mechanism, further, the weight of the output of the content encoder at each moment can be calculated, and the context vector required by the autoregressive decoder at each moment can be calculated by combining the weight and the abstract representation of the content encoder. By performing the autoregressive decoding in this manner, decoding can be stopped when the mean of gaussian attention is at the end of the content encoder abstract representation (phoneme sequence). The Mel spectrums (hidden states) predicted by the autoregressive decoder are spliced together and sent to a Mel post-processing network, so that the Mel spectrums are smoother, the generation process of the Mel spectrums depends on not only past information but also future information, and after the final Mel spectrums are obtained, final audio waveforms are obtained through a signal processing mode or a neural network synthesizer, so that the function of speech synthesis is realized.
In summary, the embodiments of the present application have the following beneficial effects: 1) through the combination of a monotonous and stable Gaussian Attention mechanism and the Attention Stop Loss, the stability of voice synthesis is effectively improved, and the phenomenon that repeated reading, word missing and the like are difficult to bear is avoided; 2) the block sparsification of the autoregressive decoder is carried out, so that the synthesis speed of the acoustic model is improved to a great extent, and the requirement on hardware equipment is reduced.
The embodiment of the application provides a more robust attention mechanism acoustic model, and the model has the advantages of high speed, high stability and the like. The acoustic model can be applied to embedded equipment such as intelligent home, intelligent automobile and the like, and because the computing capability of the embedded equipment is low, end-to-end voice synthesis is easier to realize on the equipment side; the scheme can be applied to scenes of personalized sound customization with low data quality in non-recording room scenes, such as sound customization of mobile phone map users and large-scale online teacher sound cloning in online education, because the recording users in the scenes are not professional sound, and the recording possibly has long pause, for the data, the stability of an acoustic model can be effectively ensured.
The artificial intelligence based audio generation method provided by the embodiment of the present application has been described in conjunction with the exemplary application and implementation of the server provided by the embodiment of the present application. In practical applications, each functional module in the audio generating apparatus may be cooperatively implemented by hardware resources of an electronic device (such as a terminal device, a server, or a server cluster), such as computing resources like a processor, communication resources (such as being used to support various modes of implementing communications like an optical cable and a cellular), and a memory. Fig. 2 shows an audio generating means 555 stored in the memory 550, which may be software in the form of programs and plug-ins, for example, software modules designed by programming languages such as C/C + +, Java, application software designed by programming languages such as C/C + +, Java, or dedicated software modules in large software systems, application program interfaces, plug-ins, cloud services, etc., and different implementations are exemplified below.
Example I, the Audio Generation device is a Mobile-side application and Module
The audio generating device 555 in the embodiment of the present application may provide a software module designed using a programming language such as software C/C + +, Java, and the like, and may be embedded in various mobile applications based on systems such as Android or iOS (stored in a storage medium of the mobile terminal as an executable instruction and executed by a processor of the mobile terminal), so as to directly use computing resources of the mobile terminal itself to complete related information recommendation tasks, and periodically or aperiodically transmit processing results to a remote server through various network communication methods, or locally store the processing results in the mobile terminal.
Example two, the Audio Generation device is a Server application and platform
The audio generating apparatus 555 in this embodiment may be provided as application software designed using a programming language such as C/C + +, Java, or a dedicated software module in a large-scale software system, and run on the server side (stored in a storage medium of the server side in the form of executable instructions and run by a processor of the server side), and the server uses its own computing resources to complete the relevant information recommendation task.
The embodiment of the application can also provide a method for forming an information recommendation platform (used for a recommendation list) and the like for individuals, groups or units to use by carrying a customized and easily interactive network (Web) Interface or other User Interfaces (UI) on a distributed and parallel computing platform formed by a plurality of servers.
Third, the audio generating device is a server side Application Program Interface (API) and a plug-in
The audio generation apparatus 555 in this embodiment of the present application may be provided as an API or a plug-in on a server side, so as to be called by a user to execute the audio generation method based on artificial intelligence in this embodiment of the present application, and be embedded into various application programs.
Example four, the Audio generating apparatus is a Mobile device client API and plug-in
The audio generation apparatus 555 in this embodiment of the present application may be provided as an API or a plug-in on the mobile device side, so as to be called by the user to execute the artificial intelligence based audio generation method in this embodiment of the present application.
Example five, the audio generation device is a cloud open service
The audio generation device 555 in the embodiment of the present application may provide an information recommendation cloud service developed for a user, so that an individual, a group, or an entity may obtain audio.
The audio generating apparatus 555 includes a series of modules, including an encoding module 5551, an attention module 5552, a decoding module 5553, a synthesizing module 5554, and a training module 5555. The following continues to describe how each module in the audio generating apparatus 555 provided in the embodiment of the present application cooperates to implement an audio generating scheme.
The encoding module 5551 is configured to perform encoding processing on a phoneme sequence corresponding to a text to obtain a context representation of the phoneme sequence; an attention module 5552 for determining an alignment position of a first frame implicit state relative to the context representation based on the first frame implicit state corresponding to each phoneme in the sequence of phonemes; a decoding module 5553, configured to, when the alignment position corresponds to a non-end position in the context token, perform decoding processing based on the context token and the first frame implicit state to obtain a second frame implicit state; a synthesizing module 5554, configured to perform synthesizing processing based on the first frame hidden state and the second frame hidden state, so as to obtain audio data corresponding to the text.
In some embodiments, the first frame implicit state represents an implicit state of a first frame, the second frame implicit state represents an implicit state of a second frame, and the first frame and the second frame are any two adjacent frames in the spectral data corresponding to the phoneme; when the first frame implicit state is denoted as the tth frame implicit state, the attention module 5552 is further configured to perform the following for each phoneme in the sequence of phonemes: determining an alignment position of the t frame implicit state relative to the context representation based on the t frame implicit state corresponding to the phoneme; correspondingly, the decoding module 5553 is further configured to, when the alignment position of the t-th frame implicit state with respect to the context token corresponds to a non-end position in the context token, perform decoding processing based on the context token and the t-th frame implicit state to obtain a t + 1-th frame implicit state; and T is a natural number which is increased from 1, and the value of T is more than or equal to 1 and less than or equal to T, wherein T is the total frame number corresponding to the phoneme sequence when the alignment position corresponds to the tail position in the context representation, and the total frame number represents the frame number of the spectrum data corresponding to the hidden state of each phoneme in the phoneme sequence.
In some embodiments, the synthesis module 5554 is further configured to, when the alignment position corresponds to an end position in the context representation, perform splicing processing on T frame hidden states to obtain a hidden state corresponding to the text; carrying out smoothing processing on the hidden state corresponding to the text to obtain frequency spectrum data corresponding to the text; and carrying out Fourier transform on the frequency spectrum data corresponding to the text to obtain audio data corresponding to the text.
In some embodiments, the attention module 5552 is further configured to perform gaussian prediction processing based on the t-th frame hidden state corresponding to the phoneme, so as to obtain a t-th gaussian parameter corresponding to the t-th frame hidden state; determining an alignment position of the t frame implicit state relative to the context representation based on the t Gaussian parameter.
In some embodiments, the attention module 5552 is further configured to perform prediction processing based on a gaussian function based on a t-th frame hidden state corresponding to the phoneme, so as to obtain a t-th gaussian variance and a t-th gaussian mean variation corresponding to the t-th frame hidden state; acquiring a t-1 Gaussian parameter corresponding to the hidden state of the t-1 frame; adding the t-1 Gaussian mean value included by the t-1 Gaussian parameter and the variation of the t Gaussian mean value, and taking the obtained addition result as the t Gaussian mean value corresponding to the t frame hidden state; taking the set of the tth Gaussian variance and the tth Gaussian mean value as a tth Gaussian parameter corresponding to the implicit state of the tth frame; and taking the tth Gaussian mean value as the alignment position of the implicit state of the tth frame relative to the context representation.
In some embodiments, the attention module 5552 is further configured to obtain a content text length of a contextual representation of the phoneme sequence; when the tth Gaussian mean value is larger than the length of the content text, determining that the alignment position corresponds to the tail position in the context representation; and when the tth Gaussian average value is smaller than or equal to the length of the content text, determining that the alignment position corresponds to a non-end position in the context representation.
In some embodiments, the decoding module 5553 is further configured to obtain an attention weight corresponding to the t-th frame implicit state; weighting the context characterization based on the attention weight to obtain a context vector corresponding to the context characterization; and performing state prediction processing based on the context vector and the t frame implicit state to obtain a t +1 frame implicit state.
In some embodiments, the attention module 5552 is further configured to obtain a tth gaussian parameter corresponding to the implicit state of the tth frame, wherein the tth gaussian parameter includes a tth gaussian variance and a tth gaussian mean; and carrying out Gaussian processing on the context characterization based on the tth Gaussian variance and the tth Gaussian average value to obtain the attention weight corresponding to the t-th frame hidden state.
In some embodiments, the audio generation method is implemented by invoking a neural network model; the audio generating apparatus 555 further includes: the training module 5555 is configured to perform encoding processing on a phoneme sequence sample corresponding to a text sample through the initialized neural network model, so as to obtain a context representation of the phoneme sequence sample; determining a predicted alignment position of a third frame implicit state relative to the context representation based on the third frame implicit state corresponding to each phoneme in the phoneme sequence sample; when the predicted alignment position corresponds to a non-end position in the context representation, decoding processing is carried out based on the context representation and the third frame implicit state to obtain a fourth frame implicit state; performing spectrum post-processing based on the third frame hidden state and the fourth frame hidden state to obtain predicted spectrum data corresponding to the text sample; constructing a loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample; updating parameters of the neural network model, and taking the updated parameters of the neural network model when the loss function is converged as the parameters of the trained neural network model; the third frame hidden state represents the hidden state of a third frame, the fourth frame hidden state represents the hidden state of a fourth frame, and the third frame and the fourth frame are any two adjacent frames in the spectrum data corresponding to each phoneme in the phoneme sequence sample.
In some embodiments, the training module 5555 is further configured to construct a parameter matrix based on parameters of the neural network model; dividing the parameter matrix into blocks to obtain a plurality of matrix blocks included in the parameter matrix; when the structure sparse opportunity is reached, determining the mean value of the parameters in each matrix block; the matrix blocks are sorted in an ascending order based on the mean value of the parameters in each matrix block, and the parameters in a plurality of matrix blocks sorted in the front in the ascending order sorting result are reset to obtain a reset parameter matrix; wherein the reset parameter matrix is used for updating the parameters of the neural network model.
In some embodiments, the training module 5555 is further configured to obtain a content text length of a context representation of the phoneme sequence sample; when the predicted alignment position corresponds to an end position in the context representation, constructing a position loss function of the neural network model based on the predicted alignment position and the content text length; constructing a spectrum loss function of the neural network model based on the predicted spectrum data corresponding to the text sample and the labeled spectrum data corresponding to the text sample; and carrying out weighted summation on the spectrum loss function and the position loss function, and taking the weighted summation result as the loss function of the neural network model.
In some embodiments, the encoding module 5551 is further configured to perform forward encoding processing on a phoneme sequence corresponding to a text, so as to obtain a forward hidden vector of the phoneme sequence; carrying out backward encoding processing on the phoneme sequence corresponding to the text to obtain a backward implicit vector of the phoneme sequence; and performing fusion processing on the forward hidden vector and the backward hidden vector to obtain the context representation of the phoneme sequence.
In some embodiments, the encoding module 5551 is further configured to perform, by an encoder, encoding processing on each phoneme in a phoneme sequence corresponding to the text in a first direction in sequence, so as to obtain a hidden vector of each phoneme in the first direction; sequentially coding each phoneme according to a second direction through the coder to obtain a hidden vector of each phoneme in the second direction; splicing the forward implicit vector and the backward implicit vector to obtain a context representation of the phoneme sequence; wherein the second direction is opposite to the first direction.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the artificial intelligence based audio generation method according to the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform an artificial intelligence based audio generation method provided by embodiments of the present application, for example, the artificial intelligence based audio generation method as shown in fig. 3-5.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种智能语音的交互方法、装置和系统