Speech synthesis method and apparatus, electronic device, and storage medium
1. A method of speech synthesis, comprising:
obtaining a logarithmic Mel energy spectrum of voice data to be processed;
and inputting the logarithmic Mel energy spectrum of the voice data to be processed into a preset voice synthesis model to obtain a first synthesis audio, wherein the preset voice synthesis model is obtained by training according to the logarithmic Mel energy spectrum of the training data.
2. The speech synthesis method of claim 1, wherein the step of obtaining a log mel energy spectrum of the speech data to be processed comprises:
acquiring the voice data to be processed;
performing energy spectrum calculation on the voice data to be processed to obtain an energy spectrum of the voice data to be processed;
and carrying out logarithmic Mel energy spectrum calculation on the energy spectrum to obtain a logarithmic Mel energy spectrum of the voice data to be processed.
3. The speech synthesis method according to claim 2, wherein the step of performing energy spectrum calculation on the speech data to be processed to obtain an energy spectrum of the speech data to be processed comprises:
performing framing processing on the voice data to be processed to obtain an audio sequence of the voice data to be processed;
carrying out short-time Fourier transform processing on the audio sequence to obtain a frequency spectrum of the voice data to be processed;
and performing spectrum energy calculation on the frequency spectrum to obtain an energy spectrum of the voice data to be processed.
4. The speech synthesis method of claim 1, wherein the step of inputting the log mel energy spectrum of the speech data to be processed into a predetermined speech synthesis model to obtain a first synthesized audio comprises:
inputting the logarithmic Mel energy spectrum of the voice data to be processed into a preset voice synthesis model, and calculating according to a preset pseudo-inverse matrix to obtain a pseudo-inverse energy spectrum of the voice data to be processed;
carrying out short-time Fourier transform processing on the pseudo-inverse energy spectrum to obtain a transform audio of the voice data to be processed;
and synthesizing the logarithmic Mel energy spectrum and the transformed audio of the voice data to be processed to obtain a first synthesized audio of the voice data to be processed.
5. The speech synthesis method of claim 1, further comprising the step of training a speech synthesis model, the step comprising:
obtaining a logarithmic mel-energy spectrum of the training data;
performing voice synthesis processing on the logarithmic Mel energy spectrum of the training data to obtain a second synthesized audio of the training data;
and training a preset model according to the preset parameters of the training data and the second synthetic audio to obtain a voice synthesis model.
6. The speech synthesis method of claim 5, wherein the preset parameters include correlation coefficients, and the step of training a preset model according to the preset parameters of the training data and the second synthesized audio to obtain a speech synthesis model comprises:
calculating a correlation coefficient of the training data to obtain a first correlation coefficient, and calculating a correlation coefficient of the second synthetic audio to obtain a second correlation coefficient;
and calculating to obtain a first mean square error according to the first correlation coefficient and the second correlation coefficient, and training a preset model according to the first mean square error to obtain a voice synthesis model.
7. The speech synthesis method of claim 5, wherein the predetermined parameters comprise linear predictive coding, and the step of training a predetermined model based on the predetermined parameters of the training data and the second synthesized audio to obtain the speech synthesis model comprises:
performing linear predictive coding calculation on the training data to obtain linear predictive coding, and performing reconstruction processing on the second synthesized audio according to the linear predictive coding to obtain reconstructed audio;
and calculating according to the second synthesized audio and the reconstructed audio to obtain a second mean square error, and training a preset model according to the second mean square error to obtain a voice synthesis model.
8. A speech synthesis apparatus, comprising:
the data acquisition module is used for acquiring a logarithmic Mel energy spectrum of the voice data to be processed;
and the voice synthesis module is used for inputting the logarithmic Mel energy spectrum of the voice data to be processed into a preset voice synthesis model to obtain a first synthesis audio, wherein the preset voice synthesis model is obtained by training according to the logarithmic Mel energy spectrum of the training data.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, which when executed by the processor implements the speech synthesis method of any of claims 1 to 7.
10. A storage medium comprising a computer program which, when executed, controls an electronic device in which the storage medium is located to perform the speech synthesis method according to any one of claims 1 to 7.
Background
The artificial synthesis of human voice is called speech synthesis, and the technology based on machine learning is suitable for application scenes such as text-to-speech (text-to-speech), music generation, speech support equipment, navigation systems, barrier-free service for visually impaired people and the like. However, the inventors have found that the speech synthesis method in the related art requires a large number of parameters, and thus has a problem of low efficiency of speech synthesis.
Disclosure of Invention
In view of the above, an object of the present application is to provide a speech synthesis method and apparatus, an electronic device, and a storage medium, so as to solve the problems in the prior art.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
in a first aspect, the present invention provides a speech synthesis method, including:
obtaining a logarithmic Mel energy spectrum of voice data to be processed;
and inputting the logarithmic Mel energy spectrum of the voice data to be processed into a preset voice synthesis model to obtain a first synthesis audio, wherein the preset voice synthesis model is obtained by training according to the logarithmic Mel energy spectrum of the training data.
In an optional embodiment, the step of obtaining a log mel-energy spectrum of the voice data to be processed includes:
acquiring the voice data to be processed;
performing energy spectrum calculation on the voice data to be processed to obtain an energy spectrum of the voice data to be processed;
and carrying out logarithmic Mel energy spectrum calculation on the energy spectrum to obtain a logarithmic Mel energy spectrum of the voice data to be processed.
In an optional embodiment, the step of performing energy spectrum calculation on the to-be-processed speech data to obtain an energy spectrum of the to-be-processed speech data includes:
performing framing processing on the voice data to be processed to obtain an audio sequence of the voice data to be processed;
carrying out short-time Fourier transform processing on the audio sequence to obtain a frequency spectrum of the voice data to be processed;
and performing spectrum energy calculation on the frequency spectrum to obtain an energy spectrum of the voice data to be processed.
In an optional embodiment, the step of inputting the log mel energy spectrum of the to-be-processed speech data into a preset speech synthesis model to obtain a first synthesized audio includes:
inputting the logarithmic Mel energy spectrum of the voice data to be processed into a preset voice synthesis model, and calculating according to a preset pseudo-inverse matrix to obtain a pseudo-inverse energy spectrum of the voice data to be processed;
carrying out short-time Fourier transform processing on the pseudo-inverse energy spectrum to obtain a transform audio of the voice data to be processed;
and synthesizing the logarithmic Mel energy spectrum and the transformed audio of the voice data to be processed to obtain a first synthesized audio of the voice data to be processed.
In an alternative embodiment, the speech synthesis method further comprises the step of training a speech synthesis model, the step comprising:
obtaining a logarithmic mel-energy spectrum of the training data;
performing voice synthesis processing on the logarithmic Mel energy spectrum of the training data to obtain a second synthesized audio of the training data;
and training a preset model according to the preset parameters of the training data and the second synthetic audio to obtain a voice synthesis model.
In an optional embodiment, the preset parameters include correlation coefficients, and the step of training the preset model according to the preset parameters of the training data and the second synthesized audio to obtain the speech synthesis model includes:
calculating a correlation coefficient of the training data to obtain a first correlation coefficient, and calculating a correlation coefficient of the second synthetic audio to obtain a second correlation coefficient;
and calculating to obtain a first mean square error according to the first correlation coefficient and the second correlation coefficient, and training a preset model according to the first mean square error to obtain a voice synthesis model.
In an optional embodiment, the step of obtaining a speech synthesis model by training a preset model according to the preset parameters of the training data and the second synthesis audio includes:
performing linear predictive coding calculation on the training data to obtain linear predictive coding, and performing reconstruction processing on the second synthesized audio according to the linear predictive coding to obtain reconstructed audio;
and calculating according to the second synthesized audio and the reconstructed audio to obtain a second mean square error, and training a preset model according to the second mean square error to obtain a voice synthesis model.
In a second aspect, the present invention provides a speech synthesis apparatus, comprising:
the data acquisition module is used for acquiring a logarithmic Mel energy spectrum of the voice data to be processed;
and the voice synthesis module is used for inputting the logarithmic Mel energy spectrum of the voice data to be processed into a preset voice synthesis model to obtain a first synthesis audio, wherein the preset voice synthesis model is obtained by training according to the logarithmic Mel energy spectrum of the training data.
In a third aspect, the present invention provides an electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the speech synthesis method according to any of the preceding embodiments when executing the program.
In a fourth aspect, the present invention provides a storage medium, where the storage medium includes a computer program, and the computer program controls, when running, an electronic device in which the storage medium is located to execute the speech synthesis method according to any of the foregoing embodiments.
According to the voice synthesis method and device, the electronic device and the storage medium, the synthetic audio is obtained by inputting the logarithmic Mel energy spectrum of the voice data to be processed into the preset voice synthesis model, the synthetic audio can be obtained through the logarithmic Mel energy spectrum, and the problem that the voice synthesis efficiency is low due to the fact that the number of parameters needed by the voice synthesis method in the prior art is large is solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 shows a block diagram of a speech synthesis system provided in an embodiment of the present application.
Fig. 2 shows a block diagram of an electronic device according to an embodiment of the present application.
Fig. 3 is a flowchart of a speech synthesis method according to an embodiment of the present application.
Fig. 4 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application.
Icon: 10-a speech synthesis system; 100-an electronic device; 110 — a first memory; 120-a first processor; 130-a communication module; 200-a collection device; 400-speech synthesis means; 410-a data acquisition module; 420-speech synthesis module.
Detailed Description
In order to improve at least one of the above technical problems of the present application, embodiments of the present application provide a speech synthesis method and apparatus, an electronic device, and a storage medium, and the following describes technical solutions of the present application through possible implementations.
The defects existing in the above solutions are the results obtained after the inventor has practiced and studied carefully, so the discovery process of the above problems and the solutions proposed by the embodiments of the present application in the following description to the above problems should be the contributions made by the inventor in the invention process.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.
Fig. 1 is a block diagram of a speech synthesis system 10 provided in an embodiment of the present application, which provides a possible implementation manner of the speech synthesis system 10, and referring to fig. 1, the speech synthesis system 10 may include one or more of an electronic device 100 and an acquisition device 200.
The acquisition device 200 sends the acquired to-be-processed voice data to the electronic device 100, the electronic device 100 obtains a logarithmic mel energy spectrum of the to-be-processed voice data, and the logarithmic mel energy spectrum of the to-be-processed voice data is input into a preset voice synthesis model to obtain a first synthesis audio.
Optionally, the specific type of the collecting device 200 is not limited, and may be set according to the actual application requirement. For example, in an alternative example, the capture device 200 may be the same device as the electronic device 100, and perform the speech synthesis process directly on the captured speech data.
Referring to fig. 2, a block diagram of an electronic device 100 according to an embodiment of the present disclosure is shown, where the electronic device 100 in this embodiment may be a server, a processing device, a processing platform, and the like, which are capable of performing data interaction and processing. The electronic device 100 includes a first memory 110, a first processor 120, and a communication module 130. The elements of the first memory 110, the first processor 120 and the communication module 130 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.
The first memory 110 is used for storing programs or data. The first Memory 110 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
The first processor 120 is used to read/write data or programs stored in the first memory 110 and perform corresponding functions. The communication module 130 is used for establishing a communication connection between the electronic device 100 and another communication terminal through a network, and for transceiving data through the network.
It should be understood that the configuration shown in fig. 2 is merely a schematic diagram of the configuration of the electronic device 100, and that the electronic device 100 may include more or fewer components than shown in fig. 2, or have a different configuration than shown in fig. 2. The components shown in fig. 2 may be implemented in hardware, software, or a combination thereof.
Fig. 3 shows one of flowcharts of a speech synthesis method provided in an embodiment of the present application, where the method is applicable to the electronic device 100 shown in fig. 2 and is executed by the electronic device 100 in fig. 2. It should be understood that, in other embodiments, the order of some steps in the speech synthesis method of this embodiment may be interchanged according to actual needs, or some steps may be omitted or deleted. The following describes in detail the flow of the speech synthesis method shown in fig. 3.
Step S310, obtaining a logarithmic Mel energy spectrum of the voice data to be processed.
Step S320, inputting the logarithmic mel energy spectrum of the voice data to be processed into a preset voice synthesis model to obtain a first synthesized audio.
The preset speech synthesis model is obtained by training according to the logarithm Mel energy spectrum of the training data.
According to the method, the synthetic audio is obtained by inputting the logarithmic Mel energy spectrum of the voice data to be processed into the preset voice synthesis model, so that the synthetic audio can be obtained through the logarithmic Mel energy spectrum, and the problem of low voice synthesis efficiency caused by more parameters needed by the voice synthesis method in the prior art is solved.
Before step S310, the speech synthesis method provided in the embodiment of the present application may further include a step of training a speech synthesis model, where the step may include the following sub-steps:
obtaining a logarithmic Mel energy spectrum of the training data; carrying out voice synthesis processing on the logarithm Mel energy spectrum of the training data to obtain a second synthetic audio frequency of the training data; and training the preset model according to the preset parameters of the training data and the second synthetic audio to obtain the voice synthetic model.
Optionally, the specific structure of the preset model is not limited, and may be set according to the actual application requirements. For example, in an alternative example, the preset model may include a neural network G and a neural network D, and the structures of the neural network G and the neural network D may be ParallelWaveNet. That is to say, the embodiment of the present application uses the concept of generating a confrontation network, defining the generator G and the discriminator D, and achieving the purpose of training the model through iterative optimization of the generator G and the discriminator D.
The specific mode of the speech synthesis processing is not limited, and can be set according to the actual application requirements. For example, in an alternative example, a pseudo-inverse energy spectrum of the training data wav1 may be obtained by calculation according to a preset pseudo-inverse matrix, a short-time fourier transform process may be performed on the pseudo-inverse energy spectrum to obtain a transformed audio wav2 of the training data, and a synthesis process may be performed on the logarithmic mel energy spectrum of the training data and the transformed audio wav2 to obtain a second synthesized audio wav3 of the training data.
It should be noted that, the preset model is trained according to the preset parameters of the training data and the second synthesized audio, and the specific manner of obtaining the speech synthesis model is not limited, and can be set according to the actual application requirements. For example, in an alternative example, the preset parameter may comprise a first fraction of training data, which may comprise the sub-steps of:
inputting training data wav1 and a second synthetic audio wav3 into a neural network D, wherein the neural network D scores wav1 to obtain a first score s1, and the neural network D scores wav3 to obtain a second score s 3; computing err1 ═ MSE (s1, a), and err2 ═ MSE (s3, b), where MSE represents the mean square error; performing gradient descent on the neural network D to update the parameter of D according to the sum of err1+ err2 and the gradient of the neural network D; gradient ascent is performed on the neural network G to update the parameters of G, in accordance with err2 finding the gradient of the neural network G.
Optionally, specific types of the preset first parameter a and the preset second parameter b are not limited, and may be set according to actual application requirements. For example, in an alternative example, a and b are real numbers and a is not equal to b, where a is 1 and b is-1.
For another example, in another alternative example, the preset parameter may include a log mel-energy spectrum of the training data, and the above step may include the following sub-steps:
calculating a logarithmic Mel energy spectrum M1 of the training data wav1, calculating a logarithmic Mel energy spectrum M2 of the second synthesized audio wav3, solving a mean square error err3 of the two logarithmic Mel energy spectrums, solving a gradient of the neural network G according to the mean square error err3, and executing gradient descent to update parameters of the neural network G.
The spectrum type can be flexibly selected, and is not limited to a logarithmic Mel energy spectrum, but also can be an energy spectrum, a power spectrum and the like.
For another example, in another alternative example, the preset parameter may include a correlation coefficient of the training data, and the step may include the following sub-steps:
carrying out correlation coefficient calculation on the training data to obtain a first correlation coefficient, and carrying out correlation coefficient calculation on the second synthetic audio to obtain a second correlation coefficient; and calculating to obtain a first mean square error according to the first correlation coefficient and the second correlation coefficient, and training the preset model according to the first mean square error to obtain a voice synthesis model.
In detail, the correlation coefficient of the training data wav1 is calculated, the correlation coefficient of the second synthesized audio wav3 is calculated, the mean square error of the two correlation coefficients is calculated, the gradient of the neural network G is derived from the mean square error, and gradient down is performed on G to update the parameter of G.
Wherein, for two sequences X with equal length [ X _0, X _1, X _2 …],Y=[y_0,y_1,y_2…]The correlation coefficient of X and Y is calculated by the formula:a piece of audio is cut out from wav1 as X,wav1 as Y, a first correlation coefficient c1 is calculated. And a section of audio frequency X is cut out from wav3, wav3 is used as Y, and a third phase relation number c3 is calculated. The cut start point of wav1 coincides with the cut start point of wav3, and the cut end point of wav1 coincides with the cut end point of wav 3.
For another example, in another alternative example, the preset parameters may include linear predictive coding of the training data, and the above step may include the following sub-steps:
performing linear predictive coding calculation on the training data to obtain linear predictive coding, and performing reconstruction processing on the second synthesized audio according to the linear predictive coding to obtain reconstructed audio; and calculating according to the second synthesized audio and the reconstructed audio to obtain a second mean square error, and training the preset model according to the second mean square error to obtain the voice synthesis model.
In detail, linear predictive coding of the training data wav1 is calculated, the second synthesized audio wav3 is reconstructed using the linear predictive coding to obtain reconstructed audio wav4, the mean square error of wav3 and wav4 is calculated, and gradient descent is performed on the neural network G according to the gradient of the neural network G to update the parameters of the neural network G.
It should be noted that Linear Predictive Coding (LPC) is a set of coefficients, and if the LPC is set to 31 order, the LPC is a sequence consisting of 32 real numbers. Solving the linear predictive coding may encounter the situation of no solution, or the solving equation is ill-conditioned, if there is no solution or ill-conditioned solution, this step is skipped, and the method for determining the ill-conditioned is to detect the absolute values of all the coefficients of the LPC, and if the absolute value of a certain coefficient is greater than a threshold, such as 100, it is considered as the ill-conditioned solution.
That is, the calculation of the linear predictive coding may have a problem of solving a sick matrix, and in order to avoid the problem, the absolute value of each linear predictive coding coefficient is detected, and if the absolute value is greater than a threshold, the corresponding linear predictive coding does not participate in gradient descent.
The specific mode of the training model can be executed independently or completely, and the steps of the training model are repeated for a plurality of times to obtain the parameter p1 of the neural network G.
It should be noted that, in the above specific way of training the model, the gradient ascending/descending optimization algorithm can be flexibly selected, and the AdamW algorithm is used in the present application.
For step S310, it should be noted that the specific manner of obtaining the log mel-energy spectrum of the voice data to be processed is not limited, and may be set according to the actual application requirement. For example, in one alternative example, the step may include the following sub-steps:
acquiring voice data to be processed; performing energy spectrum calculation on the voice data to be processed to obtain an energy spectrum of the voice data to be processed; and carrying out logarithmic Mel energy spectrum calculation on the energy spectrum to obtain a logarithmic Mel energy spectrum of the voice data to be processed.
The specific way of calculating the energy spectrum of the voice data to be processed to obtain the energy spectrum of the voice data to be processed is not limited, and the energy spectrum calculation can be carried out according to the actual application requirements. For example, in one alternative example, the step may include the following sub-steps:
performing framing processing on the voice data to be processed to obtain an audio sequence of the voice data to be processed; carrying out short-time Fourier transform processing on the audio sequence to obtain a frequency spectrum of the voice data to be processed; and performing spectrum energy calculation on the frequency spectrum to obtain an energy spectrum of the voice data to be processed.
In detail, the voice data wav5 to be processed is framed, the audio sequence after framing is subjected to short-time fourier transform to obtain a frequency spectrum, and the energy of the frequency spectrum is calculated to obtain an energy spectrum. The spectrum is a complex matrix having a real part a and an imaginary part B, which can be denoted as Z ═ a + Bi, and the energy spectrum is calculated in the manner ofWherein A is2Each element in a is represented as being squared separately.
And converting the energy spectrum into a Mel energy spectrum by using a preset MEL filtering group, and taking the logarithm to obtain a logarithmic Mel energy spectrum. The transformation compresses the data volume and improves the synthesis speed, and for example, the transformation can convert 1025-dimensional energy spectrum into 80-dimensional logarithmic MEL energy spectrum, and the MEL filter bank (MEL filter bank) can be a matrix.
For step S320, it should be noted that the specific manner of obtaining the first synthesized audio is not limited, and may be set according to the actual application requirement. For example, in one alternative example, the step may include the following sub-steps:
inputting the logarithmic Mel energy spectrum of the voice data to be processed into a preset voice synthesis model, and calculating according to a preset pseudo-inverse matrix to obtain a pseudo-inverse energy spectrum of the voice data to be processed; carrying out short-time Fourier transform processing on the pseudo-inverse energy spectrum to obtain a transform audio of the voice data to be processed; and synthesizing the logarithmic Mel energy spectrum and the transformed audio of the voice data to be processed to obtain a first synthesized audio of the voice data to be processed.
In detail, a pseudo-inverse matrix may be obtained by calculating a matrix pseudo-inverse of the MEL filtering group, and after an index of the logarithmic MEL energy spectrum is obtained, the pseudo-inverse matrix is multiplied to obtain a pseudo-inverse energy spectrum.
The MEL filter bank may be a real matrix K, the energy spectrum may be a real matrix S, and the actual formula for calculating the logarithmic MEL energy spectrum may be M ═ ln (ks), where M represents the logarithmic MEL energy spectrum. The calculation of the pseudo-inverse energy spectrum can be considered as the inverse operation of the logarithmic mel-energy spectrum calculation: s '═ K' exp (m), where K 'denotes a preset pseudo-inverse matrix and S' denotes a pseudo-inverse energy spectrum.
For example, K may be a matrix of size [80,1025], S may be a matrix of size [1025, T ], where T is the frame length, and the log mel-energy spectrum M is a matrix of size [80, T ], the pseudo-inverse matrix K ' may be a matrix of size [1025,80], and the pseudo-inverse energy spectrum S ' may be a matrix of size [1025, T ], where K ' K ═ I, and I is the identity matrix.
Further, the pseudo-inverse energy spectrum may be initialized to a complex number with an imaginary part of 0, and a short-time inverse fourier transform is calculated to obtain a transformed audio wav6 of the speech data to be processed. In this case, the pseudo-inverse energy spectrum S ' is a real number, but in another aspect, S ' may be considered to be a complex number with an imaginary part of 0, and S ' is subjected to short-time inverse fourier transform. The logarithmic mel energy spectrum M and wav6 are input into the neural network G to obtain a first synthesized audio.
By the method, the high-quality speech synthesis method based on the logarithmic Mel energy spectrum is provided, model training is carried out by taking short-time inverse Fourier transform as one of initial inputs of a neural network and based on an auxiliary training means of correlation coefficients and an auxiliary training means of linear predictive coding, the method provided by the application needs fewer parameters and has the characteristic of rapid synthesis, and a target logarithmic Mel energy spectrum can be synthesized into a high-fidelity, stable and high-quality speech signal by matching with a pre-trained logarithmic Mel energy spectrum prediction model.
That is to say, the matrix pseudo-inverse and short-time Fourier transform are used as the input of the neural network, the input is a good initial input, the parameters required by the generator are reduced, and the generating capacity of the generator is enhanced; the autocorrelation coefficient is applied to model training, and the phase reconstruction capability of the generator for generating audio is enhanced (better intelligibility); model training is carried out based on the linear prediction coefficient, so that the quality and stability of model synthetic audio are effectively improved, and the generated audio is more in line with human auditory sense.
With reference to fig. 4, an embodiment of the present application further provides a speech synthesis apparatus 400, where the functions implemented by the speech synthesis apparatus 400 correspond to the steps executed by the foregoing method. The speech synthesis apparatus 400 may be understood as a processor of the electronic device 100, or may be understood as a component that is independent of the electronic device 100 or a processor and implements the functions of the present application under the control of the electronic device 100. The speech synthesis apparatus 400 may include a data obtaining module 410 and a speech synthesis module 420.
And a data obtaining module 410, configured to obtain a log mel energy spectrum of the voice data to be processed. In the embodiment of the present application, the data obtaining module 410 may be configured to perform step S310 shown in fig. 3, and for relevant contents of the data obtaining module 410, reference may be made to the foregoing description of step S310.
The speech synthesis module 420 is configured to input the logarithmic mel energy spectrum of the speech data to be processed into a preset speech synthesis model to obtain a first synthesized audio, where the preset speech synthesis model is obtained by training according to the logarithmic mel energy spectrum of the training data. In the embodiment of the present application, the speech synthesis module 420 may be configured to perform step S320 shown in fig. 3, and reference may be made to the foregoing description of step S320 for relevant contents of the speech synthesis module 420.
In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the speech synthesis method.
The computer program product of the speech synthesis method provided in the embodiment of the present application includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute steps of the speech synthesis method in the above method embodiment, which may be referred to specifically in the above method embodiment, and details are not described here again.
To sum up, the speech synthesis method and apparatus, the electronic device, and the storage medium provided in the embodiments of the present application obtain the synthesized audio by inputting the log mel energy spectrum of the speech data to be processed into the preset speech synthesis model, so that the synthesized audio can be obtained by using the log mel energy spectrum, and the problem of low speech synthesis efficiency caused by a large number of parameters required by the speech synthesis method in the prior art is solved.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.