TTS system performance test method, device, equipment and medium
1. A TTS system performance test method is characterized by comprising the following steps:
acquiring a text prediction result and a voice prediction result of an input text by a TTS system;
determining a text processing performance test result of the TTS system based on the text prediction result;
determining a voice conversion performance test result of the TTS system based on the voice prediction result;
and determining the comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result.
2. The method for testing the performance of a TTS system of claim 1, wherein obtaining the text prediction result and the speech prediction result of the TTS system for the input text comprises:
acquiring predicted phonemes, predicted numbers, predicted symbols, predicted prosody and predicted accent positions of the TTS system on the input text as the text prediction result;
and acquiring the predicted voice of the TTS system to the input text as the voice prediction result.
3. The method for testing the performance of a TTS system of claim 2, wherein determining a text processing performance test result of said TTS system based on said text prediction result comprises:
determining a text accuracy test result of the TTS system based on the text prediction result and the text labeling result of the input text;
and determining a text processing performance test result of the TTS system based on the text accuracy test result.
4. The method for testing the performance of a TTS system of claim 3, wherein determining the text accuracy test result of said TTS system based on said text prediction result and the text labeling result of said input text comprises:
determining phoneme prediction accuracy, digit conversion accuracy, symbol conversion accuracy, prosody prediction accuracy and accent position prediction accuracy based on the predicted phonemes, predicted digits, predicted symbols, predicted prosody and predicted accent positions contained in the text prediction result and the annotated phonemes, annotated digits, annotated symbols, annotated prosody and annotated accent positions of the input text contained in the text annotation result;
and weighting and summing the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the rhythm prediction accuracy and the accent position prediction accuracy based on the respective weights corresponding to the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the rhythm prediction accuracy and the accent position prediction accuracy to obtain the text accuracy test result.
5. The TTS system performance testing method of claim 3, further comprising:
and determining a text responsiveness test result of the TTS system based on the text processing duration of the TTS system on the input text.
6. The method for testing the performance of a TTS system of claim 5, wherein determining the result of the test for text responsiveness of the TTS system based on the text processing duration of the TTS system for the input text comprises:
determining the actual text processing real-time rate of the TTS system based on the ratio of the text processing duration to the total duration of the labeled voice corresponding to the input text;
and determining the text responsiveness test result based on the actual text processing real-time rate and the target text processing real-time rate of the input text.
7. The method for testing the performance of a TTS system of claim 6, wherein determining the text responsiveness test result of the TTS system based on the actual text processing real-time rate and the target text processing real-time rate of the input text comprises:
if the actual text processing real-time rate is determined to be smaller than or equal to the target text processing real-time rate, determining that the text responsiveness test result is 1;
and if the actual text processing real-time rate is determined to be greater than the target text processing real-time rate, determining the text responsiveness test result based on the reciprocal of the ratio of the actual text processing real-time rate to the target text processing real-time rate.
8. The TTS system performance testing method of any of claims 5-7, wherein determining a text processing performance test result for the TTS system based on the text accuracy test result comprises:
and carrying out weighted summation on the text accuracy test result and the text responsiveness test result based on the respective corresponding weights of the text accuracy test result and the text responsiveness test result to obtain the text processing performance test result.
9. The TTS system performance testing method of any of claims 2-7, wherein determining a speech conversion performance test result for the TTS system based on the speech prediction result comprises:
determining a voice accuracy test result of the TTS system based on the voice prediction result and the voice labeling result of the input text;
and determining a voice conversion performance test result of the TTS system based on the voice accuracy test result.
10. The method for testing the performance of a TTS system of claim 9, wherein determining the result of the speech accuracy test of the TTS system based on the result of the speech prediction and the result of the speech annotation of the input text comprises:
determining pronunciation generation similarity, Mel frequency spectrum similarity, duration generation similarity, fundamental frequency generation similarity and energy generation similarity of the TTS system based on the predicted speech contained in the speech prediction result and the labeled speech of the input text contained in the speech labeling result;
and weighting and summing the pronunciation generation similarity, the Mel frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity and the energy generation similarity based on the weights corresponding to the pronunciation generation similarity, the Mel frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity and the energy generation similarity to obtain the voice accuracy test result.
11. The TTS system performance testing method of claim 7, further comprising:
and determining a voice responsiveness test result of the TTS system based on the voice synthesis duration of the TTS system to the input text.
12. The method for testing the performance of a TTS system of claim 11, wherein determining the result of the test for the speech responsiveness of the TTS system based on the speech synthesis duration of the TTS system for the input text comprises:
determining the actual speech synthesis real-time rate of the TTS system based on the ratio of the speech synthesis duration to the total duration of the labeled speech corresponding to the input text;
and determining the voice responsiveness test result based on the actual voice synthesis real-time rate and the target voice synthesis real-time rate of the input text.
13. The TTS system performance testing method of claim 11, wherein determining the speech responsiveness test result based on the actual speech synthesis real-time rate and the target speech synthesis real-time rate of the input text comprises:
if the actual voice synthesis real-time rate is determined to be smaller than or equal to the target voice synthesis real-time rate, determining that the voice responsiveness test result is 1;
and if the actual speech synthesis real-time rate is determined to be greater than the target speech synthesis real-time rate, determining the speech responsiveness test result based on the reciprocal of the ratio of the actual speech synthesis real-time rate to the target speech synthesis real-time rate.
14. The TTS system performance testing method of any of claims 11-13, wherein determining a speech conversion performance test result for the TTS system based on the speech accuracy test result comprises:
and carrying out weighted summation on the voice accuracy test result and the voice responsiveness test result based on the respective corresponding weights of the voice accuracy test result and the voice responsiveness test result to obtain the voice conversion performance test result.
15. The TTS system performance testing method of any of claims 1-7, wherein determining a combined performance test result for the TTS system based on the text processing performance test result and the speech conversion performance test result comprises:
and carrying out weighted summation on the text processing performance test result and the voice conversion performance test result based on the respective corresponding weights of the text processing performance test result and the voice conversion performance test result to obtain a comprehensive performance test result of the TTS system.
16. A TTS system performance testing apparatus, comprising:
the prediction result acquisition unit is used for acquiring a text prediction result and a voice prediction result of the TTS system on the input text;
the text processing performance testing unit is used for determining a text processing performance testing result of the TTS system based on the text prediction result;
the voice conversion performance testing unit is used for determining a voice conversion performance testing result of the TTS system based on the voice prediction result;
and the comprehensive performance determining unit is used for determining the comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result.
17. A TTS system performance testing device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the TTS system performance testing method of any of claims 1-15 when executing the computer program.
18. A computer readable storage medium, characterized in that the computer readable storage medium stores computer instructions which, when executed by a processor, implement the TTS system performance testing method of any of claims 1-15.
Background
A Text-to-Speech (TTS) system is a system that converts a Text into synthesized Speech as close as possible to real human Speech according to the pronunciation specifications of a specific language, and is widely applied to scenes such as a voice assistant, an intelligent home, map navigation, and the like.
At present, voice synthesized by a TTS system is generally scored by Mean Opinion Score (MOS), and performance of the TTS system is determined according to a voice scoring result.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a medium for testing the performance of a TTS system, which are used for solving the problem of low accuracy of the method for testing the performance of the TTS system in the prior art.
The technical scheme provided by the embodiment of the application is as follows:
in one aspect, an embodiment of the present application provides a method for testing performance of a TTS system, including:
acquiring a text prediction result and a voice prediction result of an input text by a TTS system;
determining a text processing performance test result of the TTS system based on the text prediction result;
determining a voice conversion performance test result of the TTS system based on the voice prediction result;
and determining the comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result.
On the other hand, the embodiment of the present application provides a device for testing the performance of a TTS system, including:
the prediction result acquisition unit is used for acquiring a text prediction result and a voice prediction result of the TTS system on the input text;
the text processing performance testing unit determines a text processing performance testing result of the TTS system based on the text prediction result;
the voice conversion performance testing unit is used for determining a voice conversion performance testing result of the TTS system based on the voice prediction result;
and the comprehensive performance determining unit determines the comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result.
On the other hand, an embodiment of the present application provides a TTS system performance test device, including: the TTS system performance testing method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the TTS system performance testing method provided by the embodiment of the application when executing the computer program.
On the other hand, an embodiment of the present application further provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed by a processor, the method for testing the performance of the TTS system provided in the embodiment of the present application is implemented.
The beneficial effects of the embodiment of the application are as follows:
in the embodiment of the application, objective indexes in the aspects of text processing and voice conversion are adopted to replace subjective indexes to test the performance of the TTS system, on one hand, the time and labor for testing the performance of the TTS system can be saved, the uniformity of performance evaluation standards of the TTS system can be realized, the influence of human factors on the performance evaluation of the TTS system is eliminated, and the accuracy of the performance test of the TTS system is improved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1a is a schematic flow chart illustrating an overview of a TTS system performance testing method for performing a serial test on text processing performance and voice conversion performance according to an embodiment of the present application;
FIG. 1b is a schematic flowchart illustrating another overview of a TTS system performance testing method for performing a serial test on text processing performance and speech conversion performance according to an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating an overview of a TTS system performance testing method for performing parallel testing on text processing performance and speech conversion performance in the embodiment of the present application;
FIG. 3 is a schematic flowchart of a TTS system performance testing method for performing parallel testing on text processing performance and speech conversion performance in the embodiment of the present application;
FIG. 4 is a functional structure diagram of a TTS system performance testing apparatus in an embodiment of the present application;
fig. 5 is a schematic hardware structure diagram of a TTS system performance testing device in the embodiment of the present application.
Detailed Description
In order to make the purpose, technical solution and advantages of the present application more clearly and clearly understood, the technical solution in the embodiments of the present application will be described below in detail and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
To facilitate a better understanding of the present application by those skilled in the art, a brief description of the technical terms involved in the present application will be given below.
1. The TTS system is a system for converting an input text into a synthetic voice close to the real tone of a speaker, and in the embodiment of the application, the TTS system can be a client side such as application software and the like, and can also be light application such as small program and the like.
2. The text prediction result is a result of the TTS system performing text processing on the input text, and in the embodiment of the present application, the text prediction result includes but is not limited to: the TTS system predicts phonemes, predicted digits, predicted symbols, predicted prosody, and predicted stress locations for the input text.
3. The text labeling result is a result of labeling the input text based on the real phonemes, real numbers, real symbols, real prosody and real accent positions of the input text, and in the embodiment of the present application, the speech labeling result includes but is not limited to: marking phonemes, numbers, symbols, rhythm and accent positions.
4. The speech prediction result is a result of a TTS system performing speech synthesis on an input text, and in the embodiment of the present application, the speech prediction result includes but is not limited to: the TTS system predicts speech for the input text.
5. The voice labeling result is a result of labeling the input text based on the real voice corresponding to the input text, and in the embodiment of the present application, the voice labeling result includes but is not limited to: and inputting the labeled voice of the text.
After introducing the technical terms related to the present application, the following briefly introduces the application scenarios and design ideas of the embodiments of the present application.
In order to solve the problems of strong subjectivity and low accuracy of a TTS system performance test method based on MOS, in the embodiment of the application, TTS system performance test equipment such as a mobile phone, a tablet computer and a computer can obtain a text prediction result and a voice prediction result of the TTS system on an input text in the process of processing the input text by the TTS system, determine a text processing performance test result of the TTS system based on the text prediction result, determine a voice conversion performance test result of the TTS system based on the voice prediction result, and determine a comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result. Therefore, objective indexes in the aspects of text processing and voice conversion are adopted to replace subjective indexes to test the performance of the TTS system, on one hand, the time and labor for testing the performance of the TTS system can be saved, the uniformity of performance evaluation standards of the TTS system can be realized, the influence of human factors on the performance evaluation of the TTS system is eliminated, and the accuracy of the performance test of the TTS system is improved.
After introducing the application scenario and the design concept of the embodiment of the present application, the following describes in detail the technical solution provided by the embodiment of the present application.
In this embodiment of the present application, the TTS system performance testing device may perform a serial test on the text processing performance and the voice conversion performance of the TTS system, and determine a comprehensive performance testing result of the TTS system according to a text processing performance testing result and a voice conversion performance testing result of the TTS system obtained in the serial test process, specifically, in an embodiment, referring to fig. 1a, the TTS system performance testing device may first perform step 101: acquiring a text prediction result of the TTS system on an input text, and determining a text processing performance test result of the TTS system based on the text prediction result; then, step 102 is executed: acquiring a voice prediction result of the TTS system on an input text, and determining a voice conversion performance test result of the TTS system based on the voice prediction result; finally, step 103 is executed: and determining the comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result. In another embodiment, referring to FIG. 1b, the TTS system performance testing apparatus may first perform step 111: acquiring a voice prediction result of the TTS system on an input text, and determining a voice conversion performance test result of the TTS system based on the voice prediction result; then, step 112 is executed: acquiring a text prediction result of the TTS system on an input text, and determining a text processing performance test result of the TTS system based on the text prediction result; finally, step 113 is executed: and determining the comprehensive performance test result of the TTS system based on the voice conversion performance test result and the text processing performance test result.
In order to improve the efficiency of the performance test of the TTS system, the performance test device of the TTS system may further perform a parallel test on the text processing performance and the voice conversion performance of the TTS system, and determine the comprehensive performance test result of the TTS system according to the text processing performance test result and the voice conversion performance test result of the TTS system obtained in the parallel test process, specifically, as shown in fig. 2, the performance test device of the TTS system may perform step 201: acquiring a voice prediction result of the TTS system on the input text, determining a voice conversion performance test result of the TTS system based on the voice prediction result, and executing step 202: acquiring a text prediction result of the TTS system on the input text, determining a text processing performance test result of the TTS system based on the text prediction result, and then executing step 203: and determining the comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result.
The following describes in detail each step in the TTS system performance testing method provided in the embodiment of the present application, as shown in fig. 1a, fig. 1b, and fig. 2.
In the embodiment of the application, when the TTS system performance test device acquires the text prediction result and the speech prediction result of the input text by the TTS system, different acquisition modes may be adopted according to different installation positions of the TTS system, and specifically, the following two situations may exist but are not limited to:
in the first case: the TTS system is installed on the TTS system performance testing equipment.
In this case, the TTS system performance testing device may call the TTS system to process the input text, and obtain a text prediction result and a speech prediction result of the TTS system on the input text in the process of calling the TTS system to process the input text, specifically, the TTS system performance testing device may obtain predicted phonemes, predicted numbers, predicted symbols, predicted prosody, and predicted accent positions of the TTS system on the input text as the text prediction result, and obtain predicted speech of the TTS system on the input text as the speech prediction result.
In the second case: the TTS system is installed on other mobile phones, tablet computers, computers and other terminal equipment.
Under the condition, the TTS system performance testing equipment can send a TTS system performance testing instruction to the terminal equipment, when the terminal equipment receives the TTS system performance testing instruction sent by the TTS system performance testing equipment, the TTS system is called to process the input text indicated by the TTS system performance testing instruction, and the TTS system performance testing equipment obtains a text prediction result and a voice prediction result of the TTS system on the input text in the process that the terminal equipment calls the TTS system to process the input text. Optionally, in an embodiment, in a process of calling the TTS system to process the input text, the terminal device may send a text prediction result and a speech prediction result of the TTS system on the input text to the TTS system performance testing device, so that the TTS system performance testing device can obtain the text prediction result and the speech prediction result of the TTS system on the input text, specifically, the terminal device may send a predicted phoneme, a predicted digit, a predicted symbol, a predicted prosody, and a predicted accent position of the TTS system on the input text as a text prediction result to the TTS system performance testing device, send the predicted speech of the TTS system on the input text as a speech prediction result to the TTS system performance testing device, so that the TTS system performance testing device can obtain the predicted phoneme, the predicted digit, the predicted symbol, the predicted prosody, and the predicted accent position of the TTS system on the input text as a text prediction result, obtaining the predicted voice of the TTS system to the input text as a voice prediction result; in another embodiment, the TTS system performance testing device may further actively acquire a text prediction result and a speech prediction result of the TTS system on the input text from the terminal device in a process that the terminal device calls the TTS system to process the input text, and specifically, the TTS system performance testing device may actively acquire a predicted phoneme, a predicted number, a predicted symbol, a predicted prosody, and a predicted accent position of the TTS system on the input text from the terminal device as the text prediction result, and actively acquire a predicted speech of the TTS system on the input text as the speech prediction result.
In the embodiment of the application, when determining the text processing performance test result of the TTS system based on the text prediction result, the TTS system performance test device may adopt, but is not limited to, the following modes:
firstly, the TTS system performance test equipment determines a text accuracy test result of the TTS system based on a text prediction result and a text labeling result of an input text.
Specifically, the TTS system performance testing device may determine the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the prosody prediction accuracy and the accent position prediction accuracy based on the predicted phonemes, the predicted digits, the predicted symbols, the predicted prosody and predicted accent positions included in the text prediction result and the annotated phonemes, the annotated digits, the annotated symbols, the annotated prosody and annotated accent positions of the input text included in the text annotation result, and then perform weighted summation on the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the prosody prediction accuracy and the accent position prediction accuracy based on respective corresponding weights of the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the prosody prediction accuracy and the accent position prediction accuracy to obtain the text accuracy testing result, wherein the text accuracy testing result has a numerical range of [0, 1].
In practical application, the respective weights of the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the prosody prediction accuracy and the accent position prediction accuracy can be flexibly set according to different practical requirements. For example, for a TTS system with a speech learning function, the prosody and stress requirements are not high, and the respective weights of the prosody prediction accuracy and the stress position prediction accuracy may be set to be smaller values (e.g., both set to 0.05), and the respective weights of the phoneme prediction accuracy, the digit conversion accuracy, and the symbol conversion accuracy may be set to be larger values (e.g., both set to 0.3); for another example, for a TTS system with a storytelling reading function, the requirements for digital conversion and symbol conversion are not high, and the respective weights of the digital conversion accuracy and the symbol conversion accuracy may be set to be small values (for example, both are set to 0.05), and the respective weights of the prosody prediction accuracy, the accent position prediction accuracy, and the phoneme prediction accuracy may be set to be large values (for example, both are set to 0.3).
And then, the TTS system performance test equipment determines a text processing performance test result of the TTS system based on the text accuracy test result.
Optionally, in an embodiment, the TTS system performance test device may directly determine the text accuracy test result as the text processing performance test result of the TTS system.
In another embodiment, in order to further improve the comprehensiveness and accuracy of the TTS system performance test, the TTS system performance test device may further determine the text processing performance test result based on the text processing duration of the TTS system for the input text, and after determining the text responsiveness test result of the TTS system, based on the text accuracy test result and the text responsiveness test result.
In practical application, the TTS system performance test device determines a text responsiveness test result of the TTS system based on the text processing duration of the TTS system for the input text, and may adopt, but is not limited to, the following modes when determining the text processing performance test result based on the text accuracy test result and the text responsiveness test result:
firstly, the TTS system performance test equipment determines the actual text processing real-time rate of the TTS system based on the ratio of the text processing duration to the total duration of the labeled voice corresponding to the input text.
In specific implementation, the TTS system performance test device may directly determine a ratio of the text processing duration to the total duration of the labeled speech corresponding to the input text as an actual text processing real-time rate of the TTS system.
Then, the TTS system performance test device determines a text responsiveness test result based on the actual text processing real-time rate and the target text processing real-time rate of the input text.
In specific implementation, if the TTS system performance test equipment determines that the actual text processing real-time rate is less than or equal to the target text processing real-time rate, determining that a text responsiveness test result is 1; and if the TTS system performance test equipment determines that the actual text processing real-time rate is greater than the target text processing real-time rate, determining a text responsiveness test result based on the reciprocal of the ratio of the actual text processing real-time rate to the target text processing real-time rate. Specifically, the TTS system performance test device may directly determine a reciprocal of a ratio of an actual text processing real-time rate to a target text processing real-time rate as a text responsiveness test result, where a numerical range of the text responsiveness test result is [0, 1 ].
And finally, the TTS system performance test equipment performs weighted summation on the text accuracy test result and the text responsiveness test result based on the respective weights corresponding to the text accuracy test result and the text responsiveness test result to obtain a text processing performance test result, wherein the numerical range of the text processing performance test result is [0, 2 ]. In practical application, the weights corresponding to the text accuracy test result and the text responsiveness test result can be flexibly set according to practical requirements. For example, for a TTS system with a speech learning function, which has a high requirement on accuracy, the weight of the text accuracy test result may be set to a large value (e.g., to 0.7), and the weight of the text responsiveness test result may be set to a small value (e.g., to 0.3); for another example, for a TTS system with a map navigation function, which has a high requirement on real-time performance, the weight of the text responsiveness test result may be set to a large value (e.g., to 0.6), and the weight of the text accuracy test result may be set to a small value (e.g., to 0.4).
In the embodiment of the application, when determining the result of the test of the voice conversion performance of the TTS system based on the result of the voice prediction, the TTS system performance test device may adopt, but is not limited to, the following modes:
firstly, the TTS system performance test equipment determines a voice accuracy test result of the TTS system based on a voice prediction result and a voice marking result of an input text.
In specific implementation, the TTS system performance testing device may determine, based on a predicted speech included in the speech prediction result and a labeled speech of an input text included in the speech labeling result, a pronunciation generation similarity, a mel-frequency spectrum similarity, a duration generation similarity, a fundamental frequency generation similarity, and an energy generation similarity of the TTS system, and then perform weighted summation on the pronunciation generation similarity, the mel-frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity, and the energy generation similarity, based on respective corresponding weights of the pronunciation generation similarity, the mel-frequency spectrum similarity, the fundamental frequency generation similarity, and the energy generation similarity, to obtain a speech accuracy testing result, where a numerical range of the speech accuracy testing result is [0, 1 ]. In practical application, the weights corresponding to the pronunciation generation similarity, the mel-frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity and the energy generation similarity can be flexibly set according to practical requirements, and specifically, the following two ways can be adopted but not limited to: mode 1, setting is performed according to a test target in practical application, for example, if most of predicted speech output by a TTS system has a problem of time length, the weight of the time length generation similarity may be set to a large value (for example, set to 0.4), and the weights corresponding to the pronunciation generation similarity, the mel-frequency spectrum similarity, the fundamental frequency generation similarity, and the energy generation similarity are set to a small value (for example, all set to 0.15); mode 2, setting is performed according to the consistency between the overall score of the TTS system and the subjective listening test, for example, a weight combination more consistent with the MOS subjective listening test is selected from a plurality of weight combinations. Of course, in the embodiment of the present application, the weights corresponding to the pronunciation generation similarity, the mel-frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity, and the energy generation similarity may also be set to 0.2.
And then, the TTS system performance test equipment determines a voice conversion performance test result of the TTS system based on the voice accuracy test result.
Optionally, in an embodiment, the TTS system performance test device may directly determine the speech accuracy test result as the speech conversion performance test result of the TTS system.
In another embodiment, in order to further improve the comprehensiveness and accuracy of the performance test of the TTS system, the TTS system performance test device may further determine the voice conversion performance test result based on the voice accuracy test result and the voice responsiveness test result after determining the voice responsiveness test result of the TTS system based on the voice synthesis duration of the TTS system for the input text.
In practical application, the TTS system performance testing device determines the voice responsiveness test result of the TTS system based on the voice synthesis duration of the TTS system to the input text, and may adopt, but is not limited to, the following modes when determining the voice conversion performance test result based on the voice accuracy test result and the voice responsiveness test result:
firstly, the TTS system performance test equipment determines the actual speech synthesis real-time rate of the TTS system based on the ratio of the speech synthesis duration to the total duration of the labeled speech corresponding to the input text.
In specific implementation, the TTS system performance testing device may directly determine a ratio of a speech synthesis duration to a total duration of the labeled speech corresponding to the input text as an actual speech synthesis real-time rate of the TTS system.
Then, the TTS system performance test device determines a voice responsiveness test result based on the actual voice synthesis real-time rate and the target voice synthesis real-time rate of the input text.
In specific implementation, if the TTS system performance test equipment determines that the actual speech synthesis real-time rate is less than or equal to the target speech synthesis real-time rate, determining that a speech responsiveness test result is 1; and if the TTS system performance test equipment determines that the actual speech synthesis real-time rate is greater than the target speech synthesis real-time rate, determining a speech responsiveness test result based on the reciprocal of the ratio of the actual speech synthesis real-time rate to the target speech synthesis real-time rate. Specifically, the TTS system performance test device may determine a reciprocal of a ratio of an actual speech synthesis real-time rate to a target speech synthesis real-time rate as a speech responsiveness test result, where a numerical range of the speech responsiveness test result is [0, 1 ].
And finally, the TTS system performance test equipment performs weighted summation on the voice accuracy test result and the voice responsiveness test result based on the respective corresponding weights of the voice accuracy test result and the voice responsiveness test result to obtain a voice conversion performance test result, wherein the numerical range of the voice conversion performance test result is [0, 2 ]. In practical application, the respective weights of the voice accuracy test result and the voice responsiveness test result can be flexibly set according to practical requirements. For example, for a TTS system with a speech learning function, which has a high requirement on accuracy, the weight of the speech accuracy test result may be set to a large value (e.g., 0.7), and the weight of the speech responsiveness test result may be set to a small value (e.g., 0.3); for another example, for a TTS system with a map navigation function, the requirement on real-time performance is high, the weight of the voice responsiveness test result may be set to a large value (e.g., 0.6), and the weight of the voice accuracy test result may be set to a small value (e.g., 0.4).
In the embodiment of the application, when determining the comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result, the TTS system performance test device may adopt, but is not limited to, the following modes: and the TTS system performance test equipment performs weighted summation on the text processing performance test result and the voice conversion performance test result based on the respective corresponding weights of the text processing performance test result and the voice conversion performance test result to obtain a comprehensive performance test result of the TTS system. In practical application, the weights corresponding to the text processing performance test result and the voice conversion performance test result can be flexibly set according to actual requirements, and preferably, the weights corresponding to the text processing performance test result and the voice conversion performance test result can be set to 0.5.
The method for testing the performance of the TTS system provided by the embodiment of the present application is described below by taking an example of a method for testing the text processing performance and the voice conversion performance of the TTS system by using a TTS system performance testing device in parallel, and referring to fig. 3, the specific process of the method for testing the performance of the TTS system provided by the embodiment of the present application is as follows:
step 301: the TTS system performance testing device calls the TTS system to process the input text.
Step 302: in the process of processing the input text by the TTS system, the TTS system performance testing equipment acquires the predicted phonemes, predicted numbers, predicted symbols, predicted prosody and predicted accent positions of the input text by the TTS system as a text prediction result.
Step 303: the TTS system performance testing equipment calculates the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the prosody prediction accuracy and the accent position prediction accuracy based on the predicted phonemes, the predicted digits, the predicted symbols, the predicted prosody and the predicted accent positions contained in the text prediction result and the marked phonemes, marked digits, marked symbols, marked prosody and marked accent positions of the input text contained in the text marking result.
Step 304: the TTS system performance testing equipment weights and sums the phoneme prediction accuracy, the digital conversion accuracy, the symbol conversion accuracy, the rhythm prediction accuracy and the accent position prediction accuracy based on the respective corresponding weights of the phoneme prediction accuracy, the digital conversion accuracy, the symbol conversion accuracy, the rhythm prediction accuracy and the accent position prediction accuracy to obtain a text accuracy testing result.
Step 305: the TTS system performance testing equipment determines the ratio of the text processing time of the TTS system to the input text to the total time of the labeled voice corresponding to the input text as the actual text processing real-time rate of the TTS system.
Step 306: the TTS system performance test device determines whether the actual text processing real-time rate is less than or equal to the target text processing real-time rate, if so, performs step 307, and if not, performs step 308.
Step 307: the TTS system performance test device determines that the text responsiveness test result is 1.
Step 308: and the TTS system performance test equipment determines the reciprocal of the ratio of the actual text processing real-time rate to the target text processing real-time rate as a text responsiveness test result.
Step 309: and the TTS system performance test equipment performs weighted summation on the text accuracy test result and the text responsiveness test result based on the respective weights corresponding to the text accuracy test result and the text responsiveness test result to obtain a text processing performance test result.
Step 310: the TTS system performance testing equipment acquires the predicted speech of the TTS system to the input text as a speech prediction result in the process of processing the input text by the TTS system.
Step 311: the TTS system performance test equipment calculates the pronunciation generation similarity, the Mel frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity and the energy generation similarity of the TTS system based on the predicted speech contained in the speech prediction result and the labeled speech of the input text contained in the speech labeling result.
Step 312: the TTS system performance test equipment performs weighted summation on the pronunciation generation similarity, the Mel frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity and the energy generation similarity based on the weights corresponding to the pronunciation generation similarity, the Mel frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity and the energy generation similarity to obtain a voice accuracy test result.
Step 313: the TTS system performance test equipment determines the ratio of the speech synthesis duration of the TTS system to the input text to the total duration of the labeled speech corresponding to the input text as the actual speech synthesis real-time rate of the TTS system.
Step 314: the TTS system performance testing device determines whether the actual speech synthesis real-time rate is less than or equal to the target speech synthesis real-time rate, if so, performs step 315, and if not, performs step 316.
Step 315: the TTS system performance test device determines that the voice responsiveness test result is 1.
Step 316: the TTS system performance test equipment determines the reciprocal of the ratio of the actual speech synthesis real-time rate to the target speech synthesis real-time rate as a speech responsiveness test result.
Step 317: and the TTS system performance test equipment performs weighted summation on the voice accuracy test result and the voice responsiveness test result based on the respective corresponding weights of the voice accuracy test result and the voice responsiveness test result to obtain a voice conversion performance test result.
Step 318: and the TTS system performance test equipment performs weighted summation on the text processing performance test result and the voice conversion performance test result based on the respective corresponding weights of the text processing performance test result and the voice conversion performance test result to obtain a comprehensive performance test result of the TTS system.
In practical application, the TTS system performance testing device may execute step 302-.
Based on the foregoing embodiments, an embodiment of the present application provides a device for testing performance of a TTS system, and referring to fig. 4, the device 400 for testing performance of a TTS system provided in the embodiment of the present application at least includes:
a prediction result obtaining unit 401, configured to obtain a text prediction result and a speech prediction result of the TTS system for the input text;
a text processing performance testing unit 402, which determines a text processing performance testing result of the TTS system based on the text prediction result;
a voice conversion performance test unit 403, which determines a voice conversion performance test result of the TTS system based on the voice prediction result;
the comprehensive performance determining unit 404 determines a comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result.
In a possible implementation manner, when obtaining a text prediction result and a speech prediction result of an input text by a TTS system, the prediction result obtaining unit 401 is specifically configured to:
acquiring predicted phonemes, predicted numbers, predicted symbols, predicted prosody and predicted stress positions of the TTS system on the input text as text prediction results;
and acquiring the predicted voice of the TTS system to the input text as a voice prediction result.
In a possible implementation manner, when determining the text processing performance test result of the TTS system based on the text prediction result, the text processing performance test unit 402 is specifically configured to:
determining a text accuracy test result of the TTS system based on the text prediction result and the text labeling result of the input text;
and determining a text processing performance test result of the TTS system based on the text accuracy test result.
In a possible implementation manner, when determining a text accuracy test result of the TTS system based on the text prediction result and the text labeling result of the input text, the text processing performance test unit 402 is specifically configured to:
determining phoneme prediction accuracy, digit conversion accuracy, symbol conversion accuracy, prosody prediction accuracy and accent position prediction accuracy based on the predicted phonemes, predicted digits, predicted symbols, predicted prosody and predicted accent positions contained in the text prediction result and the annotated phonemes, annotated digits, annotated symbols, annotated prosody and annotated accent positions of the input text contained in the text annotation result;
and weighting and summing the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the prosody prediction accuracy and the accent position prediction accuracy based on the weights corresponding to the phoneme prediction accuracy, the digit conversion accuracy, the symbol conversion accuracy, the prosody prediction accuracy and the accent position prediction accuracy to obtain a text accuracy test result.
In one possible implementation, the text processing performance testing unit 402 is further configured to:
and determining a text responsiveness test result of the TTS system based on the text processing duration of the TTS system on the input text.
In a possible implementation manner, when determining a text responsiveness test result of the TTS system based on a text processing duration of the TTS system for an input text, the text processing performance testing unit 402 is specifically configured to:
determining the real-time rate of the actual text processing of the TTS system based on the ratio of the text processing duration to the total duration of the labeled voice corresponding to the input text;
and determining a text responsiveness test result based on the actual text processing real-time rate and the target text processing real-time rate of the input text.
In a possible implementation manner, when determining the text responsiveness test result of the TTS system based on the actual text processing real-time rate and the target text processing real-time rate of the input text, the text processing performance testing unit 402 is specifically configured to:
if the actual text processing real-time rate is determined to be smaller than or equal to the target text processing real-time rate, determining that the text responsiveness test result is 1;
and if the actual text processing real-time rate is determined to be greater than the target text processing real-time rate, determining a text responsiveness test result based on the reciprocal of the ratio of the actual text processing real-time rate to the target text processing real-time rate.
In a possible implementation manner, when determining the text processing performance test result of the TTS system based on the text accuracy test result, the text processing performance test unit 402 is specifically configured to:
and carrying out weighted summation on the text accuracy test result and the text responsiveness test result based on the weights corresponding to the text accuracy test result and the text responsiveness test result respectively to obtain a text processing performance test result.
In a possible implementation manner, when determining a result of testing the voice conversion performance of the TTS system based on the result of the voice prediction, the voice conversion performance testing unit 403 is specifically configured to:
determining a voice accuracy test result of the TTS system based on the voice prediction result and the voice labeling result of the input text;
and determining a voice conversion performance test result of the TTS system based on the voice accuracy test result.
In a possible implementation manner, when determining a speech accuracy test result of the TTS system based on the speech prediction result and the speech tagging result of the input text, the speech conversion performance test unit 403 is specifically configured to:
determining pronunciation generation similarity, Mel frequency spectrum similarity, duration generation similarity, fundamental frequency generation similarity and energy generation similarity of the TTS system based on the predicted speech contained in the speech prediction result and the labeled speech of the input text contained in the speech labeling result;
and weighting and summing the pronunciation generation similarity, the Mel frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity and the energy generation similarity based on the weights corresponding to the pronunciation generation similarity, the Mel frequency spectrum similarity, the duration generation similarity, the fundamental frequency generation similarity and the energy generation similarity to obtain a voice accuracy test result.
In a possible implementation, the voice conversion performance testing unit 403 is further configured to:
and determining a voice responsiveness test result of the TTS system based on the voice synthesis duration of the TTS system to the input text.
In a possible implementation manner, when determining a result of the speech responsiveness test of the TTS system based on a speech synthesis duration of the TTS system for the input text, the speech conversion performance test unit 403 is specifically configured to:
determining the actual speech synthesis real-time rate of the TTS system based on the ratio of the speech synthesis duration to the total duration of the labeled speech corresponding to the input text;
and determining a voice responsiveness test result based on the actual voice synthesis real-time rate and the target voice synthesis real-time rate of the input text.
In a possible implementation manner, when determining the speech responsiveness test result based on the actual speech synthesis real-time rate and the target speech synthesis real-time rate of the input text, the speech conversion performance test unit 403 is specifically configured to:
if the actual voice synthesis real-time rate is determined to be less than or equal to the target voice synthesis real-time rate, determining that the voice responsiveness test result is 1;
and if the actual speech synthesis real-time rate is larger than the target speech synthesis real-time rate, determining a speech responsiveness test result based on the reciprocal of the ratio of the actual speech synthesis real-time rate to the target speech synthesis real-time rate.
In a possible implementation manner, when determining the result of the speech conversion performance test of the TTS system based on the result of the speech accuracy test, the speech conversion performance test unit 403 is specifically configured to:
and carrying out weighted summation on the voice accuracy test result and the voice responsiveness test result based on the respective corresponding weights of the voice accuracy test result and the voice responsiveness test result to obtain a voice conversion performance test result.
In a possible implementation manner, when determining the comprehensive performance test result of the TTS system based on the text processing performance test result and the voice conversion performance test result, the comprehensive performance determining unit 404 is specifically configured to:
and carrying out weighted summation on the text processing performance test result and the voice conversion performance test result based on the weights corresponding to the text processing performance test result and the voice conversion performance test result respectively to obtain the comprehensive performance test result of the TTS system.
It should be noted that the principle of the TTS system performance testing apparatus 400 provided in the embodiment of the present application for solving the technical problem is similar to that of the TTS system performance testing method provided in the embodiment of the present application, and therefore, the implementation of the TTS system performance testing apparatus 400 provided in the embodiment of the present application may refer to the implementation of the TTS system performance testing method provided in the embodiment of the present application, and repeated details are not repeated.
After the method and the device for testing the performance of the TTS system provided by the embodiment of the present application are introduced, a brief introduction is made to the device for testing the performance of the TTS system provided by the embodiment of the present application.
Referring to fig. 5, a TTS system performance testing apparatus 500 provided by the embodiment of the present application at least includes: the TTS system performance testing method provided by the embodiment of the present application is implemented when the processor 501 executes the computer program, wherein the computer program is executed by the processor 501, and the TTS system performance testing method is implemented by the processor 502.
The TTS system performance testing device 500 provided by the embodiment of the present application may further include a bus 503 connecting different components (including the processor 501 and the memory 502). Bus 503 represents one or more of any of several types of bus structures, including a memory bus, a peripheral bus, a local bus, and the like.
The Memory 502 may include readable media in the form of volatile Memory, such as Random Access Memory (RAM) 5021 and/or cache Memory 5022, and may further include Read Only Memory (ROM) 5023.
The memory 502 may also include a program tool 5025 having a set (at least one) of program modules 5024, the program modules 5024 including, but not limited to: an operating subsystem, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
TTS system performance testing device 500 may also communicate with one or more external devices 504 (e.g., keyboard, remote control, etc.), with one or more devices that enable a user to interact with TTS system performance testing device 500 (e.g., cell phone, computer, etc.), and/or with any device that enables TTS system performance testing device 500 to communicate with one or more other TTS system performance testing devices 500 (e.g., router, modem, etc.). Such communication may be through an Input/Output (I/O) interface 505. Further, TTS system performance testing device 500 may also communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network, such as the Internet) via Network adapter 506. As shown in FIG. 5, network adapter 506 communicates with the other modules of TTS system Performance test device 500 via bus 503. It should be understood that although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with TTS system performance testing device 500, including but not limited to: microcode, device drivers, Redundant processors, external disk drive Arrays, disk array (RAID) subsystems, tape drives, and data backup storage subsystems, to name a few.
It should be noted that the TTS system performance testing device 500 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the application scope of the embodiments of the present application.
The following describes a computer-readable storage medium provided by embodiments of the present application. The computer-readable storage medium provided by the embodiment of the application stores computer instructions, and the computer instructions, when executed by the processor, implement the method for testing the performance of the TTS system provided by the embodiment of the application. Specifically, the executable program may be built in or installed in the TTS system performance testing device 500, so that the TTS system performance testing device 500 may implement the TTS system performance testing method provided by the embodiment of the present application by executing the built-in or executable program.
In addition, the method for testing the performance of the TTS system provided in the embodiment of the present application may also be implemented as a program product, where the program product includes program codes, and when the program product can run on the device 500 for testing the performance of the TTS system, the program codes are used to enable the device 500 for testing the performance of the TTS system to execute the method for testing the performance of the TTS system provided in the embodiment of the present application.
The program product provided by the embodiments of the present application may be any combination of one or more readable media, where the readable media may be a readable signal medium or a readable storage medium, and the readable storage medium may be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof, and in particular, more specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a RAM, a ROM, an Erasable Programmable Read-Only Memory (EPROM), an optical fiber, a portable Compact disk Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product provided by the embodiment of the application can adopt a CD-ROM and comprises program codes, and can run on a computing device. However, the program product provided by the embodiments of the present application is not limited thereto, and in the embodiments of the present application, the readable storage medium may be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus, or device.
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.
Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.