Voice conversion method, device, terminal and storage medium
1. A voice conversion method is applied to a sending terminal and comprises the following steps:
acquiring voice information, and acquiring character information, sound loudness and emotional characteristics corresponding to the voice information;
and sending the voice information, the text information, the sound loudness and the emotional characteristics to a receiving terminal, wherein the receiving terminal is used for displaying the voice information and displaying the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction for the voice information.
2. The method of claim 1, wherein the obtaining text information, sound loudness and emotional characteristics corresponding to the voice information comprises:
sending the voice information to a server, wherein the server is used for identifying character information, sound loudness and emotional characteristics corresponding to the voice information by adopting a voice recognition algorithm;
and receiving the text information, the sound loudness and the emotional characteristic sent by the server.
3. The method of claim 2, wherein the sending the voice message to a server, the server being configured to recognize text information, loudness of sound, and emotional characteristics corresponding to the voice message using a voice recognition algorithm, comprises:
and sending the voice information to the server, wherein the server is used for identifying the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice identification algorithm when the voice information identification is determined to be allowed to be identified.
4. The method of claim 1, wherein the obtaining text information, sound loudness and emotional characteristics corresponding to the voice information comprises:
and locally recognizing the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm.
5. The method of claim 4, wherein said locally recognizing the textual information, the loudness of sound, and the emotional characteristic corresponding to the voice information using a voice recognition algorithm comprises:
and when the voice information identifier is determined to be allowed to be recognized, locally recognizing the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm.
6. The method of claim 1, wherein after obtaining the text information, the loudness of sound, and the emotional characteristic corresponding to the voice information, further comprising:
and displaying the voice information, and displaying the text information, the sound loudness and the emotional characteristic when a conversion instruction for the voice information is received.
7. The method of claim 6, wherein the displaying the textual information, the sound loudness, and the emotional feature upon receiving a transition instruction for the speech information, comprises:
and when a conversion instruction aiming at the voice information is received, displaying the text information, displaying the sound loudness on the bottom layer of the text information, and displaying the emotional characteristics around the text information.
8. A voice conversion method is applied to a receiving terminal and comprises the following steps:
receiving voice information sent by a sending terminal and character information, sound loudness and emotional characteristics corresponding to the voice information;
and displaying the voice information, and displaying the text information, the sound loudness and the emotional characteristic when a conversion instruction for the voice information is received.
9. The method of claim 8, wherein the displaying the textual information, the sound loudness, and the emotional feature upon receiving a transition instruction for the speech information comprises:
and when a conversion instruction aiming at the voice information is received, displaying the text information, displaying the sound loudness on the bottom layer of the text information, and displaying the emotional characteristics around the text information.
10. A terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program performs the method of any of claims 1-7 or 8-9.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 7 or 8 to 9.
Background
With the rapid development of terminal technology, more and more functions are supported by the terminal, and the life of a user can be enriched continuously. For example, a user may listen to music, watch videos, receive voice information, and the like using the terminal.
When the user receives the voice information but is inconvenient to play the voice information, the voice conversion text function of the terminal can be used for converting the received voice information into text information, so that the user can see the corresponding text information but cannot feel the emotion of a speaker sending the voice information, and information loss occurs.
Disclosure of Invention
The embodiment of the application provides a voice conversion method, a voice conversion device, a terminal and a storage medium, which can accurately reflect emotional characteristics in voice information.
In a first aspect, an embodiment of the present application provides a voice conversion method, applied to a sending terminal, including:
acquiring voice information, and acquiring character information, sound loudness and emotional characteristics corresponding to the voice information;
and sending the voice information, the text information, the sound loudness and the emotional characteristics to a receiving terminal, wherein the receiving terminal is used for displaying the voice information and displaying the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction for the voice information.
In a second aspect, an embodiment of the present application provides a voice conversion method, applied to a receiving terminal, including:
receiving voice information sent by a sending terminal and character information, sound loudness and emotional characteristics corresponding to the voice information;
and displaying the voice information, and displaying the text information, the sound loudness and the emotional characteristic when a conversion instruction for the voice information is received.
In a third aspect, an embodiment of the present application provides a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method of any one of the above first aspects when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program is used for implementing any one of the methods described above when executed by a processor.
In a fifth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of embodiments of the present application. The computer program product may be a software installation package.
The embodiment of the application provides a voice conversion method, wherein when a sending terminal obtains voice information of a user, the sending terminal obtains text information, sound loudness and emotional characteristics corresponding to the voice information and sends the text information, the sound loudness and the emotional characteristics to a receiving terminal, so that the receiving terminal can display the text information, the sound loudness and the emotional characteristics corresponding to the voice information when receiving a conversion instruction aiming at the voice information. Because the sending terminal sends the corresponding text information, the emotional characteristic and other information besides the voice information, when the receiver is inconvenient to listen to the voice information, the receiver can see the text information and feel the sound loudness and the emotional characteristic of the sender, and therefore the use experience of the user can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view illustrating an application scenario of a speech conversion method or a speech conversion apparatus applied to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an example of a terminal interface according to an embodiment of the present disclosure;
FIG. 3 is a flow chart illustrating a method of speech conversion according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating an example of a terminal interface according to an embodiment of the present application;
fig. 5 is a schematic diagram illustrating an application scenario of a speech conversion method applied to an embodiment of the present application;
FIG. 6 is a flow chart illustrating a method of speech conversion according to an embodiment of the present application;
fig. 7 is a schematic view illustrating an application scenario of a speech conversion method applied to an embodiment of the present application;
FIG. 8 illustrates an exemplary diagram of a terminal interface according to an embodiment of the present application;
fig. 9 is a schematic view illustrating an application scenario of a speech conversion method applied to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a speech conversion apparatus according to an embodiment of the present application;
FIG. 11 is a flow chart illustrating a method of speech conversion according to an embodiment of the present application;
FIG. 12 is an interaction diagram illustrating a voice conversion method according to an embodiment of the present application;
FIG. 13 is a schematic diagram illustrating an example of a terminal interface according to an embodiment of the application;
FIG. 14 is an exemplary diagram illustrating a terminal interface according to an embodiment of the disclosure;
fig. 15 is a schematic structural diagram of a speech conversion apparatus according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
With the rapid development of terminal technology, more and more functions are supported by the terminal, and the life of a user can be enriched continuously. For example, when the user communicates with friends and relatives, the user can use the terminal to carry out video call between the user and friends and relatives, and the user can visually see the emotional characteristics of the opposite party when speaking.
Fig. 1 is a schematic view illustrating an application scenario of a speech conversion method or a speech conversion apparatus applied to an embodiment of the present application. As shown in fig. 1, the transmitting terminal may receive voice information input by a sender, and upon receiving a transmission instruction for the voice information, the transmitting terminal may transmit the voice information to the receiving terminal. The receiving terminal can play the voice information when the playing instruction aiming at the voice information is received. The receiver can hear the voice message sent by the sender, and based on the voice message, the receiver can send the reply message back to the sending terminal using the receiving terminal. When the receiver is inconvenient to play the voice message, the receiver can send a voice conversion instruction to the receiving terminal. When the receiving terminal receives the voice conversion instruction sent by the receiver, the receiving terminal can convert the received voice information into the text information.
According to some embodiments, for example, the a-user may input voice information on the chat interface of the a-terminal with the B-user, which may be, for example, "when do you get off work? ". When the a terminal receives the voice information input by the a user, the a terminal can send the voice information to the B terminal where the B user is located. When the terminal b receives the voice message, the terminal b can display a User Interface (UI) icon corresponding to the voice message. When the B terminal receives a click instruction of the user B for the UI icon, the B terminal can play the received voice information. The user B can acquire the voice information. When the B user acquires the voice message, the B user may input a reply voice message, which may be, for example, "six points," on the chat interface between the B terminal and the a user. When the user B is inconvenient to play the voice information, the user B can send a voice conversion text instruction to the terminal B. When the terminal b receives the voice conversion text instruction, the terminal b may convert the received voice information into text information and display the text information on a display interface of the terminal b, where the display interface of the terminal b may be as shown in fig. 2. And B, the user sees the text information displayed on the display interface of the terminal B, and the user B can reply in time according to the text information.
It is easy to understand that when the receiving terminal converts the received voice information into text information, the receiving terminal can only acquire the text information corresponding to the voice information and cannot acquire the emotional characteristics corresponding to the voice information, so that the text information corresponding to the voice information acquired by the receiving terminal cannot accurately reflect the emotional characteristics contained in the voice information sent by the sender, and the user experience is poor.
The speech conversion method provided by the embodiment of the present application will be described in detail below with reference to fig. 3 to 9. The execution bodies of the embodiments shown in fig. 3-9 are transmitting terminals.
Referring to fig. 3, a flow chart of a voice conversion method is provided in the present embodiment. As shown in fig. 3, the method of the embodiment of the present application may include the following steps S101 to S102.
S101, acquiring voice information, and acquiring character information, sound loudness and emotional characteristics corresponding to the voice information.
According to some embodiments, the voice information may be voice information received by the sending terminal and input by the sender, and the voice information may also be, for example, voice information retrieved by the sending terminal from a memory. The sender can select a sending object on a display screen of the sending terminal, and the sending object is a receiver. When the sender selects the completion of the transmission object, the sender can click a voice input button on a display screen of the transmission terminal. When the sender continuously presses the voice input button, the sender may start inputting voice information, at which point the sending terminal may start acquiring voice information. When the voice information input by the sender is finished, the sending terminal can acquire the voice information input by the sender. The voice message may be, for example, "i want to leave cold from tomorrow".
It is easily understood that the loudness of sound may refer to a decibel value of sound when a sender inputs speech information. The display form of the sound loudness includes but is not limited to a digital form, a curve form, and the like. The emotional characteristics refer to an emotion when the sender inputs voice information. The emotional characteristics include, but are not limited to, happiness, anger, sadness, music, surprise, terror, thinking, etc., wherein each emotional characteristic includes at least one emotion. For example, thought includes but is not limited to sentiment such as thoughts, smoothies, and the like.
According to some embodiments, when the sending terminal acquires the voice information, the sending terminal may locally recognize text information, sound loudness, and emotion characteristics corresponding to the voice information by using a voice recognition algorithm, and the sending terminal may acquire the text information, sound loudness, and emotion characteristics corresponding to the voice information. Before the transmitting terminal acquires the emotion characteristics corresponding to the voice information, the transmitting terminal can learn a large amount of voice information to obtain a voice recognition algorithm. The speech recognition algorithm includes, but is not limited to, a BP neural network, a PAC-based neural network, and the like. When the sending terminal locally recognizes the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm, the sending terminal can simultaneously acquire the text information, the sound loudness and the emotional characteristic corresponding to the voice information, and the sending terminal does not need to acquire the emotional characteristic corresponding to the voice information by acquiring the text information corresponding to the voice information.
It is easy to understand that, when the sending terminal acquires the voice information, the sending terminal may acquire text information, sound loudness, and emotional characteristics corresponding to the voice information. For example, the voice information acquired by the transmitting terminal may be "i want to leave cold from tomorrow". The sending terminal can locally recognize the text information, the sound loudness and the emotion characteristics corresponding to the voice information by adopting a BP neural network, and the sending terminal can acquire the text information, the sound loudness and the emotion characteristics corresponding to the voice information that I want to make cold and fake from tomorrow. The text information acquired by the transmitting terminal may be, for example, "i want to leave cold from tomorrow". The loudness corresponding to the voice information that the sending terminal acquires "i want to make cold from tomorrow" may be a decibel value corresponding to each character in the voice information. For example, the loudness of sound acquired by the transmitting terminal may refer to a decibel value in a digital form, and the decibel value corresponding to each character that the transmitting terminal acquires "i want to put cold and leave" may be, for example, 20 decibels, 21 decibels, 22 decibels, 21 decibels, and 20 decibels. When the sending terminal obtains the decibel value corresponding to each character, the sending terminal can calculate the decibel average value, the decibel median value and the like corresponding to the voice information. The sending terminal of the embodiment of the application can calculate the decibel median corresponding to the voice line, and the decibel median can be 21 decibels, for example. The emotion feature corresponding to the voice information "i want to leave cold and leave false from tomorrow" acquired by the transmitting terminal may be happy, for example.
And S102, sending the voice information, the text information, the sound loudness and the emotional characteristics to a receiving terminal, wherein the receiving terminal is used for displaying the voice information and displaying the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction aiming at the voice information.
According to some embodiments, when the sending terminal acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information, the sending terminal may directly send the acquired voice information, the text information, the sound loudness and the emotion characteristics to the receiving terminal. When the receiving terminal receives the voice information, the text information, the sound loudness and the emotional characteristics sent by the sending terminal, the receiving terminal can display the voice information and display the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction for the voice information.
It is easy to understand that, when the sending terminal acquires that the voice message "i want to put cold and leave cold from tomorrow" corresponds to the text message "i want to put cold and leave cold from tomorrow", the sound loudness is 21 decibels, and the emotional characteristic is happy, the sending terminal may send the voice message "i want to put cold and leave cold from tomorrow", the text message "i want to put cold and leave cold from tomorrow", the sound loudness is 21 decibels, and the emotional characteristic is happy to the receiving terminal. When the receiving terminal receives the voice information 'i wants to put cold and leave cold from tomorrow', the text information 'i wants to put cold and leave cold from tomorrow', the sound loudness is 21 decibels and the emotional characteristic is happy, which are sent by the sending terminal, the receiving terminal can display the UI icon corresponding to the voice information. The recipient can click on the UI icon and click on the voice conversion control. At this time, when the receiving terminal may receive a conversion instruction for the voice information from the recipient, the receiving terminal may display the text information "i want to leave cold from tomorrow, the loudness of the sound is 21 db, and the emotional characteristic is happy, and at this time, the display interface of the receiving terminal may be as shown in fig. 4.
According to some embodiments, fig. 5 shows an application scenario diagram of a speech conversion method applied to an embodiment of the present application. As shown in fig. 5, when the sending terminal acquires the text information, the sound loudness, and the emotion feature corresponding to the voice information, the sending terminal may send the acquired voice information, the text information, the sound loudness, and the emotion feature to the receiving terminal through the server. Wherein the server may be used only for sending voice information, text information, sound loudness and emotional characteristics. For example, the transmitting terminal may transmit to the server a voice message "i am on vacation from tomorrow", a text message "i am on vacation from tomorrow", a sound loudness of 21 db, and an emotional characteristic of happy. When the server receives the voice information, the text information, the sound loudness and the emotion characteristics sent by the sending terminal, the server can send the received voice information, the text information, the sound loudness and the emotion characteristics to the receiving terminal. When the receiving terminal receives the voice information, the text information, the sound loudness and the emotional characteristics sent by the server, the receiving terminal can display the text information, the sound loudness and the emotional characteristics corresponding to the voice information when receiving a conversion instruction of a receiver for the voice information.
The application embodiment provides a voice conversion method, wherein when a sending terminal obtains voice information of a user, the sending terminal obtains text information, sound loudness and emotion characteristics corresponding to the voice information and sends the text information, the sound loudness and the emotion characteristics to a receiving terminal, so that the receiving terminal can display the text information, the sound loudness and the emotion characteristics corresponding to the voice information when receiving a conversion instruction for the voice information. Because the sending terminal sends the corresponding text information, the emotional characteristic and other information besides the voice information, when the receiver is inconvenient to listen to the voice information, the receiver can see the text information and feel the sound loudness and the emotional characteristic of the sender, and therefore the use experience of the user can be improved.
Referring to fig. 6, a flow chart of a voice conversion method is provided in the present embodiment. As shown in fig. 6, the method of the embodiment of the present application may include the following steps S201 to S204.
S201, acquiring voice information, and sending the voice information to a server, wherein the server is used for recognizing character information, sound loudness and emotional characteristics corresponding to the voice information by adopting a voice recognition algorithm.
According to some embodiments, the voice information acquired by the transmitting terminal may be, for example, voice information read by the transmitting terminal from a memory of the transmitting terminal. When the sending terminal acquires the voice information, the sending terminal may send the voice information to the server. Wherein the server is a voice conversion server. The server is used for recognizing the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm when the voice information identification is determined to be allowed to be recognized. When the voice conversion server acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information, the voice conversion server can send the text information, the sound loudness and the emotion characteristics corresponding to the voice information to the sending terminal.
It is easy to understand that fig. 7 shows an application scenario diagram of the speech conversion method applied to the embodiment of the present application. As shown in fig. 7, the voice information acquired by the transmitting terminal may be, for example, voice information read by the transmitting terminal from a memory of the transmitting terminal, and the voice information may be "there is a rain at night". When the sending terminal acquires the voice information 'there is meteor rain at night today', the sending terminal can send the voice information 'there is meteor rain at night today' to the voice conversion server. When the voice conversion server receives the voice message, the voice conversion server may obtain a voice message identifier corresponding to the voice message. When the voice conversion server determines that the voice information is identified as allowing recognition, the voice conversion server may acquire the text information corresponding to the voice information "there is a meteor rain today" using a voice recognition algorithm, for example, "there is a meteor rain today" today, the sound loudness may be a sound loudness curve, for example, and the emotional characteristic may be excitement, for example. When the voice conversion server acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information 'there is meteor rain at night today', the voice conversion server can send the text information, the sound loudness and the emotion characteristics corresponding to the voice information 'there is meteor rain at night today' to the sending terminal. The sending terminal can receive the text information, the sound loudness and the emotional characteristics corresponding to the voice information sent by the voice conversion server.
According to some embodiments, when the sending terminal acquires the voice information, the sending terminal may also locally recognize text information, sound loudness, and emotional characteristics corresponding to the voice information by using a voice recognition algorithm. When the sending terminal acquires the voice information, the sending terminal can directly acquire the voice information identifier corresponding to the voice information. When the voice information identifier corresponding to the voice information acquired by the sending terminal is allowed to be recognized, the sending terminal can locally recognize the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm.
It is easy to understand that when the sending terminal acquires the voice information "there is meteor rain at night today", the sending terminal can also locally recognize text information, sound loudness and emotional characteristics corresponding to the voice information "there is meteor rain at night today" by using a voice recognition algorithm. When the sending terminal acquires the voice information 'there is meteor rain at night today', the sending terminal can directly acquire the voice information identifier corresponding to the voice information 'there is meteor rain at night today'. When the voice information identifier corresponding to the voice information acquired by the sending terminal is allowed to be identified, the sending terminal may locally identify, by using a voice identification algorithm, the text information corresponding to the voice information, for example, "there is a meteor rain at night", the sound loudness may be, for example, a sound transformation curve, and the emotional characteristic may be, for example, excitement.
Optionally, when the voice conversion server sends the text information, the sound loudness and the emotion feature corresponding to the voice information to the sending terminal, and the sending terminal receives the text information, the sound loudness and the emotion feature corresponding to the voice information, the sending terminal may also locally recognize the text information, the sound loudness and the emotion feature corresponding to the voice information by using a voice recognition algorithm, so as to improve the accuracy of obtaining the text information, the sound loudness and the emotion feature corresponding to the voice information by the sending terminal.
S202, receiving the character information, the sound loudness and the emotional characteristics sent by the server.
According to some embodiments, when the sending terminal sends the voice information to the server, the server may obtain, based on the received voice information, text information, loudness of sound, and emotional characteristics corresponding to the voice information by using a voice recognition algorithm. When the server acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information, the server can send the text information, the sound loudness and the emotion characteristics corresponding to the voice information to the sending terminal.
It is easy to understand that, when the sending terminal sends the voice message "there is a meteor rain at night today" to the server, the server may obtain, based on the received voice message "there is a meteor rain at night today", text information, sound loudness and emotional characteristics corresponding to the voice message by using a voice recognition algorithm. The text information corresponding to the voice information "there is meteor rain today" acquired by the server may be "there is meteor rain today evening", the sound loudness may be a sound loudness curve, for example, and the emotional characteristic may be excitement, for example. The server can send the voice message 'there is meteor rain at night today' corresponding text message ', there is meteor rain at night today', the sound loudness curve and the excited emotional characteristic to the sending terminal. The sending terminal can receive the voice message, namely the character message corresponding to the meteor shower at night, the sound loudness curve and the excited emotional characteristic.
S203, displaying the voice information, and displaying the text information, the sound loudness and the emotional characteristic when receiving a conversion instruction aiming at the voice information.
According to some embodiments, when the transmitting terminal receives the voice information, the transmitting terminal may display the voice information. After the sending terminal acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information, the sending terminal can display the text information, the sound loudness and the emotion characteristics when the sending terminal receives a conversion instruction for the voice information. For example, the sending terminal may locally recognize text information, sound loudness, and emotion characteristics corresponding to the voice information by using a voice recognition algorithm, and the sending terminal may also receive text information, sound loudness, and emotion characteristics corresponding to the voice information sent by the server. When the transmitting terminal receives the conversion instruction aiming at the voice information, the transmitting terminal can display the text information, display the sound loudness on the bottom layer of the text information and display the emotional characteristics around the text information.
It is easy to understand that, for example, the voice information acquired by the sending terminal may be "there is a meteor shower at night today". The sending terminal can receive the text information, the sound loudness and the emotional characteristics corresponding to the voice information sent by the server. For example, based on the received voice information, the server may acquire the voice information "there is a streamer at night today," there is a streamer at night, "there is a sound loudness curve, and emotional features of excitation" by using a voice conversion algorithm, and send the acquired text information, sound loudness curve, and emotional features of excitation to the sending terminal. The sending terminal can receive the voice message, namely the character message corresponding to the meteor shower at night, the sound loudness curve and the excited emotional characteristic. When the transmitting terminal receives the conversion instruction aiming at the voice information, the transmitting terminal can display the text information, display the sound loudness curve on the bottom layer of the text information and display the excited emotional characteristics around the text information. At this time, the display interface of the transmitting terminal may be as shown in fig. 8.
And S204, sending the voice information, the text information, the sound loudness and the emotional characteristics to a receiving terminal, wherein the receiving terminal is used for displaying the voice information and displaying the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction aiming at the voice information.
According to some embodiments, when the sending terminal acquires the text information, the sound loudness and the emotion feature corresponding to the voice information sent by the server, the sending terminal may directly send the acquired voice information, the text information, the sound loudness and the emotion feature to the receiving terminal. When the receiving terminal receives the voice information, the receiving terminal may display the voice information and display text information, sound loudness, and emotional characteristics when a conversion instruction for the voice information is received.
It is easy to understand that, as shown in fig. 7, when the sending terminal receives the text information "there is a meteor shower today at night", the sound loudness curve and the excited emotional characteristic sent by the voice conversion server, the sending terminal may send the text information "there is a meteor shower today at night", the sound loudness curve and the excited emotional characteristic to the receiving terminal. When the receiving terminal receives the voice information, the receiving terminal can display the voice information and display the text information of 'there is a meteor rain at night today', a sound loudness curve and excited emotional characteristics when receiving a conversion instruction for the voice information 'there is a meteor rain at night today'.
According to some embodiments, fig. 9 shows an application scenario diagram of a speech conversion method applied to an embodiment of the present application. As shown in fig. 9, when the sending terminal receives the text information, the sound loudness, and the emotion characteristic corresponding to the voice information sent by the voice conversion server, the sending terminal may send the obtained voice information, the text information, the sound loudness, and the emotion characteristic to the receiving terminal through the forwarding server. The forwarding server is only used for sending the received voice information, the character information, the sound loudness and the emotional characteristics to the receiving terminal, and does not perform voice conversion processing on the received voice information. When the receiving terminal receives the voice information, the receiving terminal may display the voice information and display text information, sound loudness, and emotional characteristics when a conversion instruction for the voice information is received.
The embodiment of the application provides a voice conversion method, wherein the sending terminal sends character information, sound loudness and emotion characteristics corresponding to voice information acquired from a server to the receiving terminal, so that the receiving terminal can display the character information, the sound loudness and the emotion characteristics corresponding to the voice information when receiving a conversion instruction for the voice information. The server can send the acquired voice information, the text information, the sound loudness and the emotion characteristics to the sending terminal, and then the sending terminal can send the voice information to the receiving terminal, so that the receiver can see the text information and feel the sound loudness and the emotion characteristics of the sender at the same time.
The following describes a speech conversion apparatus provided in an embodiment of the present application in detail with reference to fig. 10. It should be noted that the speech conversion apparatus shown in fig. 10 is used for executing the method of the embodiment shown in fig. 4-9 of the present application, and for convenience of description, only the portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to the embodiment shown in fig. 4-9 of the present application.
Please refer to fig. 10, which shows a schematic structural diagram of a voice conversion apparatus according to an embodiment of the present application. The voice conversion apparatus 1000 may be implemented by software, hardware or a combination of both as all or a part of a user terminal. According to some embodiments, the speech conversion apparatus 1000 includes an information obtaining unit 1001 and an information sending unit 1002, and is specifically configured to:
an information obtaining unit 1001, configured to obtain voice information, and obtain text information, sound loudness, and emotion characteristics corresponding to the voice information;
the information sending unit 1002 is configured to send the voice information, the text information, the sound loudness, and the emotional characteristic to the receiving terminal, where the receiving terminal is configured to display the voice information and display the text information, the sound loudness, and the emotional characteristic when a conversion instruction for the voice information is received.
According to some embodiments, the information obtaining unit 1001, when obtaining text information, sound loudness, and emotional characteristics corresponding to the voice information, is specifically configured to:
the voice information is sent to a server, and the server is used for recognizing character information, sound loudness and emotional characteristics corresponding to the voice information by adopting a voice recognition algorithm;
and receiving the text information, the sound loudness and the emotional characteristics sent by the server.
According to some embodiments, the information sending unit 1002 is configured to send the voice information to a server, and when the server is configured to recognize text information, sound loudness, and emotional characteristics corresponding to the voice information by using a voice recognition algorithm, the server is specifically configured to:
and sending the voice information to a server, wherein the server is used for recognizing the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm when the voice information identifier is determined to be allowed to be recognized.
According to some embodiments, the information obtaining unit 1001, when obtaining text information, sound loudness, and emotional characteristics corresponding to the voice information, is specifically configured to:
and locally recognizing the text information, the sound loudness and the emotional characteristics corresponding to the voice information by adopting a voice recognition algorithm.
According to some embodiments, the information obtaining unit 1001 is configured to, when locally recognizing text information, sound loudness, and emotion characteristics corresponding to the speech information by using a speech recognition algorithm, specifically:
and when the voice information identifier is determined to be allowed to be recognized, locally recognizing the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm.
According to some embodiments, the voice conversion apparatus 1000 further includes an information display unit 1003, configured to display the voice information after acquiring the text information, the sound loudness, and the emotion characteristic corresponding to the voice information, and display the text information, the sound loudness, and the emotion characteristic when receiving a conversion instruction for the voice information.
According to some embodiments, the information display unit 1003 is configured to, when receiving a conversion instruction for the voice information, display text information, loudness of sound, and emotional characteristics, specifically:
when a conversion instruction aiming at the voice information is received, the text information is displayed, the sound loudness is displayed on the bottom layer of the text information, and the emotional characteristics are displayed around the text information.
The embodiment of the application provides a voice conversion device, which obtains voice information through the information obtaining unit 1001, and obtains text information, sound loudness and emotion characteristics corresponding to the voice information, so that the information sending unit 1002 can send the voice information, the text information, the sound loudness and the emotion characteristics to a receiving terminal, and the receiving terminal is used for displaying the voice information and displaying the text information, the sound loudness and the emotion characteristics when receiving a conversion instruction for the voice information. The voice conversion device can enable the receiver to timely and definitely know the emotional state of the sender, and can improve the communication efficiency between the receiver and the sender, so that the use experience of a user can be improved.
The speech conversion method provided by the embodiment of the present application will be described in detail below with reference to fig. 11 to 13. The implementation bodies of the embodiments shown in fig. 11-13 are receiving terminals.
Please refer to fig. 11, which is a flowchart illustrating a voice conversion method according to an embodiment of the present application. As shown in fig. 11, the method of the embodiment of the present application may include the following steps S301 to S302.
S301, receiving the voice information sent by the sending terminal, and the text information, the sound loudness and the emotional characteristics corresponding to the voice information.
According to some embodiments, when the sending terminal acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information, the sending terminal may directly send the voice information, the text information, the sound loudness and the emotion characteristics to the receiving terminal. When the receiving terminal detects the voice information, the text information, the sound loudness and the emotional characteristics sent by the sending terminal, the receiving terminal can receive the voice information, the text information, the sound loudness and the emotional characteristics.
It is easy to understand that the voice information acquired by the transmitting terminal may be, for example, "shopping is ready today, but it suddenly rains". When the sending terminal locally recognizes and acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information by adopting a voice recognition algorithm, the sending terminal can directly send the voice information, the text information, the sound loudness and the emotion characteristics to the receiving terminal. The text information corresponding to the voice information acquired by the transmitting terminal may be, for example, "a person is ready to visit a street today but suddenly rains," the loudness of sound may be, for example, a loudness curve of sound, and the emotional characteristic may be, for example, an emotional characteristic of sadness. When the receiving terminal detects the voice information, the text information, the sound loudness and the emotional characteristics sent by the sending terminal, the receiving terminal can receive the voice information, the text information, the sound loudness and the emotional characteristics.
According to some embodiments, when the sending terminal acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information, the sending terminal may directly send the voice information, the text information, the sound loudness and the emotion characteristics to the server. The server can be used for converting the voice information into text information, sound loudness and emotion characteristics corresponding to the voice information based on the received voice information, and can also be used for directly sending the received voice information, the received text information, the received sound loudness and the received emotion characteristics to the receiving terminal. When the receiving terminal detects the voice information, the text information, the sound loudness and the emotional characteristics sent by the server, the receiving terminal can receive the voice information, the text information, the sound loudness and the emotional characteristics.
It is easy to understand that fig. 12 shows an interaction diagram of a voice conversion method according to an embodiment of the present application. As shown in fig. 12, the voice information acquired by the transmitting terminal may be, for example, "shopping is ready today, but it suddenly rains". When the transmitting terminal can transmit the voice information to the voice conversion server. When the voice conversion server acquires the text information, the sound loudness and the emotion characteristics corresponding to the voice information by adopting a voice recognition algorithm, the voice conversion server can directly send the voice information, the text information, the sound loudness and the emotion characteristics to the sending terminal. The text information corresponding to the voice information acquired by the transmitting terminal may be, for example, "a person is ready to visit a street today but suddenly rains," the loudness of sound may be, for example, a loudness curve of sound, and the emotional characteristic may be, for example, an emotional characteristic of sadness. The sending terminal can send the acquired voice information, character information and emotion characteristics to the receiving terminal through the forwarding server. When the receiving terminal detects the voice information, the text information, the sound loudness and the emotional characteristics sent by the sending terminal, the receiving terminal can receive the voice information, the text information, the sound loudness and the emotional characteristics.
S302, displaying the voice information, and displaying the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction aiming at the voice information.
According to some embodiments, when the sending terminal acquires text information, sound loudness and emotional characteristics corresponding to the voice information, the sending terminal may send the voice information, the text information, the sound loudness and the emotional characteristics to the receiving terminal. The sending terminal can directly send the voice information, the text information, the sound loudness and the emotion characteristics to the receiving terminal, and the sending terminal can also send the voice information, the text information, the sound loudness and the emotion characteristics to the receiving terminal through the server. When the receiving terminal receives the voice information, the text information, the sound loudness and the emotional characteristics, the receiving terminal can display the voice information and display the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction for the voice information. The receiving terminal can display the text information, display the sound loudness on the bottom layer of the text information and display the emotional characteristics around the text information.
It is easy to understand that, for example, the voice information acquired by the transmitting terminal may be "shopping is ready today, but raining suddenly". When the sending terminal locally recognizes the text information corresponding to the voice information by adopting a voice recognition algorithm, wherein the text information is ready to be shopping today, but rains, a sound loudness curve and sad emotional characteristics are suddenly generated, the sending terminal can directly send the voice information, the text information, the sound loudness and the emotional characteristics to the receiving terminal. When the receiving terminal receives the voice information, the receiving terminal may display the voice information, and at this time, a display interface of the receiving terminal may be as shown in fig. 13.
Optionally, for example, when the receiving terminal receives the voice information, the receiving terminal may display a UI icon corresponding to the voice information on a display screen of the receiving terminal. The recipient may press the UI icon for a long time, and the receiving terminal may display a selection box when the receiving terminal detects that the recipient presses the UI icon for a long time. The recipient may click on the voice conversion text control. At this time, the display interface of the receiving terminal may be as shown in fig. 14. When the receiving terminal detects that the receiver clicks the voice conversion text control, the receiving terminal can receive a conversion instruction of the voice information. When the receiving terminal receives the conversion instruction aiming at the voice information, the receiving terminal can display the text information, display the sound loudness curve on the bottom layer of the text information and display the excited emotional characteristics around the text information.
It is easy to understand that when the sending terminal receives the conversion instruction for the voice information, the sending terminal can display the text information, and can also display the sound loudness on the upper part of the text information and the emotional characteristics behind the text information.
The embodiment of the application provides a voice conversion method, and by receiving voice information sent by a sending terminal and text information, sound loudness and emotional characteristics corresponding to the voice information, the receiving terminal can display the voice information and display the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction for the voice information, so that a receiver can definitely know the text information corresponding to the voice information sent by the sender, and can obtain the sound loudness of the sender when sending the voice information and the current emotional characteristics of the sender, and the use experience of the receiver can be improved.
The following describes in detail a speech conversion apparatus provided in an embodiment of the present application with reference to fig. 15. It should be noted that the speech conversion apparatus shown in fig. 15 is used for executing the method of the embodiment shown in fig. 11-14 of the present application, and for convenience of description, only the portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to the embodiment shown in fig. 11-14 of the present application.
Please refer to fig. 15, which shows a schematic structural diagram of a voice conversion apparatus according to an embodiment of the present application. The speech conversion apparatus 1500 may be implemented by software, hardware or a combination of both as all or a part of a user terminal. According to some embodiments, the speech conversion apparatus 1500 includes an information receiving unit 1501 and an information display unit 1502, and is specifically configured to:
the information receiving unit 1501 is configured to receive voice information sent by the sending terminal, and text information, sound loudness, and emotional characteristics corresponding to the voice information;
the information display unit 1502 displays voice information, and displays text information, sound loudness, and emotional characteristics when a conversion instruction for the voice information is received.
According to some embodiments, the information display unit 1502, when displaying the text information, the loudness of sound, and the emotional characteristic when receiving the conversion instruction for the voice information, is specifically configured to:
when a conversion instruction aiming at the voice information is received, the text information is displayed, the sound loudness is displayed on the bottom layer of the text information, and the emotional characteristics are displayed around the text information.
In the voice conversion apparatus according to the embodiment of the present application, the information receiving unit 1501 receives the voice information sent by the sending terminal and the text information, the sound loudness and the emotional characteristic corresponding to the voice information, and the information display unit 1502 may display the voice information and display the text information, the sound loudness and the emotional characteristic when receiving the conversion instruction for the voice information. Compared with the technical scheme of only displaying the text information corresponding to the voice information, the technical scheme of the embodiment of the application can not only enable the receiver to clearly know the text information corresponding to the voice information sent by the sender, but also enable the receiver to obtain the sound loudness of the sender when the sender sends the voice information and the current emotional characteristics of the sender, reduce the loss of the sender information and further improve the use experience of the receiver.
Please refer to fig. 16, which is a schematic structural diagram of a terminal according to an embodiment of the present application. As shown in fig. 16, the terminal 1600 may include: at least one processor 1601, at least one network interface 1604, a user interface 1603, memory 1605, at least one communication bus 1602.
Wherein a communication bus 1602 is used to enable connective communication between these components.
User interface 1603 may include a Display screen (Display) and a GPS, and optional user interface 1103 may also include a standard wired interface, a wireless interface.
The network interface 1604 may optionally comprise a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
Processor 1601 may include one or more processing cores, among other things. The processor 1601 interfaces various components throughout the terminal 1600 using various interfaces and lines to perform various functions and process data of the terminal 1600 by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1605, as well as invoking data stored in the memory 1605. Optionally, the processor 1601 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1601 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is to be understood that the modem may not be integrated into the processor 1601, but may be implemented by a single chip.
The Memory 1605 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1605 includes a non-transitory computer-readable storage medium. The memory 1605 may be used to store instructions, programs, code sets, or instruction sets. The memory 1605 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1605 may alternatively be at least one memory device located remotely from the processor 1601 as previously described. As shown in fig. 16, a memory 1605, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an application program for voice conversion.
In the terminal 1600 shown in fig. 16, the user interface 1603 is mainly used to provide an input interface for a user to obtain data input by the user; and the processor 1601 may be used to invoke an application program of the voice conversion method stored in the memory 1605 and specifically perform the following operations:
acquiring voice information, and acquiring character information, sound loudness and emotional characteristics corresponding to the voice information;
and sending the voice information, the text information, the sound loudness and the emotional characteristics to a receiving terminal, wherein the receiving terminal is used for displaying the voice information and displaying the text information, the sound loudness and the emotional characteristics when receiving a conversion instruction for the voice information.
According to some embodiments, the processor 1601 is configured to, when obtaining the text information, the sound loudness, and the emotion feature corresponding to the voice information, specifically perform the following steps:
sending the voice information to a server, wherein the server is used for identifying character information, sound loudness and emotional characteristics corresponding to the voice information by adopting a voice recognition algorithm;
and receiving the text information, the sound loudness and the emotional characteristic sent by the server.
According to some embodiments, the processor 1601 is configured to send the voice information to a server, and when the server is configured to recognize text information, sound loudness, and emotional characteristics corresponding to the voice information by using a voice recognition algorithm, the server is specifically configured to perform the following steps:
and sending the voice information to a server, wherein the server is used for recognizing the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm when the voice information identification is determined to be allowed to be recognized.
According to some embodiments, the processor 1601 is configured to, when obtaining the text information, the sound loudness, and the emotion feature corresponding to the voice information, specifically perform the following steps:
and locally recognizing the text information, the sound loudness and the emotional characteristics corresponding to the voice information by adopting a voice recognition algorithm.
According to some embodiments, the processor 1601 is configured to, when locally recognizing text information, sound loudness, and emotional characteristics corresponding to the speech information by using a speech recognition algorithm, specifically perform the following steps:
and when the voice information identifier is determined to be allowed to be recognized, locally recognizing the text information, the sound loudness and the emotional characteristic corresponding to the voice information by adopting a voice recognition algorithm.
According to some embodiments, the processor 1601 is configured to, after obtaining text information, sound loudness, and emotional characteristics corresponding to the voice information, further perform the following steps:
and displaying the voice information, and displaying the text information, the sound loudness and the emotional characteristic when a conversion instruction for the voice information is received.
According to some embodiments, processor 1601 is configured to display the text information, the loudness of the sound, and the emotional feature when receiving a conversion instruction for the speech information, including:
and when a conversion instruction aiming at the voice information is received, displaying the text information, displaying the sound loudness on the bottom layer of the text information, and displaying the emotional characteristics around the text information.
According to some embodiments, the processor 1601 is further configured to perform the following steps:
receiving voice information sent by a sending terminal and character information, sound loudness and emotional characteristics corresponding to the voice information;
and displaying the voice information, and displaying the text information, the sound loudness and the emotional characteristic when a conversion instruction for the voice information is received.
According to some embodiments, the processor 1601 is configured to, when the text information, the loudness of the sound, and the emotional characteristic are displayed when the conversion instruction for the speech information is received, specifically perform the following steps:
and when a conversion instruction aiming at the voice information is received, displaying the text information, displaying the sound loudness on the bottom layer of the text information, and displaying the emotional characteristics around the text information.
The embodiment of the application provides a device, through with the speech information who acquires text message, sound loudness and emotion characteristic sends to receiving terminal can show speech information and receive to the display of speech information when the switching instruction of speech information text message, sound loudness and emotion characteristic. Compared with the technical scheme of only displaying the text information corresponding to the voice information, the terminal provided by the embodiment of the application can not only enable the receiver to clearly know the text information corresponding to the voice information sent by the sender, but also enable the receiver to obtain the sound loudness of the sender when the sender sends the voice information and the current emotional characteristics of the sender, and further can improve the application range of the terminal.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the speech conversion methods as set forth in the above method embodiments.
It is clear to a person skilled in the art that the solution of the present application can be implemented by means of software and/or hardware. The "unit" and "module" in this specification refer to software and/or hardware that can perform a specific function independently or in cooperation with other components, where the hardware may be, for example, a Field-ProgrammaBLE Gate Array (FPGA), an Integrated Circuit (IC), or the like.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some service interfaces, devices or units, and may be an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program, which is stored in a computer-readable memory, and the memory may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.