Voice recognition processing method and device, electronic equipment and storage medium
1. A speech recognition processing method, characterized in that the method comprises:
performing voice recognition on voice in a multimedia file to obtain a sentence comprising a plurality of words, wherein the words are used as an initial recognition result;
displaying the sentence including the plurality of words according to a display manner corresponding to the recognition certainty degree of each of the words;
displaying a candidate word of at least one of the words;
and responding to a selection operation, and replacing words in the sentence, which are positioned at the same position as the selected candidate words, according to the selected candidate words.
2. The method of claim 1, wherein said displaying at least one candidate term for said term comprises:
in response to a focusing operation on at least one of the words in the sentence, displaying a corresponding at least one candidate word in a neighboring position of the at least one word; alternatively, the first and second electrodes may be,
automatically displaying a candidate term for at least one of the terms in the sentence.
3. The method according to claim 1, wherein said displaying the sentence including the plurality of words in accordance with a display manner corresponding to the recognition certainty degree of each of the words comprises:
displaying the words with the identification determination degree lower than a degree threshold value in a display mode different from the determination words in the sentences;
wherein the deterministic word is a word for which the recognition certainty degree is not lower than the degree threshold.
4. The method according to claim 1, wherein said displaying the sentence including the plurality of words in accordance with a display manner corresponding to the recognition certainty degree of each of the words comprises:
determining the significance degree of the display mode corresponding to each word according to the identification determination degree of each word, and displaying the sentence according to the significance degree of the display mode corresponding to each word;
wherein there is a negative correlation between the degree of significance and the degree of recognition certainty.
5. The method of claim 1, wherein after replacing a word in the sentence according to the selected candidate word that is co-located with the selected candidate word, the method further comprises:
automatically replacing at least one word in the sentence which is the same as the replaced word with the selected candidate word, and
applying the display mode of the selected candidate words to a display mode consistent with the deterministic words;
wherein the deterministic word is a word for which the recognition certainty degree is not lower than a degree threshold.
6. The method of claim 1, wherein after replacing a word in the sentence according to the selected candidate word that is co-located with the selected candidate word, the method further comprises:
presenting global replacement prompt information, wherein the global replacement prompt information is used for prompting that global replacement is carried out on the basis of the selected candidate words;
and in response to a confirmation operation aiming at the global replacement prompt information, all the words which are the same as the words replaced by the selected candidate words in at least one sentence identified from the multimedia file are replaced by the selected candidate words.
7. The method of claim 1, wherein when a one-to-one correspondence of a plurality of sentences is identified from a plurality of voices in the multimedia file, prior to the displaying the sentence comprising the plurality of words, the method further comprises:
determining the sorting mode adopted when the plurality of sentences are displayed according to at least one of the following modes:
sequencing the sentences in an ascending order according to the recognition certainty degrees of the sentences, wherein the recognition certainty degree of the sentences is the sum of the recognition certainty degrees of all words in the sentences;
sequencing the sentences according to the sequence of the voice corresponding to the sentences appearing in the multimedia file;
sorting the plurality of sentences in descending order according to the frequency of occurrence of the included uncertain words in the plurality of sentences, wherein the uncertain words are words of which the recognition certainty degree is lower than a degree threshold.
8. The method of claim 7, wherein after replacing a word in the sentence that is co-located with the selected candidate word according to the selected candidate word, the method further comprises:
updating the displayed ordering of the plurality of sentences according to at least one of the following ways: updating an ascending ordering of the recognition certainty of the plurality of statements; updating a descending order of frequency of occurrence of the uncertainty terms included in the plurality of statements.
9. The method of claim 1, wherein after replacing a word in the sentence that is co-located with the selected candidate word according to the selected candidate word, the method further comprises:
applying the display mode of the selected candidate words to a display mode consistent with the deterministic words;
wherein the deterministic word is a word for which the recognition certainty degree is not lower than a degree threshold.
10. The method of claim 1, wherein when the word comprises a plurality of candidate words, said displaying at least one candidate word for the word comprises:
determining a ranking of the plurality of candidate words according to at least one of: sorting the candidate words in a descending order according to the identification determining degree of the candidate words; sorting the candidate words in a descending order according to the selected times of the candidate words;
and displaying the candidate words according to the sorting mode.
11. The method of claim 1, wherein prior to said displaying the sentence comprising the plurality of words, the method further comprises:
determining a decoding path with the lowest path score as an optimal decoding path from a plurality of decoding paths obtained by performing voice recognition on voice in the multimedia file;
determining a plurality of words included in the optimal decoding path as the initial recognition result.
12. The method of claim 11, wherein prior to said displaying a candidate term for at least one of said terms, said method further comprises:
the decoding paths are sorted in an ascending order according to the path scores, and the previous part of the decoding paths except the optimal decoding path are selected from the ascending order result;
determining a word in the same position as at least one word in the selected decoding path as the candidate word.
13. A speech recognition processing apparatus, comprising:
the recognition module is used for carrying out voice recognition on voice in the multimedia file to obtain a sentence comprising a plurality of words, wherein the words are used as an initial recognition result;
a display module configured to display the sentence including the plurality of words according to a display manner corresponding to the recognition determination degree of each of the words;
the display module is further used for displaying at least one candidate word of the word;
and the replacing module is also used for responding to the selection operation and replacing the words in the sentence, which are positioned at the same position as the selected candidate words, according to the selected candidate words.
14. An electronic device, comprising:
a memory for storing computer executable instructions;
a processor for implementing the speech recognition processing method of any one of claims 1 to 12 when executing computer executable instructions stored in the memory.
15. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the speech recognition processing method of any one of claims 1 to 12 when executed.
Background
Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. Speech Technology (Speech Technology) is an important direction in artificial intelligence, and various theories and methods for achieving efficient communication between a person and a computer using natural language have been mainly studied.
Automatic Speech Recognition (ASR) is a branch of Speech technology that is capable of converting human Speech into text. However, in the related art, automatic speech recognition cannot completely recognize continuous pronunciation and the like, and thus speech recognition errors are likely to occur. When the initial recognition result recognized aiming at the voice information is wrong, the user is often required to modify the initial recognition result in a mode of manually inputting correct text, so that the efficiency and the speed for acquiring the recognition result meeting the requirements of the user in the voice recognition process are low, and an effective solution is not available in the related technology.
Disclosure of Invention
The embodiment of the application provides a voice recognition processing method and device, electronic equipment and a computer readable storage medium, which can improve the efficiency of obtaining a recognition result meeting the requirements of a user in the voice recognition process.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a voice recognition processing method, which comprises the following steps:
performing voice recognition on voice in a multimedia file to obtain a sentence comprising a plurality of words, wherein the words are used as an initial recognition result;
displaying the sentence including the plurality of words according to a display manner corresponding to the recognition certainty degree of each of the words;
displaying a candidate word of at least one of the words;
and responding to a selection operation, and replacing words in the sentence, which are positioned at the same position as the selected candidate words, according to the selected candidate words.
In the above scheme, the performing speech recognition on the speech in the multimedia file to obtain a sentence including a plurality of words includes:
performing voice recognition on voice in the multimedia file to obtain a plurality of decoding paths and a path score corresponding to each decoding path;
wherein each of the decoding paths includes a plurality of words.
In the above scheme, the performing speech recognition on the speech in the multimedia file to obtain a plurality of decoding paths and a path score corresponding to each decoding path includes:
framing the voice in the multimedia file to obtain a plurality of audio subdata;
performing acoustic feature extraction on each audio subdata to obtain a plurality of audio features;
performing word graph generation processing on the plurality of audio features to obtain a word graph network;
and performing path generation processing based on the word graph network to obtain a plurality of decoding paths and path scores corresponding to each decoding path.
In the foregoing solution, the performing path generation processing based on the word graph network to obtain a plurality of decoding paths and a path score corresponding to each decoding path includes:
determining a plurality of decoding paths from a starting node to a terminating node in the word graph network;
performing the following for each of the decoding paths:
and summing the scores of a plurality of nodes and the scores of a plurality of connecting lines included in the decoding path to obtain the path score of the decoding path.
In the foregoing solution, the performing a word graph generation process on the multiple audio features to obtain a word graph network includes:
converting each of the audio features into a corresponding phoneme to obtain a plurality of phonemes;
combining the plurality of phonemes to obtain a plurality of phoneme strings;
identifying each phoneme string through an acoustic model to obtain a plurality of words corresponding to each phoneme string and a score of each word;
taking the obtained words as nodes in the word graph network, and determining the scores of the words as the scores of the corresponding nodes;
connecting nodes corresponding to adjacent phoneme strings, and identifying words corresponding to two nodes of each connecting line through a language model to obtain a score of each connecting line;
and the score of the connecting line represents the probability that the words corresponding to the two nodes of the connecting line are connected together to form a complete sentence.
In the above solution, before the obtaining the word as a node in the word graph network, the method further includes:
performing the following processing for a plurality of words corresponding to each phoneme string:
and sorting the words in an ascending order according to the scores of the words, and filtering out the words at the back in the ascending order result.
In the above scheme, the converting each of the audio features into a corresponding phoneme includes:
performing the following for each of the audio features:
matching the audio features with each basic phoneme in a basic phoneme library to determine similarity between the audio features and each basic phoneme;
and determining the base phoneme with the highest similarity as the phoneme corresponding to the audio features.
In the above scheme, before the displaying the sentence including the plurality of words, the method further includes:
performing the following processing for a plurality of words corresponding to each phoneme string:
sorting the words in an ascending order according to the scores, and selecting two words in the front in the ascending order result;
determining a difference in scores between the two terms;
determining the recognition determining degree of the words corresponding to the phoneme string in the initial recognition result according to the grading difference;
wherein the score difference is positively correlated with the degree of recognition certainty.
An embodiment of the present application provides a speech recognition processing apparatus, including:
the recognition module is used for carrying out voice recognition on voice in the multimedia file to obtain a sentence comprising a plurality of words, wherein the words are used as an initial recognition result;
a display module configured to display the sentence including the plurality of words according to a display manner corresponding to the recognition determination degree of each of the words;
the display module is further used for displaying at least one candidate word of the word;
and the replacing module is also used for responding to the selection operation and replacing the words in the sentence, which are positioned at the same position as the selected candidate words, according to the selected candidate words.
In the above solution, the display module is further configured to display, in response to a focusing operation on at least one of the words in the sentence, a corresponding at least one candidate word in a neighboring position of the at least one word; or, automatically displaying a candidate word for at least one of the words in the sentence.
In the above scheme, the display module is further configured to display the words with the recognition certainty degrees lower than a degree threshold in a display manner different from that of the certainty words in the sentence; wherein the deterministic word is a word whose recognition certainty degree is not lower than the degree threshold.
In the above scheme, the display module is further configured to determine a degree of significance of a display manner corresponding to each word according to the recognition determination degree of each word, and display the sentence according to the degree of significance of the display manner corresponding to each word; wherein there is a negative correlation between the degree of significance and the degree of recognition certainty.
In the above scheme, the replacing module is further configured to automatically replace at least one word in the sentence that is the same as the replaced word with the selected candidate word, and apply a display manner of the selected candidate word to a display manner consistent with the deterministic word; wherein the deterministic word is a word whose recognition certainty degree is not lower than a degree threshold.
In the above scheme, the replacement module is further configured to present global replacement prompt information, where the global replacement prompt information is used to prompt global replacement to be performed based on the selected candidate word; and in response to a confirmation operation aiming at the global replacement prompt information, all the words which are the same as the words replaced by the selected candidate words in at least one sentence identified from the multimedia file are replaced by the selected candidate words.
In the foregoing solution, when a plurality of sentences corresponding to one another are recognized from a plurality of voices in the multimedia file, the voice recognition processing apparatus further includes: a sorting module, configured to determine a sorting manner adopted when displaying the plurality of sentences according to at least one of the following manners: sequencing the sentences in an ascending order according to the recognition certainty degrees of the sentences, wherein the recognition certainty degree of the sentences is the sum of the recognition certainty degrees of all words in the sentences; sequencing the sentences according to the sequence of the voice corresponding to the sentences appearing in the multimedia file; sorting the plurality of sentences in descending order according to the occurrence frequency of the included uncertain words in the plurality of sentences, wherein the uncertain words are words with identification certainty degree lower than a degree threshold.
In the foregoing solution, the sorting module is further configured to update and display the sorting of the multiple statements according to at least one of the following manners: updating an ascending ordering of the recognition certainty of the plurality of statements; updating a descending order of frequency of occurrence of the uncertainty terms included in the plurality of statements.
In the above scheme, the display module is further configured to apply the display mode of the selected candidate word to a display mode consistent with the deterministic word; wherein the deterministic word is a word whose recognition certainty degree is not lower than a degree threshold.
In the above solution, when the word includes a plurality of candidate words, the display module is further configured to determine an ordering manner of the plurality of candidate words according to at least one of the following manners: sorting the candidate words in a descending order according to the identification determining degree of the candidate words; sorting the candidate words in a descending order according to the selected times of the candidate words; and displaying the candidate words according to the sorting mode.
In the above scheme, the recognition module is further configured to determine, as an optimal decoding path, a decoding path with a lowest path score among a plurality of decoding paths obtained by performing speech recognition on speech in the multimedia file; determining a plurality of words included in the optimal decoding path as the initial recognition result.
In the above scheme, the identification module is further configured to sort the plurality of decoding paths in an ascending order according to the path scores, and select a previous part of the decoding paths except the optimal decoding path from an ascending sorting result; determining a word in the same position as at least one word in the selected decoding path as the candidate word.
In the above scheme, the recognition module is further configured to perform speech recognition on speech in the multimedia file to obtain a plurality of decoding paths and a path score corresponding to each decoding path; wherein each of the decoding paths includes a plurality of words.
In the above scheme, the recognition module is further configured to perform framing processing on the voice in the multimedia file to obtain a plurality of audio subdata; performing acoustic feature extraction on each audio subdata to obtain a plurality of audio features; performing word graph generation processing on the plurality of audio features to obtain a word graph network; and performing path generation processing based on the word graph network to obtain a plurality of decoding paths and path scores corresponding to each decoding path.
In the above solution, the identifying module is further configured to determine a plurality of decoding paths from the start node to the end node in the word graph network; performing the following for each of the decoding paths: and summing the scores of a plurality of nodes and the scores of a plurality of connecting lines included in the decoding path to obtain the path score of the decoding path.
In the foregoing solution, the identification module is further configured to convert each of the audio features into a corresponding phoneme to obtain a plurality of phonemes; combining the plurality of phonemes to obtain a plurality of phoneme strings; identifying each phoneme string through an acoustic model to obtain a plurality of words corresponding to each phoneme string and a score of each word; taking the obtained words as nodes in the word graph network, and determining the scores of the words as the scores of the corresponding nodes; connecting nodes corresponding to adjacent phoneme strings, and identifying words corresponding to two nodes of each connecting line through a language model to obtain a score of each connecting line; and the score of the connecting line represents the probability that the words corresponding to the two nodes of the connecting line are connected together to form a complete sentence.
In the foregoing solution, the recognition module is further configured to perform the following processing for a plurality of words corresponding to each phoneme string: and sorting the words in an ascending order according to the scores of the words, and filtering out the words at the back in the ascending order result.
In the foregoing solution, the identification module is further configured to perform the following processing for each of the audio features: matching the audio features with each basic phoneme in a basic phoneme library to determine similarity between the audio features and each basic phoneme; and determining the base phoneme with the highest similarity as the phoneme corresponding to the audio features.
In the foregoing solution, the recognition module is further configured to perform the following processing for a plurality of words corresponding to each phoneme string: sorting the words in an ascending order according to the scores, and selecting two words in the front in the ascending order result; determining a difference in scores between the two terms; determining the recognition determining degree of the words corresponding to the phoneme string in the initial recognition result according to the grading difference; wherein the score difference is positively correlated with the degree of recognition certainty.
An embodiment of the present application provides an electronic device, including:
a memory for storing computer executable instructions;
and the processor is used for realizing the voice recognition processing method provided by the embodiment of the application when executing the computer executable instructions stored in the memory.
The embodiment of the present application provides a computer-readable storage medium, which stores computer-executable instructions and is used for implementing the speech recognition processing method provided by the embodiment of the present application when being executed by a processor.
The embodiment of the application has the following beneficial effects:
aiming at the recognition determination degree of each word in the initial recognition result, a plurality of words in the initial recognition result are displayed in a differentiated mode, the words with lower recognition determination degree in the initial recognition result can be prompted, and a user can modify the words with lower recognition determination degree conveniently; and selective replacement and modification are carried out on words with low recognition and determination degrees, the operation is simple, the initial recognition result can be rapidly and accurately modified, and therefore the efficiency of obtaining the recognition result meeting the user requirements is improved.
Drawings
Fig. 1A and 1B are schematic diagrams of application scenarios of speech recognition provided by the related art;
fig. 2 is a schematic structural diagram of a speech recognition processing system 100 provided in an embodiment of the present application;
fig. 3 is a schematic structural diagram of a terminal 400 provided in an embodiment of the present application;
FIG. 4 is a flow chart of a speech recognition processing method provided in an embodiment of the present application;
FIG. 5 is a flowchart illustrating a speech recognition processing method according to an embodiment of the present application;
fig. 6A and fig. 6B are schematic flow charts of a speech recognition processing method provided in an embodiment of the present application;
FIG. 7 is a flowchart illustrating a speech recognition processing method according to an embodiment of the present application;
fig. 8A, fig. 8B, fig. 8C, fig. 8D, and fig. 8E are schematic diagrams of application scenarios of a speech recognition processing method provided in an embodiment of the present application;
FIG. 9 is a flowchart illustrating a speech recognition processing method according to an embodiment of the present application;
fig. 10A, 10B, 10C, and 10D are schematic diagrams illustrating a speech recognition processing method according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.
2) The terminal comprises a client, and an application program running in the terminal and used for providing various services, such as a live client, a video client or a short video client.
3) The key technologies of the voice technology are automatic voice recognition technology, voice synthesis technology and voiceprint recognition technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
4) The term graph (Lattice), or word Lattice, is a directed acyclic graph in automatic speech recognition, where each node represents a word or word and the edge represents a score (or cost). One path from the starting point to the end point in the word graph is a complete speech recognition result.
5) An acoustic model is a unit that decodes acoustic features of speech into phonemes or words. The language model then decodes the phonemes or words into a complete sentence.
6) Word Error Rate (WER), refers to the ratio between the number of erroneous words and the total number of words in speech recognition.
7) Phonemes, the pronunciation of a word is made up of phonemes. For english, a commonly used phone set is a set of 39 phones from the kaki merlon university. In Chinese, all initials and finals are generally used as a phoneme set directly, and in addition, Chinese recognition is also divided into harmony and no harmony.
Referring to fig. 1A and 1B, fig. 1A and 1B are schematic diagrams of application scenarios of speech recognition provided by the related art for modifying a word recognized incorrectly in ASR, where the related art generally gives a recognition result of ASR, and lets a user actively find the incorrect word and type in a corrected word to correct the incorrect word. For example, in fig. 1A and 1B, the user can only actively find the wrong word in the text edit box 101 containing the recognition result and type the corrected word to correct the wrong word.
The following technical problems of the related art are found in the embodiments of the present application: if the automatically recognized text contains errors, the user needs to carefully remove the errors of the words, and the wrongly recognized words and wrongly recognized words are not easy to be found due to the reading habit of human beings, so that the efficiency and the speed for obtaining the recognition result meeting the requirements of the user in the voice recognition process are low.
In view of the foregoing technical problems, embodiments of the present application provide a speech recognition processing method, which can improve the efficiency of obtaining a recognition result meeting the user requirements in a speech recognition process. An exemplary application of the speech recognition processing method provided in the embodiment of the present application is described below, and the speech recognition processing method provided in the embodiment of the present application may be implemented by various electronic devices, for example, may be implemented by a terminal alone, or may be implemented by a server and a terminal in cooperation.
Next, an embodiment of the present application will be described by taking a server and a terminal as an example, and referring to fig. 2, fig. 2 is a schematic structural diagram of a speech recognition processing system 100 provided in the embodiment of the present application. The speech recognition processing system 100 includes: the server 200, the network 300, and the terminal 400 will be separately described.
The server 200 is a background server of the client 410, and is configured to perform speech recognition on speech in the multimedia file to obtain an initial recognition result including a plurality of words; and also for sending the initial recognition result to the client 410.
The network 300, which is used as a medium for communication between the server 200 and the terminal 400, may be a wide area network or a local area network, or a combination of both.
The terminal 400 is used for operating the client 410, and the client 410 is a client with a voice recognition function. The client 410 is configured to receive the initial recognition result sent by the server 200, and display the initial recognition result including a plurality of words in the human-computer interaction interface according to the display mode corresponding to the recognition determination degree of each word; also used for displaying at least one candidate word of the word; and the method is also used for replacing the word which is positioned at the same position as the selected candidate word in the initial recognition result according to the selected candidate word in response to the selection operation of the user.
In some embodiments, the terminal 400 implements the speech recognition processing method provided by the embodiments of the present application by running a computer program, for example, the computer program may be a native program or a software module in an operating system; can be a local (Native) Application program (APP), i.e. a program that needs to be installed in an operating system to run, such as a speech recognition APP, a subtitle editing APP or a social APP; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; but also a speech recognition applet or a subtitle editing applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.
Next, the embodiment of the present application will be described by taking the terminal 400 in fig. 2 as an independent embodiment.
The terminal 400 is used for operating the client 410, and the client 410 is a client with a voice recognition function. The client 410 is used for performing voice recognition on voice in the multimedia file to obtain an initial recognition result comprising a plurality of words; the method is also used for displaying an initial recognition result comprising a plurality of words in a human-computer interaction interface according to a display mode corresponding to the recognition determining degree of each word; also used for displaying at least one candidate word of the word; and the method is also used for replacing the word which is positioned at the same position as the selected candidate word in the initial recognition result according to the selected candidate word in response to the selection operation of the user.
The embodiments of the present application can be applied to various speech recognition scenarios, such as subtitle editing, real-time speech-to-text or lyric editing, and the like, which will be described separately below.
Taking the scene of subtitle editing as an example, the client 410 may be a subtitle editing APP, and the multimedia file may be a video of a subtitle to be edited. A user uploads a video of a subtitle to be edited through a client 410, and the client 410 performs voice recognition on a plurality of voices in the video to obtain a plurality of initial recognition results corresponding to the plurality of voices one by one; displaying the initial recognition result in a human-computer interaction interface according to a display mode corresponding to the recognition determining degree of each word in the initial recognition result; when the user finds that an error word exists in the initial recognition result, the client 410 displays the candidate word in response to the focusing operation of the user, and replaces the word in the initial recognition result, which is at the same position as the selected candidate word, with the selected candidate word in response to the selection operation of the user on the candidate word. Therefore, the accuracy of the user for editing the subtitles can be improved, and the efficiency and the speed of the user for editing the subtitles can be improved.
Taking a real-time speech-to-text scenario as an example, the client 410 may be a social APP and the multimedia file may be a separate speech message. The user performs voice recognition on the uploaded voice through a real-time voice to text function in the client 410 to obtain a corresponding initial recognition result; displaying the initial recognition result in a human-computer interaction interface according to a display mode corresponding to the recognition determining degree of each word in the initial recognition result; when the user finds that an error word exists in the initial recognition result, the client 410 displays the candidate word in response to the focusing operation of the user, and replaces the word in the initial recognition result, which is at the same position as the selected candidate word, with the selected candidate word in response to the selection operation of the user on the candidate word. Therefore, the efficiency and the speed of sending the text information by the user in the chat process can be improved.
Taking the scenario of lyric editing as an example, the client 410 may be a lyric editing APP, and the multimedia file may be audio of lyrics to be edited. A user uploads the audio of the lyrics to be edited through the client 410, and the client 410 performs voice recognition on a plurality of voices in the audio to obtain a plurality of initial recognition results corresponding to the plurality of voices one by one; displaying the initial recognition result in a human-computer interaction interface according to a display mode corresponding to the recognition determining degree of each word in the initial recognition result; when the user finds that an error word exists in the initial recognition result, the client 410 displays the candidate word in response to the focusing operation of the user, and replaces the word in the initial recognition result, which is at the same position as the selected candidate word, with the selected candidate word in response to the selection operation of the user on the candidate word. Therefore, the method and the device can improve the accuracy rate of editing the lyrics by the user and improve the efficiency and speed of editing the lyrics by the user.
The embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology for unifying series resources such as hardware, software, and network in a wide area network or a local area network to implement data calculation, storage, processing, and sharing.
The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources.
As an example, the server 200 may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 400 and the server 200 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.
The structure of the terminal 400 in fig. 2 is explained next. Referring to fig. 3, fig. 3 is a schematic structural diagram of a terminal 400 according to an embodiment of the present application, where the terminal 400 shown in fig. 3 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in FIG. 3.
The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.
The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.
The operating system 451, which includes system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used for implementing various basic services and for processing hardware-based tasks.
A network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.
A presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430.
An input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
In some embodiments, the information prompting device in the live broadcast process provided by the embodiment of the present application may be implemented in software, and fig. 3 illustrates a speech recognition processing device 455 stored in a memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: an identification module 4551, a display module 4552 and a replacement module 4553, which are logical and thus may be arbitrarily combined or further divided according to the functions implemented. The functions of the respective modules will be explained below.
The speech recognition processing method provided by the embodiment of the present application may be executed by the terminal 400 in fig. 2 alone, or may be executed by the terminal 400 and the server 200 in fig. 2 in cooperation.
In the following, a voice recognition processing method provided by the embodiment of the present application is performed by the terminal 400 in fig. 2 alone as an example. Referring to fig. 4, fig. 4 is a schematic flowchart of a speech recognition processing method provided in an embodiment of the present application, and will be described with reference to the steps shown in fig. 4.
It should be noted that the method shown in fig. 4 can be executed by various forms of computer programs executed by the terminal 400, and is not limited to the client 410, such as the operating system 451, the software modules and the scripts, so the client should not be considered as limiting the embodiments of the present application.
In step S101, speech recognition is performed on speech in a multimedia file to obtain a sentence including a plurality of words.
Here, a plurality of words are used as the initial recognition result.
In some embodiments, the multimedia file may be audio, video, or separate voice information. Therefore, the multimedia file may include one voice or a plurality of voices.
As an example, when the multimedia file is audio, the embodiments of the present application may be applied to a lyric editing application scenario; when the multimedia file is a video, the embodiment of the application can be applied to a subtitle editing application scene; when the multimedia file is independent voice information, the embodiment of the application can be applied to a real-time voice-to-text application scene.
Taking a scene of subtitle editing as an example, the client may be a subtitle editing APP, the multimedia file may be a video of a subtitle to be edited, and the video includes a plurality of voices. The method comprises the steps that a user uploads a video of a subtitle to be edited through a client, and the client carries out voice recognition on a plurality of voices in the video to obtain a plurality of sentences corresponding to the voices one by one.
Taking a scene of real-time voice to text as an example, the client may be a social APP, and the multimedia file may be separate voice information. And the user performs voice recognition on the uploaded voice through a real-time voice-to-text function in the client so as to obtain a sentence corresponding to the voice.
Taking a scene of lyric editing as an example, the client may be a lyric editing APP, the multimedia file may be an audio of lyrics to be edited, and the audio includes a plurality of voices. The method comprises the steps that a user uploads the audio frequency of lyrics to be edited through a client, and the client carries out voice recognition on a plurality of voices in the audio frequency to obtain a plurality of sentences corresponding to the voices one by one.
In step S102, a sentence including a plurality of words is displayed in accordance with a display mode corresponding to the recognition certainty degree of each word.
Here, the degree of recognition certainty of a word is used to characterize the probability, e.g., confidence, that the word recognition is correct in the speech recognition process.
In some embodiments, determining the degree of significance of the display mode corresponding to each word according to the recognition determination degree of each word, and displaying the sentence according to the degree of significance of the display mode corresponding to each word; wherein there is a negative correlation between the degree of significance and the degree of recognition certainty.
By way of example, the words are displayed in a manner that includes at least one of: color; a font type; the word size; special effect; identifiers (e.g., underlines, quotation marks, etc.); thickness degree.
For example, when the word a, the word B, and the word C are included in the sentence, the recognition certainty degree of the word a is 0.5, the recognition certainty degree of the word B is 0.3, and the recognition certainty degree of the word C is 0.8, the significance degree of the display manner of the word C is the lowest, the significance degree of the display manner of the word a is centered, the significance degree of the display manner of the word B is the highest, for example, the word size of the word C is the smallest, the word size of the word a is centered, and the word size of the word B is the largest.
For example, in fig. 8C, the recognition certainty degree of the word "gold day" is 0.3, the recognition certainty degree of the word "royal wei" is 0.5, and the recognition certainty degree of the word "from military," "we," "learning," and "is 1, and at this time, the display modes of the words" from military, "" we, "" learning, "and" are the same in significance and the lowest, that is, the letter numbers are all 8; the significance degree of the display mode of the word "Wangwei" is centered, namely the character number is 10; the word "jintian" is displayed with the highest degree of significance, i.e. the font size is 12.
In other embodiments, words whose degree of certainty is below a degree threshold are identified for display in a manner that is distinct from the display of the deterministic words in the sentence.
Here, the certainty word is a word whose recognition certainty degree is not lower than the degree threshold or a word without a candidate word. The uncertain word is a word whose recognition certainty is below a threshold degree or a word having a candidate word. The degree threshold may be a default value, a value set by a user, or a value determined according to the recognition certainty degrees corresponding to all the words in the sentence, for example, an average value of the recognition certainty degrees corresponding to all the words is used as the degree threshold.
As an example, the uncertain words can be displayed with a size that is distinct from the deterministic words, e.g., the size of the uncertain words can be larger than the size of the deterministic words; the uncertain words may be displayed in a font color distinct from the deterministic words, e.g., the font color of the uncertain words may be red and the font color of the deterministic words may be black; a distinguishing identifier may be displayed in the vicinity of the uncertain word to distinguish it from the uncertain word, where the distinguishing identifier may be a symbol such as underline or quotation mark, for example, the lower part of the uncertain word may be underlined and the certain word may not be underlined.
For example, when a word a, a word B, and a word C are included in a sentence, the degree threshold is 0.4, the recognition certainty degree of the word a is 0.5, the recognition certainty degree of the word B is 0.3, and the recognition certainty degree of the word C is 0.8, the word C and the word a are certainty words, and the word B is an uncertainty word. In this manner, the word B is displayed in a display manner different from the words C and a. For example, the font color of word B is red, and the font colors of word C and word a are black.
For example, in fig. 8A, the font size of the uncertain word "wangwei" in the displayed sentence is larger than the font size of the uncertain words "today", "we", "learning", "of", "subordinate" or "subordinate".
The embodiment of the application aims at the identification determining degree of each word in the initial identification result, the words in the initial identification result are displayed in a differentiation mode in multiple modes, the words with lower identification determining degree in the initial identification result can be prompted, a user can modify the words with lower identification determining degree in the initial identification result conveniently, and compared with the situation that the user needs to remove the core by himself in the related art to correct wrong words, the embodiment of the application can achieve the purpose of quickly and accurately modifying the initial identification result, and therefore the efficiency of obtaining the identification result meeting the user requirements is improved.
In step S103, candidate words of at least one word are displayed.
In some embodiments, in response to a focus operation on at least one word in the sentence, the corresponding at least one candidate word is displayed in a neighboring position of the at least one word.
As an example, in response to a focus operation for at least one word in a sentence, a corresponding at least one candidate word is displayed in a position adjacent to the word at which the focus of the focus operation is located.
As an example, the focusing operation may be various forms of operations that are preset by the operating system and that do not conflict with the registered operation; or may be various forms of operations that are user-defined and that do not conflict with registered operations. The focusing operation includes at least one of: a click operation (e.g., a touch click operation or a click operation of moving a mouse, etc., where a focus is a click position of the click operation); hovering operation (for example, operation of moving a mouse when a word hovering time exceeds a time threshold, where the time threshold may be a default value or a value set by a user, and a focus is a hovering position of the hovering operation); voice operation (wherein, the word at the focus is the word contained in the voice operation); a gaze focusing operation (focus is gaze focus, e.g., the word at which the gaze focus is located is the word selected by the gaze focusing operation, i.e., the candidate word that displays the word at which the gaze focus is located). Thus, the operation experience of the user can be improved.
As an example, the adjacent position of a word may be adjacent above, below, left, or right of the word. For example, when the sentence is horizontally arranged, the corresponding candidate word may be displayed above or below the neighborhood of the word; when the sentence is vertically aligned, the corresponding candidate word may be displayed to the adjacent left or right of the word.
For example, in fig. 8A, when a user hovers a mouse over an arbitrary word, a hover box 801 is shown below the word adjacency and several candidate words are shown in the hover box 801.
In other embodiments, candidate words for at least one word in the sentence are automatically displayed.
As an example, candidate words corresponding to a part of words in a sentence may be automatically displayed, or candidate words corresponding to all words in the sentence may be displayed.
As an example, when a plurality of sentences corresponding one to one are recognized from a plurality of voices in a multimedia file, a candidate word of at least one word in all the sentences may be automatically displayed; the candidate words of at least one word in the sentence can be automatically displayed aiming at partial sentences in the plurality of sentences; the partial sentences may be sentences in which uncertain words exist, or sentences included in the terminal screen.
For example, in fig. 8D, only the sentence "please turn to thirteenth book" and the sentence "from junior of great here we learn great" have the uncertain word among the sentences identified in the multimedia file, so that the candidate word corresponding to the uncertain word in the above two sentences can be automatically displayed in the floating frame 803.
For example, in fig. 8E, a presentation box 804 is displayed on the right side of the subtitle editing region, and a candidate word for at least one word is displayed in the presentation box 804. The display frame 804 may be located above, below, to the left, or to the right of the subtitle editing area, which is not limited in this embodiment.
According to the embodiment of the application, the correct words exist in the candidate words with higher probability, the candidate words are displayed to the user, the user can conveniently select the correct words from the candidate words for correction, and therefore the efficiency of obtaining the recognition result meeting the user requirements is improved.
In some embodiments, the candidate words may be displayed in a manner consistent with the words in the same position, or in a manner different from the words in the same position. For example, the candidate words "wangwei" and "wangwei" for the word "wangwei" in fig. 8A may be displayed in the same display mode as that used when displaying the "wangwei", or may be displayed in different display modes.
In some embodiments, when the word comprises a plurality of candidate words, the ranking of the plurality of candidate words is determined according to at least one of: sorting the candidate words in a descending order according to the identification determination degree of the candidate words; sorting the candidate words in a descending order according to the selected times of the candidate words; and displaying a plurality of candidate words according to the sorting mode.
As an example, since the higher the recognition certainty degree of the candidate word is, the higher the probability that the candidate word is the correct word is represented, the candidate words are sorted in a descending order according to the recognition certainty degree of the candidate word, and the candidate word with the higher probability of being the correct word can be preferentially shown to the user, so that the speed of the user in correcting and recognizing the wrong word can be increased.
As an example, the higher the number of times of selection of the candidate word is, the higher the probability that the candidate word meets the user requirement is represented, so that the candidate words are sorted in a descending order according to the number of times of selection of the candidate word, the candidate words with higher probability meeting the user requirement can be preferentially displayed to the user, and the speed of correcting and identifying wrong words by the user can be increased.
In step S104, in response to the selection operation, a word in the sentence at the same position as the selected candidate word is replaced according to the selected candidate word.
As an example, the selection operation may be various forms of operations that are preset by the operating system and that do not conflict with the registered operation; or may be various forms of operations that are user-defined and that do not conflict with registered operations. The selecting operation includes at least one of: a click operation (e.g., an operation of moving a mouse to click a candidate word or an operation of touching and clicking a candidate word, etc.); hovering operation (for example, operation of moving a mouse when a hovering time of a candidate word exceeds a time threshold, where the time threshold may be a default value or a value set by a user); performing voice operation; body feeling operation; a gaze selection operation (e.g., the candidate word at which the gaze focus is located is the candidate word selected by the gaze selection operation). Thus, the operation experience of the user can be improved.
For example, in fig. 8B, the user may replace "wangwei" in the sentence by selecting the candidate word 802 "wangwei", and when the user selects the candidate word "wangwei", the corrected sentence is "the subordinate march we learn wangwei today".
The embodiment of the application selectively replaces and modifies the wrongly recognized words, compared with the related technology that the operation of correcting the wrongly recognized words by typing the corrected words is simple, the initial recognition result can be quickly and accurately modified, and therefore the efficiency of obtaining the recognition result meeting the user requirements is improved.
In some embodiments, after step S104, the method may further include: and applying the display mode of the selected candidate words in the sentence to be the display mode consistent with the determinacy words. Therefore, the corrected words can be not prompted to the user any more, the user is prevented from mixing up the corrected words and the uncorrected words in the correcting process, and the correcting efficiency of the user is improved.
For example, in fig. 8A, the font size of the uncertain word "wangwei" in the displayed sentence is larger than the font size of the uncertain words "today", "we", "learning", "of", "subordinate" or "subordinate". In fig. 8B, when "wangwei" is replaced with "wangwei", the sentence "the subordinate march of which we learn wangwei today" is displayed, wherein the font size of "wangwei" is the same as the font size of the deterministic words "today", "we", "learning", "subordinate march".
In some embodiments, after step S104, the method may further include: and automatically replacing at least one word which is the same as the replaced word in the sentence with the selected candidate word, and applying the display mode of the selected candidate word in the sentence to be the display mode consistent with the determinacy word. Therefore, the same wrong words in the sentences can be replaced by the selected candidate words in batches, and therefore the correction efficiency of the user is improved.
As an example, all words that are the same as the replaced word are automatically replaced in the sentence with the selected candidate word; or automatically replacing all the words which are the same as the replaced words and have the recognition degree lower than the degree threshold value with the selected candidate words, and applying the display mode of the selected candidate words in the sentence to be the display mode consistent with the certainty words.
In some embodiments, after step S104, the method may further include: presenting global replacement prompt information, wherein the global replacement prompt information is used for prompting global replacement to be carried out on the basis of the selected candidate words; and in response to the confirmation operation aiming at the global replacement prompt information, all the words which are in at least one sentence identified from the multimedia file and are the same as the words replaced by the selected candidate words are replaced by the selected candidate words. In this way, after the confirmation operation of the user is received, the same wrong words in the sentences can be replaced by the selected candidate words in batch, and therefore the correction efficiency of the user is improved.
Here, the confirmation operation may be an operation of triggering a confirmation button in the global replacement guidance information, or may be an operation of a focus-off word in the focus operation.
As an example, after all words in at least one sentence identified from the multimedia file and identical to the word replaced by the selected candidate word are replaced by the selected candidate word, the method may further include: the display manner of the words of the replaced positions is applied as the display manner consistent with the deterministic words. Therefore, the corrected words can be not prompted to the user any more, the user is prevented from mixing up the corrected words and the uncorrected words in the correcting process, and the correcting efficiency of the user is improved.
In some embodiments, when a plurality of sentences corresponding to one another are recognized from a plurality of voices in a multimedia file, referring to fig. 5, fig. 5 is a flowchart of a voice recognition processing method provided in an embodiment of the present application, based on fig. 4, a step S105 may be included before step S102, and a step S106 may be included after step S104, it should be noted that step S106 is optional.
In step S105, the sort used when displaying the plurality of sentences is determined.
In some embodiments, the plurality of sentences are sorted in ascending order by the recognition certainty degree of the sentences, wherein the recognition certainty degree of the sentences is the sum of the recognition certainty degrees of all the words included in the sentences.
For example, when a sentence a, a sentence B, and a sentence C are recognized from the multimedia file, the recognition certainty of the sentence a is 2.3, the recognition certainty of the sentence B is 2.8, and the recognition certainty of the sentence C is 2.5, the sentences a, B, and C are ordered: statement A; statement C; statement B. Therefore, the sentences with lower identification certainty degree can be preferentially displayed to the user, and the user can preferentially correct the sentences with lower identification certainty degree conveniently.
In other embodiments, the plurality of sentences are ordered according to the sequence of the occurrence of the speech corresponding to the sentences in the multimedia file.
Taking a subtitle editing scene as an example, in fig. 8A, the displayed multiple sentences are sorted according to the sequence of occurrence of voices corresponding to the sentences in the video, and play times of the voices corresponding to the multiple sentences one by one are displayed at adjacent positions of the multiple sentences, where the play times include a start play time and an end play time. Therefore, corresponding sentences can be displayed to the user according to the time axis, and the user can correct the sentences with wrong words according to the sequence of the voice corresponding to the sentences appearing in the multimedia file.
In still other embodiments, the plurality of sentences are sorted in descending order by the frequency of occurrence of the included uncertainty terms in the plurality of sentences.
For example, when a sentence a, a sentence B, and a sentence C are identified from the multimedia file, 2 uncertain words are included in the sentence a, 5 uncertain words are included in the sentence B, and 3 uncertain words are included in the sentence C, the sentence a, the sentence B, and the sentence C are ordered in the following manner: a statement B; statement C; statement a. Therefore, sentences with more wrong words can be preferentially displayed to the user, and the user can preferentially correct the sentences with more wrong words conveniently.
In step S106, the order in which the plurality of sentences are displayed is updated.
In some embodiments, the ascending ordering of the recognition certainty of the plurality of statements is updated.
For example, when at least one of the sentences a, B and C is corrected, the recognition determination degrees of the corrected sentences a, B and C are 2.3, 2.8 and 2.5 respectively, and the updating display of the sequence of the sentences is as follows: statement A; statement C; statement B. Therefore, the sentences are updated after being corrected, so that the sentences with lower recognition certainty degree can be always preferentially displayed to the user, and the user can preferentially correct the sentences with lower recognition certainty degree.
In other embodiments, the descending ordering of the frequency of occurrence of the uncertainty terms included in the plurality of statements is updated.
For example, when at least one of the sentences a, B and C is corrected, the corrected sentences a, B and C respectively include 2 uncertain words, 5 uncertain words and 3 uncertain words, and the updating and displaying of the ordering of the sentences is as follows: a statement B; statement C; statement a. Therefore, by updating the sequencing mode after the sentences are corrected, the sentences with more wrong words can be always preferentially displayed to the user, and the user can preferentially correct the sentences with more wrong words conveniently.
In some embodiments, referring to fig. 6A, fig. 6A is a schematic flowchart of a speech recognition processing method provided in an embodiment of the present application, and based on fig. 4, step S101 may be replaced with step S107.
In step S107, speech recognition is performed on the speech in the multimedia file, and a plurality of decoding paths and a path score corresponding to each decoding path are obtained.
Here, each decoding path includes a plurality of words.
For example, there are four decoding paths in fig. 10C, and there is a path score for each decoding path, e.g., "learn king dimension" — the path score is "6"; "blood wash Kingwei" — path score "11"; "Wangwei learning" — Path score "7"; "Wash blood king toilet" -the path score is "12".
In some embodiments, step S102 may be preceded by: determining a decoding path with the lowest path score as an optimal decoding path from a plurality of decoding paths obtained by performing voice recognition on voice in a multimedia file; determining a plurality of words included in the optimal decoding path as an initial recognition result.
For example, in fig. 10C, since the path score of "learned king dimension" is the lowest, the "learned king dimension" is the optimal decoding path, and the words "learned" and "king dimension" are included therein as the initial recognition result.
In some embodiments, step S103 may be preceded by: the multiple decoding paths are sorted in an ascending order according to the path scores, and partial decoding paths except the optimal decoding path and in front are selected from the ascending order result; and determining the words which are at the same position as at least one word in the selected decoding path as candidate words.
For example, in fig. 10C, four decoding paths are sorted in ascending order according to the path scores, and the sorted decoding paths are respectively: "learning King Wei"; "learning king wei"; "Xue xi wang wei"; "Xue xi wang wei"; at this time, if two decoding paths other than the optimal decoding path and before are selected, which are respectively "learning king defend" and "blood washing king dimensionality", the candidate word corresponding to "learning" in the initial recognition result is "blood washing", and the candidate word corresponding to "blood washing" is "king defend".
In some embodiments, referring to fig. 6B, fig. 6B is a schematic flowchart of a speech recognition processing method provided in an embodiment of the present application, and based on fig. 6A, step S107 may include steps S1071 to S1074.
In step S1071, the speech in the multimedia file is framed to obtain a plurality of audio sub data.
In some embodiments, the framing process is dividing the speech into a plurality of segments, each segment being the sub-audio data.
In step S1072, acoustic feature extraction is performed on each audio sub data to obtain a plurality of audio features.
In some embodiments, after the speech is subjected to framing processing, the speech becomes a plurality of audio sub-data, but the waveforms of the audio sub-data have little description capability in the time domain, so that the waveforms of the audio sub-data must be transformed, that is, each audio sub-data is transformed into an audio feature of a multi-dimensional vector according to the physiological characteristics of human ears, and the audio feature contains the content information of the corresponding audio sub-data.
In step S1073, a word graph generation process is performed on the plurality of audio features to obtain a word graph network.
In some embodiments, each audio feature is converted to a respective phoneme to obtain a plurality of phonemes; combining the plurality of phonemes to obtain a plurality of phoneme strings; identifying each phoneme string through an acoustic model to obtain a plurality of words corresponding to each phoneme string and a score of each word; taking the obtained words as nodes in a word graph network, and determining the scores of the words as the scores of the corresponding nodes; connecting the nodes corresponding to the adjacent phoneme strings, and identifying words corresponding to the two nodes of each connection through a language model to obtain the score of each connection.
As an example, the score of a line characterizes the probability that the words corresponding to two nodes of the line are connected together to be a complete sentence. For example, the lower the score of the connection line, the higher the probability that the words corresponding to the two nodes representing the connection line are connected together to be a complete sentence; the higher the score of the connecting line is, the lower the probability that the words corresponding to the two nodes of the connecting line are connected together to form a complete sentence is.
As an example, before the obtained word is taken as a node in the word graph network, the method may further include: performing the following processing for a plurality of words corresponding to each phoneme string: and sorting the plurality of words in an ascending order according to the scores of the words, and filtering out the words at the back in the ascending order result.
As an example, converting each audio feature into a respective phoneme may include: the following processing is performed for each audio feature: matching the audio features with each basic phoneme in the basic phoneme library to determine the similarity between the audio features and each basic phoneme; and determining the base phoneme with the highest similarity as the phoneme corresponding to the audio features.
As an example, before displaying the sentence including a plurality of words, it may further include: performing the following processing for a plurality of words corresponding to each phoneme string: sorting the words in an ascending order according to the scores, and selecting two words in the front from the ascending order result; determining a score difference between the two terms; determining the recognition determining degree of the words corresponding to the phoneme string in the initial recognition result according to the grading difference; wherein the score difference is positively correlated with the degree of recognition certainty.
For example, in fig. 10C, the candidate word "blood wash" reaching this state is stored in the intermediate node, and the word "learning" word size or color is marked by calculating the score difference 7-2-5 between the word "learning" of the optimal decoding path and the word "blood wash" of the suboptimal decoding path, where the larger the score difference, the larger the confidence (i.e., the higher the correctness) of the recognition result (i.e., the word "learning" of the identified optimal decoding path), the smaller the word size (or the more the font color is biased to black); conversely, the larger the font size (or the more red the font color).
In step S1074, a path generation process is performed based on the word graph network to obtain a plurality of decoding paths and a path score corresponding to each decoding path.
In some embodiments, a plurality of decoding paths from a starting node to a terminating node are determined in a word graph network; the following processing is performed for each decoding path: and summing the scores of the plurality of nodes and the scores of the plurality of connecting lines included in the decoding path to obtain the path score of the decoding path.
For example, in fig. 10D, the start node is "1", the end node is "4", and the complete "data" can be output according to the decoding path from 0- > 1- > 2- > 3- > 4; wherein the path score of a decoding path from 0- > 1- > 2- > 3- >4 is 1+0.5+0.3+ 1-2.8.
Compared with the prior art that only the optimal decoding path is searched in the voice recognition process, the decoding paths with lower scores of other paths are searched, words in the other decoding paths are used as candidate words, the probability that correct words exist in the candidate words is improved, the initial recognition result can be quickly and accurately modified, and the efficiency of obtaining the recognition result meeting the user requirements is improved.
Next, a speech recognition processing method provided in the embodiment of the present application will be described by taking as an example that the server 200 and the terminal 400 in fig. 2 cooperate with each other. Referring to fig. 7, fig. 7 is a flowchart illustrating a speech recognition processing method according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 7.
In step S201, the server performs speech recognition on the speech in the multimedia file to obtain a sentence including a plurality of words.
Wherein a plurality of words are used as initial recognition results.
In step S202, the server transmits a sentence including a plurality of words to the client.
In step S203, the client displays a sentence including a plurality of words in accordance with a display mode corresponding to the recognition certainty degree of each word.
In step S204, the client displays candidate words of at least one word.
In step S205, in response to the selection operation, the client replaces a word in the sentence at the same position as the selected candidate word according to the selected candidate word.
It should be noted that the specific implementation manner in step S201 and steps S203 to S205 is similar to the embodiment included in steps S101 to S104, and will not be described again here.
In the embodiment of the application, the server has strong computing capability and high computing speed relative to the terminal, and the speed of displaying the recognition result by the terminal can be improved and the computing resources of the terminal can be reduced by completing the voice recognition process through the server.
The speech recognition processing method provided by the embodiment of the present application is described below by taking a subtitle editing application scenario as an example.
Referring to fig. 8A, fig. 8A is a schematic view of an application scenario of the speech recognition processing method provided in the embodiment of the present application, and in fig. 8A, a subtitle text editing box provided in the embodiment of the present application is distinguished from a text editing box provided in the related art (see fig. 1A and fig. 1B), specifically, a wrong word in an ASR can be marked by different recognition certainty degrees represented by different word numbers (which may also be different colors), and the user can correct the wrong word in the ASR by replacing "input" with "selection".
Referring to fig. 9, fig. 9 is a schematic flowchart of a speech recognition processing method according to an embodiment of the present application, which will be described with reference to fig. 8A and fig. 9.
In step S901, a video file without subtitles uploaded by a user is acquired.
In step S902, a subtitle corresponding to the video file is generated.
In some embodiments, words in the caption are marked with different font sizes or different colors, wherein a smaller font (or darker color) represents a higher recognition confidence for the word, and a larger font (or redder color) represents a lower recognition confidence for the word.
In step S903, candidate words are displayed.
In some embodiments, in fig. 8A, when a user hovers a mouse over any word (including words with higher confidence), a hover box 801 is presented and several candidate words are presented in the hover box 801.
In step S904, in response to the selection operation for the candidate word, correction of the erroneous word is completed.
In some embodiments, when the user clicks on the candidate word shown in the floating box 801, the correction of the incorrect word can be completed.
Next, a specific process of speech recognition is described, referring to fig. 10A, 10B, 10C, and 10D, where fig. 10A, 10B, 10C, and 10D are schematic diagrams illustrating a speech recognition processing method according to an embodiment of the present application.
In fig. 10A, first, 25ms to 40ms time slices are sliced from the speech waveform; then, identify the phoneme corresponding to each slice, where a phoneme refers to the smallest phonetic division unit, and each vowel and initial consonant in mandarin chinese corresponds to a phoneme, for example: "each" consists of two phonemes, "G" and "E"; and finally, generating a Lattice network, wherein each node in the Lattice network is a word or a word, and edges representing costs are connected among all nodes to form a directed acyclic graph.
In some embodiments, the cost between two nodes (i.e., the edges described above) is derived by the scores of the acoustic model, which describe how likely the speech waveform is to pronounce the node, and the language model, which describes how likely the nodes are connected together is a smooth sentence. After obtaining the Lattice network, a plurality of decoding paths can be obtained from the starting node to the terminating node along any path, wherein the path with the minimum total cost (i.e. the path score) is the optimal decoding path.
In some embodiments, fig. 10B is a functional architecture diagram of speech recognition, wherein the functional architecture of speech recognition in fig. 10B consists of the following input and output:
voice waveform- > feature extractor- > audio feature;
audio features > speech decoder > Lattice network;
lattice network- > result generator- > recognition result with candidate words.
Here, the Lattice network is a directed acyclic graph, and the result generator is used for converting the directed acyclic graph into: word 1, word 2, word 3, word 4 (candidate word 4.1, candidate word 4.2) word 5 (candidate word 5.1) such that when the user triggers word 5, the candidate word 5.1 is presented.
The core of the embodiment of the application is that the recognition result with the candidate word is generated by a result generator. In the speech decoding process of the A SR, the presented initial recognition result is the optimal decoding path on the Lattice network, and if only the minimum optimal decoding path in the descending order of the total cost is obtained, the WER is about 0.1; if the first three decoding paths in descending order of total cost are selected, the WER is about 0.05. Therefore, the words on the suboptimal decoding path are presented to the user by means of mouse hovering, wherein the presented candidate words have a high probability of having correct words, and fig. 10C shows some words selected in the Lattice network, which will be described with reference to fig. 10C.
In fig. 10C, the edge represents the cost required to go one of the decoding paths, and a higher cost indicates that the path is less likely to be taken; the black dots represent the identified state, and the identification process must jump among the dots from the starting node to the ending node; and the square represents a token, one token comprises all words and corresponding costs which are identified to be contained in the current state, wherein the costs on the token are the costs of the whole sentence when the word is identified, and the costs on the edges are accumulated on the token when the state jumps between black points.
The speech decoding determines the shortest decoding path in fig. 10C, and then outputs the decoding path in the token with the smallest cost. In the related art, only the optimal decoding path is found, for example, when the "learning" node is reached in fig. 10C, the cost of the token is known to be the minimum and is a "converging" node, and at this time, other results (or pruning) are discarded and the decoding continues backwards. In the embodiment of the present application, the candidate word "blood wash" reaching this state is stored in this node, and the word "blood wash" is stored in the candidate word set of the recognition result, and the word size or the color is marked by calculating the cost difference 7-2 ═ 5 between the word "learning" of the optimal decoding path and the word "blood wash" of the suboptimal decoding path, where the larger the cost difference is, the larger the confidence (i.e., the higher the correctness) of the recognition result (i.e., the word "learning" of the identified optimal decoding path) is, the smaller the word size is (or the font color is more black); conversely, the larger the font size (or the more red the font color).
For example, when the decoding path reaches the first state in the upper left corner, the cost of "learning" is 1, the cost of "blood wash" is 5, which indicates that the current position is the most likely to be "learning" (the cost is the smallest), the numbers "1" and "2" on the sides pointed by the decoding path represent the cost which needs to be increased to advance along the decoding path, the cost of "learning" on the middle node is 1+ 1-2, the cost of "blood wash" is 5+ 2-7, at this time, "blood wash" is reserved as a candidate word, the more costly "learning wash" (8+ 1-9) and "snow wash" (6+ 2-8) are discarded, and in this way, several candidate words are reserved.
Fig. 10D shows a decoding process of a speech decoder, which takes the audio features as input and inputs the audio features into a decoding diagram (fig. 10D is a part of the decoding diagram) to generate a Lattice network. The decoding graph in fig. 10D is used to output an identification word (data) in the Lattice network, where "epsilon" refers to a null output, and the output with words and the null output together form the Lattice network; "0" in the circle is the start state, "4" is the end state, "d: data/1" means: d (input phoneme): data (words on the output Lattice network)/1 (cost). If the input phoneme is 4 sequences of d/ae/t/ax, the phoneme can jump from 0- > 1- > 2- > 3- >4 to a termination state, and complete data is output; the input phoneme is determined as the input phoneme by calculating the similarity between the audio features and the basic phoneme and determining the basic phoneme with the highest similarity. In summary, the cost of the word data is 1+0.5+0.3+1, which is 2.8. In the actual decoding process, the Lattice network is a graph with a structure similar to that of fig. 10D but with a large scale, and all the words output by the search can form a graph, i.e., the Lattice network.
According to the embodiment of the application, the selection mode is used for replacing key input aiming at the wrong words recognized in the voice recognition process, so that the efficiency of subtitle editing personnel in subtitle production can be greatly improved.
An exemplary structure of a speech recognition processing apparatus provided in an embodiment of the present application, which is implemented as a software module, is described below with reference to fig. 3.
In some embodiments, as shown in fig. 3, the software modules stored in the speech recognition processing device 455 of the memory 450 may include:
a recognition module 4551, configured to perform speech recognition on speech in a multimedia file to obtain a sentence including a plurality of words, where the plurality of words are used as an initial recognition result;
a display module 4552 configured to display a sentence including a plurality of words in a display manner corresponding to the recognition determination degree of each word;
a display module 4552, further configured to display a candidate word of the at least one word;
and the replacing module 4553 is further configured to replace, in response to the selection operation, a word in the sentence, which is at the same position as the selected candidate word, according to the selected candidate word.
In the above solution, the display module 4552 is further configured to display, in response to a focusing operation on at least one word in the sentence, the corresponding at least one candidate word at a position adjacent to the at least one word; alternatively, candidate words for at least one word in the sentence are automatically displayed.
In the above scheme, the display module 4552 is further configured to display the words with the recognition certainty degrees lower than the degree threshold in a manner different from the display manner of the certainty words in the sentence; wherein the certainty word is a word whose recognition certainty degree is not lower than the degree threshold.
In the above scheme, the display module 4552 is further configured to determine the degree of significance of the display manner corresponding to each word according to the identification determination degree of each word, and display the sentence according to the degree of significance of the display manner corresponding to each word; wherein there is a negative correlation between the degree of significance and the degree of recognition certainty.
In the above scheme, the replacing module 4553 is further configured to automatically replace at least one word in the sentence that is the same as the replaced word with the selected candidate word, and apply a display manner of the selected candidate word to a display manner consistent with the deterministic word; wherein the certainty word is a word whose recognition certainty degree is not lower than the degree threshold.
In the above solution, the replacement module 4553 is further configured to present global replacement prompt information, where the global replacement prompt information is used to prompt global replacement to be performed based on the selected candidate word; and in response to the confirmation operation aiming at the global replacement prompt information, all the words which are in at least one sentence identified from the multimedia file and are the same as the words replaced by the selected candidate words are replaced by the selected candidate words.
In the above solution, when a plurality of sentences corresponding to one another are recognized from a plurality of voices in a multimedia file, the voice recognition processing device 455 further includes: the sorting module is used for determining a sorting mode adopted when the plurality of sentences are displayed according to at least one of the following modes: sequencing a plurality of sentences in an ascending order according to the recognition certainty degrees of the sentences, wherein the recognition certainty degree of the sentences is the sum of the recognition certainty degrees of all words in the sentences; sequencing the sentences according to the sequence of the voice corresponding to the sentences appearing in the multimedia file; and sorting the plurality of sentences according to the occurrence frequency of the included uncertain words in the plurality of sentences in a descending order, wherein the uncertain words are words with the recognition determination degree lower than the degree threshold.
In the above solution, the sorting module is further configured to update and display the sorting of the plurality of sentences according to at least one of the following manners: updating the ascending order of the recognition certainty degrees of the plurality of sentences; updating a descending order of frequency of occurrence of the uncertainty terms included in the plurality of statements.
In the above scheme, the display module 4552 is further configured to apply the display mode of the selected candidate word to a display mode consistent with the deterministic word; wherein the certainty word is a word whose recognition certainty degree is not lower than the degree threshold.
In the above solution, when the word includes a plurality of candidate words, the display module 4552 is further configured to determine an ordering manner of the plurality of candidate words according to at least one of the following manners: sorting the candidate words in a descending order according to the identification determination degree of the candidate words; sorting the candidate words in a descending order according to the selected times of the candidate words; and displaying a plurality of candidate words according to the sorting mode.
In the foregoing solution, the identifying module 4551 is further configured to determine, as an optimal decoding path, a decoding path with a lowest path score among a plurality of decoding paths obtained by performing speech recognition on speech in a multimedia file; determining a plurality of words included in the optimal decoding path as an initial recognition result.
In the foregoing scheme, the identifying module 4551 is further configured to sort the multiple decoding paths in an ascending order according to the path scores, and select a previous partial decoding path except the optimal decoding path from the ascending sorting result; and determining the words which are at the same position as at least one word in the selected decoding path as candidate words.
In the above scheme, the identifying module 4551 is further configured to perform speech recognition on speech in a multimedia file to obtain a plurality of decoding paths and a path score corresponding to each decoding path; wherein each decoding path includes a plurality of words.
In the above scheme, the identifying module 4551 is further configured to perform framing processing on the voice in the multimedia file to obtain a plurality of audio subdata; extracting acoustic features of each audio subdata to obtain a plurality of audio features; performing word graph generation processing on the plurality of audio features to obtain a word graph network; based on the word graph network, path generation processing is performed to obtain a plurality of decoding paths and a path score corresponding to each decoding path.
In the above solution, the identifying module 4551 is further configured to determine a plurality of decoding paths from the start node to the end node in the word graph network; the following processing is performed for each decoding path: and summing the scores of the plurality of nodes and the scores of the plurality of connecting lines included in the decoding path to obtain the path score of the decoding path.
In the foregoing solution, the identifying module 4551 is further configured to convert each audio feature into a corresponding phoneme to obtain a plurality of phonemes; combining the plurality of phonemes to obtain a plurality of phoneme strings; identifying each phoneme string through an acoustic model to obtain a plurality of words corresponding to each phoneme string and a score of each word; taking the obtained words as nodes in a word graph network, and determining the scores of the words as the scores of the corresponding nodes; connecting nodes corresponding to adjacent phoneme strings, and identifying words corresponding to two nodes of each connection line through a language model to obtain a score of each connection line; and the score of the connecting line represents the probability that the words corresponding to the two nodes of the connecting line are connected together to form a complete sentence.
In the above scheme, the identifying module 4551 is further configured to perform the following processing on a plurality of words corresponding to each phoneme string: and sorting the plurality of words in an ascending order according to the scores of the words, and filtering out the words at the back in the ascending order result.
In the foregoing solution, the identifying module 4551 is further configured to perform the following processing for each audio feature: matching the audio features with each basic phoneme in the basic phoneme library to determine the similarity between the audio features and each basic phoneme; and determining the base phoneme with the highest similarity as the phoneme corresponding to the audio features.
In the above scheme, the identifying module 4551 is further configured to perform the following processing on a plurality of words corresponding to each phoneme string: sorting the words in an ascending order according to the scores, and selecting two words in the front from the ascending order result; determining a score difference between the two terms; determining the recognition determining degree of the words corresponding to the phoneme string in the initial recognition result according to the grading difference; wherein the score difference is positively correlated with the degree of recognition certainty.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the speech recognition processing method described in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, cause the processor to execute a speech recognition processing method provided by embodiments of the present application, for example, the speech recognition processing methods shown in fig. 4, 5, 6A, 6B, 7 and 9, where the computer includes various computing devices including an intelligent terminal and a server.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, the computer-executable instructions may be in the form of programs, software modules, scripts or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and they may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, computer-executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, computer-executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the embodiment of the present application has the following beneficial effects:
(1) aiming at the identification determining degree of each word in the initial identification result, the words in the initial identification result are displayed in a differentiated mode in multiple modes, the words with lower identification determining degree in the initial identification result can be prompted, a user can conveniently modify the words with lower identification determining degree in the initial identification result, and compared with the method that the user needs to remove the core by himself in the related art to correct the wrong words, the embodiment of the application can rapidly and accurately modify the initial identification result, so that the efficiency of obtaining the identification result meeting the user requirement is improved.
(2) The correct words exist in the candidate words with higher probability, the candidate words are displayed to the user, the user can conveniently select the correct words from the candidate words for correction, and therefore the efficiency of obtaining the recognition result meeting the user requirements is improved.
(3) Compared with the prior art that the operation of correcting the wrong words by keying in the corrected words is simple, the method can quickly and accurately correct the initial recognition result, and therefore the efficiency of obtaining the recognition result meeting the requirements of the user is improved.
(4) The corrected words can be not prompted to the user any more, the user is prevented from mixing the corrected words with the uncorrected words in the correcting process, and therefore correcting efficiency of the user is improved.
(5) The sentences with lower identification determination degree can be preferentially displayed to the user, so that the user can preferentially correct the sentences with lower identification determination degree; corresponding sentences can be displayed to a user according to the time axis, so that the user can correct the sentences with wrong words according to the sequence of the voice corresponding to the sentences in the multimedia file; the sentences with more wrong words can be preferentially displayed to the user, so that the user can preferentially correct the sentences with more wrong words.
(6) Compared with the prior art that only the optimal decoding path is searched in the voice recognition process, the decoding paths with lower scores of other paths are also searched, and the words in the other decoding paths are used as the candidate words, so that the probability of correct words in the candidate words is improved, the initial recognition result can be quickly and accurately modified, and the efficiency of obtaining the recognition result meeting the user requirements is improved.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种语音识别方法及其相关设备