Voice recognition method and related equipment thereof
1. A method of speech recognition, the method comprising:
acquiring a current voice section and a reference voice corresponding to the current voice section; wherein the acquisition time of the reference voice is later than that of the current voice section;
coding the current voice segment according to the state data to be used and the reference voice corresponding to the current voice segment to obtain the voice code of the current voice segment and the coding state data of the current voice segment;
and decoding the voice code of the current voice section to obtain a voice text corresponding to the current voice section, and updating the data of the state to be used by using the data of the coding state of the current voice section.
2. The method of claim 1, wherein the determining of the speech coding comprises:
respectively extracting the characteristics of the current voice section and the reference voice corresponding to the current voice section to obtain the voice characteristics of the current voice section and the reference characteristics corresponding to the current voice section;
forward coding the voice characteristics of the current voice section according to the to-be-used state data to obtain a forward coding result of the current voice section;
reversely encoding the voice characteristics of the current voice section according to the reference characteristics corresponding to the current voice section to obtain a reverse encoding result of the current voice section;
and splicing the forward coding result of the current voice section and the reverse coding result of the current voice section to obtain the voice coding of the current voice section.
3. The method of claim 2, wherein the determining of the reverse encoding result comprises:
reversely encoding the reference characteristics corresponding to the current voice section to obtain reverse initial state data corresponding to the current voice section;
and reversely encoding the voice characteristics of the current voice section according to the reverse initial state data corresponding to the current voice section to obtain a reverse encoding result of the current voice section.
4. The method according to claim 2, wherein said inversely coding the speech feature of the current speech segment according to the reference feature corresponding to the current speech segment to obtain an inversely coded result of the current speech segment, comprises:
and inputting the voice characteristics of the current voice section and the reference characteristics corresponding to the current voice section into a Simple Regression Unit (SRU) network which is constructed in advance, and obtaining a reverse coding result of the current voice section output by the SRU network.
5. The method of claim 1, wherein the determining the encoding status data comprises:
extracting the characteristics of the current voice section to obtain the voice characteristics of the current voice section;
and forward coding the voice characteristics of the current voice section according to the to-be-used state data to obtain the coding state data of the current voice section.
6. The method according to claim 1, wherein if the current speech segment and the reference speech corresponding to the current speech segment are collected according to a preset window size, and the preset window parameters include an identification window size and a reference window size, the collecting process of the current speech segment and the reference speech corresponding to the current speech segment includes:
collecting the current voice section according to the size of the identification window;
determining the reference data acquisition time period according to the size of the reference window and the acquisition ending time point of the current voice segment;
and acquiring the reference voice corresponding to the current voice segment according to the reference data acquisition time segment.
7. A speech recognition apparatus, comprising:
the voice acquisition unit is used for acquiring a current voice section and a reference voice corresponding to the current voice section; wherein the acquisition time of the reference voice is later than that of the current voice section;
the voice coding unit is used for coding the current voice section according to the state data to be used and the reference voice corresponding to the current voice section to obtain the voice code of the current voice section and the coding state data of the current voice section;
the voice decoding unit is used for decoding the voice codes of the current voice section to obtain a voice text corresponding to the current voice section;
and the data updating unit is used for updating the to-be-used state data by utilizing the coding state data of the current voice section.
8. An apparatus, characterized in that the apparatus comprises: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1 to 6.
9. A computer-readable storage medium having stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of any one of claims 1 to 6.
10. A computer program product, characterized in that it, when run on a terminal device, causes the terminal device to perform the method of any one of claims 1 to 6.
Background
With the development of the speech recognition technology, the application scenarios of the speech recognition technology are more and more extensive. For example, voice recognition techniques may be applied to voice input methods, voice assistants, hearing conference systems, and so forth.
However, the related speech recognition technology has defects, so that the speech recognition process based on the related speech recognition technology has poor real-time performance.
Disclosure of Invention
The embodiment of the present application mainly aims to provide a speech recognition method and related devices thereof, which can effectively improve the real-time performance of speech recognition.
The embodiment of the application provides a voice recognition method, which comprises the following steps:
acquiring a current voice section and a reference voice corresponding to the current voice section; wherein the acquisition time of the reference voice is later than that of the current voice section;
coding the current voice segment according to the state data to be used and the reference voice corresponding to the current voice segment to obtain the voice code of the current voice segment and the coding state data of the current voice segment;
and decoding the voice code of the current voice section to obtain a voice text corresponding to the current voice section, and updating the data of the state to be used by using the data of the coding state of the current voice section.
In one possible embodiment, the determining process of the speech coding includes:
respectively extracting the characteristics of the current voice section and the reference voice corresponding to the current voice section to obtain the voice characteristics of the current voice section and the reference characteristics corresponding to the current voice section;
forward coding the voice characteristics of the current voice section according to the to-be-used state data to obtain a forward coding result of the current voice section;
reversely encoding the voice characteristics of the current voice section according to the reference characteristics corresponding to the current voice section to obtain a reverse encoding result of the current voice section;
and splicing the forward coding result of the current voice section and the reverse coding result of the current voice section to obtain the voice coding of the current voice section.
In a possible implementation, the determining of the reverse encoding result includes:
reversely encoding the reference characteristics corresponding to the current voice section to obtain reverse initial state data corresponding to the current voice section;
and reversely encoding the voice characteristics of the current voice section according to the reverse initial state data corresponding to the current voice section to obtain a reverse encoding result of the current voice section.
In a possible implementation manner, the inversely encoding the speech feature of the current speech segment according to the reference feature corresponding to the current speech segment to obtain an inversely encoded result of the current speech segment includes:
and inputting the voice characteristics of the current voice section and the reference characteristics corresponding to the current voice section into a Simple Regression Unit (SRU) network which is constructed in advance, and obtaining a reverse coding result of the current voice section output by the SRU network.
In a possible implementation, the determining of the encoding status data includes:
extracting the characteristics of the current voice section to obtain the voice characteristics of the current voice section;
and forward coding the voice characteristics of the current voice section according to the to-be-used state data to obtain the coding state data of the current voice section.
In a possible implementation manner, if the current speech segment and the reference speech corresponding to the current speech segment are collected according to a preset window size, and the preset window parameters include an identification window size and a reference window size, the collecting process of the current speech segment and the reference speech corresponding to the current speech segment includes:
collecting the current voice section according to the size of the identification window;
determining the reference data acquisition time period according to the size of the reference window and the acquisition ending time point of the current voice segment;
and acquiring the reference voice corresponding to the current voice segment according to the reference data acquisition time segment.
An embodiment of the present application further provides a speech recognition apparatus, including:
the voice acquisition unit is used for acquiring a current voice section and a reference voice corresponding to the current voice section; wherein the acquisition time of the reference voice is later than that of the current voice section;
the voice coding unit is used for coding the current voice section according to the state data to be used and the reference voice corresponding to the current voice section to obtain the voice code of the current voice section and the coding state data of the current voice section;
the voice decoding unit is used for decoding the voice codes of the current voice section to obtain a voice text corresponding to the current voice section;
and the data updating unit is used for updating the to-be-used state data by utilizing the coding state data of the current voice section.
An embodiment of the present application further provides an apparatus, including: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any implementation of the speech recognition method provided by the embodiment of the application.
The embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is caused to execute any implementation of the voice recognition method provided in the embodiment of the present application.
The embodiment of the present application further provides a computer program product, and when the computer program product runs on a terminal device, the terminal device is enabled to execute any implementation manner of the voice recognition method provided by the embodiment of the present application.
Based on the technical scheme, the method has the following beneficial effects:
according to the voice recognition method provided by the application, after a current voice section and a reference voice corresponding to the current voice section are obtained, coding processing is carried out on the current voice section according to state data to be used and the reference voice corresponding to the current voice section to obtain a voice code of the current voice section and coding state data of the current voice section; and then decoding the speech coding of the current speech section to obtain a speech text corresponding to the current speech section, and updating the data of the state to be used by using the coding state data of the current speech section so as to perform coding processing by using the updated data of the state to be used in the next round of speech recognition process.
The voice recognition method comprises the steps of collecting voice data from a user voice stream, and acquiring voice data of the user from the voice stream by pickup equipment.
The historical voice information of the current voice section can be accurately represented by the to-be-used state data, and the reference voice corresponding to the current voice section can accurately represent the future voice information of the current voice section, so that the voice coding determined by referring to the to-be-used state data and the reference voice (namely, referring to the context information of the current voice section) can more accurately represent the voice information carried by the current voice section, and the voice recognition accuracy is improved.
And because the to-be-used state data is already calculated in the historical speech recognition process (namely, the process of performing speech recognition on the historical speech corresponding to the current speech segment), the to-be-used state data can be directly used in the current round of speech recognition process without recalculating the to-be-used state data, so that the time consumption of speech recognition on the current speech can be effectively reduced, the speech recognition efficiency on the current speech can be effectively improved, and the real-time performance of the speech recognition can be further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a segmented acquisition of a user voice stream according to an embodiment of the present application;
fig. 3 is a schematic diagram of an encoding process according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.
Detailed Description
The inventor of the present invention has found in the research on speech recognition that, for some speech recognition technologies (e.g., speech recognition method based on Bidirectional Long Short Term Memory (BLSTM), etc.), since these speech recognition technologies can only perform speech recognition on speech data having a whole sentence of speech text (e.g., how much the weather is today), so that the voice recognition technology can only carry out voice recognition on the voice data carrying the whole sentence of voice text after the sound pickup equipment collects the voice data, so that the speech recognition time consumption of these speech recognition techniques includes not only the processing time consumed for performing speech recognition processing on speech data, but also the waiting time consumed for collecting speech data having a complete sentence of speech text, thus, the speech recognition of these speech recognition techniques takes a long time, resulting in poor real-time performance of the speech recognition process based on these speech recognition techniques.
Based on the above findings, in order to solve the technical problems in the background art section, an embodiment of the present application provides a speech recognition method, including: acquiring a current voice section and a reference voice corresponding to the current voice section; coding the current voice section according to the state data to be used and the reference voice corresponding to the current voice section to obtain the voice coding of the current voice section and the coding state data of the current voice section; decoding the speech coding of the current speech section to obtain a speech text corresponding to the current speech section, and updating the data of the state to be used by using the coding state data of the current speech section.
Therefore, the current voice segment is used for representing the voice data collected by the pickup equipment from the voice stream of the user in real time, so that the voice recognition method provided by the application can perform real-time voice recognition on the voice data collected in real time, the purpose of performing voice recognition while performing voice collection can be realized, the waiting time caused by collecting the voice data carrying the whole sentence of voice text can be effectively avoided, and the real-time performance of the voice recognition can be effectively improved.
The historical voice information of the current voice section can be accurately represented by the to-be-used state data, and the reference voice corresponding to the current voice section can accurately represent the future voice information of the current voice section, so that the voice coding determined by referring to the to-be-used state data and the reference voice (namely, referring to the context information of the current voice section) can more accurately represent the voice information carried by the current voice section, and the voice recognition accuracy is improved. In addition, since the to-be-used state data is already calculated in the historical speech recognition process (that is, the process of performing speech recognition on the historical speech corresponding to the current speech segment), the to-be-used state data can be directly used in the current round of speech recognition process without recalculating the to-be-used state data, so that the time consumption of speech recognition on the current speech can be effectively reduced, the speech recognition efficiency on the current speech can be effectively improved, and the real-time performance of the speech recognition can be improved.
In addition, the embodiment of the present application does not limit the execution subject of the voice recognition method, and for example, the voice recognition method provided by the embodiment of the present application may be applied to a data processing device such as a terminal device or a server. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like. The server may be a stand-alone server, a cluster server, or a cloud server.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Method embodiment
Referring to fig. 1, a flowchart of a speech recognition method according to an embodiment of the present application is shown.
The voice recognition method provided by the embodiment of the application comprises the following steps of S1-S4:
s1: and acquiring a current voice section and a reference voice corresponding to the current voice section.
Wherein the current voice segment is used to represent voice data (e.g., the first voice segment, the second voice segment, … …, the eighth voice segment shown in fig. 2) collected by a sound pickup device (e.g., a microphone, etc.) in real time from a user voice stream (e.g., a user voice stream carrying "how it is today")Speech segments). Additionally, the current speech segment may include NcFrame voice data, and NcIs a positive integer.
The reference voice corresponding to the current voice section is used for representing voice data which needs to be referenced when voice recognition is carried out on the current voice section; and the collection time of the reference voice is later than that of the current voice segment, so that the reference voice is used for representing the future voice information of the current voice segment. That is, the reference speech corresponding to the current speech segment may include a future speech segment corresponding to the current speech segment. For example, if the current speech segment is "the first speech segment" in fig. 2, the reference speech corresponding to the current speech segment may be "the second speech segment" in fig. 2.
In addition, the reference speech corresponding to the current speech segment may include NrFrame voice data. Wherein N isrIs a positive integer. In the examples of the present application, N is not limitedcAnd NrThe magnitude relationship between them.
In addition, the embodiment of the present application does not limit the collecting process of the sound pickup apparatus for the current voice segment and the reference voice corresponding to the current voice segment, for example, the sound pickup apparatus may collect the current voice segment and the reference voice corresponding to the current voice segment according to a preset window size.
The preset window size is used to indicate a size of a collection window that is required to be used when the sound pickup apparatus collects various voice segments (e.g., a current voice segment and a reference voice corresponding to the current voice segment) from a voice stream of a user.
In addition, the preset window size is not limited in the embodiments of the present application, for example, the preset window size may include an identification window size and a reference window size. The recognition window size is used for representing the size of a collection window which is required to be used when the sound pickup equipment collects a voice segment which needs to be subjected to voice recognition processing in real time from a voice stream of a user. The reference window size is used to indicate a collection window size that needs to be used when the sound pickup apparatus collects a voice segment as reference information from a user voice stream. It should be noted that, the size relationship between the identification window size and the reference window size is not limited in the embodiments of the present application, for example, the identification window size may be equal to the reference window size.
In fact, the size of the recognition window can control the waiting time of the sound pickup device for collecting the voice segments which need to be subjected to the voice recognition processing, and the size of the reference window can control the information amount of future voice information, so that the size of the recognition window and the size of the reference window can influence the processing time consumption of the future voice information in the voice recognition process. Based on this, in order to better meet the real-time requirements of different application scenes, the size of the identification window and the size of the reference window can be set according to the real-time requirements of the application scenes to be used. The application scenario to be used is used to represent an application scenario of the speech recognition method provided by the embodiment of the application.
In addition, the embodiment of the present application does not limit the acquisition process of the current speech segment and the reference speech corresponding to the current speech segment, for example, in a possible implementation manner, if the preset window size includes the recognition window size and the reference window size, the acquisition process may specifically include steps 11 to 13:
step 11: and collecting the current voice section according to the size of the identification window.
In the embodiment of the present application, for the sound pickup apparatus, the sound pickup apparatus may determine the voice stream division time length (e.g., "d" in fig. 2) according to the size of the recognition window; and then, segmenting and collecting the user voice stream in real time according to the voice stream division duration (for example, "first voice segment" in fig. 2) and sending the segmented and collected voice stream to the executing device of the "voice recognition method" in real time, so that the executing device of the "voice recognition method" can perform voice recognition processing on the received voice segment in real time.
For example, as shown in fig. 2, if the recognition window size is d, and the user starts speaking at the T-th time, the sound pickup apparatus immediately sends the first voice segment to the execution apparatus of the "voice recognition method" after the first voice segment is collected at the T + d-th time, so that the execution apparatus of the "voice recognition method" can perform voice recognition on the first voice segment; after the sound pickup equipment collects a second voice segment at the T +2d moment, immediately sending the second voice segment to executing equipment of a voice recognition method, so that the executing equipment of the voice recognition method can perform voice recognition on the second voice segment; … … (and so on); the sound pickup apparatus immediately transmits the eighth voice segment to the execution apparatus of the "voice recognition method" after the eighth voice segment is collected at the time T +8d, so that the execution apparatus of the "voice recognition method" can perform voice recognition on the eighth voice segment. Wherein d is a positive integer.
Step 12: and determining a reference data acquisition time period according to the size of the reference window and the acquisition ending time point of the current voice segment.
The reference data acquisition time period refers to a time period required by the pickup equipment to acquire future voice information of the current voice section. In addition, the embodiment of the present application does not limit the determination manner of the reference data acquisition time period, for example, if the acquisition time period of the current speech segment is [ T [ ]start,Tend]And the reference window size is D, the reference data acquisition time period may be Tend,Tend+D]. Wherein, TstartRepresenting a starting time point of the pickup equipment for collecting the current voice section; t isendA collection end time point representing that the sound pickup apparatus collected the current voice section (that is, a collection end time point of the current voice section); d is a positive integer.
Step 13: and acquiring reference voice corresponding to the current voice section according to the reference data acquisition time section.
In this embodiment, for the sound pickup apparatus, after the reference data acquisition time period is acquired, the sound pickup apparatus may acquire the voice data from the voice stream of the user according to the reference data acquisition time period, and use the acquired voice data as the reference voice corresponding to the current voice segment.
Based on the above-mentioned related contents from step 11 to step 13, the sound pickup device may collect the current speech segment and the reference speech corresponding to the current speech segment in real time from the user speech stream according to the preset window size, and send the current speech segment and the reference speech corresponding to the current speech segment to the executing device of the "speech recognition method" in real time, so that the executing device of the "speech recognition method" can perform speech recognition processing on the current speech segment in real time.
It should be noted that, the embodiment of the present application does not limit the sending time of the current speech segment and the corresponding reference speech, for example, in order to further improve the speech recognition efficiency, after the current speech segment is collected by the sound pickup device, the sound pickup device may immediately send the current speech segment to the executing device of the "speech recognition method", so that the executing device of the "speech recognition method" can immediately perform corresponding processing (e.g., feature extraction, forward encoding, and the like) on the current speech segment, so that the reference data collection time segment corresponding to the current speech segment can be fully utilized (i.e., the time segment during which the reference speech corresponding to the current speech segment is collected from the speech stream of the user by the sound pickup device), thereby effectively improving the speech recognition efficiency.
Based on the related content of S1, for some application scenarios (e.g., voice input method, voice assistant, listening conference system, etc.) with high real-time requirement, when the user starts speaking, the sound pickup device may perform real-time segmented collection on the user voice stream according to the preset window size and send the collected user voice stream to the execution device of the "voice recognition method" in real time, so that the execution device of the "voice recognition method" can perform real-time voice recognition processing on the received voice segment, and thus the purpose of voice recognition while collecting can be achieved, and thus, the waiting time for collecting voice data carrying a whole sentence of voice text can be effectively avoided, and the real-time performance of voice recognition can be effectively improved.
S2: and coding the current voice section according to the state data to be used and the reference voice corresponding to the current voice section to obtain the voice coding of the current voice section and the coding state data of the current voice section.
The to-be-used state data is used for representing historical voice information of the current voice section; furthermore, in order to improve the speech recognition efficiency, the status data to be used may be determined based on the encoding status data generated in the previous speech recognition process.
The previous round of voice process refers to a process of performing voice recognition on the latest historical voice segment of the current voice segment. The acquisition time of the latest historical voice section of the current voice section is earlier than that of the current voice section, and the acquisition ending time point of the latest historical voice section is adjacent to the acquisition starting time point of the current voice section. For example, if the current speech segment is "second speech segment" in FIG. 2, the recent historical speech segment of the current speech segment may be "first speech segment" in FIG. 2.
As can be seen from the above two paragraphs, for the j-th speech recognition procedure, if j is equal to 1, the to-be-used state data used when performing the encoding process in the j-th speech recognition procedure may be preset; if j is more than or equal to 2, the to-be-used state data used for coding in the j-th speech recognition process can be determined according to the coding state data generated in the coding process in the j-1-th speech recognition process. The j-th round voice recognition process refers to a process of performing voice recognition on the j-th voice segment in the voice stream of the user; j is a positive integer.
Speech coding of a current speech segment is used to characterize the speech information carried by the current speech segment.
The encoding state data of the current speech segment refers to encoding state data (e.g., cell state data and/or hidden layer state data) generated when the current speech segment is encoded.
The embodiment of the present application is not limited to the implementation of the "encoding process" in S2 (i.e., the implementation of S2), for example, in one possible implementation, to improve the encoding efficiency, S2 may specifically include S21-S25:
s21: and respectively extracting the characteristics of the current voice section and the reference voice corresponding to the current voice section to obtain the voice characteristics of the current voice section and the reference characteristics corresponding to the current voice section.
The voice feature of the current voice segment is obtained by extracting the feature of the current voice segment, so that the voice feature is used for representing the voice information carried by the current voice segment. For example, if the current speech segment includes the Tth speech segmentstartFrame voice data to Tthstart+Nc1 frame of speech data, the speech feature of the current speech segment may be
The reference feature corresponding to the current speech segment is obtained by performing feature extraction on the reference speech corresponding to the current speech segment, so that the reference feature is used for representing the speech information carried by the reference speech (i.e., the future speech information of the current speech segment). For example, if the current speech segment corresponds to the reference speech of the Tthstart+NcFrame voice data to Tthstart+Nc+Nr1 frame of speech data, the reference feature corresponding to the current speech segment may be
In addition, the embodiment of the present application is not limited to the implementation of "feature extraction" in S21, and for example, any existing or future-appearing method capable of performing feature extraction on speech data (e.g., a Perceptual Linear Prediction (PLP) feature extraction method, a Mel Frequency Cepstral Coefficient (MFCC) feature extraction method, or a FilterBank feature extraction method) may be used.
In addition, the embodiment of the present application does not limit the relationship between the obtaining time of the speech feature of the current speech segment and the obtaining time of the reference feature corresponding to the current speech segment, for example, because the obtaining time of the current speech segment is earlier than the obtaining time of the reference speech corresponding to the current speech segment, in order to improve the speech recognition efficiency, when the sound pickup equipment starts to collect the reference speech, the execution equipment of the "speech recognition method" may start to perform feature extraction on the collected current speech segment, so that the obtaining time of the speech feature of the current speech segment is earlier than the obtaining time of the reference feature corresponding to the current speech segment, and thus, the obtaining time period of the reference speech can be effectively utilized, thereby being beneficial to improving the speech recognition efficiency.
S22: and forward coding the voice characteristics of the current voice section according to the state data to be used to obtain a forward coding result of the current voice section and the coding state data of the current voice section.
The forward encoding result of the current speech segment is an encoding result obtained by forward encoding the current speech segment. For example, if the current speech segment is forward encoded using a forward coding network, the forward coding result of the current speech segment may refer to the hidden layer state data output by the forward coding network.
In S22, "forward encoding" refers to a process of sequentially encoding each speech feature in a speech feature sequence in the forward direction arrangement order of the speech feature sequence. For example, for a sequence of speech featuresIn other words, the speech features in the speech feature sequence may be encoded sequentially from front to back (i.e., sequentially for each speech feature in the speech feature sequence) Encoding is performed).
In addition, the embodiment of the present application is not limited to the implementation of "forward coding" in S22, and may be implemented by any existing or future implementation that can perform forward coding processing on speech data (for example, may be implemented by using an LSTM network or a forward coding network in BLSTM).
The encoding state data for the current speech segment may comprise encoding state data generated when the current speech segment is forward encoded. For ease of understanding, the following description is made with reference to examples.
As an example, if the "forward coding" in S22 is implemented according to the LSTM network, and the speech feature of the current speech segment isThe encoding state data for the current speech segment may comprise cell state data generated when forward encoding the current speech segmentAnd/or hidden layer state data generated when forward encoding the current speech segmentWherein the content of the first and second substances,the method comprises the steps that a cell state obtained by forward coding is expressed by an LSTM network aiming at the l +1 frame voice data in the current voice section;representing the hidden layer state (i.e., the output data of the LSTM network) resulting from forward encoding of the l +1 th frame of speech data in the current speech segment by the LSTM network. l is a non-negative integer, l is more than or equal to 0 and less than or equal to Nc-1,NcIs a positive integer.
The present embodiment does not limit the function of the "to-be-used state data" in S22, for example, when forward-encoding the current speech segment by using the LSTM network, the to-be-used state data may be used as the initial state data of the LSTM network (e.g., if the to-be-used state data includes the nth of the latest historical speech segments for the current speech segment)cThe cell state obtained by forward coding the frame speech data, the to-be-used state data may be used as an initialization parameter value of the cell state of the LSTM network), so that the subsequent LSTM network can forward code the current speech segment based on the to-be-used state data.
Based on the related content of S22, after the speech feature of the current speech segment is obtained, the speech feature of the current speech segment may be forward encoded with reference to the state data to be used, so as to obtain the forward encoding result and the encoding state data of the current speech segment, so that the speech encoding of the current speech segment can be determined by using the forward encoding result in the following procedure, and the encoding state data of the current speech segment is stored for use in the next speech recognition procedure, so that there is no need to perform encoding processing on the historical speech data in the next speech recognition procedure, and thus the time consumption of speech recognition can be effectively reduced, and the speech recognition efficiency can be improved.
S23: and reversely encoding the voice characteristics of the current voice section according to the reference characteristics corresponding to the current voice section to obtain a reverse encoding result of the current voice section.
The reverse encoding result of the current speech segment is an encoding result obtained by performing reverse encoding on the current speech segment. For example, if the current speech segment is backward encoded by using a backward encoding network, the backward encoding result of the current speech segment may refer to the hidden layer state data output by the backward encoding network.
In S22, "reverse encoding" refers to a process of sequentially encoding each speech feature in a speech feature sequence in the reverse order of the speech feature sequence. For example, for a sequence of speech featuresIn other words, the encoding can be performed sequentially from back to front for each speech feature in the speech feature sequence (i.e., sequentially for each speech feature in the speech feature sequence) Encoding is performed).
In addition, the embodiment of the present application is not limited to the implementation of "reverse coding" in S23, and may be implemented by any existing or future implementation that can perform forward coding processing on speech data (for example, may be implemented by a reverse coding network in BLSTM).
In addition, the embodiment of S23 is not limited in the examples of the present application, for example, in one possible embodiment, S23 may specifically include S231-S232:
s231: and reversely encoding the reference characteristics corresponding to the current voice section to obtain reverse initial state data corresponding to the current voice section.
The reverse initial state data corresponding to the current speech segment refers to an initialization parameter value of encoding state data (e.g., cell state data and/or hidden layer state data) used when performing reverse encoding on the current speech segment.
In addition, the present application embodiment does not limit the determination manner of the "reverse initial state data" (that is, the implementation manner of S231), for example, in a possible implementation manner, S231 may specifically include: firstly, reversely encoding reference characteristics corresponding to a current voice section to obtain encoding state data corresponding to the reference characteristics; and determining reverse initial state data corresponding to the current voice section according to the coding state data corresponding to the reference feature.
For ease of understanding, the following description is made with reference to examples.
As an example, if the "backward coding" is implemented in S231 by using a backward coding network (e.g., the backward coding network in BLSTM or the SRU network below), the reference feature corresponding to the current speech segment isS231 may specifically include: first using the reverse coding network toPerforming reverse encoding to obtain hidden layer state dataAnd cell status dataThen, the reverse initial state data corresponding to the current speech segment is determined from the two state data (for example, the reference may be madeThe coding state data corresponding to the 1 st frame of speech data in speech (e.g.,and/or) Determining the reverse initial state data corresponding to the current voice segment) so that the reverse initial state data corresponding to the current voice segment can accurately represent the future voice information of the current voice segment.
S232: and reversely encoding the voice characteristics of the current voice section according to the reverse initial state data corresponding to the current voice section to obtain a reverse encoding result of the current voice section.
As an example, if the inverse coding network (e.g., the inverse coding network in BLSTM or the SRU network below) is used to implement "inverse coding" in S232, S232 may specifically include: using the reverse initial state data corresponding to the current speech segment (e.g.,and/or) Initializing state data (such as hidden layer state data and/or cell state data) related to the reverse coding network to obtain a reverse coding network initialized by the state data; the reverse coding network after initialization with the state data processes the speech characteristics of the current speech segmentPerforming reverse coding to obtain the reverse coding result of the current speech segment
Based on the related contents in S231 to S232, after the speech feature and the reference feature corresponding to the current speech segment are obtained, the reference feature corresponding to the current speech segment may be reversely encoded to obtain reverse encoded data (e.g., reverse hidden layer state data and reverse cell state data) corresponding to the reference feature; determining reverse initial state data corresponding to the current voice segment according to the reverse encoded data corresponding to the reference feature, so that the reverse initial state data can accurately represent future voice information of the current voice segment; finally, according to the reverse initial state data, the voice characteristics of the current voice section are reversely encoded to obtain a reverse encoding result of the current voice section, so that the reverse encoding result can more accurately represent the voice information carried by the current voice section.
In fact, in order to further improve the coding efficiency of the "reverse coding" in S23, a Simple Regression Unit (SRU) network constructed in advance may be used for implementation, that is, S23 may specifically be: and inputting the voice characteristics of the current voice section and the reference characteristics corresponding to the current voice section into a pre-constructed SRU network to obtain a reverse coding result of the current voice section output by the SRU network.
The SRU network may be encoded using equations (1) - (5), among others.
ft=σ(Wfxt+bf) (2)
rt=σ(Wrxt+br) (3)
ht=rt⊙g(ct)+(1-rt)⊙xt (5)
In the formula, htOutput data representing the SRU network at time t; x is the number oftInput data representing the SRU network at time t; c. Ct-1Indicates the SRU network pair at the t-1 timeThe corresponding network cell state; sigma, W, Wf、WrAll represent preset parameter values; g (-) represents the activation function.
Based on the above equations (1) to (5), calculation is madeftAnd rtOnly input data x of the SRU network at the t momenttThat is, there is no need to rely on the output data h of the SRU network at the previous timet-1Therefore, it isftAnd rtParallel calculation is possible, which is beneficial to improving the reverse coding efficiency, thereby being beneficial to improving the speech recognition efficiency.
Based on the related content of the SRU network, the voice characteristics of the current voice section are obtainedReference feature corresponding to the current speech segmentThen, the two are spliced to obtain splicing characteristicsInputting the splicing characteristic into SRU network to make the SRU network reversely encode the splicing characteristic, and obtaining and outputting the reverse encoding result of the splicing characteristicFinally, extracting the reverse coding result of the current speech segment from the reverse coding result of the splicing characteristic
Based on the related content of S23, after the voice feature and the reference feature of the current voice segment are obtained, the voice feature of the current voice segment may be referred to, and the reverse coding result of the current voice segment is obtained by performing reverse coding on the voice feature of the current voice segment, so that the reverse coding result is determined by combining the voice information carried by the current voice segment and the voice information carried by the reference voice corresponding to the current voice segment, so that the directional coding result is more accurate, which is beneficial to improving the accuracy of voice recognition.
S24: and splicing the forward coding result of the current voice section and the reverse coding result of the current voice section to obtain the voice coding of the current voice section.
In the embodiment of the application, the forward coding result of the current speech segment is obtainedAnd the reverse coding result of the current speech segmentThen, the two can be spliced to obtain the speech coding of the current speech segment
Based on the above-mentioned related content of S2, after the current speech segment is obtained, the historical speech information (e.g., the to-be-used state data) of the current speech segment and the future speech information (e.g., the reference speech corresponding to the current speech segment) of the current speech segment may be referred to, and the current speech segment is encoded (e.g., the encoding process shown in fig. 3), so as to obtain the speech encoding of the current speech segment and the encoding state data of the current speech segment.
S3: and decoding the voice code of the current voice section to obtain a voice text corresponding to the current voice section.
And the voice text corresponding to the current voice section is used for representing the voice information carried by the current voice section.
In addition, the embodiment of the present application is not limited to the implementation of the "decoding process" in S3, and for example, any existing or future implementation that can perform a decoding process for speech coding may be used (for example, a weighted final-state-translator (WFST) decoder or a decoding (decoder) layer in an end-to-end speech recognition model may be used for implementation).
Based on the relevant content of S3, after the speech code of the current speech segment is obtained, the speech code may be directly decoded to obtain the speech text corresponding to the current speech segment, so that the speech text can accurately represent the speech information carried by the current speech segment.
S4: and updating the to-be-used state data by using the coding state data of the current voice segment.
In this embodiment of the present application, after the coding state data of the current speech segment is obtained, the coding state data of the current speech segment may be utilized to update the to-be-used state data (for example, if the coding state data of the current speech segment isThe status data to be used can be updated to the last status data in the encoded status data) And the updated state data to be used can accurately represent the current speech section and the speech information carried by the corresponding historical speech section, so that the encoding state data of the current speech section can be used for encoding (especially forward encoding) aiming at the latest future speech section of the current speech section in the next round of speech recognition process, and the encoding aiming at the corresponding historical speech data is not needed in the next round of speech process, thereby being beneficial to improving the speech recognition efficiency and improving the instantaneity of the speech recognition process.
The acquisition time of the latest future voice section of the current voice section is later than that of the current voice section, and the acquisition starting time point of the latest future voice section is adjacent to the acquisition ending time point of the current voice section. For example, if the current speech segment is "the second speech segment" in fig. 2, the latest future speech segment (i.e., the processing object to be speech-recognized in the next speech recognition process) of the current speech segment may be "the third speech segment" in fig. 2.
It should be noted that the present embodiment does not limit the execution order of S3 and S4, for example, S3 and S4 may be executed sequentially, S4 and S3 may be executed sequentially, and S3 and S4 may be executed simultaneously.
Based on the related contents of the foregoing S1 to S3, for the speech recognition method provided in this embodiment of the present application, after the current speech segment and the reference speech corresponding to the current speech segment are obtained, the current speech segment is encoded according to the state data to be used and the reference speech corresponding to the current speech segment, so as to obtain the speech code of the current speech segment and the encoded state data of the current speech segment; and then decoding the voice code of the current voice section to obtain a voice text corresponding to the current voice section, and updating the data of the state to be used by utilizing the coding state data of the current voice section.
The voice recognition method comprises the steps of collecting voice data from a user voice stream, and acquiring voice data of the user from the voice stream by pickup equipment.
The historical voice information of the current voice section can be accurately represented by the to-be-used state data, and the reference voice corresponding to the current voice section can accurately represent the future voice information of the current voice section, so that the voice coding determined by referring to the to-be-used state data and the reference voice (namely, referring to the context information of the current voice section) can more accurately represent the voice information carried by the current voice section, and the voice recognition accuracy is improved.
And because the to-be-used state data is already calculated in the historical speech recognition process (namely, the process of performing speech recognition on the historical speech corresponding to the current speech segment), the to-be-used state data can be directly used in the current round of speech recognition process without recalculating the to-be-used state data, so that the time consumption of speech recognition on the current speech can be effectively reduced, the speech recognition efficiency on the current speech can be effectively improved, and the real-time performance of the speech recognition can be further improved.
Based on the speech recognition method provided by the above method embodiment, the embodiment of the present application further provides a speech recognition apparatus, which is explained and explained with reference to the drawings.
Device embodiment
The embodiment of the device introduces a speech recognition device, and please refer to the above method embodiment for related contents.
Referring to fig. 4, the figure is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.
The speech recognition apparatus 400 provided in the embodiment of the present application includes:
a voice obtaining unit 401, configured to obtain a current voice segment and a reference voice corresponding to the current voice segment; wherein the acquisition time of the reference voice is later than that of the current voice section;
a speech encoding unit 402, configured to perform encoding processing on the current speech segment according to state data to be used and a reference speech corresponding to the current speech segment, so as to obtain a speech code of the current speech segment and encoding state data of the current speech segment;
a speech decoding unit 403, configured to decode a speech code of the current speech segment to obtain a speech text corresponding to the current speech segment;
a data updating unit 404, configured to update the to-be-used state data with the coding state data of the current speech segment.
In a possible implementation, the speech encoding unit 402 includes:
a first extraction subunit, configured to perform feature extraction on the current speech segment and a reference speech corresponding to the current speech segment, respectively, to obtain a speech feature of the current speech segment and a reference feature corresponding to the current speech segment;
a forward coding subunit, configured to perform forward coding on the speech feature of the current speech segment according to the to-be-used state data, so as to obtain a forward coding result of the current speech segment;
a reverse coding subunit, configured to perform reverse coding on the voice feature of the current voice segment according to a reference feature corresponding to the current voice segment, so as to obtain a reverse coding result of the current voice segment;
and the coding and splicing subunit is used for splicing the forward coding result of the current voice section and the reverse coding result of the current voice section to obtain the voice code of the current voice section.
In a possible implementation, the inverse coding subunit is specifically configured to:
reversely encoding the reference characteristics corresponding to the current voice section to obtain reverse initial state data corresponding to the current voice section;
and reversely encoding the voice characteristics of the current voice section according to the reverse initial state data corresponding to the current voice section to obtain a reverse encoding result of the current voice section.
In a possible implementation, the inverse coding subunit is specifically configured to:
and inputting the voice characteristics of the current voice section and the reference characteristics corresponding to the current voice section into a Simple Regression Unit (SRU) network which is constructed in advance, and obtaining a reverse coding result of the current voice section output by the SRU network.
In a possible implementation, the speech encoding unit 402 includes:
the second extraction subunit is used for extracting the characteristics of the current voice segment to obtain the voice characteristics of the current voice segment;
and the state determining subunit is configured to perform forward coding on the speech feature of the current speech segment according to the to-be-used state data, so as to obtain coding state data of the current speech segment.
In a possible implementation manner, if the current speech segment and the reference speech corresponding to the current speech segment are collected according to a preset window size, and the preset window parameters include an identification window size and a reference window size, the collecting process of the current speech segment and the reference speech corresponding to the current speech segment includes:
collecting the current voice section according to the size of the identification window;
determining the reference data acquisition time period according to the size of the reference window and the acquisition ending time point of the current voice segment;
and acquiring the reference voice corresponding to the current voice segment according to the reference data acquisition time segment.
Further, an embodiment of the present application further provides a speech recognition device, including: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any one of the implementation methods of the voice recognition method.
Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the instructions cause the terminal device to execute any implementation method of the foregoing speech recognition method.
Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the above-mentioned speech recognition method.
As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.