Voice processing method and system, and voice interaction device and method
1. A method of speech processing comprising:
receiving voice data from a user;
determining a user identity of the user; and
and generating a processing result of the voice data based on the user identity.
2. The method of claim 1, wherein determining the user identity of the user comprises at least one of:
identifying biometric information of the user and determining the user identity based on the biometric information;
identifying account information for the user and determining the user identity based on the account information.
3. The method of claim 2, wherein identifying biometric information of the user and determining the user identity based on the biometric information comprises:
performing voiceprint recognition on the voice data, and determining the identity of the user according to the recognized voiceprint;
identifying the fingerprint of the user, and determining the identity of the user according to the identified fingerprint;
and carrying out image recognition on the user, and determining the identity of the user according to the recognized image characteristics.
4. The method of claim 2, further comprising:
collecting respective biological information of a plurality of users and generating an identity information base, and
identifying biometric information of the user and determining the user identity based on the biometric information includes:
comparing the acquired biological information of the user with the biological information stored in the identity information base; and
and determining the user identity according to the comparison result.
5. The method of claim 1, wherein generating the processing result of the voice data based on the user identity comprises:
acquiring portrait information and historical information of the user based on the user identity; and
and generating a processing result of the voice data based on the portrait information and the history information.
6. The method of claim 5, wherein generating the processing result of the speech data further comprises:
acquiring scene and/or context information of the voice data;
and generating a processing result of the voice data based on the scene and/or the context information.
7. The method of claim 6, wherein generating the processing result of the speech data further comprises:
based on the scene and/or the context information, screening required user information from the portrait information and the historical information; and
and generating a processing result of the voice data based on the screened user information.
8. The method of claim 5, wherein the representation information and history information includes at least one of:
the user uses the image information and the historical information which are obtained when the current voice interaction equipment is used; and
the user uses the associated account and/or device to obtain the pictorial information and the historical information.
9. The method of claim 8, further comprising:
creating and/or updating the pictorial information and historical information based on the user's actions with respect to at least one of:
the current voice interaction device;
other associated accounts; and
other associated devices.
10. The method of claim 1, wherein generating the processing result of the voice data based on the user identity comprises:
determining a domain intent of the voice data based on the user identity; and
generating a processing result of the speech data based on the domain intent.
11. The method of claim 1, wherein generating the processing result of the voice data based on the user identity comprises:
determining, based on the user identity, an additional domain intent beyond the domain intent of the voice data itself; and
generating additional processing results for the speech data based on the additional domain intents.
12. The method of claim 1, further comprising:
providing a service based on a processing result of the voice data to the user.
13. The method of claim 12, wherein the service comprises a plurality of services, each service involving a corresponding association operation, the plurality of association operations comprising at least one of:
the same kind of operation performed successively; and
heterogeneous operations are performed simultaneously.
14. The method of claim 13, wherein the associating operation comprises at least one of:
playing sound;
a visual presentation; and
and controlling other devices.
15. The method of claim 14, wherein the sound playback includes voice feedback, the form of the voice feedback determined based on the user identity.
16. The method of claim 12, wherein providing the user with a service based on the processing result of the voice data comprises:
providing information flows corresponding to the same or different services to the user.
17. A speech processing system comprises a speech interaction device and a server,
the voice interaction device is configured to:
receiving voice data from a user;
determining a user identity of the user;
uploading the voice data and the user identity to the server, an
The server is configured to:
and generating a processing result of the voice data based on the user identity.
18. The system of claim 17, wherein the voice interaction device is to:
and carrying out voiceprint recognition on the voice data, and determining the identity of the user according to the recognized voiceprint.
19. The system of claim 18, wherein the voice interaction device is to:
a determined user identity is obtained from a biometric component or device.
20. The system of claim 19, wherein the biometric component or device comprises at least one of:
a fingerprint identification component or device; and
a face recognition component or device.
21. The system of claim 17, wherein the voice interaction device is to:
acquiring respective identity information of a plurality of users, wherein the identity information is used for determining the identity of the users; and
an identity information base comprising the identity information is generated locally or on a server.
22. The system of claim 21, wherein the voice interaction device is to:
acquiring identity information of the user;
comparing the acquired biological information of the user with identity information stored in the identity information base; and
and determining the user identity according to the comparison result.
23. The system of claim 17, wherein the server is to:
based on the user identity, inquiring portrait information and historical information of the user; and
and generating a processing result of the voice data based on the portrait information and the history information.
24. The system of claim 23, wherein the server is to:
acquiring scene and/or context information of the voice data;
based on the scene and/or the context information, screening required user information from the portrait information and the historical information; and
and generating a processing result of the voice data based on the screened user information.
25. The system of claim 17, wherein the server is to:
determining a domain intent of the voice data and/or an additional domain intent beyond the domain intent of the voice data itself based on the user identity; and
generating a processing result of the speech data based on the domain intent, and/or the additional domain intent.
26. The system of claim 17, wherein the server is to:
and returning service information based on the processing result of the voice data.
27. The system of claim 26, wherein the voice interaction device is to:
acquiring service information returned by the server;
based on the service information, executing corresponding association operation, wherein the association operation comprises at least one of the following items:
playing sound;
a visual presentation; and
and controlling other devices.
28. The system of claim 27, wherein the voice interaction device is a smart speaker and the other device is an internet of things device networked with the smart speaker.
29. A voice interaction device, comprising:
voice data receiving means for receiving voice data of a user;
user identity determination means for determining a user identity of the user;
the networking device is used for uploading the acquired voice data and the user identity to a server and acquiring a processing result of the voice data generated and issued by the server based on the user identity;
and the interaction device is used for performing interaction based on the issued processing result.
30. The apparatus of claim 29, wherein the voice data receiving means comprises:
and the microphone device is used for acquiring voice data of the user.
31. The apparatus of claim 29, further comprising:
short-range communication means for at least one of:
acquiring voice data acquired by other voice acquisition equipment; and
and acquiring identity data which is acquired by other equipment and used for determining the identity of the user or the determined identity of the user.
32. The apparatus of claim 31, wherein the interaction device comprises at least one of:
the loudspeaker device is used for broadcasting the processing result to the user;
display screen means for displaying the result of said processing to the user, and/or
And the short-distance communication device is used for sending the acquired processing result to other equipment.
33. The apparatus of claim 32, wherein the processing result of the voice data comprises a plurality of operations, and the plurality of operations relate to either a continuous operation of the interaction device or a simultaneous operation of at least two different interaction devices.
34. The apparatus of claim 29, comprising:
scene information acquisition means for acquiring scene information, an
The networking device is used for uploading the acquired scene information to a server and acquiring the processing result generated by the server based on the user identity information screened by the scene information.
35. The device of claim 29, wherein the voice interaction device is to:
acquiring respective identity information of a plurality of users, wherein the identity information is used for determining the identity of the users; and
an information repository is generated locally or on a server that includes the identity information.
36. A voice interaction method, comprising:
receiving voice data of a user;
determining a user identity of the user;
uploading the acquired voice data and the user identity to a server;
acquiring a processing result of the voice data generated and issued by the server based on the user identity; and
and performing operation based on the issued processing result.
37. The method of claim 36, wherein determining the user identity of the user comprises at least one of:
determining a user identity of the user based on voiceprint information extracted from the voice data; and
and acquiring the user identity or identity information used for determining the user identity, which is acquired by other equipment.
38. The method of claim 36, wherein performing an operation based on the delivered processing result comprises at least one of:
broadcasting the processing result to a user;
displaying the processing result to a user; and
and sending the acquired processing result to other equipment.
39. The method of claim 38, further comprising:
acquiring respective identity information of a plurality of users, wherein the identity information is used for determining the identity of the users; and
an identity information base comprising the identity information is generated locally or on a server.
40. A method of speech processing comprising:
receiving voice data from a user;
acquiring image data of the user;
determining a user identity of the user based on the voice data and/or the image data; and
and generating a processing result of the voice data based on the user identity.
41. An in-vehicle speech processing system comprising:
a microphone for receiving voice data of a user;
a processor for determining a user identity of the user based on the voice data; and
and the interaction device is used for carrying out interaction according to the voice processing result generated based on the user identity.
42. The system of claim 41, wherein,
the microphones comprise a plurality of groups of microphones arranged at different positions of the vehicle, and the processor determines the user identity of the user according to the voice data acquired by the plurality of groups of modules, and/or
The system further includes an image capture device and the processor further determines a user identity of the user based on image information captured by the image capture data.
43. A speech processing system comprising:
a plurality of voice interaction devices for receiving voice devices from users,
wherein one of the plurality of voice interaction devices is awakened to interact with a user, and the interaction comprises:
receiving voice data from a user;
determining a user identity of the user based on the voice data and determining a current interaction scenario based on the location of the voice interaction device being woken up; and
and generating a processing result of the voice data based on the user identity and the current interaction scene.
44. A computing device, comprising:
a processor; and
memory having stored thereon executable code which, when executed by the processor, causes the processor to perform the method of any of claims 36-39.
45. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 36-39.
Background
Intelligent voice assistants have become increasingly popular and become an integral part of the lives of many users. In addition to appearing in the user's home in the most common speaker configuration, voice assistants are beginning to appear in devices such as car entertainment systems, smart phones, and wearable smart devices. Currently, most intelligent voice interaction adopts a single-instruction single-service form, a single user instruction needs to use clear user intention, and only a single-service form exists, and the relatively fixed voice feedback mode can bring mechanical stiffness feeling to a user and is contrary to the original intention of intelligent voice products.
For this reason, a more flexible voice interaction feedback scheme is needed.
Disclosure of Invention
In order to solve at least one of the above problems, the present invention proposes a scheme capable of providing a personalized voice interaction service according to a user identity. According to different preferences of different users, different information service flow combinations under the scene of inaccurate intention are recommended.
According to a first aspect of the present invention, a speech processing method is provided, including: receiving voice data from a user; determining a user identity of the user; and generating a processing result of the voice data based on the user identity.
According to a second aspect of the present invention, there is provided a speech processing system comprising a server and a plurality of speech interaction devices, wherein: the voice interaction device is configured to: receiving voice data from a user; determining a user identity of the user; uploading the voice data and the user identity to the server, the server being configured to: processing the voice data; and generating and issuing a processing result of the voice data based on the user identity.
According to a third aspect of the present invention, there is provided a voice interaction device, comprising: voice data receiving means for receiving voice data of a user; user identity determination means for determining a user identity of the user; the networking device is used for uploading the acquired voice data and the user identity to a server and acquiring a processing result of the voice data generated and issued by the server based on the user identity; and the interaction device is used for performing interaction based on the issued processing result.
According to a fourth aspect of the present invention, a voice interaction method is provided, including: receiving voice data of a user; determining a user identity of the user; uploading the acquired voice data and the user identity to a server; acquiring a processing result of the voice data generated and issued by the server based on the user identity; and performing operation based on the issued processing result.
According to a fifth aspect of the present invention, a speech processing method is provided, including: receiving voice data from a user; acquiring image data of the user; determining a user identity of the user based on the voice data and/or the image data; and generating a processing result of the voice data based on the user identity.
According to a sixth aspect of the present invention, a vehicle-mounted speech processing system is provided, comprising: a microphone for receiving voice data of a user; a processor for determining a user identity of the user based on the voice data; and the interaction device is used for carrying out interaction according to a voice processing result generated based on the user identity.
According to a seventh aspect of the present invention, there is provided a speech processing system comprising: a plurality of voice interaction devices for receiving a voice device from a user, wherein one of the plurality of voice interaction devices is woken up to interact with the user, and the interaction comprises: receiving voice data from a user; determining a user identity of the user based on the voice data and determining a current interaction scenario based on the location of the voice interaction device being woken up; and generating a processing result of the voice data based on the user identity and the current interaction scene.
According to an eighth aspect of the present invention, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of the fourth aspect.
According to a ninth aspect of the present invention, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method of the fourth aspect above.
The voice processing scheme of the invention can acquire portrait information and historical information of the user through determining the identity of the user, determine the intention field of the voice input of the user based on the information, and recommend information streams comprising a plurality of services.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
Fig. 1 shows a process flow diagram of a voice interaction link.
Fig. 2 shows a flow diagram of a speech processing method according to an embodiment of the invention.
Fig. 3 shows an example of establishing an identity information base for subsequent identification according to the present invention.
Fig. 4 shows an example of recommending information service flows according to the present invention.
FIG. 5 shows a block diagram of a speech processing system according to one embodiment of the invention.
FIG. 6 is a block diagram of a voice interaction device, according to one embodiment of the present invention.
FIG. 7 shows a flow diagram of a voice interaction method according to an embodiment of the invention.
FIG. 8 is a schematic structural diagram of a computing device that can be used to implement the speech processing method described above according to an embodiment of the invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Intelligent voice assistants have become increasingly popular and become an integral part of the lives of many users. The intelligent voice assistant is able to conduct spoken dialog with the user and give voice feedback and other actions based on the content of the dialog. Fig. 1 shows a process flow diagram of a voice interaction link. The voice interaction link refers to unit modules involved in the process of realizing voice interaction, and the unit modules cooperate together to complete the voice interaction function. In different application scenarios, some or all of the modules in the interactive link may be involved. The most central unit modules are shown in fig. 1, and in other implementations, the interactive link may also include functional modules such as wake-up response and voiceprint recognition.
As shown in fig. 1, the user speech collected by the audio is processed by a speech recognition module (ASR) to generate a speech recognition result, i.e. a text instruction corresponding to the user speech. Subsequently, a natural language understanding module (NLU) semantically parses the user utterance. Here, natural language understanding refers to an algorithm/system that recognizes the meaning of text therein. In a voice assistant, the NLU can recognize user voice instructions as a particular domain intent. The field refers to a certain specific problem in the natural language understanding field, such as weather, time and the like, and the intention refers to a specific problem in the natural language understanding field, such as weather inquiry, time inquiry, alarm clock setting and the like, belonging to a certain field. When the received user voice data is definite, for example, inputting 'Beijing weather' can trigger the weather inquiry to return an accurate result. But when the user's voice input content is not clear, for example, user a says "i come back" to the smart speaker after returning home, the existing voice reply may be "welcome back" and further interaction is needed, such as "what can help you" to clear the area of user a's intent. After understanding the domain and intent in the user's voice command, the domain and intent can be fed into a domain service module (DS) that can select the system behavior that needs to be performed based on the NLU semantic parsing results (i.e., the specific domain and intent). A natural language generation module (NLG) generates a natural language or says a system utterance according to a system behavior result; finally, the generated language is read by a Speech synthesis module (Text-to-Speech, TTS) to the user for listening.
Existing dialog system logic relies on the NLU to identify a specific skill from which the DS can follow system actions such as dialog processing, voice broadcast, and other services. The processing mode can meet the requirement under the condition that the data boundary of each skill of the dialogue system is clear, and as the skill construction steps are accelerated and the skills are increased, the boundary between the skills becomes fuzzy, so that the NLU does not have enough information to judge whether the NLU belongs to a certain specific skill. This is especially true in cases involving question-answering, Knowledge Graph (KG), encyclopedia skills, and chat conversations. At this time, it is difficult to realize accurate domain classification of the query only by the NLU classification model. For the above situation, if the system still relies on the classification result given by the NLU for subsequent processing, the system recall error rate may be increased, resulting in a reduction in user experience. This also results in a degraded user experience if the system needs to continue to talk to the user to clarify the user's intent. In addition, the above process gives a single feedback for a given voice input, i.e. in the form of a single instruction single service, there is no user distinction, and it also does not give a feedback satisfactory to the user.
To this end, the present invention provides a scheme for providing a personalized information service flow for each user of a smart voice device or system (e.g., a smart speaker or a home smart voice interaction system). The scheme can be used for recommending information or service which accords with the user preference for different users (for example, different users in a family) of the same equipment/system without specific intention instructions, so that thousands of people are realized, and the viscosity and the use experience of the user are improved.
Since the user identity is obtained, in the voice processing method of the present invention, the domain intention of the voice data can be determined based on the user identity. Therefore, the method and the device are particularly suitable for defining the intention field of the voice input with ambiguous semantic intention by utilizing the identity information of the user and giving a corresponding processing result. In other words, the invention can utilize the data representing a certain target user to presume the hidden intention of the user except the current speech input self semantic, thereby leading the intelligent interaction to be more 'good for the people' and avoiding the subsequent multi-round inquiry.
For example, if dad says "good morning" among family members, it is practical to recommend information services with high user preferences, such as weather information, scientific news information (favorite of the user portrait), and discount information (shopping) of favorite scientific and technological products, according to the family user habits. When mom says "good morning" among family members, weather information (e.g., at a slower pace of speech and a gentler mood) may also be recommended, and information such as breakfast knowledge (user portrait preferences), and daily discount information (shopping) may be recommended.
Fig. 2 shows a flow diagram of a speech processing method according to an embodiment of the invention. In different application scenarios, the voice processing method can be a stand-alone scheme completely realized on the voice interaction device, can be a stand-alone networking scheme realized by a single voice interaction device by means of cloud processing capability, can be a distributed system scheme realized by a cloud server supporting mass voice interaction terminals, and can also be a cloud solution scheme executed by the server alone.
In step S210, voice data from a user is received. In some embodiments, for example, embodiments involving a voice interaction side, receiving voice data may refer to obtaining its captured audio data directly from an audio capture module (e.g., microphone array) of a local voice interaction device. For example, in a home scenario, the smart speaker collects voice data of a user via a microphone array disposed therein, or a smart voice sticker connected to the smart speaker, such as bluetooth, collects voice data of the user. In a driving scene, audio data acquired by an In-Vehicle entertainment system (IVI) through a microphone array arranged on a steering wheel. In other embodiments, receiving voice data may refer to voice data obtained from the outside, e.g., from a client by a cloud server. Here, the voice data may refer to original voice data, for example, audio data, or may refer to voice data that has undergone some preprocessing. For example, the cloud server may obtain data uploaded after local noise reduction compression, or may even directly obtain text data obtained after ASR (automatic speech recognition) processing by the client.
In step S220, the user identity of the user is determined. In step S230, a processing result of the voice data is generated based on the determined user identity. For example, different members of the same local system (e.g., a home smart voice interaction system) may give different voice data processing results (even though different members give the same voice instruction, e.g., dad and mom each say "good morning" to the smart speaker). Therefore, different users are distinguished, and the user satisfaction is improved.
Here, the determination of "user identity" refers to recognition of from which natural person the voice data comes. The user identity may be determined based on identity information of the user. The identity information of the user may be, for example, account information of the login device. This is particularly applicable to scenarios where the device and the user are strongly related. In other embodiments, the identity information may be collected user biometric information, whereby the user identity is determined with a higher degree of confidence. Accordingly, step S220 may include identifying biometric information of the user and determining the user identity based on the biometric information. Alternatively or additionally, the determination of the user identity may also include identifying account information for the user and determining the user identity based on the account information. The biometric information recognition may include voice print recognition of the voice data, for example, voice print recognition of the voice data received in step S210, and then the user identity may be determined according to the recognized voice print. Biometric identification may also include identifying a user's fingerprint and determining the user's identity based on the identified fingerprint. As another example, the biometric information recognition may also include face recognition of the user and determining the identity of the user based on the recognized face. The recognition can be obtained by the voice device through processing voice data by a voice print recognition module equipped in the voice device, can be obtained by the voice device through the recognition of a fingerprint of a specific user by a fingerprint lock and the transmission of a recognition result, and can be obtained by the voice device through the image recognition of a camera installed on the voice device or the transmission of an image recognition result of other devices. The image recognition may be face recognition, for example, or may be recognition of other human body features, for example, height and shape recognition.
In specific practice, the user identity may be determined comprehensively according to various information, for example, by combining account information and voiceprint information, combining face information and voiceprint information, and the like. User information can be further clarified through voice broadcast and user secondary confirmation.
Furthermore, identity information libraries can be established for the same voice device or different users of the voice system, so that the identity of the user can be conveniently identified by comparing the identified information with the information stored in the libraries. To this end, the speech processing method of the present invention may include: the biological information of each of the plurality of users is collected and an identity information base is generated. Accordingly, step S220 may include: comparing the acquired biological information of the user with the biological information stored in the identity information base; and determining the user identity according to the comparison result.
Fig. 3 shows an example of establishing an identity information base for subsequent identification according to the present invention. As shown in fig. 3, assume that in a family smart voice scene, three family members are included, dad, mom, and baby (e.g., a three-year-old baby). To implement the user identity based voice processing of the present invention, a user identity registration may be performed first. In this example, the user identity registration includes a face and voiceprint registration. For this reason, the smart speaker as the central node of the home smart audio system may require dad, mom, and baby to input a fixed voice (e.g., a wake-up word) respectively for feature extraction by the smart speaker (based on local or cloud processing capabilities). In addition, the face image information of the user can be acquired through an intelligent sound box with a camera or a smart phone with a voice management APP and the like, and feature extraction can be performed in the same way. The extracted features may be input into a model training module, typically located on a server, to model the face and voiceprint, and obtain a face/voiceprint library for the family as the identity information library. Based on different implementations, the identity information base can be stored in the cloud (e.g., in a database connected to the voice server) as a home or device ID, locally, e.g., in a local smart speaker or home database, in a database readable by the edge computing device, etc.
After the identity database is established, the identity can be determined by comparing the currently acquired data with the data in the database. For example, when a family member comes back and speaks a wakeup word of the smart speaker, the wakeup word wakes up the voice interaction function of the smart speaker, and determines that the identity of the user is dad by extracting the voiceprint feature of the voice of the wakeup word and comparing the voiceprint feature with voiceprint information in the identity information base, and then provides subsequent information service flows according to the wakeup word, for example, including subsequent voice interaction and various forms of feedback.
After the user identity is determined, a processing result of the voice data may be generated based on the determined user identity at step S230. Specifically, user portrait information and historical information for the user may be obtained based on the user identity. Subsequently, a processing result of the voice data may be generated based on the acquired portrait information and history information. The portrait information and the history information may include portrait information and history information acquired when the user uses the current voice interaction device, and may further include portrait information and history information acquired when the user uses an associated account and/or device. Accordingly, the portrait information and historical information may be created and subsequently updated based on user operations with respect to the current voice interaction device, associated account, and/or associated device.
Specifically, similar to creating an identity information base for a family member as shown in fig. 3, a user profile and behavior analysis results may also be created for the family member. For example, tagged preference information for each user may be established based on different family members. Table 1 below shows an example of labeled preference information for family members. The information may be stored as, or as part of, a user representation of each user of the home. In some embodiments, the user representation may also include natural personal attribute information of the user, such as age, height, gender, and frequent residence, among others. The information can be acquired through modes of user input and the like, so that the accuracy of the user image to the description of the user preference is further improved. Similarly, the table of information may be stored in the cloud (e.g., in a database connected to the voice server) by home or device ID, locally, e.g., in a local smart speaker or home database, in a database readable by the edge computing device, etc.
Table 1 family member tagged favorite information table
When an image is created for a user of a smart device, if there is a lack of historical data, the historical data can be pulled from, for example, an associated account (e.g., shopping account, payment account, navigation account) and an associated device (e.g., associated APP on a smartphone and home IoT) to analyze for the tag information. The above-mentioned associated account and the history operation information of the device, together with the operation information of the current device, may be associatively stored as the history information of the user. This history information is continuously updated with subsequent operations by the user and may also be used to update the tags. The historical information may be stored locally and preferably structured in the cloud to facilitate retrieval of, for example, memory engine services.
The user portrait and historical information obtained based on the user identity can be used for helping to determine the domain intention of the voice data so as to generate a processing result of the voice data conveniently. In some embodiments, the invention is particularly applicable to determining underlying intent behind user identity for speech data that is itself semantically ambiguous via representation and historical information corresponding thereto. For example, when dad goes home and enters the voice "i am back" for the smart speaker, its implicit intent can be to tell the system to operate to meet the usual needs after going home. For example, the home IoT controls the lights in the living room to turn on, introduces newly favorite popular music singers, plays music while giving commentary to music lovers, etc.
In some embodiments, even if the voice of the voice data itself is clear, an additional domain intention other than the domain intention of the voice data itself may be determined according to the user portrait and the history information corresponding to the user identity information, and an additional processing result of the voice data may be generated based on the additional domain intention. For example, when a baby inputs the voice "i want to watch tv", it can be inferred from the user profile that his particular intent is to turn on tv watch a animation. At this point, the home IoT may control the smart tv to turn on and play the a animation.
In addition, gesture customized recognition of different users, for example, for the smart sound box, can be achieved according to biological features (for example, recognized human faces or voiceprints).
For further explicit purposes, context and/or context information of the speech data may also be obtained. The context and/or context information may then be used to clarify the intended domain of the speech data to generate a processing result for the speech data.
Here, the scene information may refer to information that can be used to describe a specific scene where a user is located when voice data is generated, or more specifically, when raw audio data is collected (details will be given by way of example later). Since the context information can represent the specific scene in which the user is located when generating speech, the context information may refer to the context in the dialog process, for example. Therefore, the processing result corresponding to the scene and the context can be generated according to the judgment. That is, in the case where the same voice input is received, different processing results of the voice data may be generated based on different scene information.
As described above, the scene information is information for describing a scene in which the user speech is generated. In one embodiment, the context information may be determined based on at least one correlation information. In one embodiment, the relevant information may include the current time. Whether the current time is at leisure or in a hurry, the current level of attention of the user can also be reflected from one side. For example, the pre-work scenes of a workday are more hurry than the holiday scenes, so the interactive system can provide a shorter and time-saving feedback. Further, the scene division of the current time may be performed according to a specific work and rest time of a specific user. For example, for retirees, the typical commute time should not be divided into hurry times. The above-mentioned division can be obtained by analyzing the usage habits of the current user by the device or based on user settings.
In one embodiment, the relevant information may include current calendar information. In other embodiments, the current calendar information may also be part of the current time information. Here, the calendar information may specifically refer to information on certain festivals and holidays or all-people activities. This information is reflected on the calendar (or calendar software) and can therefore be referred to as calendar information. Then, it may be determined that the context information includes special calendar context information based on the current calendar information. For example, calendar information corresponding to spring festival and biendex. For a period of time before the arrival of the twenty-one, the interactive system may provide promotional feedback related thereto, thereby promoting the user's willingness to further browse and participate in the twenty-one shopping.
In one embodiment, the related information may also include environmental information. The environment information here may be small environment information such as ambient volume, brightness, etc., or large environment information such as weather, temperature, etc. For example, when the scene information indicates that the current background sound is noisy (e.g., the smart speaker knows that the smart tv is being turned on), the smart speaker may perform voice interaction at a greater volume. For another example, when the scene information indicates that the current brightness is low (e.g., when the smart speaker learns that the smart light is turned off at night), the smart speaker may display an image that is fed back to the user with low brightness or eye-protecting yellow light. Here, the volume of the playing sound and the brightness of the screen may be regarded as an embodiment of the different interaction forms.
In one embodiment, the related information may also include user information. The user information may include user preference information set by a user or obtained based on a user's usage behavior or a user profile, or may be attribute information of the user itself. For example, the user may set a user preferred level of interaction to be higher or lower than a preset level of interaction in a certain scenario. For example, if the user's work hours are different from regular work hours, the user may set the user's preferences regarding time on his or her own. For example, the user may turn off feedback corresponding to the interaction level, and so on. For another example, whether the current interactive object is an old person, a child or an adult can be identified through voiceprint, and corresponding tone and interaction richness are selected for interaction.
In one embodiment, the scene or the related information for determining the scene may also be determined according to the location of the user. For example, it may be determined that the user is currently in a bedroom, a living room, or a kitchen based on whether the user is communicating with a voice device in the bedroom, the living room, or the kitchen, etc., and determine a bedroom, living room, or kitchen scene accordingly, or help determine a scene of being asleep, entertaining, or cooking based on the above information.
In general, the scenario information may be determined based on more than two kinds of related information as above, thereby more accurately deducing the user state by referring to information descriptions of different approaches, and thereby giving a processing result corresponding to the current state of the user.
In particular, in some embodiments, the context and/or contextual information itself may be used to help clarify the intended field. In other embodiments, context and/or context information may assist in filtering user information. For this, it is possible to screen desired user information from the portrait information and the history information based on the scene and/or context information, and to generate a processing result of the voice data based on the screened user information.
Here, the processing result may be an interaction result for interacting with the user. The interaction result may be a voice interaction, or may be in other forms, for example, a tactile and visual interaction, for example, data for performing TTS (speech synthesis) by voice recognition, or may include other presentation forms, for example, in a case where the local voice device further has a display screen, the interaction result further includes data for displaying on the display screen, and the like. After generating the processing result of the voice data, a service based on the processing result of the voice data may be provided to the user.
The service may include a single service, for example, the home IoT may control the smart tv to turn on and play the a animation. More preferably, however, the service may comprise a plurality of services, each service involving a corresponding association operation, the plurality of association operations comprising at least one of: homogeneous operations performed sequentially (e.g., pop songs played sequentially); the sound play may include voice feedback, e.g., while the television is turned on, the voice feedback "television is on" is played to the baby, the specific form of the voice feedback may be determined based on the user identity.
Thus, the service of providing the user with the processing result based on the voice data may include: providing information flows corresponding to the same or different services to the user. For example, controlling IoT devices, providing recommended audio streams, giving voice feedback, etc.
Fig. 4 shows an example of recommending information service flows according to the present invention. In the middle of fig. 4, the conventional operation of the voice interaction system is shown, that is, when the user requirement is obtained, a Natural Language Understanding (NLU) operation is performed and an execution domain is determined for the voice input (query) of the user, and a corresponding service is shown. In the present invention, branches of user information and scene information are further added. In addition to conventional semantic recognition, when user speech input is obtained, voiceprint recognition is used to identify the user identity, as shown on the right side of the figure, and user information is obtained based on the user identity, e.g., information extracted from a user portrait and memory engine (e.g., for extracting useful historical information). Further, as shown on the left side of the figure, context and context information of the speech input may also be determined, e.g., based on processing by the context understanding module. The information of the three branches may then be aggregated for use in determining the domain to be executed. According to the summarized information, a recommendation engine and a knowledge graph can be inquired, and a corresponding voice feedback is given by a conversation management module. Thereby, a recommended information service flow is given, which typically involves a number of different services and operations.
For example, the smart speaker receives mom's voice input of "done meal", and at this time, it may be determined that the voice processing result of "done meal" includes turning off the television and the lamp in the living room according to the context and the scene information (the television and the lamp are turned on in the living room). Subsequently, the relevant equipment in the kitchen can be turned on according to the semantic understanding of NLP on 'cooked' and the previous operation habit of mom, and simultaneously, the menu can recommend the favorite menu according to family members (dad and baby).
Therefore, different information service flow combinations under the scene of inaccurate intention can be recommended according to the preferences and habits of different users.
As described above, in different embodiments, the voice processing method described with reference to fig. 1 may be a stand-alone scheme implemented completely on the voice interaction device, a stand-alone networking scheme implemented by a single voice interaction device with the aid of cloud processing capability, a distributed system scheme implemented by a cloud server supporting a large number of voice interaction terminals, or a cloud solution implemented by a server alone in different application scenarios.
The specific application of the speech processing scheme of the present invention in a different context will be further described below in conjunction with fig. 5-7.
FIG. 5 shows a block diagram of a speech processing system according to one embodiment of the invention. The system refers to a distributed system in a larger scope (instead of a home voice interaction system in a small scope, for example), and comprises a server (cloud end) and a plurality of voice interaction devices. In some implementations, the multiple voice interaction devices may be multiple voice interaction devices of the same type, but in a wider implementation as shown in fig. 5, the cloud supports voice processing for a number of different types of mass voice interaction devices.
As shown in FIG. 5, distributed speech processing system 500 includes a server 510 and a plurality of voice interaction devices 520.
The voice interaction device 520 may include, for example, various types of interaction terminals, such as the illustrated smart speaker (e.g., a smart speaker with a screen), a smart television, a car entertainment system, and so on. Here, the smart speaker may be used as a home smart interaction center node and to cope with a scenario of a multi-home user. The intelligent television can be used as a central node of the intelligent video conference and is used for dealing with scenes of multiple participants. The vehicle entertainment system can deal with the multi-user scene of the driver and other passengers on the vehicle. The present invention does not limit the implementation form of the voice interaction device 520.
The voice interaction device 520 may be configured to: receiving voice data from a user; determining a user identity of the user; the voice data and the user identity are uploaded to the server 510.
The server 510 may be a cloud server that provides networked voice services for all voice interactive terminals under the same brand or manufacturer. Server 310 may be configured to process the voice data; and generating and issuing a processing result of the voice data based on the user identity.
Specifically, the identification of the user identity by the voice interaction device 520 may be data acquisition identification performed by itself, and acquisition data or identification of an acquisition result of other devices is obtained by means of the cloud server and/or the identification of data. In one embodiment, voice interaction device 520 may perform voiceprint recognition on the voice data and determine the user identity based on the recognized voiceprint. Alternatively or additionally, the voice interaction device 520 may obtain a determined user identity from a biometric component or device. The biometric component or device may include at least one of: a fingerprint identification component or device; and a face recognition component or device. The biometric components or devices may be components included in the voice interaction device 520, such as a camera and an image processing device, or may be classified devices, such as a fingerprint lock.
Further, the voice interaction device 520 may participate in the establishment of the previous identity information base. In particular, the voice interaction device 520 may be configured to: acquiring respective identity information of a plurality of users, wherein the identity information is used for determining the identity of the users; and generating, locally or on a server (or edge computing device), an identity information repository including the identity information. Subsequently, the voice interaction device may determine the user identity according to the comparison result by acquiring the identity information of the user, and comparing the acquired biological information of the user with the identity information stored in the identity information base.
After obtaining the user identity, server 510 may be configured to: based on the user identity, inquiring portrait information and historical information of the user; and generating a processing result of the voice data based on the portrait information and the history information. Similarly, the user's portrait information and historical information may preferably be stored in the cloud, or partially locally, such as in the tag information table shown in Table 1 above.
In one embodiment, server 510 may be used to: acquiring scene and/or context information of the voice data; based on the scene and/or the context information, screening required user information from the portrait information and the historical information; and generating a processing result of the voice data based on the screened user information.
In one embodiment, server 510 may be used to: determining a domain intent of the voice data and/or an additional domain intent beyond the domain intent of the voice data itself based on the user identity; and generating a processing result of the speech data based on the domain intention, and/or the additional domain intention.
Subsequently, the server 510 may return service information based on the processing result of the voice data. The voice interaction device 520 obtains the service information returned by the server, and based on the service information, performs a corresponding association operation, where the association operation includes at least one of the following: playing sound; a visual presentation; and other device controls.
Preferably, the voice interaction device 520 is a smart speaker, and the other devices are internet of things devices networked with the smart speaker.
FIG. 6 is a block diagram of a voice interaction device, according to one embodiment of the present invention. The voice interaction device 600 may be the voice interaction device 520 shown in the previous figures.
As shown in fig. 6, the voice interaction apparatus 600 includes a voice data receiving means 610 for receiving voice data of a user; user identity determining means 620 for determining a user identity of the user; the networking device 630 is configured to upload the acquired voice data and the user identity to a server, and acquire a processing result of the voice data that is generated and issued by the server based on the user identity; and an interaction device 640, configured to perform interaction based on the issued processing result.
In one embodiment, the voice data receiving means 610 may be a microphone means of the device 600 for collecting voice data of the user. In other embodiments, the device 600 may include short-range communication means for obtaining user voice data collected by other voice terminals, for example, voice data collected and transmitted by smart voice stickers disposed in other rooms in the home. In addition, the short-distance communication device can also acquire identity data acquired by other equipment for determining the identity of the user or the determined identity of the user, for example, the determined identity of the user of the coded lock.
In different implementations, the interaction means 630 may comprise at least one of: the loudspeaker device is used for broadcasting the processing result to the user; display screen means for displaying the processing results to a user, and the short-range communication means may be for transmitting the acquired processing results to other devices, for example, controlled IoT devices.
In one embodiment, the apparatus 600 may further include a scene information obtaining device for obtaining scene information. The context information may include at least one of: a voice interaction device type; the current time; a current geographic location; a current speed; current calendar information; and current environmental information, etc. The scene information acquiring means 620 may include at least one of: the networking device is used for acquiring scene information through a query network; positioning means for acquiring scene position information by a positioning system (for example, acquiring GPS information as geographical position information by a GPS device); one or more sensors for sensing one or more scene sensing information (e.g., sensing vehicle speed, etc.); system access means for reading local system configuration information (e.g. device type information). The networking device may upload the acquired scene information to a server, and acquire the processing result generated by the server based on the user identity information screened by the scene information.
As mentioned above, the device may collect respective identity information of a plurality of users, where the identity information is used to determine the identity of the user; and generating, locally or on a server, an information repository comprising the identity information.
In addition, the device may locally perform a part of the functionality of the voice interactive link, and thus, the device 600 may further comprise at least one of: the voice print recognition device is used for carrying out voice print recognition on at least part of the acquired voice data; the voice recognition device is used for carrying out voice recognition on at least part of the acquired voice data; and natural language understanding means for performing intention and domain recognition on at least part of the voice data subjected to the voice recognition.
FIG. 7 shows a flow diagram of a voice interaction method according to an embodiment of the invention. The method can be realized by the voice interaction device.
In step S710, voice data of a user is received. In step S720, the user identity of the user is determined. In step S730, the acquired voice data and the user identity are uploaded to a server. In step S740, the processing result that the server generates and issues the voice data based on the user identity is obtained. In step S750, an operation is performed based on the issued processing result.
In one embodiment, determining the user identity of the user comprises at least one of: determining a user identity of the user based on voiceprint information extracted from the voice data; and acquiring the user identity acquired by other equipment or identity information used for determining the user identity.
In one embodiment, acquiring the voice data may include at least one of: collecting voice data of a user by using a microphone device; and acquiring user voice data acquired by other voice terminals by using the short-distance communication device.
In one embodiment, interacting based on the delivered processing result may include at least one of: broadcasting the processing result to a user by using a loudspeaker device; displaying the processing result to a user by using a display screen device; and transmitting the acquired processing result to other voice terminals using a short-range communication device.
In one embodiment, the method further comprises: acquiring respective identity information of a plurality of users, wherein the identity information is used for determining the identity of the users; and generating an identity information base comprising the identity information locally or on a server.
Further, the device may locally perform a part of the functionality of the voice interactive link, and therefore the method may further comprise: performing voice recognition on at least part of the acquired voice data; and recognizing the intention and the domain of the voice data at least partially recognized by the voice.
The speech processing scheme of the present invention is also applicable to a variety of specific application scenarios.
In one embodiment, the present invention may also be implemented as a speech processing method, including: receiving voice data from a user; acquiring image data of the user; determining a user identity of the user based on the voice data and/or the image data; and generating a processing result of the voice data based on the user identity. The method is particularly suitable for being implemented by a home intelligent system or an intelligent conference system comprising voice equipment and a camera. The voice device can collect voice data of the user and perform voiceprint extraction, and the camera can collect image data of the user to perform user feature extraction and recognition (such as face recognition), so that the user identity is determined together or alternatively, and a corresponding voice data processing result is given. In this example, if the system is equipped with a plurality of voice devices or the camera has a depth measurement function, the voice interaction scene may be further refined according to the position of the interactive voice device or the position of the user determined by the depth image data acquired by the camera, so as to clarify the user's intention and give corresponding feedback.
In one embodiment, the present invention may also be applied to an in-vehicle scenario. To this end, the present invention may also be embodied as a vehicle-mounted voice processing system including: a microphone for receiving voice data of a user; a processor for determining a user identity of the user based on the voice data; and the interaction device is used for carrying out interaction according to a voice processing result generated based on the user identity. Here, the user identity may refer to a specific identity division on the vehicle-mounted scene, e.g., driver and passenger, and the passenger may be further refined into front passenger and rear passenger, and so on. At this point, the onboard system may set certain commands to be issued only by users identified as "drivers".
The determination of the identity may be based on different mechanisms. For example, the microphones may comprise a plurality of sets of microphones disposed at different locations in the vehicle, and the processor determines the user identity of the user from the voice data acquired by the plurality of sets of modules. Alternatively or additionally, the vehicle-mounted system may further comprise an image acquisition device, and the processor may further determine the user identity of the user according to image information acquired by the image acquisition data.
Further, the present invention can also be implemented as a speech processing system comprising a plurality of speech interaction devices. The plurality of voice interaction devices may be, for example, different voice devices arranged in a living room, a kitchen, and a bedroom in a smart home, and may be used to receive a voice device from a user.
Typically, one of the plurality of voice interaction devices is woken up to interact with the user. The interaction may implement the speech processing method of the present invention, and may include: receiving voice data from a user; determining a user identity of the user based on the voice data and determining a current interaction scenario based on the location of the voice interaction device being woken up; and generating a processing result of the voice data based on the user identity and the current interaction scene. E.g., determine the entertainment scene and turn on the television, etc., based on the dad identity and the living room location.
FIG. 8 is a schematic structural diagram of a computing device that can be used to implement the speech processing method described above according to an embodiment of the invention.
Referring to fig. 8, computing device 800 includes memory 810 and processor 820.
The processor 820 may be a multi-core processor or may include multiple processors. In some embodiments, processor 820 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 820 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).
The memory 810 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 820 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 810 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 810 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
Memory 810 has stored thereon executable code that, when processed by processor 720, causes processor 820 to perform the speech processing methods described above.
The voice processing method and system, and the voice interaction apparatus and method according to the present invention have been described in detail above with reference to the accompanying drawings. The voice processing scheme of the invention can provide a scheme of personalized voice interaction service according to the identity of the user. According to different preferences of different users, different information service flow combinations under the scene of inaccurate intention are recommended.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.
Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:车内含噪语音数据生成方法、装置以及设备