Sound source positioning method, device, equipment and storage medium
1. A sound source localization method, comprising:
acquiring a sound signal, and extracting acoustic features of a sound-producing object from the sound signal;
determining a sound-emitting object visual feature matching the sound-emitting object acoustic feature, wherein the sound-emitting object visual feature comprises a feature of a sound-emitting object extracted through an image of the sound-emitting object;
and determining the sound-producing object which emits the sound signal according to the visual characteristics of the sound-producing object matched with the acoustic characteristics of the sound-producing object and the acquired scene image, and determining the position of the sound-producing object.
2. The method of claim 1, wherein determining visual characteristics of the sound-producing object that match the acoustic characteristics of the sound-producing object comprises:
determining visual features of the sound-producing object matched with the acoustic features of the sound-producing object from a pre-constructed feature data set;
and the feature data set stores acoustic features of the sound-producing object and visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object.
3. The method of claim 2, further comprising capturing images of the scene in real time and identifying visual features of the sound generating objects from the captured images of the scene, and storing the identified visual features of the sound generating objects;
after acquiring the sound signal and extracting acoustic features of the sound-emitting object from the sound signal, the method further comprises:
acquiring visual characteristics of a sound generating object acquired in a target time period, wherein the target time period is a time period with set duration before the sound signal is acquired;
and correspondingly storing the visual characteristics of the sound-producing object acquired in the target time period and the acoustic characteristics of the sound-producing object into the characteristic data set.
4. The method according to claim 3, wherein the storing the visual features of the sound-generating object acquired in the target time period and the acoustic features of the sound-generating object into the feature data set correspondingly comprises:
detecting whether acoustic features of the sound-producing object are stored in the feature data set or not, and detecting whether the number of visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object reaches a set number or not;
if the acoustic features of the sound-producing object are stored in the feature data set and the number of the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object reaches a set number, updating the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object stored in the feature data set by using the visual features of the sound-producing object acquired in the target time period;
if the acoustic features of the sound-producing object are stored in the feature data set and the number of the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object does not reach a set number, the visual features of the sound-producing object acquired in the target time period are used as the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object and are stored in the feature data set;
and if the acoustic features of the sound-producing object are not stored in the feature data set, correspondingly storing the visual features of the sound-producing object acquired in the target time period and the acoustic features of the sound-producing object into the feature data set.
5. The method according to claim 2, wherein when a plurality of visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object are stored in the feature data set, the determining a visual characteristic of the sound-producing object matching the acoustic characteristics of the sound-producing object from a pre-constructed feature data set comprises:
and selecting the visual features of the sound-producing object with the most occurrence times or stored latest from the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object stored in a pre-constructed feature data set as the visual features of the sound-producing object matched with the acoustic features of the sound-producing object.
6. The method according to claim 5, wherein selecting the visual feature of the sound-producing object with the largest number of occurrences or the latest stored visual feature of the sound-producing object from the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object stored in the pre-constructed feature data set comprises:
if at least three visual characteristics of the sounding objects corresponding to the acoustic characteristics of the sounding objects are stored in a pre-constructed characteristic data set, selecting the visual characteristics of the sounding objects with the largest occurrence frequency from the visual characteristics of the sounding objects corresponding to the acoustic characteristics of the sounding objects;
and if less than three visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object are stored in a pre-constructed characteristic data set, selecting the latest stored visual characteristics of the sound-producing object from the visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object.
7. The method of claim 1, wherein determining the sound emitting object emitting the sound signal and determining the position of the sound emitting object based on the visual characteristics of the sound emitting object matching the acoustic characteristics of the sound emitting object and the captured scene image comprises:
detecting a sound-producing object which emits the sound signal from a scene image collected by a camera according to the visual characteristics of the sound-producing object matched with the acoustic characteristics of the sound-producing object;
and determining the position of the sounding object according to the detected position of the sounding object which emits the sound signal in the scene image acquired by the camera.
8. The method of claim 7, wherein the detecting the sound-producing object that produces the sound signal from the scene image captured by the camera according to the sound-producing object visual characteristics matching the sound-producing object acoustic characteristics comprises:
detecting a target sounding object from a scene image acquired by a camera; wherein the target sound production object satisfies the following characteristics: the similarity between the visual features of the sound-producing object matched with the acoustic features of the sound-producing object and the visual features of the target sound-producing object is greater than a set similarity threshold;
if the target sound-emitting object is detected, determining the target sound-emitting object as a sound-emitting object emitting the sound signal;
if the target sounding object is not detected, controlling the camera to rotate to a sound source position, wherein the sound source position is determined according to the sound signal;
detecting a target sounding object from a scene image collected in the process of rotating a camera towards a sound source direction;
and if the target sound-emitting object is detected, determining the target sound-emitting object as the sound-emitting object emitting the sound signal.
9. The method of claim 8, wherein the target sound-emitting object further satisfies the following characteristics: and the deviation of the position of the target sound production object and the position of the sound source determined according to the sound signal is within a preset deviation range.
10. The method of claim 1, further comprising:
determining a sound source location from the sound signal if a visual feature of the sound-generating object matching the acoustic feature of the sound-generating object cannot be determined.
11. The method of claim 1, wherein the visual characteristics of the sound-producing subject include facial characteristics of the sound-producing subject captured via an image of the sound-producing subject.
12. A sound source localization apparatus, comprising:
the acoustic signal processing device comprises a signal acquisition unit, a processing unit and a processing unit, wherein the signal acquisition unit is used for acquiring an acoustic signal and extracting acoustic features of a sound production object from the acoustic signal;
a feature determination unit configured to determine a sound-emitting object visual feature that matches the sound-emitting object acoustic feature, the sound-emitting object visual feature including a feature of a sound-emitting object extracted through an image of the sound-emitting object;
and the sound source positioning unit is used for determining the sound generating object which sends the sound signal according to the visual characteristics of the sound generating object matched with the acoustic characteristics of the sound generating object and the collected scene image, and determining the position of the sound generating object.
13. A sound source localization apparatus, comprising:
a memory and a processor;
the memory is connected with the processor and used for storing programs;
the processor is configured to implement the sound source localization method according to any one of claims 1 to 11 by executing the program in the memory.
14. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, implements a sound source localization method according to any one of claims 1 to 11.
Background
With the continuous increase of the intelligent level of intelligent equipment, sound source positioning has gradually become an essential function of most intelligent equipment. For example, for an intelligent robot, it is necessary to determine the position of a user calling the robot by sound source localization so as to be able to move or turn towards the user in order to accurately understand the user indication.
In conventional sound source positioning, sound signals are collected by a microphone array to realize sound source positioning, but in a noisy environment, the sound signals emitted by a sound source may be interfered by noise, and meanwhile, the microphone precision is limited, and a positioning result depending on the microphone for positioning is often unreliable.
Disclosure of Invention
Based on the above technical current situation, the present application provides a sound source positioning method, device, equipment and storage medium, which can improve the sound source positioning accuracy.
In order to achieve the above purpose, the present application proposes the following technical solutions:
a sound source localization method, comprising:
acquiring a sound signal, and extracting acoustic features of a sound-producing object from the sound signal;
determining a sound-emitting object visual feature matching the sound-emitting object acoustic feature, wherein the sound-emitting object visual feature comprises a feature of a sound-emitting object extracted through an image of the sound-emitting object;
and determining the sound-producing object which emits the sound signal according to the visual characteristics of the sound-producing object matched with the acoustic characteristics of the sound-producing object and the acquired scene image, and determining the position of the sound-producing object.
Optionally, the determining visual characteristics of the sound-producing object that match the acoustic characteristics of the sound-producing object includes:
determining visual features of the sound-producing object matched with the acoustic features of the sound-producing object from a pre-constructed feature data set;
and the feature data set stores acoustic features of the sound-producing object and visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object.
Optionally, the method further includes acquiring a scene image in real time, identifying the visual features of the sound generating object from the acquired scene image, and storing the identified visual features of the sound generating object;
after acquiring the sound signal and extracting acoustic features of the sound-emitting object from the sound signal, the method further comprises:
acquiring visual characteristics of a sound generating object acquired in a target time period, wherein the target time period is a time period with set duration before the sound signal is acquired;
and correspondingly storing the visual characteristics of the sound-producing object acquired in the target time period and the acoustic characteristics of the sound-producing object into the characteristic data set.
Optionally, the correspondingly storing the visual characteristics of the sounding object acquired in the target time period and the acoustic characteristics of the sounding object into the characteristic data set includes:
detecting whether acoustic features of the sound-producing object are stored in the feature data set or not, and detecting whether the number of visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object reaches a set number or not;
if the acoustic features of the sound-producing object are stored in the feature data set and the number of the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object reaches a set number, updating the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object stored in the feature data set by using the visual features of the sound-producing object acquired in the target time period;
if the acoustic features of the sound-producing object are stored in the feature data set and the number of the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object does not reach a set number, the visual features of the sound-producing object acquired in the target time period are used as the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object and are stored in the feature data set;
and if the acoustic features of the sound-producing object are not stored in the feature data set, correspondingly storing the visual features of the sound-producing object acquired in the target time period and the acoustic features of the sound-producing object into the feature data set.
Optionally, when a plurality of visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object are stored in the feature data set, determining the visual features of the sound-producing object matching the acoustic features of the sound-producing object from a pre-constructed feature data set includes:
and selecting the visual features of the sound-producing object with the most occurrence times or stored latest from the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object stored in a pre-constructed feature data set as the visual features of the sound-producing object matched with the acoustic features of the sound-producing object.
Optionally, the selecting, from the visual features of the sound-emitting object corresponding to the acoustic features of the sound-emitting object stored in the pre-constructed feature data set, the visual features of the sound-emitting object that are most frequently appeared or newly stored includes:
if at least three visual characteristics of the sounding objects corresponding to the acoustic characteristics of the sounding objects are stored in a pre-constructed characteristic data set, selecting the visual characteristics of the sounding objects with the largest occurrence frequency from the visual characteristics of the sounding objects corresponding to the acoustic characteristics of the sounding objects;
and if less than three visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object are stored in a pre-constructed characteristic data set, selecting the latest stored visual characteristics of the sound-producing object from the visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object.
Optionally, the determining, according to the visual feature of the sound generating object matched with the acoustic feature of the sound generating object and the collected scene image, the sound generating object which emits the sound signal, and determining the position of the sound generating object include:
detecting a sound-producing object which emits the sound signal from a scene image collected by a camera according to the visual characteristics of the sound-producing object matched with the acoustic characteristics of the sound-producing object;
and determining the position of the sounding object according to the detected position of the sounding object which emits the sound signal in the scene image acquired by the camera.
Optionally, the detecting the sound generating object emitting the sound signal from the scene image collected by the camera according to the visual feature of the sound generating object matched with the acoustic feature of the sound generating object includes:
detecting a target sounding object from a scene image acquired by a camera; wherein the target sound production object satisfies the following characteristics: the similarity between the visual features of the sound-producing object matched with the acoustic features of the sound-producing object and the visual features of the target sound-producing object is greater than a set similarity threshold;
if the target sound-emitting object is detected, determining the target sound-emitting object as a sound-emitting object emitting the sound signal;
if the target sounding object is not detected, controlling the camera to rotate to a sound source position, wherein the sound source position is determined according to the sound signal;
detecting a target sounding object from a scene image collected in the process of rotating a camera towards a sound source direction;
and if the target sound-emitting object is detected, determining the target sound-emitting object as the sound-emitting object emitting the sound signal.
Optionally, the target sound object further satisfies the following characteristics: and the deviation of the position of the target sound production object and the position of the sound source determined according to the sound signal is within a preset deviation range.
Optionally, the method further includes:
determining a sound source location from the sound signal if a visual feature of the sound-generating object matching the acoustic feature of the sound-generating object cannot be determined.
Optionally, the visual feature of the sound-producing object includes a facial feature of the sound-producing object obtained from an image of the sound-producing object.
A sound source localization apparatus comprising:
the acoustic signal processing device comprises a signal acquisition unit, a processing unit and a processing unit, wherein the signal acquisition unit is used for acquiring an acoustic signal and extracting acoustic features of a sound production object from the acoustic signal;
a feature determination unit configured to determine a sound-emitting object visual feature that matches the sound-emitting object acoustic feature, the sound-emitting object visual feature including a feature of a sound-emitting object extracted through an image of the sound-emitting object;
and the sound source positioning unit is used for determining the sound generating object which sends the sound signal according to the visual characteristics of the sound generating object matched with the acoustic characteristics of the sound generating object and the collected scene image, and determining the position of the sound generating object.
A sound source localization apparatus comprising:
a memory and a processor;
the memory is connected with the processor and used for storing programs;
the processor is configured to implement the sound source localization method by running the program in the memory.
A storage medium having stored thereon a computer program which, when executed by a processor, implements the sound source localization method described above.
According to the sound source positioning method, the sound generating object is positioned by means of sound and images, and in the sound source positioning process, the sound generating object matched with the acoustic feature of the sound generating object is detected from the scene image by means of the acoustic feature of the sound generating object extracted from the sound signal and the corresponding relation empirical data of the acoustic feature and the visual feature of the sound generating object, so that the sound generating object is identified and positioned. The sound source localization process enriches data bases for sound source localization, and therefore has higher accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a sound source localization method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of another sound source localization method provided by the embodiments of the present application;
FIG. 3 is a schematic diagram of a data storage form in a feature data set provided by an embodiment of the present application;
fig. 4 is a schematic flowchart of another sound source localization method provided by an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a principle of calculating a position of an object according to an image according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a sound source positioning device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a sound source localization apparatus according to an embodiment of the present application.
Detailed Description
The technical scheme of the embodiment of the application is suitable for sound source positioning application scenes, and by means of the technical scheme of the embodiment of the application, sound and images can be used for accurately positioning sounding objects.
A conventional sound source localization scheme is sound source localization by means of sound signals collected by microphones. For example, two or more microphones are installed on the same device or different devices, sound signals are simultaneously collected by the microphones, and the specific position of the sound source is calculated by the path difference and the position between the microphones.
The sound source positioning scheme has a good effect in a quiet environment or an ideal environment such as a laboratory, but in a noisy environment, a sound source may be interfered by noise, and meanwhile, the microphone precision is limited, so that the positioning result of simply utilizing the microphone to perform sound source positioning is often unreliable.
In order to improve the accuracy of sound source positioning, the embodiment of the application provides a sound source positioning method, and the method simultaneously applies sound and images in a scene to sound source positioning, so that the problem that sound source positioning by singly using sound is easily interfered by noise can be solved, and the stability of sound source positioning is improved. Meanwhile, sound and images are simultaneously applied to sound source positioning, so that the data basis of sound source positioning can be enriched, and the sound source positioning precision is improved.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a sound source positioning method, which can be applied to intelligent equipment with a data processing function, in particular to electronic equipment capable of processing sound data and image data. For example, it can be applied to devices such as an intelligent robot and an intelligent terminal, and also to an intelligent system including a data processing device such as a sound processing device and an image processing device. As a preferred implementation manner, the sound source positioning method provided in the embodiment of the present application is applied to an electronic device having an audio acquisition and processing function and an image acquisition and processing function, and the electronic device can acquire a sound signal and an image by executing the technical solution of the embodiment of the present application, and position a sound source of the acquired sound signal based on the acquired sound signal and the acquired image.
In the following embodiments of the present application, taking a scene that a user calls an intelligent robot as an example, a specific processing procedure of the sound source localization method provided by the embodiments of the present application is introduced by introducing the intelligent robot to implement the technical scheme of the embodiments of the present application to realize the identification and localization of the user.
Referring to fig. 1, a sound source localization method provided in an embodiment of the present application includes:
s101, acquiring a sound signal, and extracting acoustic features of a sound-emitting object from the sound signal.
The sound signal may be a sound signal emitted from an arbitrary sound emitting object. For example, it may be a voice signal of someone speaking, or a voice signal of something vibrating, colliding, rubbing, etc. It is to be understood that when any object makes a sound for any reason, the object may be regarded as a sound-producing object.
The sound signal may be collected by a microphone, and as a preferred implementation, the sound signal is collected by a microphone array, so that the sound signal collected by the microphone array can locate the position of the sound source emitting the sound signal, that is, the position of the sound emitting object emitting the sound signal.
For example, a microphone array is arranged on the intelligent robot, and sound signals in a scene where the intelligent robot is located are collected in real time through the microphone array. Since various sounds may occur in the scene where the intelligent robot is located, and the intelligent robot cannot respond to all the sounds, it may be limited that the sound signal is a specific sound signal, for example, a wake-up sound for waking up the intelligent robot. Then, the smart robot acquires a sound signal, specifically, an awakening sound signal, that is, when the smart robot acquires the awakening sound signal from the scene where the smart robot is located, the acoustic feature of the sound-emitting object is extracted from the awakening sound signal.
It is understood that the specific type or content of the sound signal obtained by the above-mentioned processing step of obtaining the sound signal can be flexibly set according to the actual scene or business requirement, that is, the sound signal of a certain type or content can be collected and processed. For example, for the above-mentioned intelligent robot, it may be configured to acquire a wake-up sound signal and perform subsequent processing, so as to realize the positioning of a wake-up person; or, it can be set to acquire footstep sound and perform subsequent processing so as to realize positioning or following of the pedestrian.
The acoustic feature of the sound-emitting object is specifically an acoustic feature of the sound-emitting object that emits the sound signal. The acoustic feature of the sound-emitting object is a feature of sound emitted by the sound-emitting object in terms of acoustics. For example, if the sound signal is a voice of a user speaking, the acoustic feature of the sound generation target extracted from the sound signal may be a voiceprint feature extracted from the sound signal, and the voiceprint feature may represent feature information such as a tone of the user.
It will be appreciated that the sounds produced by different sound producing objects have their own unique acoustic characteristics, for example, the voiceprint characteristics of different users who speak the same sound are different, and the voiceprint characteristics of each user can represent the user's unique timbre. As another example, different animals, such as birds, dogs, etc., may produce screams having different acoustic characteristics. The sound emitted by the same object for different reasons may also have different acoustic characteristics, such as the sound of paper rubbing and the sound of paper shredding, with distinctly different characteristics.
Therefore, it is advantageous to extract the acoustic feature of the sound-generating object from the sound signal to judge the sound-generating object by the acoustic feature of the sound-generating object. For example, if the acoustic features of the sound-emitting object extracted from the sound signal conform to the acoustic features of a bird's cry, it can be inferred that the sound-emitting object from which the sound signal is emitted is most likely a bird. Therefore, the acoustic features of the sound-emitting object are extracted from the acquired sound signal, and the sound-emitting object which emits the sound signal is determined by the acoustic features of the sound-emitting object.
As an optional implementation manner, the acoustic features of the sound-generating object extracted from the sound signal may be extracted by using an existing or future acoustic feature extraction method, and the embodiment of the present application is not detailed again nor does it limit a specific extraction manner.
And S102, determining the visual characteristics of the sound-producing object matched with the acoustic characteristics of the sound-producing object.
The visual feature of the sound-producing object is a feature that is visually recognized by the sound-producing object, that is, an appearance feature of the sound-producing object. For example, if the sound-generating object is a person, the height, weight, skin color, sex, face, and other visually-accessible feature information of the person may be used as the visual feature of the person.
The technical scheme of the embodiment of the application is applied to the electronic processing equipment, so that the electronic processing equipment can realize automatic sound source positioning. Therefore, for the device, the visual function is realized by the camera thereof, and the visual information acquiring function is realized by acquiring information from the image acquired by the camera thereof. Therefore, when acquiring the visual feature of the sound generating object, it is necessary to extract the feature related to the sound generating object from the image of the sound generating object. Therefore, in the embodiment of the present application, the visual feature of the sound-generating object is specifically a feature of the sound-generating object extracted from an image of the sound-generating object.
The visual feature of the sound-generating object may be one or more visually visible features of the sound-generating object, that is, one or more features of the sound-generating object extracted from an image of the sound-generating object, for example, if the sound-generating object is a human, one or more of features such as a facial feature, a skin color feature, and a limb feature of the sound-generating object that are visually visible may be used as the visual feature of the sound-generating object.
As a preferred implementation, the present embodiment uses, as the visual feature of the sound-generating object, the facial feature of the sound-generating object acquired from the image of the sound-generating object.
Typically, each sound producing object has a unique acoustic characteristic, as well as a unique visual characteristic. For example, for Zhang three people, the voice print characteristics of their speaking voice are unique, and the facial characteristics are also unique. Therefore, the corresponding matching relation between the three-tone voiceprint feature and the face feature can be established in advance. When the voiceprint feature of Zhang III is extracted from a certain voice signal, the corresponding matching relation between the voiceprint feature of Zhang III and the face feature which is established in advance can be determined, the voiceprint feature of Zhang III is matched with the face feature of the voice signal, and therefore the voice signal can be determined to be sent by Zhang III.
Based on the above idea, as an optional implementation manner, the acoustic features of each sound-emitting object and the visual features of each sound-emitting object may be predetermined, and a feature data set may be established, in which the corresponding matching relationship between the acoustic features and the visual features of each sound-emitting object is stored, that is, in the feature data set, the acoustic features of the sound-emitting object and the visual features of the sound-emitting object corresponding to the acoustic features of the sound-emitting object are stored. Then, when acoustic features of the sound-generating object are extracted from the acquired sound signal, visual features of the sound-generating object that match the extracted acoustic features of the sound-generating object can be determined by querying the feature data set.
For example, for an intelligent robot, in order to facilitate that a plurality of users can wake up the robot, wake-up sounds of the plurality of users may be recorded in the intelligent robot in advance, so that the robot extracts user voiceprint features from the user wake-up sounds, and records faces of the users, so that the robot extracts user face features, and then the intelligent robot may store the voiceprint features and the face features of each user in a memory in a corresponding matching manner. When the intelligent robot acquires the awakening sound of a certain user, the voiceprint feature of the user can be extracted from the awakening sound, and if the voiceprint feature of the user is stored in the memory of the intelligent robot, the face feature of the user, which is correspondingly matched with the voiceprint feature, can be further inquired and determined from the memory.
The feature data set may be constructed before the technical solution of the embodiment of the present application is executed, or may be updated in real time during or after the technical solution of the embodiment of the present application is executed, that is, the corresponding matching relationship between the acoustic features and the visual features of the sound object that has been already determined is recorded in real time and stored in the feature data set, so as to continuously expand the data volume of the corresponding matching relationship between the acoustic features and the visual features of the sound object, and provide data support for subsequent sound source localization.
It should be noted that, with the increasing data amount of the acoustic features and the visual features of the sound generating object and the recognition error of the processing device on the voiceprint features and the image features, the acoustic features of the sounds generated by different sound generating objects may be the same, or the acoustic features of the same sound generating object may be different visual features of the sound generating object. In this case, there may be a plurality of visual features of the sound-producing object corresponding to the same acoustic features of the sound-producing object. Then in performing step S102, a plurality of visual features of the sound-producing object may be determined that match the extracted acoustic features of the sound-producing object. At this point, multiple visual characteristics of the sound producing object may be used simultaneously for subsequent processing, or one or more visual characteristics of the sound producing object may be selected therefrom for subsequent processing.
S103, determining the sound production object which emits the sound signal according to the sound production object visual characteristics matched with the sound production object acoustic characteristics and the collected scene image, and determining the position of the sound production object.
Specifically, when the visual characteristics of the sound-producing object matching the acoustic characteristics of the sound-producing object are determined, that is, the visual characteristics of the sound-producing object that emits the sound signal are determined. At the moment, the scene image is collected through the camera, and the sounding object which accords with the visual characteristics of the sounding object is detected and identified from the collected scene image, so that the aim of identifying the sounding object which sends the sound signal can be fulfilled, and the position of the sounding object can be positioned.
For example, when the smart robot acquires a wake-up tone signal, and determines that the face feature of zhangsan matches the voiceprint feature extracted from the wake-up tone signal through the processing of the above steps S101 and S102, a scene image is acquired by using the camera, the face feature of zhangsan is detected from the acquired scene image, and when the face feature of zhangsan is detected, a person in the scene image that meets the face feature of zhangsan, that is, a person that has sent the wake-up tone signal, that is, the smart robot can determine a wake-up person from the acquired scene image, and further can determine the position of zhangsan according to the angle, direction, and the like of the acquired scene image.
The sound source positioning scheme based on the images also conforms to the positioning process of the sound production objects in the actual life scene. For example, when a person A is called by another person B by voice, if B is in the sight range of A, then A can easily find B's position according to the consistency of B's calling sound and B's mouth action, and make associative memory of B's voiceprint characteristic and B's visual characteristic. When the first is called by the second, the first can determine that the second is calling according to the heard shout because the first has memorized the voiceprint feature and the visual feature of the second, so if the second is not in the sight range of the first, the first usually looks around, and when the first looks at the second, the first finds the sound-producing object, and then can walk towards the second or converse towards the second.
Therefore, the sound source positioning method provided by the embodiment of the application has the bionic characteristic and simulates the biological sound source positioning process. In the process, the sound and the image are combined and applied to the recognition and the positioning of the sound production object, and the sound source positioning is more accurate due to the diversification of the positioning basis.
As can be seen from the above description, in the sound source localization method provided in the embodiment of the present application, the sound and the image are used to localize the sound generating object, and in the sound source localization process, the sound generating object matched with the acoustic feature of the sound generating object is detected from the scene image by using the acoustic feature of the sound generating object extracted from the sound signal and the empirical data of the correspondence between the acoustic feature and the visual feature of the sound generating object, so as to identify and localize the sound generating object. The process simulates the process of biologically positioning the sound source object, has bionic characteristics, and enriches the data basis for sound source positioning, thereby having higher accuracy.
As a possible case, after acquiring a sound signal and extracting acoustic features of a sound-emitting object from the sound signal, if it is not possible to determine visual features of the sound-emitting object that match the acoustic features of the sound-emitting object, for example, if visual features of the sound-emitting object that correspond to the acoustic features of the sound-emitting object are not found in the feature data set, the sound source position is determined from the acquired sound signal.
Specifically, since the sound signals are acquired by the microphone array, the sound signals can be used to perform sound source localization and determine the sound source position by using a controllable beam response algorithm or a time difference algorithm.
It can be understood that the key of the sound source localization method provided by the embodiment of the present application is to determine the visual characteristics of the sound generating object matched with the acoustic characteristics of the sound generating object, and therefore, the data in the feature data set described above plays an important role in implementing the sound source localization method provided by the embodiment of the present application. If there is less data in the feature data set, it may not be possible to query it to determine the visual features of the sound generating object that match the acoustic features from the sound generating object. Storing the acoustic and visual characteristics of any sound producing object that may be present in the feature data set prior to implementation of the scheme is substantially impossible. Therefore, in the embodiment of the present application, the sound source localization scheme is added with the processing of acquiring visual features of a sound object in real time and updating a feature data set, and the specific scheme is as follows:
referring to fig. 2, in the implementation of the sound source localization method, step S211 of acquiring a scene image in real time, identifying visual features of a sound object from the acquired scene image, and storing the identified visual features of the sound object is synchronously performed.
As described above, when the technical solution of the embodiment of the present application is applied to an electronic device having an audio acquisition and processing function and an image acquisition and processing function, a camera of the electronic device may be controlled to acquire a scene image in real time, and a visual feature of a sound generating object may be identified from the acquired scene image, and when the visual feature of the sound generating object is identified, the identified visual feature of the sound generating object may be stored.
Specifically, the visual characteristics of the sound production object are collected through the camera, and the object with the sound production action is detected from the scene image collected by the camera, and the visual characteristics of the object are extracted.
For example, a camera of the intelligent robot acquires a scene image in real time, identifies a person with mouth movement, i.e., speaking movement, from the scene image, and then extracts visual features of the person with speaking movement, such as one or more of facial features, body features, skin color features, and the like, as the visual features of the speaker.
Considering that the storage space of the device is not unlimited, the visual features of the sound producing object set to be a set time from the current time can be deleted. For example, the visual features of the sound-producing object within 2 seconds before the current time are stored in the storage space, and when the storage time of the visual features of the sound-producing object stored in the storage space exceeds 2 seconds, the visual features of the sound-producing object are deleted. Thus, what the device stores is always the visual characteristics of the sound producing object within 2 seconds before the current time.
Based on the above processing of acquiring the visual characteristics of the sound generating object in real time, after step S201 is executed to acquire the sound signal and the acoustic characteristics of the sound generating object are extracted from the sound signal, step S212 is executed to acquire the visual characteristics of the sound generating object acquired in a target time period, where the target time period is a time period of a set time length before the sound signal is acquired, and in this embodiment of the present application, the target time period is 2 seconds before the sound signal is acquired.
That is, when a sound signal is acquired and acoustic features of a sound-generating object are extracted from the acquired sound signal, visual features of the sound-generating object acquired within 2 seconds before the sound signal is acquired are acquired.
Then, step S213 is executed to correspondingly store the visual characteristics of the sound generating object acquired in the target time period and the acoustic characteristics of the sound generating object into a characteristic data set.
In general, when a device acquires a sound signal, the sound emission object that emits the sound signal has already emitted sound, and therefore, when the sound signal is acquired, the image of the sound emission object that emits the sound signal is actually present in the scene image before the sound signal is acquired, and therefore, the visual characteristics of the sound emission object that emits the sound signal can only be extracted from the scene image before the sound signal is acquired.
Therefore, in order to obtain the visual feature of the sound generating object corresponding to the acoustic feature of the collected sound signal, in the embodiment of the present application, after the sound signal is collected and the acoustic feature of the sound generating object is extracted from the sound signal, the visual feature of the sound generating object obtained in the target time period before the sound signal is collected and the extracted acoustic feature of the sound generating object are correspondingly stored in the feature data set.
For example, a camera of the intelligent robot acquires a scene image in real time, and extracts visual features of a sound-producing object from the scene image for storage. When the intelligent robot collects the awakening sound, the visual features of the sounding object stored 2 seconds before the awakening sound is collected and the acoustic features of the sounding object extracted from the awakening sound are correspondingly stored in a feature data set.
Based on the above processing, even if the data amount in the feature data set is small or even no data at the beginning, the data amount of the feature data set can be continuously expanded by continuously executing the processing of the above embodiment of the present application along with the continuous operation of the device, thereby forming the effect of "more use and smarter" of the device.
Steps S201, S202, and S203 in the embodiment shown in fig. 2 correspond to steps S101, S102, and S103 in the method embodiment shown in fig. 1, respectively, and for specific contents, reference is made to the contents of the method embodiment shown in fig. 1, which is not described herein again.
It can be understood that, from the same scene image, a plurality of visual features of the sound-producing object may be extracted at the same time, and therefore, when storing, there may be one acoustic feature of the sound-producing object corresponding to a plurality of visual features of the sound-producing object. In order to facilitate the correspondence between the acoustic feature of the sound-generating object and the visual feature of the sound-generating object, the embodiment of the present invention stores the acoustic feature of the sound-generating object as a key and the visual feature of the sound-generating object corresponding to the acoustic feature of the sound-generating object as a value. In this case, one sound object acoustic feature key may correspond to a set number of sound object visual feature values.
For example, as shown in fig. 3, it is assumed that, during the operation of the intelligent robot according to the above scheme, a plurality of face features of the uttering object corresponding to the voiceprint feature of "three-piece sound", such as face features representing "three-piece face" and "four-piece face", are acquired. The voice print feature representing the "three-leaf voice" is used as a key, and a plurality of face features representing the "three-leaf face" and the "four-leaf face" are used as values and stored correspondingly.
Based on the above setting, when the visual features of the sound-emitting object acquired within the target time period and the acoustic features of the sound-emitting object are stored into the feature data set, the processing may be performed according to the following processing procedure shown in a1-a 4:
a1, detecting whether the acoustic feature of the sound-producing object is stored in the feature data set, and detecting whether the number of visual features of the sound-producing object corresponding to the acoustic feature of the sound-producing object reaches a set number.
Specifically, the embodiment of the application sets the upper limit of the number of the visual features of the sounding object corresponding to the acoustic features of the sounding object, that is, the acoustic features of one sounding object are set to be at most capable of correspondingly storing the visual features of the sounding object with the set number. The set number can be flexibly set according to actual scenes.
After the acoustic features and the visual features of the sounding objects are obtained, the feature data set is detected, whether the obtained acoustic features of the sounding objects are stored in the feature data set or not is judged, and whether the number of the visual features of the sounding objects corresponding to the acoustic features of the sounding objects stored in the feature data set reaches a set number or not is detected.
If the acoustic feature of the sound-emitting object is stored in the feature data set and the number of visual features of the sound-emitting object corresponding to the acoustic feature of the sound-emitting object reaches a set number, it may be determined that the acoustic feature of the sound-emitting object has been stored in the feature data set and the number of visual features of the sound-emitting object corresponding to the acoustic feature of the sound-emitting object has reached a maximum number. At this time, in order to store the visual features of the sound-producing object acquired in the target time period into the feature data set, step a2 is executed to update the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object stored in the feature data set, using the visual features of the sound-producing object acquired in the target time period.
Specifically, the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object, which are stored at the earliest and are the same as the visual features of the sound-producing object acquired in the target time period, are deleted from the feature data set, so that a storage space is made available from the feature data set. Then, the visual feature of the sound-producing object acquired in the target time period is stored in the feature data set as the visual feature of the sound-producing object corresponding to the acoustic feature of the sound-producing object.
For example, assuming that the intelligent robot extracts the voiceprint features from the collected sound signals and extracts 3 face features from the scene images collected within 2 seconds before the sound signals are collected, the voiceprint features and the 3 face features need to be correspondingly stored in the feature data set. If the intelligent robot finds that the voiceprint features are stored in the feature data set and the number of the face features corresponding to the voiceprint features reaches the maximum number, the intelligent robot deletes 3 earliest stored face features from the face features corresponding to the voiceprint features, then takes the newly acquired 3 face features as the face features corresponding to the voiceprint features, and stores the feature data set. At this time, the face features corresponding to the voiceprint features are still the maximum number, but the update of the face features is realized.
If the acoustic feature of the sound-producing object is stored in the feature data set and the number of visual features of the sound-producing object corresponding to the acoustic feature of the sound-producing object does not reach the set number, step a3 is executed to store the visual feature of the sound-producing object acquired in the target time period as the visual feature of the sound-producing object corresponding to the acoustic feature of the sound-producing object in the feature data set.
It can be understood that, if the acoustic feature of the sound-producing object is already stored in the feature data set, and the number of the visual features of the sound-producing object corresponding to the acoustic feature of the sound-producing object does not reach the set number, the visual features of the sound-producing object acquired in the target time period may be directly stored in the feature data set as the visual features of the sound-producing object corresponding to the acoustic feature of the sound-producing object.
If the acoustic feature of the sound-producing object is not stored in the feature data set, step a4 is executed, and the visual feature of the sound-producing object acquired in the target time period and the acoustic feature of the sound-producing object are correspondingly stored in the feature data set.
Specifically, if the acoustic feature of the sound-producing object is not stored in the feature data set, the visual feature of the sound-producing object acquired in the target time period is taken as the acoustic feature of the sound-producing object corresponding to the acoustic feature of the sound-producing object, and is correspondingly stored in the feature data set together with the acoustic feature of the sound-producing object. When there are a plurality of visual features of the sound-producing object acquired in the target time period, the visual features of the sound-producing object and the acoustic features of the sound-producing object are stored in the manner shown in fig. 3.
Based on the above-mentioned feature data set construction principle, in the feature data set, there may be a plurality of visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object. Then, when the feature data set is searched to determine the visual feature of the sound-producing object matching the acoustic feature of the sound-producing object extracted from the acquired sound signal, if it is found that a plurality of visual features of the sound-producing object corresponding to the acoustic feature of the sound-producing object are stored in the feature data set, the visual feature of the sound-producing object which is most frequently appeared or which is newly stored is selected as the visual feature of the sound-producing object matching the acoustic feature of the sound-producing object from among the visual features of the sound-producing object corresponding to the acoustic feature of the sound-producing object stored in the feature data set.
Specifically, when there are a plurality of different visual characteristics of the sound-generating object corresponding to the same acoustic characteristic of the sound-generating object, the visual characteristics of the sound-generating object can be simultaneously used as the visual characteristics of the sound-generating object matched with the acoustic characteristic of the sound-generating object, or one or more visual characteristics can be selected from the visual characteristics of the sound-generating object matched with the acoustic characteristic of the sound-generating object.
Alternatively, when there are a plurality of visual features of the same sound-producing object acquired at different times corresponding to the acoustic feature of the same sound-producing object, one of the visual features of the sound-producing object should be selected as the visual feature of the sound-producing object matching the acoustic feature of the sound-producing object.
As a preferred selection method, the visual feature of the sound-producing object, which is most frequently appeared or is newly stored, is selected from among the visual features of the sound-producing objects corresponding to the acoustic features of the same sound-producing object, as the visual feature of the sound-producing object matching the acoustic features of the sound-producing object.
Specifically, if at least three visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object are stored in the characteristic data set, the visual characteristics of the sound-producing object with the largest number of occurrences are selected from the visual characteristics of the sound-producing object corresponding to the sound-producing object.
That is, if there are a plurality of visual characteristics of the sound-producing object corresponding to the same acoustic characteristic of the sound-producing object in the feature data set, the visual characteristic of the sound-producing object is matched with the acoustic characteristic of the sound-producing object according to the majority rule, which visual characteristic occurs the most frequently among the plurality of visual characteristics of the sound-producing object corresponding to the acoustic characteristic of the sound-producing object.
For example, if there are a plurality of face features corresponding to the same voiceprint feature in the feature data set, when the intelligent robot determines a face feature matching the voiceprint feature from the feature data set, the face feature with the largest number of occurrences is selected from the plurality of face features as the face feature matching the voiceprint feature.
For example, a k-means clustering algorithm may be used to cluster a plurality of facial features corresponding to the same voiceprint feature, and the facial feature with the largest occurrence number may be selected from the clustering algorithm. For example, a plurality of face features corresponding to the same voiceprint feature are classified into 2 classes or more by a k-means algorithm according to the cosine distance or the Euclidean distance of the face features. It is assumed that each face feature of the voiceprint feature corresponding to the "three-page sound" shown in fig. 3 is classified into 2 types, namely "three-page face" and "four-page face", and then the category with the largest number of samples is selected from the classified categories, that is, the face feature with the largest occurrence frequency is obtained. For example, in the two categories of "face with three faces" and "face with four faces", the number of samples of "face with three faces" is the largest, and therefore, the face features of "face with three faces" are taken as the face features matched with the voiceprint features of "sound with three faces".
If there are less than three visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object, for example, 1 or 2 visual characteristics, stored in the feature data set, the visual characteristics of the sound-producing object are selected from the visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object, and the newly stored visual characteristics of the sound-producing object are used as the visual characteristics of the sound-producing object matched with the acoustic characteristics of the sound-producing object.
Specifically, if only one visual feature of the sound-producing object corresponds to the acoustic feature of the sound-producing object, the visual feature of the sound-producing object may be directly used as the visual feature of the sound-producing object matched with the acoustic feature of the sound-producing object.
If 2 visual features of the sounding objects correspond to the same acoustic features of the sounding objects, the latest stored visual features of the sounding objects are selected from the 2 visual features of the sounding objects to serve as the visual features of the sounding objects matched with the acoustic features of the sounding objects.
In contrast, if the feature data set does not store the visual features of the sound generating object corresponding to the acoustic features of the sound generating object, as described in the foregoing embodiment, the visual features of the sound generating object matching the acoustic features of the sound generating object cannot be determined by querying the feature data set, and therefore the sound source position can be determined only by the collected sound signal.
According to the above processing, after determining the visual feature of the sound generating object matching the acoustic feature of the sound generating object extracted from the acquired sound signal, as shown in fig. 4, the determining the sound generating object which emits the sound signal and determining the position of the sound generating object according to the visual feature of the sound generating object matching the acoustic feature of the sound generating object and the collected scene image includes:
and S403, detecting the sound production object which emits the sound signal from the scene image collected by the camera according to the visual characteristics of the sound production object which are matched with the acoustic characteristics of the sound production object.
Specifically, a sound-producing object having a visual feature of a sound-producing object matching an acoustic feature of the sound-producing object is detected from a scene image captured by a camera, and when the detection is made, the detected sound-producing object is used as a sound-producing object for producing the sound signal.
When the visual characteristics of the sound production object matched with the acoustic characteristics of the sound production object are determined, a target sound production object is detected from a scene image currently collected by a camera. The target sound production object meets the following characteristics: the similarity between the visual features of the sound-producing object matched with the acoustic features of the sound-producing object and the visual features of the target sound-producing object is greater than a set similarity threshold, that is, the target sound-producing object refers to a sound-producing object whose visual features are matched with the acoustic features of the sound-producing object and the similarity of the visual features of the sound-producing object is greater than the set similarity threshold.
And if the target sound-emitting object is detected, determining the target sound-emitting object as the sound-emitting object emitting the sound signal.
For example, after the intelligent robot extracts a voiceprint feature from the acquired wake sound and determines a face feature matching the voiceprint feature, the face feature is detected from the scene image acquired by the camera of the intelligent robot, and when the face feature is detected, the user with the face feature is determined as the user who uttered the wake sound.
And if the target sound-emitting object is not detected in the scene image collected by the camera, controlling the camera to rotate to the sound source position determined according to the acquired sound signal, and detecting the target sound-emitting object from the scene image collected by the camera in the process of rotating the camera.
And in the process of rotating the camera, if the target sound-emitting object is detected from the acquired scene image, controlling the camera to stop rotating, and determining the detected target sound-emitting object as the sound-emitting object emitting the sound signal.
If the target sound-emitting object is not detected from the scene image collected by the camera all the time in the process that the camera is rotated to face the sound source position, the situation shows that the sound-emitting object is not detected in the scene collected by the camera, and at the moment, the sound source position determined according to the collected sound signal is directly used as the position of the sound-emitting object emitting the sound signal.
For example, after the smart robot acquires the wake tone, extracts the voiceprint feature from the wake tone, and determines the face feature matching the voiceprint feature, the smart robot starts to detect the user with the face feature from the scene image acquired by the camera of the smart robot, and if the face feature is detected, the detected user is determined as the user who sent the wake tone. If not, positioning the sound source position according to the collected awakening sound, and controlling the camera to rotate towards the sound source position. And in the process of rotating the camera, detecting the user with the human face characteristics in real time from a scene image acquired by the camera, controlling the camera to stop rotating if the user with the human face characteristics is detected, and determining the detected user as the user sending the awakening sound. And if the user with the human face characteristics is not detected from the scene image collected by the camera in the process that the camera rotates to face the sound source position, determining the sound source position determined according to the awakening sound as the position of the user sending the awakening sound.
As a preferred implementation manner, in the process of detecting the target sound-emitting object, when a sound-emitting object whose visual feature matches with the acoustic feature of the sound-emitting object and whose similarity of the visual feature of the sound-emitting object is greater than a set similarity threshold is detected, the position of the sound-emitting object in the actual scene is further estimated according to the position of the sound-emitting object in the scene image, and further, whether the deviation between the position of the sound-emitting object and the position of the sound source determined according to the sound signal is within a preset deviation range is determined, and if the deviation is within the preset deviation range, the sound-emitting object is determined as the target sound-emitting object; otherwise, the sound-generating object may not be determined as the target sound-generating object.
That is, it can be further defined that the above-described target sound-emitting object also satisfies the following characteristics: and the deviation of the position of the target sound production object from the position of the sound source determined according to the sound signal is within a preset deviation range.
The target sound-emitting object detection process combines the position of the target sound-emitting object for detection. The detected target sound-producing object not only has the visual characteristics of the sound-producing object matched with the acoustic characteristics of the sound-producing object, but also is close to the sound source positioning result, so that the method has higher reliability.
S404, determining the position of the sound-producing object according to the detected position of the sound-producing object which emits the sound signal in the scene image collected by the camera.
Specifically, when a sound-emitting object that emits the sound signal is detected from a scene image, the position of the sound-emitting object in the actual scene is determined according to the position of the sound-emitting object in the scene image, the angle of view of a camera, and the like.
As shown in fig. 5, the camera angle of view θ and the screen size L are generally fixed due to the characteristics of the camera and the apparatus itself. Objects seen in the visual angle of the camera can be completely drawn on the screen, and if the central point of the face is at the point Q, the central point of the face is And matching with sine and cosine theorem according to the size of PQ in the picture, easily obtaining the size of angle POQ, and further obtaining the deflection angle of the object from the positive center of the camera. Furthermore, according to the orientation of the camera, the real direction of the object can be determined. Furthermore, the real position of the object can be calculated by combining the focal length of the camera and a coordinate system.
Steps S401 and S402 in the embodiment shown in fig. 4 correspond to steps S101 and S102 in the method embodiment shown in fig. 1, respectively, and for specific contents, please refer to the contents of the method embodiment shown in fig. 1, which is not described herein again.
In correspondence with the above method embodiment, the present application also proposes a sound source customizing apparatus, as shown in fig. 6, the apparatus includes:
a signal acquisition unit 100 configured to acquire a sound signal and extract acoustic features of a sound-emitting object from the sound signal;
a feature determination unit 110 for determining a sound-emitting object visual feature matching the sound-emitting object acoustic feature, the sound-emitting object visual feature including a feature of the sound-emitting object extracted through an image of the sound-emitting object;
and the sound source positioning unit 120 is configured to determine a sound generating object which generates the sound signal according to the visual characteristics of the sound generating object matched with the acoustic characteristics of the sound generating object and the collected scene image, and determine the position of the sound generating object.
The sound source positioning device provided by the embodiment of the application positions the sound generating object by means of sound and images, and in the sound source positioning process, by means of acoustic features of the sound generating object extracted from sound signals and corresponding relation empirical data of the acoustic features and the visual features of the sound generating object, the sound generating object matched with the acoustic features of the sound generating object is detected from a scene image, so that the sound generating object is identified and positioned. The process simulates the process of biologically positioning the sound source object, has bionic characteristics, and enriches the data basis for sound source positioning, thereby having higher accuracy.
Optionally, the determining visual characteristics of the sound-producing object that match the acoustic characteristics of the sound-producing object includes:
determining visual features of the sound-producing object matched with the acoustic features of the sound-producing object from a pre-constructed feature data set;
and the feature data set stores acoustic features of the sound-producing object and visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object.
Optionally, the signal acquiring unit 100 is further configured to acquire a scene image in real time, identify a visual feature of the sound generating object from the acquired scene image, and store the identified visual feature of the sound generating object;
after acquiring the sound signal and extracting the acoustic feature of the sound-emitting object from the sound signal, the signal acquiring unit 100 is further configured to:
acquiring visual characteristics of a sound generating object acquired in a target time period, wherein the target time period is a time period with set duration before the sound signal is acquired;
and correspondingly storing the visual characteristics of the sound-producing object acquired in the target time period and the acoustic characteristics of the sound-producing object into the characteristic data set.
Optionally, the correspondingly storing the visual characteristics of the sounding object acquired in the target time period and the acoustic characteristics of the sounding object into the characteristic data set includes:
detecting whether acoustic features of the sound-producing object are stored in the feature data set or not, and detecting whether the number of visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object reaches a set number or not;
if the acoustic features of the sound-producing object are stored in the feature data set and the number of the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object reaches a set number, updating the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object stored in the feature data set by using the visual features of the sound-producing object acquired in the target time period;
if the acoustic features of the sound-producing object are stored in the feature data set and the number of the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object does not reach a set number, the visual features of the sound-producing object acquired in the target time period are used as the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object and are stored in the feature data set;
and if the acoustic features of the sound-producing object are not stored in the feature data set, correspondingly storing the visual features of the sound-producing object acquired in the target time period and the acoustic features of the sound-producing object into the feature data set.
Optionally, when a plurality of visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object are stored in the feature data set, determining the visual features of the sound-producing object matching the acoustic features of the sound-producing object from a pre-constructed feature data set includes:
and selecting the visual features of the sound-producing object with the most occurrence times or stored latest from the visual features of the sound-producing object corresponding to the acoustic features of the sound-producing object stored in a pre-constructed feature data set as the visual features of the sound-producing object matched with the acoustic features of the sound-producing object.
Optionally, the selecting, from the visual features of the sound-emitting object corresponding to the acoustic features of the sound-emitting object stored in the pre-constructed feature data set, the visual features of the sound-emitting object that are most frequently appeared or newly stored includes:
if at least three visual characteristics of the sounding objects corresponding to the acoustic characteristics of the sounding objects are stored in a pre-constructed characteristic data set, selecting the visual characteristics of the sounding objects with the largest occurrence frequency from the visual characteristics of the sounding objects corresponding to the acoustic characteristics of the sounding objects;
and if less than three visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object are stored in a pre-constructed characteristic data set, selecting the latest stored visual characteristics of the sound-producing object from the visual characteristics of the sound-producing object corresponding to the acoustic characteristics of the sound-producing object.
Optionally, the determining, according to the visual feature of the sound generating object matched with the acoustic feature of the sound generating object and the collected scene image, the sound generating object which emits the sound signal, and determining the position of the sound generating object include:
detecting a sound-producing object which emits the sound signal from a scene image collected by a camera according to the visual characteristics of the sound-producing object matched with the acoustic characteristics of the sound-producing object;
and determining the position of the sounding object according to the detected position of the sounding object which emits the sound signal in the scene image acquired by the camera.
Optionally, the detecting the sound generating object emitting the sound signal from the scene image collected by the camera according to the visual feature of the sound generating object matched with the acoustic feature of the sound generating object includes:
detecting a target sounding object from a scene image acquired by a camera; wherein the target sound production object satisfies the following characteristics: the similarity between the visual features of the sound-producing object matched with the acoustic features of the sound-producing object and the visual features of the target sound-producing object is greater than a set similarity threshold;
if the target sound-emitting object is detected, determining the target sound-emitting object as a sound-emitting object emitting the sound signal;
if the target sounding object is not detected, controlling the camera to rotate to a sound source position, wherein the sound source position is determined according to the sound signal;
detecting a target sounding object from a scene image collected in the process of rotating a camera towards a sound source direction;
and if the target sound-emitting object is detected, determining the target sound-emitting object as the sound-emitting object emitting the sound signal.
Optionally, the target sound object further satisfies the following characteristics: and the deviation of the position of the target sound production object and the position of the sound source determined according to the sound signal is within a preset deviation range.
Optionally, the sound source positioning unit 120 is further configured to:
determining a sound source location from the sound signal if a visual feature of the sound-generating object matching the acoustic feature of the sound-generating object cannot be determined.
Optionally, the visual feature of the sound-producing object includes a facial feature of the sound-producing object obtained from an image of the sound-producing object.
The detailed working contents of each unit of the sound source positioning device are referred to the corresponding contents in the above method embodiments, and are not repeated here.
Another embodiment of the present application further provides a sound source localization apparatus, as shown in fig. 7, including:
a memory 200 and a processor 210;
wherein, the memory 200 is connected to the processor 210 for storing programs;
the processor 210 is configured to implement the sound source localization method disclosed in any of the above embodiments by running the program stored in the memory 200.
Specifically, the sound source localization apparatus may further include: a bus, a communication interface 220, an input device 230, and an output device 240.
The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:
a bus may include a path that transfers information between components of a computer system.
The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.
The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.
The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.
Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.
Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.
The processor 210 executes the programs stored in the memory 200 and invokes other devices, which can be used to implement the steps of the sound source localization method provided by the embodiments of the present application.
Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the sound source localization method provided in any of the above embodiments.
Specifically, the specific working contents of each part of the sound source positioning device and the specific processing contents of the computer program on the storage medium when being executed by the processor can refer to the contents of each embodiment of the sound source positioning method, and are not described herein again.
While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.
The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.
In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:天线控制装置、方法及雷达