Multi-user motion gesture control method and device, intelligent sound box and medium
1. a multi-user motion gesture control method is applied to an intelligent sound box, and the intelligent sound box comprises a loudspeaker array used for emitting ultrasonic signals and a microphone array used for collecting reflected signals; the method comprises the following steps:
acquiring a reflected signal which is reflected by one or more user motion gestures and is collected by a microphone array from an ultrasonic signal transmitted by a loudspeaker array;
according to the ultrasonic signals and the preprocessed reflection signals, determining the position and the reflection intensity of a reflection source by utilizing a reflection signal positioning model and a sparse recovery algorithm so as to obtain a positioning result corresponding to one or more user motion gestures; the method comprises the following steps that a sparse recovery algorithm based on a broadband dictionary is suitable for a microphone array with any shape and any distance; and/or a sparse recovery algorithm based on a speed-aware dictionary is suitable for user motion gestures at any speed;
and extracting gesture tracks corresponding to one or more users according to the positioning result so as to respond to preset instructions corresponding to different gesture tracks.
2. The method of claim 1, wherein the reflection signal is preprocessed to eliminate self-interference and static reflections and to obtain a reflection signal that contains a limited number of motion gestures.
3. The method of claim 1, wherein the wideband dictionary-based sparse recovery algorithm is adapted for any shape and spacing microphone array, comprising:
setting the ultrasonic signal into a broadband signal with K subcarriers by utilizing orthogonal frequency division multiplexing to construct a broadband dictionary;
according to the characteristic that the sparsity of the reflected signals is the same among dictionaries corresponding to different subcarriers, when the transmitted ultrasonic signals are set to be broadband signals with K frequency components, sparse recovery positioning is carried out on each frequency component by utilizing the broadband dictionary comprising K sub dictionaries corresponding to the K frequency components respectively;
the positioning results of each frequency component are overlapped, and the common intersection of the aliasing positioning results of the K frequency components is used as the positioning result, so that the problem of location result space aliasing caused by insufficient space sampling rate can be solved on the microphone array with any shape, and the method is suitable for the microphone array with any shape and distance.
4. The method of claim 3, wherein the sparse recovery algorithm based on a velocity-aware dictionary is adapted to user motion gestures at any velocity, and comprises:
according to the frequency offset condition of each subcarrier caused by Doppler effect at different speeds, a speed sensing dictionary matched with the speed is constructed based on the broadband dictionary;
and respectively carrying out sparse recovery positioning by utilizing the dictionaries corresponding to all the speeds, taking the speed of the dictionary corresponding to the positioning result with the maximum reflection intensity as the estimation of the motion speed, and taking the positioning result as the final positioning result, so that the problem of positioning error caused by Doppler effect can be solved, and the method is suitable for the motion gesture of the user at any speed.
5. The method of claim 1, wherein the reflected signal localization model comprises:
A=[a(d1,θ1),…,aP(dD,θD)];
wherein X is a reflection signal vector acquired by the microphone array; s is an ultrasonic signal vector emitted by the loudspeaker array;is a noise signal; a is a guidance matrix used for expressing the emission condition; a (d)D,θD) A guidance vector representing a position D of the reflection source; dD,θDRespectively showing the distance and angle of the D-th reflection source position relative to the intelligent sound box.
6. The method of claim 5, wherein determining the reflection source location and the reflection intensity using a reflection signal localization model and a sparse recovery algorithm comprises:
expanding the guidance matrix A into an overcomplete matrix A' with N dimensions:
A′=[aP(d1,θ1),aP(d2,θ2),…,a(dN,θN)];
wherein N is>>D;a(dN,θN) Indicating relative smart loudspeaker distance dNAnd angle thetaNA possible position of the reflection source; accordingly, the ultrasound signal vector S is expanded to an N-dimensional sparse vector S':
S′=[0,0,…s1,0,…s2…si…sD…0]T;
wherein if in a (d)N,θN) If a reflected signal is actually present, the corresponding coefficient in S' is SiOtherwise, the value is 0;
whereby the reflected signal localization model becomes:
at this time, according to the ultrasonic signal s0(t) is foreseen, further s0(t) incorporation into A' gives:
where Dic is a pre-computed overcomplete matrix called a dictionary, in which each element veciStore the slave in each position (d)N,θN) Reflected delay signal: dic ═ vec (d)1,θ1,t),…,vec(dN,θN,t)](ii) a C is a sparse vector representing the intensity of the reflected signal from the corresponding location, C ═ 0,0, … C1,0,…c2…ci…cD…0]T;
When the number D of valid reflections is much smaller than the dimension N of the dictionary Dic, a sparsest vector C can be found to represent the distribution of the reflection signals constituting the reflection signal vector X in the dictionary Dic:
min||C||0s.t.||X-Dic·C||2≤ε;
wherein | C | Y calculation0Represents the 0 norm of the vector C, i.e. the number of non-zero elements in the vector C; s.t. indicates that the latter constraint is satisfied; i X-Dic C I non-woven phosphor2Represents the 2 norm between the reflected signal vector X and Dic · C; ε is a minimum number close to 0;
after solving for vector C, non-zero element C in vector C can be selectediObtain a corresponding source position of the reflection source, while from the non-zero element ciThe reflection intensity is obtained from the values of (a).
7. The method of claim 6, wherein the wideband dictionary-based sparse recovery algorithm comprises:
setting an ultrasound signal to a wideband signal having K subcarriers, f, for each subcarrier using orthogonal frequency division multiplexingkAll have respective dictionaries DickAccording to DickConstructing a broadband dictionary wDic:
the sparsity of the reflected signal is the same characteristic between dictionaries of different subcarriers, depending on the subcarriers being transmitted and reflected together:
Ck=Clfor k is not equal to l; i.e. k ≠ l time Ck=Cl;
At this time, solving the multi-dictionary joint optimization problem is equivalent as follows:
min||Ck||0 for k=1,2…K;
s.t.||Xk-Dick·Ck||2≤ε and Ck=Cl for k≠l;
using k ≠ l time Ck=ClThe important constraint of the method is that the multi-dictionary combined optimization is converted into single dictionary optimization;
wherein, X iskThe vertical arrangement is as follows: x ═ X1,X2...XK]T;
The wide-band dictionary wDic is also formed by DickVertically arranged to form: wDic ═ Dic1,Dic2...Dick]T;
The corresponding sparse vector wC is then: wC ═ C1=C2=…=CK;
Thus, there are:
solving sparse vector wC based on the wideband dictionary wDic is approximately equivalent to respectively carrying out sparse recovery positioning on each subcarrier, and the positioning result of each subcarrier is taken as the positioning result of common intersection;
after the vector wC is solved, the corresponding position of the reflection source can be obtained from the position of the nonzero element in the vector wC, and the reflection intensity can be obtained from the numerical value of the nonzero element.
8. The method of claim 7, wherein the sparse recovery algorithm based on a velocity-aware dictionary comprises:
for each subcarrier fkSd (k, vi) is used to represent the subcarrier signal at speed vi, and the subcarrier f matching speed vi at this timekThe corresponding dictionary of (a) should be:
corresponding dictionaries under different speeds are integrated together to construct a speed perception dictionary vDick:
The speed perception dictionary is equivalent to extending a two-dimensional dictionary (d, theta) into a three-dimensional dictionary (d, theta, v), and the corresponding sparse vector is vCk=[0,0,…c1,0,…ci…0]TThe vector dimension is changed from N dimension to N × M dimension;
at this time, for each subcarrier:
dictionary vDic based on speed perceptionkSolving for sparse vectors vCkThe process of (2) is equivalent to respectively performing sparse recovery positioning by utilizing the corresponding dictionaries at all speeds; taking the speed of the dictionary corresponding to the positioning result with the maximum reflection intensity as the estimation of the gesture movement speed of the user, and taking the positioning result as the final positioning result;
solve to obtain vCkThe corresponding position and velocity of the reflection source can then be obtained from the position of the non-zero element therein, while the reflection intensity is obtained from the value of the non-zero element.
9. The multi-user motion gesture control device is applied to an intelligent sound box, and the intelligent sound box comprises a loudspeaker array for transmitting ultrasonic signals and a microphone array for collecting reflected signals; the device comprises:
the acquisition module is used for acquiring a reflected signal which is reflected by the ultrasonic signal transmitted by the loudspeaker array through one or more user motion gestures and acquired by the microphone array;
the processing module is used for determining the position and the reflection intensity of a reflection source by utilizing a reflection signal positioning model and a sparse recovery algorithm according to the ultrasonic signal and the preprocessed reflection signal so as to obtain a positioning result corresponding to one or more user motion gestures; the method comprises the following steps that a sparse recovery algorithm based on a broadband dictionary is suitable for a microphone array with any shape and any distance; and/or a sparse recovery algorithm based on a speed-aware dictionary is suitable for user motion gestures at any speed; and extracting gesture tracks corresponding to one or more users according to the positioning result so as to respond to preset instructions corresponding to different gesture tracks.
10. An intelligent sound box, characterized in that the device comprises:
a microprocessor storing computer instructions which when executed implement the method of any one of claims 1 to 8;
a speaker array that can emit ultrasonic signals;
a microphone array that can collect the reflected signal.
11. A computer-readable storage medium having stored thereon computer instructions which, when executed, perform the method of any one of claims 1 to 8.
Background
The interaction between the intelligent sound box and the user mostly adopts voice control, and the interaction mode is more suitable for accurate control or intelligent voice chat application. However, in many scenarios, the user may wish to control the smart speaker in a non-voice manner. For example, there are situations where there is a language barrier, or where silence is required, and where the user needs to quickly convey some common and concise control instructions, etc.
Some research works have proposed that a mobile phone is controlled by motion gestures through an ultrasonic positioning technology similar to sonar radar, but at present, no mature solution for controlling a smart sound box by motion gestures of multiple users exists.
Disclosure of Invention
In view of the above-mentioned shortcomings in the prior art, it is an object of the present application to provide a multi-user motion gesture control method, apparatus, smart speaker, and medium to solve at least one problem in the prior art.
In order to achieve the above objects and other related objects, the present application provides a multi-user motion gesture control method applied to an intelligent sound system, the intelligent sound system including a speaker array for emitting an ultrasonic signal and a microphone array for collecting a reflected signal; the method comprises the following steps: acquiring a reflected signal which is reflected by one or more user motion gestures and is collected by a microphone array from an ultrasonic signal transmitted by a loudspeaker array; according to the ultrasonic signals and the preprocessed reflection signals, determining the position and the reflection intensity of a reflection source by utilizing a reflection signal positioning model and a sparse recovery algorithm so as to obtain a positioning result corresponding to one or more user motion gestures; the method comprises the following steps that a sparse recovery algorithm based on a broadband dictionary is suitable for a microphone array with any shape and any distance; and/or a sparse recovery algorithm based on a speed-aware dictionary is suitable for user motion gestures at any speed; and extracting gesture tracks corresponding to one or more users according to the positioning result so as to respond to preset instructions corresponding to different gesture tracks.
In an embodiment of the present application, the reflection signal is preprocessed to eliminate self-interference and static reflection, and obtain a reflection signal including a limited number of motion gestures.
In an embodiment of the present application, the wideband dictionary-based sparse recovery algorithm is applicable to a microphone array with arbitrary shape and distance, including: setting the ultrasonic signal into a broadband signal with K subcarriers by utilizing orthogonal frequency division multiplexing to construct a broadband dictionary; according to the characteristic that the sparsity of the reflected signals is the same among dictionaries corresponding to different subcarriers, when the transmitted ultrasonic signals are set to be broadband signals with K frequency components, sparse recovery positioning is carried out on each frequency component by utilizing the broadband dictionary comprising K sub dictionaries corresponding to the K frequency components respectively; the positioning results of each frequency component are overlapped, and the common intersection of the aliasing positioning results of the K frequency components is used as the positioning result, so that the problem of location result space aliasing caused by insufficient space sampling rate can be solved on the microphone array with any shape, and the method is suitable for the microphone array with any shape and distance.
In an embodiment of the present application, the sparse recovery algorithm based on a speed-aware dictionary is applicable to a user motion gesture at any speed, and includes: according to the frequency offset condition of each subcarrier caused by Doppler effect at different speeds, a speed sensing dictionary matched with the speed is constructed based on the broadband dictionary; and respectively carrying out sparse recovery positioning by utilizing the dictionaries corresponding to all the speeds, taking the speed of the dictionary corresponding to the positioning result with the maximum reflection intensity as the estimation of the motion speed, and taking the positioning result as the final positioning result, so that the problem of positioning error caused by Doppler effect can be solved, and the method is suitable for the motion gesture of the user at any speed.
In an embodiment of the present application, the reflected signal positioning model includes:A= [a(d1,θ1),…,a(dD,θD)](ii) a Wherein X is a reflected signal collected by the microphone array; s is an ultrasonic signal emitted by the loudspeaker array;is a noise signal; a is a guidance matrix used for expressing the emission condition; a (d)D,θD) A guidance vector representing a position D of the reflection source; dD,θDRespectively showing the distance and angle of the D-th reflection source position relative to the intelligent sound box.
In an embodiment of the present application, the determining the position and the intensity of the reflection source by using the reflection signal positioning model and the sparse recovery algorithm includes: expanding the guidance matrix A into an overcomplete matrix A' with N dimensions: a' ═ a (d)1,θ1),a(d2,θ2),…,a(dN,θN)](ii) a Wherein N is>>D;a(dN,θN) Indicating relative smart loudspeaker distance dNAnd angle thetaNA possible position of the reflection source; accordingly, the ultrasound signal vector S is expanded to an N-dimensional sparse vector S': s ═ 0,0, … S1,0,…s2…si…sD…0]T(ii) a Wherein if in a (d)N,θN) If a reflected signal is actually present, the corresponding coefficient in S' is SiOtherwise, the value is 0; whereby the reflected signal localization model becomes:at this time, according to the ultrasonic signal s0(t) is foreseen, further s0(t) incorporation into A' gives:where Dic is a pre-computed overcomplete matrix called a dictionary, in which each element veciStore the slave in each position (d)N,θN) Reflected delay signal: dic ═ vec (d)1,θ1,t),…,vec(dN,θN,t)](ii) a C is a sparse vector representing the intensity of the reflected signal from the corresponding location, C ═ 0,0, … C1,0,…c2…ci…cD…0]T(ii) a When the number of valid reflections D is much smaller than the dimension N of the dictionary Dic, a sparsest direction can be foundThe quantity C represents the distribution of the reflection signals constituting the reflection signal vector X in the dictionary Dic: min | | C | luminance0s.t.||X-Dic·C||2Epsilon is less than or equal to; wherein | C | Y calculation0Represents the 0 norm of the vector C, i.e. the number of non-zero elements in the vector C; s.t. indicates that the latter constraint is satisfied; i X-Dic C I non-woven phosphor2Represents the 2 norm between the reflected signal vector X and Dic · C; ε is a minimum number close to 0; after solving for vector C, non-zero element C in vector C can be selectediObtain a corresponding source position of the reflection source, while from the non-zero element ciThe reflection intensity is obtained from the values of (a).
In an embodiment of the present application, the wideband dictionary-based sparse recovery algorithm includes: setting an ultrasound signal to a wideband signal having K subcarriers, f, for each subcarrier using orthogonal frequency division multiplexingkAll have respective dictionaries DickAccording to DickConstructing a broadband dictionary wDic:the sparsity of the reflected signal is the same characteristic between dictionaries of different subcarriers, depending on the subcarriers being transmitted and reflected together: ck=Clfor k is not equal to l; i.e. k ≠ l time Ck=Cl(ii) a At this time, solving the multi-dictionary joint optimization problem is equivalent as follows: min | | Ck||0for k=1,2…K;s.t.||Xk-Dick·Ck||2≤εand Ck=Clfor k is not equal to l; using k ≠ l time Ck=ClThe important constraint of the method is that the multi-dictionary combined optimization is converted into single dictionary optimization; wherein, X iskThe vertical arrangement is as follows: x ═ X1,X2...XK]T(ii) a The wide-band dictionary wDic is also formed by DickVertically arranged to form: wDic ═ Dic1,Dic2...Dick]T(ii) a The corresponding sparse vector wC is then: wC ═ C1=C2=…=CK(ii) a Thus, there are:solving sparse vector wC based on the wideband dictionary wDic is approximately equivalent to respectively carrying out sparse recovery positioning on each subcarrier, and the positioning result of each subcarrier is taken as the positioning result of common intersection; after the vector wC is solved, the corresponding position of the reflection source can be obtained from the position of the nonzero element in the vector wC, and the reflection intensity can be obtained from the numerical value of the nonzero element.
In an embodiment of the present application, the sparse recovery algorithm based on a speed-aware dictionary includes: for each subcarrier fkSd (k, vi) is used to represent the subcarrier signal at speed vi, and the subcarrier f matching speed vi at this timekThe corresponding dictionary of (a) should be:corresponding dictionaries under different speeds are integrated together to construct a speed perception dictionary vDick:The speed perception dictionary is equivalent to extending a two-dimensional dictionary (d, theta) into a three-dimensional dictionary (d, theta, v), and the corresponding sparse vector is vCk= [0,0,…c1,0,…ci…0]TThe vector dimension is changed from N dimension to N × M dimension; at this time, for each subcarrier:dictionary vDic based on speed perceptionkSolving for sparse vectors vCkThe process of (2) is equivalent to respectively performing sparse recovery positioning by utilizing the corresponding dictionaries at all speeds; taking the speed of the dictionary corresponding to the positioning result with the maximum reflection intensity as the estimation of the gesture movement speed of the user, and taking the positioning result as the final positioning result; solve to obtain vCkThe corresponding position and velocity of the reflection source can then be obtained from the position of the non-zero element therein, while the reflection intensity is obtained from the value of the non-zero element.
In order to achieve the above and other related objects, the present application provides a multi-user motion gesture control apparatus applied to an intelligent sound system, the intelligent sound system including a speaker array for emitting an ultrasonic signal and a microphone array for collecting a reflected signal; the device comprises: the acquisition module is used for acquiring a reflected signal which is reflected by the ultrasonic signal transmitted by the loudspeaker array through one or more user motion gestures and acquired by the microphone array; the processing module is used for determining the position and the reflection intensity of a reflection source by utilizing a reflection signal positioning model and a sparse recovery algorithm according to the ultrasonic signal and the preprocessed reflection signal so as to obtain a positioning result corresponding to one or more user motion gestures; the method comprises the following steps that a sparse recovery algorithm based on a broadband dictionary is suitable for a microphone array with any shape and any distance; and/or a sparse recovery algorithm based on a speed-aware dictionary is suitable for user motion gestures at any speed; and extracting gesture tracks corresponding to one or more users according to the positioning result so as to respond to preset instructions corresponding to different gesture tracks.
To achieve the above and other related objects, there is provided an intelligent sound system, the system including: a microprocessor storing computer instructions that when executed perform the method as described above; a speaker array that can emit ultrasonic signals; a microphone array that can collect the reflected signal.
To achieve the above and other related objects, the present application provides a computer readable storage medium storing computer instructions which, when executed, perform the method as described above.
In summary, the method, the device, the intelligent sound box and the medium for controlling the multi-user motion gestures of the present application obtain the reflected signals which are reflected by the ultrasonic signals emitted by the speaker array through one or more user motion gestures and collected by the microphone array; according to the ultrasonic signals and the preprocessed reflection signals, determining the position and the reflection intensity of a reflection source by utilizing a reflection signal positioning model and a sparse recovery algorithm so as to obtain a positioning result corresponding to one or more user motion gestures; the method comprises the following steps that a sparse recovery algorithm based on a broadband dictionary is suitable for a microphone array with any shape and any distance; and/or a sparse recovery algorithm based on a speed-aware dictionary is suitable for user motion gestures at any speed; and extracting gesture tracks corresponding to one or more users according to the positioning result so as to respond to preset instructions corresponding to different gesture tracks.
Has the following beneficial effects:
the method and the device can realize multi-user motion gesture control under the conditions of no extra hardware requirement and no influence on the voice control function of the intelligent sound box, add a new interaction mode to the intelligent sound box on the basis of the voice control interaction mode, and can be widely applied to scenes such as silent control, multi-user control and the like; moreover, the sparse recovery positioning algorithm based on the broadband signals and the speed perception dictionary can position multiple users without being influenced by the coherence of the reflected signals, and can be suitable for microphone arrays with any shapes and distances and user motion gestures at any speed.
Drawings
Fig. 1 is a schematic view illustrating a scene of a smart speaker according to an embodiment of the present application.
Fig. 2 is a flowchart illustrating a multi-user motion gesture control method according to an embodiment of the present disclosure.
Fig. 3 is a model diagram of a coordinate system of a microphone according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating the operation of the Matching Pursuit (MP) solver algorithm of an embodiment of the present application.
Fig. 5A-5C are schematic projection diagrams illustrating the correspondence between the elements vec containing angles and distances in the dictionary Dic according to an embodiment of the present application.
Fig. 5D is a schematic diagram illustrating a model of a motion gesture trajectory obtained by accumulating a plurality of reflection positioning points according to an embodiment of the present disclosure.
FIG. 6A is a schematic diagram illustrating a measurement of a distance result of a reflection location according to an embodiment of the present invention.
FIG. 6B is a schematic diagram illustrating measurement of angle results of reflection positioning according to an embodiment of the present invention.
FIG. 6C is a waveform diagram illustrating the positioning and tracking effect of the present application in an embodiment using a static dictionary.
FIG. 6D is a waveform diagram illustrating the localization and tracking effects of the present application in an embodiment using a velocity-aware dictionary.
Fig. 7A-7D are schematic views respectively showing the tracking effect of the present application in the case that the users are 1, 2, 3 and 4 respectively according to an embodiment.
FIG. 8 is a block diagram illustrating a multi-user motion gesture control apparatus according to an embodiment of the present disclosure.
Detailed Description
The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only schematic and illustrate the basic idea of the present application, and although the drawings only show the components related to the present application and are not drawn according to the number, shape and size of the components in actual implementation, the type, quantity and proportion of the components in actual implementation may be changed at will, and the layout of the components may be more complex.
Throughout the specification, when a part is referred to as being "connected" to another part, this includes not only a case of being "directly connected" but also a case of being "indirectly connected" with another element interposed therebetween. In addition, when a certain part is referred to as "including" a certain component, unless otherwise stated, other components are not excluded, but it means that other components may be included.
The terms first, second, third, etc. are used herein to describe various elements, components, regions, layers and/or sections, but are not limited thereto. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the scope of the present application.
Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," and/or "comprising," when used in this specification, specify the presence of stated features, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions or operations are inherently mutually exclusive in some way.
Fig. 1 is a schematic structural diagram of an intelligent sound box according to an embodiment of the present application. As shown, the smart sound box 100 is composed of three components: a speaker array 110 consisting of a plurality of speakers, a microphone array 120 consisting of a plurality of microphones, and a microprocessor 130; wherein the speaker array 110 can emit an ultrasonic signal and the microphone array 120 can collect a reflected signal.
Theoretically, a loudspeaker may not work well with low frequency extension down, but up to high frequency (ultrasound) is easy to do. The human hearing range is between 20-20000HZ, below 20HZ is called infrasound, above 20000HZ is called ultrasound, and many headphones label frequency response ranges beyond this value. Therefore, the loudspeaker of the existing intelligent sound can emit ultrasonic signals, and correspondingly, the microphone array can also be used for collecting the reflected ultrasonic signals.
It should be noted that the microprocessor 130 may include a processor and a memory, and the processor loads one or more instructions corresponding to the processes of the application program into the memory according to the steps described in fig. 2, and the processor executes the application program stored in the memory 801, thereby implementing the method described in fig. 2.
In various embodiments, the Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory stores an operating system and operating instructions, executable modules or data structures, or subsets thereof, or expanded sets thereof, wherein the operating instructions may include various operating instructions for performing various operations. The operating system may include various system programs for implementing various basic services and for handling hardware-based tasks.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
The basic principle of the invention is similar to that of sonar radar, ultrasonic signals are transmitted through the speaker array 110, after being reflected by the gestures of the user, the microphone array 120 collects the reflected signals, and finally the microprocessor 130 performs signal positioning and trajectory extraction.
The multi-user motion gesture control method for the commercial intelligent sound box 100 provided by the invention can be used for adding a gesture control function on the premise of not influencing the voice control of the intelligent sound box 100. The principle is based on the principle that the smart sound box 100 emits ultrasonic signals like sonar radar and analyzes the reflected ultrasonic signals, so as to recognize gesture control instructions from multiple users.
It should be noted that, in order to implement the multi-user gesture control function on the microphone array 120 of the commercial smart sound box 100, the present invention needs to solve the following three technical problems:
1) the first technical problem is how to design a suitable positioning algorithm on the microphone array 120, and in case that the reflected signals are all coherent signals, the positioning of multiple persons can still be achieved.
2) The second problem is how to accomplish multi-person localization on arbitrarily shaped and spaced microphone arrays 120. The microphone array 120 of the smart speaker 100 is originally designed to handle the human voice band, so the spacing is typically around 5cm to 10 cm. This array spacing is far from sufficient to provide a spatial sampling rate for the ultrasound signal, which can lead to spatial aliasing problems. And the microphone array 120 of the smart sound box 100 may have any shape, such as a uniform linear array and a uniform circular array, etc. commonly available in the market. Therefore, the present invention needs to solve the problem of spatial aliasing of the positioning result caused by insufficient spatial sampling rate on a microphone array with an arbitrary shape, so that the present invention is suitable for a microphone array with an arbitrary shape and an arbitrary distance.
3) Due to the fact that the sound propagation speed is low, Doppler effect brought by the gesture speed of the user movement has an obvious frequency offset problem for ultrasonic signals. When the gesture moves rapidly, the frequency offset problem caused by the doppler effect will cause serious positioning error. Therefore, the third technical problem is how to solve the problem of positioning errors caused by the doppler effect, so as to be suitable for the user motion gesture at any speed.
In recent years, a motion gesture interaction mode is realized by using ultrasonic signals. Most of the existing work is to realize ultrasonic positioning [1] [2] on a mobile phone, and the distance measurement and triangulation positioning modes are mostly adopted, so that the distance is limited and the simultaneous interaction of multiple persons cannot be realized. [3] The multi-user gesture control is realized, but the multi-user gesture control needs a machine learning training mode, can only be suitable for a self-made non-uniform linear array, and cannot be suitable for arrays with any shapes and intervals and any gesture movement speed. There are also some localization tracking systems [4] [5] on radio frequency signals, which can not be applied to arrays with any shape and space and any gesture movement speed; and (4) a signal positioning algorithm. The traditional MUSIC [6] [7] is most widely applied, and in recent years, a positioning algorithm based on a sparse recovery idea is gradually developed [8] [9 ].
In order to solve the problems, the application provides a multi-user motion gesture control method based on a sparse recovery positioning algorithm of a broadband dictionary and a velocity perception dictionary. Meanwhile, a dictionary based on broadband signals and having a speed perception function is constructed and provided for a sparse recovery positioning algorithm to overcome the problems of space aliasing and Doppler effect, so that the method is suitable for microphone arrays with any shapes and intervals and any gesture movement speed, and the positioning of gesture reflection of multiple users is completed. And finally, extracting the gesture track of the user on the positioning result.
The sparse recovery positioning algorithm based on the broadband dictionary and the velocity perception dictionary can position a plurality of users without being influenced by the coherence of a reflected signal, and can solve the problems of space aliasing caused by insufficient space sampling rate of a microphone array and positioning errors caused by Doppler effect; meanwhile, by positioning the gestures of multiple users to form a motion trajectory, the smart sound box 100 can recognize the motion gestures of the multiple users; in addition, the present invention can be directly applied to the smart sound box 100 having the uniform circular microphone array 120 on the market. On the premise of not influencing the original voice interaction function, a multi-user motion gesture control function is added to the voice interaction device. The method can be widely applied to scenes such as silent control, multi-user control and the like.
Fig. 2 is a schematic flow chart of a multi-user motion gesture control method according to an embodiment of the present application. The method is applied to a smart sound as shown in fig. 1, which includes a speaker array for emitting ultrasonic signals and a microphone array for collecting reflected signals. As shown in fig. 2, the method includes:
step S210: reflected signals are acquired that are reflected back from ultrasonic signals emitted by the speaker array via one or more user motion gestures and are collected by the microphone array.
Briefly, based on the application scenario as shown in fig. 1, the speaker array of the smart audio emits an ultrasonic signal, and if a motion gesture is generated by a plurality of users at this time, part of the ultrasonic signal is reflected and collected by the microphone array.
Step S220: according to the ultrasonic signals and the preprocessed reflection signals, determining the position and the reflection intensity of a reflection source by utilizing a reflection signal positioning model and a sparse recovery algorithm so as to obtain a positioning result corresponding to one or more user motion gestures; the method comprises the following steps that a sparse recovery algorithm based on a broadband dictionary is suitable for a microphone array with any shape and any distance; and/or a sparse recovery algorithm based on a speed-aware dictionary to adapt to user motion gestures at arbitrary speeds.
In an embodiment of the present application, the reflection signal is preprocessed to eliminate self-interference and static reflection, and obtain a reflection signal including a limited number of motion gestures.
It should be noted that, since the environment includes an interference signal and a static reflection signal, the acquired reflection signal is first preprocessed to eliminate self-interference and static reflection, where the preprocessing may adopt a common signal preprocessing method, and the application is not limited thereto. It is critical to get a signal that contains a limited number of reflections of the gesture motion. Based on the method, the signal positioning problem can be converted into the sparse recovery problem of recovering the most sparse reflected signal component from the overall signal of the microphone array through a subsequent sparse recovery algorithm, and the reflected signal positioning problem is further solved.
First, the first problem to be solved by the present invention is the reflected signal localization. The following embodiments may facilitate understanding of a mathematical model representation of the reflected signal localization problem.
The signals received by the different microphones of the microphone array are reflected signals of different delay cases, which can be parameterized by the distance and angle of the position of the reflection source, i.e. the hand moved by the user, relative to the microphone array. Suppose that the microphone array of the intelligent sound box is equipped with L microphonesThe ultrasonic signal generated is s0(t), there are D signals reflected back. The position of the ith reflection source is located at a distance d from the intelligent sound boxiAnd angle thetaiTo (3). Reflected signal x reflected back to microphone kk(t) is:
wherein, in the formula fiDenotes the delay and attenuation effects caused by the ith reflection source position.
Then, the whole microphone array collects the reflection signal vector as:
X=[x0(t),x1(t),...,xL-1(t)]T;
the signal positioning algorithm aims at estimating the position parameter of the position of a reflection source from a reflection signal vector X collected by a microphone array [ (d)i,θi),i=1,2,…,D]. It should be noted that, in the reflected signal localization model of the present application, both the ultrasonic signal and the reflected signal can be regarded as existing in the form of vector.
For example, as shown in the model diagram of the microphone coordinate system shown in fig. 3, a uniform annular microphone array with a radius R and L microphones and a distance L is given. Where microphone No. 0 is on the x-axis of the coordinate system and the origin of coordinates is taken as reference point ref. Therefore, the angle between the x-axis and the kth microphone is:
the coordinates of microphone No. k are then:
for the sake of clarity, the present application will first describe the positioning model in the presence of only a single reflection signal, and then describe the positioning model in the presence of multiple reflection signals.
When only a single reflection is present, the reflected signal is assumed to be from a distance d from a reference point refiAnd angle thetaiThen the phase reflected back to ref is:
wherein v issRepresenting the speed of sound, f representing the transmitted ultrasonic signal s0(t) frequency. The phase offset between the kth microphone and ref may be in degrees theta depending on the geometry of the uniform annular microphone arrayiExpressed as:
wherein, Δ dkThe path difference between the k microphone and ref is shown and can be intuitively indicated by the thick dashed line in fig. 3. Therefore, the phase of the reflected signal reflected to the k microphone is:
φk(di,θi)=φref+Δφk;
therefore, for one from (d)i,θi) The position i of the reflection source, the reflection signal X collected by the microphone array can be expressed as:
wherein, a (d)i,θi) A guidance vector representing the position i of the reflection source, constant ciRepresenting the attenuation factor of the source position i.
When there are a plurality of reflections, generalizing equation 1 to the scene that has D reflection source locations, at this moment microphone array signal X is:
further, the present application defines the guidance matrix used to express the launch situation as:
A=[a(d1,θ1),…,a(dD,θD)];
wherein, a (d)D,θD) A guidance vector representing a position D of the reflection source; dD,θDRespectively showing the distance and angle of the D-th reflection source position relative to the intelligent sound box.
Simultaneously, the signal attenuation factor and the ultrasonic signal s are combined0Combining together:
S=[c1,…,cD]Ts0(t)=[s1(t),…,sD(t)]T;
taking into account the noiseThe relationship between the final reflected signal collected by the microphone array and the ultrasonic signal emitted by the loudspeaker array is as follows:
the multi-reflection signal positioning problem is to solve the guidance matrix a in equation 2 according to the known reflection signal X collected by the microphone array and the ultrasonic signal S emitted by the speaker array at the time. According to (d) in AD,θD) The orientation at which the position of the source of the reflection is known.
In brief, the reflected signal X collected by the microphone array is equal to the ultrasonic signal S emitted by the speaker array, and is subjected to a (including reflection) reflection delay transformation, and then a noise signal is addedThus, the compound was obtained.
In the present application, to solve the problem of reflected signal positioning, the present invention uses a sparse recovery positioning algorithm to solve the problem of reflected signal positioning, i.e., uses a sparse recovery algorithm to solve equation 2.
In general, the starting points of the sparse recovery positioning algorithm are as follows: since the static reflected signals in the environment can be removed by the signal interference elimination technology, the reflected signals received by the microphone array mainly include signals reflected by a limited number of gesture motions after the interference elimination. Thus, the signal localization problem can be translated into a sparse recovery problem that recovers the sparsest reflected signal component from the overall signal of the microphone array. In the specific technical details, the sparse recovery positioning algorithm comprises the following steps:
first, the present application expands the guidance matrix a in equation 2 into an overcomplete matrix a' with N (N > > D) dimensions:
A′=[a(d1,θ1),a(d2,θ2),…,a(dN,θN)];
wherein, a (d)N,θN) Representing the relative ref distance dNAnd angle thetaNA possible position of the reflection source.
Accordingly, the ultrasound signal S emitted by the loudspeaker array is also correspondingly expanded into a sparse vector S' of dimension N:
S′=[0,0,…s1,0,…s2…si…sD…0]T
wherein if the position is a (d) at a reflection sourceN,θN) In the presence of a reflected signal, the corresponding coefficient is siOtherwise, it is 0.
Thereby, the reflected signal localization model becomes:
wherein, because the application scene of this application is the active ultrasonic location, the ultrasonic signal s that the speaker array launches0(t) is known in advance, so s is further defined0(t) incorporation into A' gives:
where Dic is a pre-computed overcomplete matrix called a dictionary, each element veciFrom each position (d) is storedN,θN) Reflected delay signal:
Dic=[vec(d1,θ1,t),…,vec(dN,θN,t)];
similar to the sparse vector S', C is also a sparse vector representing the reflected signal strength from the corresponding location:
C=[0,0,…c1,0,…c2…ci…cD…0]T;
when the number D of valid reflections is much smaller than the dimension N of the dictionary, the signal localization problem can be transformed into a sparse recovery problem, i.e. find a sparsest vector C to represent the distribution of the reflection signals in the dictionary that make up the microphone array signal vector X:
min||C||0s.t.||X-Dic·C||2epsilon is less than or equal to; (formula 3)
Wherein | C | Y calculation0Represents the 0 norm of the vector C, i.e. the number of non-zero elements in the vector C; s.t. indicates that the latter constraint is satisfied; i X-Dic C I non-woven phosphor2Represents the 2 norm between the reflected signal vector X and Dic · C; ε is a very small number close to 0. Specifically, equation 3 expresses that: c is found to minimize the 0 norm of C under the condition that the 2 norm (representing the difference between X and Dic · C) between X and Dic · C is as small as possible (i.e., X is approximately equal to Dic · C). Where the variable is C, the objective is to find the quantity C with the smallest number of non-zero elements while C satisfies the condition that Dic · C is approximately equal to X.
After solving for vector C, the present application can derive non-zero element CiPosition of the source of the reflection sourceSimultaneous non-zero element ciThe reflection intensity is obtained from the values of (a). Wherein, PosnIs the meaning of n stimuli (positions), n non-zero elements ciThe position in the vector C can be converted into a reflection positioning position Pos in the real worldn。
In the embodiment of the present application, the vector C may be solved by a solver. Preferably, a Matching Pursuit (MP) solver is employed to solve optimization equation 3 of the sparse recovery positioning algorithm. The matching pursuit solver is a greedy type of iterative solver. The matching pursuit solver iteratively finds a vector from the dictionary Dic that has the largest projection with X, and then subtracts this vector from the signal X until the number of iterations reaches a threshold value. Specifically, as shown in fig. 4, is an operation diagram of the Matching Pursuit (MP) solver.
The second problem that the invention is important to solve is how to accomplish multi-person localization on a microphone array of arbitrary shape and spacing. Most commercial smart speakers are designed to operate in the low frequency human voice band, while ultrasonic signal based gesture tracking systems operate in the ultrasonic band (17kHz-23kHz) to reduce interference to the user. Therefore, for a narrow band ultrasonic signal with frequency f, the microphone array spacing (5cm-10cm) on a commercial smartspeaker is typically much greater than half the ultrasonic wavelength (1.5cm-2 cm). The spatial sampling rate of the microphone array under the ultrasonic frequency band is far insufficient, and the problem of spatial aliasing of a sparse recovery positioning algorithm is caused. And the microphone array of the intelligent sound box can have any shape, such as a uniform linear array, a uniform annular array and the like which are common in the market. Therefore, the present invention needs to solve the problem of spatial aliasing of positioning results caused by insufficient spatial sampling rate on a microphone array with an arbitrary shape, so that the present invention is suitable for microphone arrays with arbitrary shapes and distances.
When the microphone array distance R is greater than half the wavelength of the narrowband ultrasonic signal with the frequency f:
for a certain position of the reflection source from the distance d and the angle theta, the relative signal phase from the position of the k microphone is phik(d, θ). However, it is not limited toWhen there are a plurality of angle parameters theta different from theta1,θ2...θnSimilar phase values can be generated:
φk(d,θ)≈φk(d,θ1)+2k1π≈φk(d,θ2)+2k2π≈…≈φk(d,θn)+2knπ;
this means that there are a number of similar vectors in the dictionary Dic that satisfy:
vec(d,θ)≈vec(d,θ1)≈vec(d,θ2)≈…≈vec(d,θn);
this aliasing problem also exists in the distance domain:
vec(d,θ)≈vec(d1,θ)≈vec(d2,θ)≈…≈vec(dm,θ);
the present application illustrates the spatial aliasing problem intuitively with a specific example. Assuming that there is a uniform annular microphone array with a microphone spacing R of 5cm and a number L of 6 microphones, the frequency f of the transmitted narrow-band ultrasound signal is 17kHz, and there is a source position in space from (d, θ) — (150cm,30 °). And calculating the projection of all vecs (d, theta) in the dictionary Dic for the sparse recovery positioning algorithm by adopting a matching pursuit solver. In each iteration, the matching pursuit solver will choose the vec (d) with the largest projectioni,θi) As the estimated position.
Finally, the projections corresponding to all elements vec (d, θ) storing the distance and angle in the dictionary Dic calculated by the matching pursuit solver are shown in fig. 5A. It can be seen that there is a serious problem of spatial aliasing in the projection result, both in the distance direction and in the angular direction, and it cannot be seen that the (150cm,30 °) position should be the position where the projection is maximum.
Therefore, when the microphone array spacing of a commercial smart speaker is much larger than half the wavelength of the transmitted narrowband signal, the insufficient spatial sampling rate may cause a spatial aliasing problem to the sparse recovery positioning algorithm. In the application, the sparse recovery positioning algorithm is improved, and the problem of space aliasing is solved by using the sparse recovery positioning algorithm based on the broadband dictionary, so that the method is suitable for microphone arrays with any shapes and intervals.
In an embodiment of the present application, the wideband dictionary-based sparse recovery algorithm is applicable to a microphone array with any shape and distance, including:
A. setting the ultrasonic signal into a broadband signal with K subcarriers by utilizing orthogonal frequency division multiplexing to construct a broadband dictionary;
B. according to the characteristic that the sparsity of the reflected signals is the same among dictionaries corresponding to different subcarriers, when the transmitted ultrasonic signals are set to be broadband signals with K frequency components, sparse recovery positioning is carried out on each frequency component by utilizing the broadband dictionary comprising K sub dictionaries corresponding to the K frequency components respectively;
C. the positioning results of all the frequency components are overlapped, the common intersection of the aliasing positioning results of the K frequency components is used as the positioning result, the problem of the spatial aliasing of the positioning results caused by insufficient spatial sampling rate can be solved on the microphone array in any shape, and therefore the method is suitable for the microphone array in any shape and any distance.
Specifically, the present application proposes to eliminate spatial aliasing using the frequency characteristics of a broadband signal. Namely, the method of the present application is based on an important finding: in the case of ultrasonic signals of different frequencies, the same microphone array has different positions of spatial aliasing, but simultaneously contains the correct position of the reflected source position signal.
Then, when the transmitted signal is not a single narrow-band ultrasonic signal but has K frequency components f1,f2...fKAnd a corresponding dictionary Dic1,Dic2...DicKWhen the positioning results of each frequency component are overlapped, the correct result which is the only common intersection of the overlapping conditions of the K frequency components is highlighted and enhanced, so that the result is distinguished from other overlapping results.
The sparse recovery positioning algorithm based on the broadband dictionary comprises the following steps:
firstly, the methodThe present application utilizes Orthogonal Frequency Division Multiplexing (OFDM) to generate a wideband signal having K subcarriers, f for each subcarrierkAll have respective dictionaries DickAccording to DickConstructing a broadband dictionary wDic:
since in the application scenario of the present application, the subcarriers are transmitted and reflected together, the sparsity of the reflected signal is the same between dictionaries of different subcarriers, i.e.:
Ck=Cl for k≠l;
i.e. k ≠ l time Ck=ClWherein k and l represent different subcarrier numbers; the expression means that their subcarriers are different, but their reflection signal sparse vector CkAnd ClThe same between them.
At this time, solving the multi-dictionary joint optimization problem is equivalent as follows:
min||Ck||0for k=1,2…K
s.t.||Xk-Dick·Ck||2≤εand Ck=Cl for k≠l;
it expresses that: satisfy each subcarrier XkAnd Dick·Ck2 norm (representing the difference between the two) is as small as possible, and sparsity C between different subcarrierskAnd ClUnder the same condition, finding the let vector C for each subcarrier at the same timek0 norm minimum Ck。
Next, the present application utilizes C when k ≠ lk=ClThis important constraint transforms the multi-dictionary joint optimization problem into a single dictionary optimization problem. Specifically, X iskThe vertical arrangement is as follows:
X=[X1,X2...XK]T;
wherein the wideband dictionary wDic is composed of DickVertically arranged to form:
wDic=[Dic1,Dic2...Dick]T;
the corresponding sparse vector wC is then:
wC=C1=C2=…=CK;
this has the following:
solving the sparse vector wC by using a matching pursuit solver based on the wideband dictionary wDic is approximately equivalent to performing sparse recovery positioning on each subcarrier respectively, and the positioning result of each subcarrier is the positioning result of common intersection.
After the vector wC is solved, the corresponding position of the reflection source can be obtained from the position of the nonzero element in the vector wC, and the reflection intensity can be obtained from the numerical value of the nonzero element.
As shown in fig. 5B, the same scenario as that in fig. 5A is adopted to visually demonstrate the effect of the sparse recovery positioning algorithm based on the wideband dictionary. The present application first tests the effect of using 10 subcarriers in a wideband signal generated by Orthogonal Frequency Division Multiplexing (OFDM) with a bandwidth of 17kHz-17.5 kHz. Aliasing in the range dimension of the projection results in fig. 5B is significantly reduced compared to the single frequency signal effect of fig. 5A. The application then tests the effect of 120 sub-carriers with a bandwidth of 17kHz to 23 kHz. As shown in fig. 5C, aliasing in the projection results is substantially eliminated and the correct true (150cm,30 °) position can be effectively identified by the matching pursuit solver.
And finally, running the positioning algorithm for multiple times, and accumulating the multiple reflection positioning points to obtain the motion gesture track shown in the figure 5D. The gesture trajectories of the two users can be seen as a triangle and a circle.
It should be further noted that, due to the low propagation speed of sound, the doppler effect caused by the gesture speed of the user motion has a significant frequency shift problem for the ultrasonic signal. When the gesture moves rapidly, such as waving a palm in a voice interactive game, the frequency of the reflected signal will shift. This will reduce the performance of gesture tracking because the signal received by the smart speaker is no longer a delayed version of its transmitted signal. Therefore, the third problem that the present application addresses is the doppler effect problem to be suitable for the user motion gesture at any speed.
Specifically, assume that the original frequency domain sequence of the Orthogonal Frequency Division Multiplexing (OFDM) -generated meta-signal is:
[s(k),k=0,1…N-1];
the transmitted Orthogonal Frequency Division Multiplexing (OFDM) results in a meta-signal time domain sequence of:
due to the doppler effect, the true time domain sequence becomes:
where ε is the normalized frequency offset, i.e., the frequency offset removal is in subcarrier spacing. The corresponding frequency domain sequence is:
in an ideal case without doppler effect, for each subcarrier, the relationship between the signal received by the smart speaker and the transmitted signal is:
in practice, however, the signal conditions received by the smart speaker are:
given a speed of motion v, the frequency offset of an ultrasound signal of frequency fi will be:
so, when the user's hand is in motion, ε ≠ 0, and thus sd (k) ≠ s (k). Furthermore, the difference between sd (k) and s (k), i.e. the difference between the actual reflected signal model and the ideal model, becomes larger and larger with the increase of the motion speed.
Here, the present application illustrates the effect of the doppler effect using an example. Assuming that one source position is at (80cm, 150 °), the instantaneous velocity varies from-200 cm/s to 200cm/s, where positive velocity indicates close and negative velocity indicates far. For the conditions of different instantaneous speeds, the dictionary Dick generated under the ideal static speed is still uniformly adopted to be provided for the matching pursuit solver. This is equivalent to the application not dealing with the doppler effect at all.
Finally, the result of the reflection localization will be shown in fig. 6A and 6B. Fig. 6A shows the distance measurement result, and fig. 6B shows the angle measurement result. As the speed increases, the distance measurements become more and more scattered, deviating significantly from the true reference value. Performance degradation is also observed in the angle measurements: as the speed increases, the intensity of the localization on the true angular reference value becomes weaker and weaker.
In addition, the influence of the Doppler effect on the positioning result can be more visually displayed in an actual use scene under the condition of two users. The method specifically comprises the following steps: one of the users makes a gesturing movement at an average speed of about 10cm/s, and the other user makes a gesturing movement at an average speed of about 40 cm/s. At this time, the dictionary Dick generated at the ideal stationary speed is used to be supplied to the matching pursuit solver. This represents a positional tracking of the two users' gesture movements without any processing of the doppler effect.
Finally, the localization and tracking effect using the static dictionary is shown in FIG. 6C. It can be seen that for gestures with faster motion speed, the localization and tracking effect is poor: if the doppler effect is not dealt with and the dictionary is used only at rest speed as usual, the distance measurements for gestures with faster motion speed will often deviate significantly from the correct result. And the measurement of the distance and angle of the gesture is also often interrupted and lost due to the reduced signal strength.
Therefore, in order to correctly position and track the user gesture when the movement speed of the user gesture is high, the doppler effect problem needs to be solved in the invention so as to be suitable for the user movement gesture at any speed. In the application, the dictionary with the speed perception function is further constructed on the basis of the sparse recovery positioning algorithm based on the broadband dictionary, and the sparse recovery positioning algorithm based on the speed perception dictionary is provided.
Based on the above analysis, in an actual use scenario, when the user gesture moves faster, the influence of the doppler effect is not negligible, and the reflection positioning and the trajectory extraction are seriously affected. Dictionary Dic provided for matching pursuit solver in the scene of obvious Doppler effectkShould not still be a static dictionary, but should adapt the dynamics of the gesture accordingly to reduce positioning errors.
Therefore, the application provides a sparse recovery positioning algorithm based on a speed perception dictionary to solve the Doppler effect problem. The sparse recovery positioning algorithm based on the speed perception dictionary is based on the sparse recovery positioning algorithm based on the broadband signal, and has the difference that the dictionary is expanded into three dimensions, and the rest parts have no difference.
In an embodiment of the present application, the sparse recovery algorithm based on a speed-aware dictionary is applicable to a user motion gesture at any speed, and includes:
A. according to the frequency offset condition of each subcarrier caused by Doppler effect at different speeds, a speed sensing dictionary matched with the speed is constructed based on the broadband dictionary;
B. and respectively carrying out sparse recovery positioning by utilizing the corresponding dictionaries at all speeds, wherein the positioning result of the dictionary matched with the gesture movement speed of the user has no positioning error and the reflection intensity is maximum. The positioning result of the dictionary with the maximum reflection intensity is used as the final positioning result of the whole speed perception dictionary, so that the problem of positioning errors caused by Doppler effect can be solved, and the method is suitable for the user motion gestures at any speed.
In particular, for each subcarrier fkIn the present application, sd (k, vi) is used to represent subcarrier signals at a speed vi, and at this time, subcarrier f matched with speed vikThe corresponding dictionary of (a) should be:
corresponding dictionaries under different speeds are integrated together to construct a speed perception dictionary vDick:
The speed perception dictionary is equivalent to extending a two-dimensional dictionary (d, theta) into a three-dimensional dictionary (d, theta, v), and the corresponding sparse vector is vCk=[0,0,…c1,0,…ci…0]TThe vector dimension is changed from N dimension to N × M dimension;
at this time, for each subcarrier:
finally, the present application, as such, utilizes a matching pursuit solver at the speed perception dictionary vDickOn the basis of (1) solving sparse vector vCk. The process is equivalent to performing sparse recovery positioning by using dictionaries corresponding to all speeds, taking the speed of the dictionary corresponding to the positioning result with the maximum reflection intensity as the estimation of the gesture movement speed of the user, and taking the positioning result as the final positioning result. Solve to obtain vCkThereafter, the corresponding position of the reflection source can be obtained from the position of the non-zero element thereinAnd speed, while obtaining the reflection intensity from the values of the non-zero elements.
As shown in fig. 6D, compared with fig. 6C, with the help of the velocity-sensing dictionary, the results in distance and angle and the extracted gesture trajectory based on the results have better accuracy and continuity.
Step S230: and extracting gesture tracks corresponding to one or more users according to the positioning result so as to respond to preset instructions corresponding to different gesture tracks.
In various embodiments, the gesture may include: sliding in different directions such as up, down, left, right, left, down, left, right, etc., or a trajectory such as a triangle, a circle, or a combination of directions such as a V-shape, etc. The gesture tracks can preset corresponding instructions, for example, the gesture track at the upper right corresponds to the increase of the volume, the gesture track at the lower right corresponds to the decrease of the volume, the gesture track at the right corresponds to the switching of channels or songs, the circular gesture corresponds to the setting or the startup and shutdown, and the like.
According to the method and the device, after the positioning result is obtained, the gesture tracks corresponding to one or more users can be extracted, and the corresponding instruction is found according to the gesture tracks and further corresponds to the instruction.
The embodiments of the present application are described below by specific examples. Since the microprocessor development of commercial smart speakers is not open source. Therefore, the invention firstly constructs a prototype with the same structure as the commercial intelligent sound box by using hardware, then realizes the positioning algorithm on the prototype, and tests the embodiment to identify the gesture motion trail of one to four users.
The intelligent sound box is characterized in that a commercial intelligent sound box without any modification is used as prototype equipment, such as a microprocessor, a loudspeaker array and an intelligent sound box prototype consisting of a microphone array, the prototype structure of the intelligent sound box is very similar to that of most commercial intelligent sound boxes in the market, and the intelligent sound box prototype has 360-degree playing sound effect brought by the loudspeaker array and 360-degree sound pickup capability brought by a uniform annular microphone array.
Specifically, the Raspberry Pi 3 development board is adopted as a microprocessor of an intelligent sound box prototype, four Edifier M1250 loudspeakers are used for emitting OFDM broadband ultrasonic signals, the adopted uniform annular microphone array is a ReSpeaker 6-Mic uniform annular microphone array, and the distance between the microphones is 4.7 cm. The prototype microprocessor, speaker array, and microphone array were connected using audio lines equipped with an AUX interface.
The invention repeatedly transmits and receives ultrasonic broadband OFDM element signals on prototype equipment, and performs signal positioning in each OFDM element signal. The specific parameters are set such that the OFDM element signal has a bandwidth of 6kHz, from 17kHz to 23 kHz. Each meta-signal has 960 time domain samples at a sampling rate of 48kHz, i.e. each meta-signal has a duration of 20 ms.
In this embodiment, one to four users are required to sit down naturally against the prototype of the smart sound box implementing the positioning algorithm of the present invention, the distance is 1 meter, and the distance between the users is 30 degrees. Different shapes of the template are then drawn with the palm according to a pre-defined template on the table, including simple figures such as triangles and circles, as well as more complex shapes such as capitalized English letters and Arabic numerals. The user's gesture motion must follow the size and shape of the template and the trajectory of these pre-known locations will be used as a positioning accuracy reference. The shortest euclidean distance from the position output by the positioning algorithm to the predefined template trajectory is considered as the positioning error.
From fig. 7A to 7D, it can be seen that the tracking effect of the present invention under different situations of 1, 2, 3 and 4 users can be seen intuitively, and the straight lines respectively represent the reference values of the template trajectories which are known correctly in advance and the gesture trajectories obtained by sparse tracking output by the present invention.
It should be noted that the difference between the two straight lines cannot be visually shown due to color limitation, but the difference between the two straight lines can still be shown in fig. 7C and 7D, while the difference between the two straight lines in fig. 7A and 7B cannot be seen due to the overlapping. But it is undeniable that even if the number of the users is 4, the error degree of the two is still within an acceptable range, and the calculated gesture track basically conforms to the template track. The gesture track obtained by the method and the pre-known template track reference value increase errors along with the increase of the number of users participating simultaneously; wherein the average positioning errors for 1, 2, 3 and 4 users are 0.82cm, 1.09cm, 1.90cm and 2.66cm, respectively.
In summary, the multi-user motion gesture control method applied to the intelligent sound box provided by the application can realize multi-user motion gesture control without additional hardware requirements and without influencing the voice control function of the intelligent sound box, adds a new interaction mode to the intelligent sound box on the basis of the voice control interaction mode, and can be widely applied to scenes such as silent control, multi-user control and the like; the sparse recovery positioning algorithm based on the broadband dictionary and the velocity perception dictionary can position multiple users without being influenced by coherence of reflected signals, can solve the problem of space aliasing caused by insufficient space sampling rate of a microphone array, is applicable to microphone arrays with any shapes and intervals, solves the problem of positioning errors caused by Doppler effect, and is applicable to user motion gestures at any velocity.
Fig. 8 is a block diagram of a multi-user motion gesture control apparatus according to an embodiment of the present invention. The device is applied to intelligent sound equipment as shown in fig. 1, and the intelligent sound equipment comprises a loudspeaker array for transmitting ultrasonic signals and a microphone array for collecting reflected signals. As shown, the apparatus 800 includes:
an obtaining module 801, configured to obtain a reflected signal, which is obtained by reflecting an ultrasonic signal emitted by a speaker array through one or more user motion gestures and collecting by a microphone array;
a processing module 802, configured to pre-process the reflection signal to eliminate self-interference and static reflection, so as to obtain a reflection signal containing a limited number of motion gestures; according to the ultrasonic signals and the preprocessed reflection signals, positioning the position of a reflection source by using a reflection signal positioning model and a sparse recovery algorithm and obtaining reflection intensity so as to obtain a positioning result corresponding to one or more user motion gestures; the method comprises the following steps that a sparse recovery algorithm based on a broadband dictionary is suitable for a microphone array with any shape and any distance; and/or a sparse recovery algorithm based on a speed-aware dictionary is suitable for user motion gestures at any speed; and extracting gesture tracks corresponding to one or more users according to the positioning result so as to respond to preset instructions corresponding to different gesture tracks.
It should be noted that, because the contents of information interaction, execution process, and the like between the modules/units of the apparatus are based on the same concept as the method embodiment described in the present application, the technical effect brought by the contents is the same as the method embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.
It should be further noted that the above division of the modules of the apparatus 800 is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And these units can be implemented entirely in software, invoked by a processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the processing module 802 may be a separate processing element, or may be integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the processing module 802. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs); or, one or more microprocessors (digital signal processors, DSP for short); or one or more Field Programmable Gate arrays (FPGA for short), etc.; for another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code; for another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
In an embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method described in fig. 2.
The present application may be embodied as systems, methods, and/or computer program products, in any combination of technical details. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present application.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable programs described herein may be downloaded from a computer-readable storage medium to a variety of computing/processing devices, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device. The computer program instructions for carrying out operations of the present application may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present application by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
In summary, the multi-user motion gesture control method, the multi-user motion gesture control device, the intelligent sound box and the medium provided by the application can realize multi-user motion gesture control under the conditions of no extra hardware requirement and no influence on the voice control function of the intelligent sound box, add a new interaction mode to the intelligent sound box on the basis of the voice control interaction mode, and can be widely applied to scenes such as silent control, multi-user control and the like; the sparse recovery positioning algorithm based on the broadband signals and the speed perception dictionary can position a plurality of users without being influenced by coherence of reflected signals, can solve the problem of space aliasing caused by insufficient space sampling rate of a microphone array, is applicable to microphone arrays with any shapes and intervals, solves the problem of positioning errors caused by Doppler effect, and is applicable to user motion gestures at any speed.
The application effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the invention. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present application.
Reference documents:
[1]Wei Wang,Alex X.Liu,and Sun Ke.Device-Free Gesture Tracking Using Acoustic Signals. MobiCom,2016.
[2]RajalakshmiNandakumar,VikramIyer,Desney Tan,and ShyamnathGollakota.FingerIO: Using Active Sonar for Fine-Grained Finger Tracking.CHI,2016.
[3]Wenguang Mao,Mei Wang,Wei Sun,Lili Qiu,Swadhin Pradhan,Yi-Chao Chen. RNN-Based Room Scale Hand Motion Tracking.MobiCom,2019.
[4]FadelAdib,Zachary Kabelac,Dina Katabi,and Robert C.Miller.3D Tracking via Body Radio Reflections.NSDI,2014.
[5]ManikantaKotaru,Kiran Joshi,Dinesh Bharadia,and SachinKatti.Spotfi:Decimeter level localization using WiFi.SIGCOMM,2015.
[6]RALPH O.SCHMIDT.Multiple Emitter Location and Signal Parameter Estimation.IEEE, 1986.
[7]F.Belfiori,W.van Rossum,and P.Hoogeboom,“Application of 2dmusic algorithm to range-azimuth fmcw radar data,”IEEE EuropeanRadar Conference 2012.
[8]MALLAT,S.,AND ZHANG,Z.Matching Pursuit With Time-Frequency Dictionaries. IEEE Transactions on Signal Processing 41(1993),3397–3415.
[9]D.Malioutov,M.C,etin,and A.S.Willsky,“A Sparse Signal Reconstruction Perspective for Source Localization With Sensor Arrays,”IEEE Transactions on Signal Processing,vol.53,no. 8,pp.3010–3022,2005。