Gesture segmentation and recognition algorithm based on millimeter wave radar

文档序号:8330 发布日期:2021-09-17 浏览:77次 中文

1. A gesture segmentation and recognition algorithm based on millimeter wave radar comprises a model building and training stage and a model application stage, and is characterized in that the model building and training stage comprises the following steps:

step 1: collecting radar echo data of gesture actions, wherein in each collection period of a single gesture action, the action is repeated for multiple times, and simultaneously collecting continuous gesture actions, wherein the actions at least comprise two gestures, and recording the change time of the gestures;

step 2: transmitting a signal S from a radarT(t) and a received signal SR(t) input to a mixer to obtain a mixed signal SM(t) filtering the high frequency part by a low pass filter to obtain an intermediate frequency signal frequency SIF(t) wherein 77GHz millimeter wave radar transmits signal ST(t) is specifically represented by the formula:

wherein A isTRepresenting the amplitude of the transmitted signal, fcIs the center frequency of the carrier, T is the sawtooth signal pulse width, fT(τ) represents the frequency of the emitted signal over the T time;

radar reception signal SR(t) is specifically represented by the formula:

wherein A isTFor received signal amplitude, Δ t is the time delay, fR(T) is the received signal frequency within T time;

obtaining a mixing signal S by a mixerM(t) represented by the formula: sM(t)=ST(t)SR(t)

Mixing signal SM(t) passing through a low-pass filter to obtain an intermediate frequency signal SIF(t) represented by the formula:

and step 3: performing two-dimensional Fourier transform on the intermediate frequency signal frame by frame, and then performing high-pass filtering to obtain a time-distance-speed three-dimensional map;

and 4, step 4: the method comprises the steps that a single gesture action set is segmented into video data sets according to a time axis sliding window, sequence frames which are partially overlapped are arranged in front of and behind segmented samples, the number of the overlapped frames is determined by the window length and the step length of a sliding window algorithm, the effect of gesture segmentation and recognition is directly influenced by different step lengths and window lengths, after a series of samples are obtained, the samples are randomly divided into a training set StrainAnd a verification set Sval

And 5: establishing a three-dimensional convolution neural network model, and collecting a training set StrainTraining three-dimensional convolutional neural network as input data of the three-dimensional convolutional neural network, and using verification set SvalTesting the performance of the test paper specifically comprises the following steps:

step 5-1: building a three-dimensional convolutional neural network model, wherein the three-dimensional convolutional neural network model comprises 4 layers of 3D convolutional layers, the number of convolutional kernels is 4, 8, 32 and 64, the number of activation functions adopts ReLU, 4 layers of BN layers and 3 layers of maximum 3D pooling layers, flattening is performed through a flatten layer, then through 3 layers of full connection layers, the number of neurons is 256, 32 and 3, the first two layers of activation functions adopt tanh, and the activation functions of an output layer adopt softmax to obtain output;

step 5-2: selecting the total number of training rounds, and randomly disorganizing the training set S before each round of trainingtrain.

Step 5-3: input training set StrainTraining the model epoch is 30, taking 10 samples as a batch,the loss function is the cross entropy; the Adam algorithm is adopted as an optimization algorithm of the model gradient during training, the Adam algorithm adopts the learning rate of self-adaptive dynamic adjustment, different learning rates can be selected according to different parameters, dynamic constraint can be formed on the learning rate, larger gradient fluctuation is avoided, the loss function value and the accuracy rate of the training set are recorded, and the verification set S is utilized after each epoch is finishedvalAnd (5) carrying out verification, and recording the loss function value and the accuracy of the verification set.

Will train set StrainTraining three-dimensional convolutional neural network as input data of the three-dimensional convolutional neural network, and using verification set SvalThe performance was tested.

The model application phase comprises:

step 6: and (5) for the continuous gesture set, sliding a window frame by frame to extract samples, putting the samples into the three-dimensional convolutional neural network model trained in the step (5) for recognition, and performing accurate positioning and gesture segmentation on the primary recognition result through a segmentation algorithm to finally obtain complete gesture information.

2. The millimeter wave radar-based gesture segmentation and recognition algorithm according to claim 1, wherein the step 4 specifically comprises the following steps:

step 4-1: estimating the action period, and determining the optimal window length L and step length L by multiple experimentsspThe window length should be less than the number of frames of one period of the fastest action;

step 4-2: in steps of lspSliding a window, intercepting a sample, and adding a label;

step 4-3: dividing the sample into training sets S according to the proportion of 80 percent and 20 percenttrainAnd a verification set Sval

3. The millimeter wave radar-based gesture segmentation and recognition algorithm according to claim 1, wherein step 6 specifically includes the following steps:

step 6-1: according to step length l on time axisspAnd window length L, where LspDividing continuous gesture three-dimensional atlas by L and putting into stepsIdentifying the model obtained in the step 5, and recording the identification result as an array for visualization;

step 6-2: marking the window with the maximum recognition probability lower than 0.8 as a transition window;

step 6-3: recording segments continuously recognized as the same action as the action;

step 6-4: intersecting every two probability matrix curves of the output labels, wherein the output probability of one gesture label is reduced along with the increase of time, the output probability of the other gesture label is increased, and the intersection point of the output probability curves of the two labels is used as a boundary point of the segmentation and is determined as an action starting point or a division point to finish the gesture segmentation;

step 6-5: and comparing the found segmentation boundary with the segmentation boundary recorded by division and performance analysis.

4. The millimeter wave radar-based gesture segmentation and recognition algorithm according to claim 1, specifically comprising the steps of:

step 1: wave hand about the design, wave hand and push forward and pull back three kinds of gesture actions, record each kind of action as different categories, in every collection cycle of single gesture action, the action is all repeated many times, simultaneously, gathers continuous gesture action, wherein contains two kinds of gestures at least to record the moment that the gesture changed, including following step:

step 1-1: three gesture actions of waving hands up and down, waving hands left and right and pushing and pulling forwards and backwards are designed as actions to be collected, and a label is added to each action;

step 1-2: carrying out parameter configuration on a 77GHz millimeter wave radar used for data acquisition, and setting appropriate radar waveform parameters according to the actual application scene of gesture recognition; an IWR1642 radar of TI company can be adopted, the waveform is a linear frequency modulation continuous wave, the sampling frequency is 2000kHz, the frame period is 45ms, 150 frames of data are collected each time, 128 chirp signals are arranged in each frame, each chirp signal is provided with 64 sampling points, an antenna adopts a single-transmitting single-receiving mode, and the collection environment is a relatively open corridor;

step 1-3: collecting single gesture actions, wherein the actions are continuously and repeatedly carried out in each collection period;

step 1-4: collecting mixed gesture actions, wherein each collection period at least comprises two gestures, and recording the gesture change time;

step 2: transmitting a signal S from a radarT(t) and a received signal SR(t) input to a mixer to obtain a mixed signal SM(t) filtering the high frequency part by a low pass filter to obtain an intermediate frequency signal frequency SIF(t);

And step 3: the method comprises the following steps of performing two-dimensional Fourier transform on an intermediate frequency signal frame by frame, and then performing high-pass filtering to obtain a time-distance-Doppler three-dimensional map, wherein the method comprises the following steps:

step 3-1: one frame is formed by radar parameters and 128 frequency sweep signals;

step 3-2: performing two-dimensional Fourier transform on each frame of signal to obtain a range-Doppler image;

step 3-3: selecting a high-pass filter, and carrying out high-pass filtering on each frame of signal to remove clutter interference of the static target;

step 3-4: arranging the distance-Doppler images according to the frame sequence to obtain a time-distance-Doppler three-dimensional map;

and 4, step 4: the method comprises the steps of conducting fragment segmentation on radar videos through a single gesture action set according to a time axis sliding window to obtain a series of samples, and dividing the samples into a training set StrainAnd a verification set SvalThe method comprises the following steps:

step 4-1: estimating an action period, determining a window length L, wherein the window length is less than the frame number of the fastest action in one period, and ensuring that video segment information of the action is recorded in a sample, and the window length is not too large, so that the segmentation boundary of a continuous gesture for subsequent testing is not clear, is not too small, and has too little video segment information which is not enough to represent the characteristic of a gesture motion state;

step 4-2: assuming that the total number of frames of the collected continuous gestures is N, the total number of frames is a certain step length l in the time dimensionspSliding a window, intercepting samples from a single gesture signal set, adding labels, and acquiring each three-dimensional mapA sample is obtained;

step 4-3: dividing all samples into training sets S according to the proportion of 80 percent and 20 percenttrainAnd a verification set Sval

And 5: will train set StrainTraining three-dimensional convolutional neural network as input data of the three-dimensional convolutional neural network, and using verification set SvalThe performance of the test piece is tested, and the test piece comprises the following steps:

step 5-1: establishing a three-dimensional convolutional neural network model which comprises 4 layers of 3D convolutional layers, wherein the number of convolutional kernels is 4, 8, 32 and 64, and an activation function adopts ReLU; 4 BN layers and 3 maximum 3D pooling layers are flattened through a flatten layer and then pass through 3 full-connection layers, the number of neurons is 256, 32 and 3 respectively, the first two layers of activation functions adopt tanh, and the activation functions of an output layer adopt softmax to obtain output;

step 5-2: selecting the total number of training rounds, and randomly disorganizing the training set S before each round of trainingtrain

Step 5-3: input training set StrainTraining the model to be epoch of 30, taking 10 samples as a batch, and taking the loss function as cross entropy; during training, an Adam algorithm is used as an optimization algorithm of the model gradient, the Adam algorithm adopts a learning rate of self-adaptive dynamic adjustment, different learning rates can be selected according to different parameters, a dynamic constraint can be formed on the learning rate, and large gradient fluctuation is avoided; recording loss function values and accuracy of the training set; use of validation set S after the end of each epochvalCarrying out verification, and recording the loss function value and the accuracy of the verification set;

step 6: for a continuous gesture set, sliding a window frame by frame to extract samples and putting the samples into a trained network for recognition, performing accurate positioning and gesture segmentation on a primary recognition result through a segmentation algorithm, and finally obtaining complete gesture information, wherein the method comprises the following steps:

step 6-1: according to the window length L on the time axis, step length LspDividing a continuous gesture three-dimensional map, putting the three-dimensional map into a network for recognition testing, and recording the recognition probability of three labelsVisualization is carried out;

step 6-2: marking the window with the maximum recognition probability lower than a set threshold value as a transition window, and marking the other windows with corresponding labels according to the maximum recognition probability;

step 6-3: recording the segment with the continuous recognition result as the same action as the action;

step 6-4: and determining the starting point and the ending point of the action to finish gesture segmentation.

Background art:

the human-computer interaction based on gesture recognition has great advantages in naturalness and convenience, so that the gesture recognition has more and more application scenes, such as an intelligent home system, a sign language real-time teaching system, a gesture control game system and the like. With the rapid development of human-computer interaction technology, gesture recognition technology has become a research hotspot for scholars at home and abroad. The existing gesture recognition method mainly comprises four categories of visual images based on visible light, depth and the like, mechanical sensor signals based on a motion sensor, a pressure sensor and the like of a wearable device, non-broadband wireless communication signals based on Wi-Fi and the like and radar signals from the aspect of signal types. The visual image method is obviously influenced by illumination conditions, background environments and partial shielding, and the risk of revealing the privacy of the user exists; the wearable equipment is inconvenient to use and limited in applicable scenes; the non-broadband wireless communication signal method has low resolution and large background interference. The FMCW millimeter wave radar sensor has the advantages of small size, low cost, high distance and speed resolution, no influence of factors such as illumination change and the like, strong anti-interference performance and the like, so that the gesture recognition technology based on the millimeter wave radar becomes a research hotspot in recent years.

From the perspective of an identification algorithm, features are mainly extracted manually, and a conventional machine learning classification method and a deep learning method are mainly adopted. The features extracted by manual selection have strong subjectivity and cannot completely meet the requirements of an actual gesture recognition system. In recent years, the deep learning method is more and more widely applied to gesture recognition, and compared with the traditional mode of adding a classifier to manual feature extraction, the deep learning mode combines automatic feature extraction and classification into an integrated end-to-end learning framework, avoids the subjectivity of manual experience feature extraction, and therefore achieves qualitative improvement on the recognition rate. The Jun Seuk Suh uses a long-term short-term memory recurrent neural network as a supervised machine learning technology, can recognize seven gestures within 0.4m and +/-30 degrees of the distance from the center of a 24GH millimeter wave radar transmitting antenna, and has the precision of more than 91 percent. Dekker et al use the micro Doppler spectrogram of 24GHz FMCW radar and deep Convolutional Neural Network (CNN) for gesture recognition to achieve better classification effect. Wangyong et al performs gesture recognition based on an FMCW radar system and a Convolutional Neural Network (CNN), performs image splicing by using distance, Doppler and angle parameters, constructs a multi-dimensional parameter data set of gesture actions, solves the problem of low gesture information amount described by single-dimensional parameters, and improves the accuracy of gesture recognition compared with the single-dimensional parameter data set.

However, the research of the gesture recognition algorithm mainly focuses on offline classification recognition of isolated motion data, and this kind of gesture recognition has been studied in the field of computer vision for many years, and a section of gesture data containing only one motion is given, and data classification is required. The method reduces the difficulty of researching the recognition algorithm by avoiding the time sequence positioning of the gesture action, so that the subsequent algorithm research is advanced. In a real application, motion stream data which is not divided in a time domain is more easily encountered, and there may be a time state of continuously performing gesture motions and a time state of whether there is a gesture motion, that is, an idle time state, where it is impossible to predict when a user performs a gesture and when the gesture ends, so that real-time gesture recognition must solve the problem of gesture detection in the gesture motion stream data, that is, it is necessary to perform time domain positioning and type determination on a behavior motion at the same time. The gesture recognition is more complex, the continuous motion flow data usually comprises a plurality of motion categories, and the system needs to automatically recognize the start and the end of the motion and the motion categories. In addition, the traditional radar-based gesture recognition method mainly focuses on constructing gesture data in a two-dimensional image format for classification and recognition, and the gesture data in the format contains less key information of gestures.

The invention content is as follows:

the invention provides a real-time gesture segmentation method aiming at the problems, which comprises the steps of firstly carrying out sliding window batch processing on real-time action flow data, and then finishing the time state interval demarcation of the real-time data according to gesture label output probability distribution of the real-time data after batch processing; in addition, gesture data in a three-dimensional video format, which is richer than gesture data in a two-dimensional image format, is constructed, and a neural network model based on three-dimensional convolution is built to classify and recognize the gesture data in the three-dimensional video format.

The invention is achieved by the following measures:

a gesture segmentation and recognition algorithm based on millimeter wave radar comprises a model building and training stage and a model application stage, and is characterized in that the model building and training stage comprises the following steps:

step 1: collecting radar echo data of gesture actions, wherein in each collection period of a single gesture action, the action is repeated for multiple times, and simultaneously collecting continuous gesture actions, wherein the actions at least comprise two gestures, and recording the change time of the gestures;

step 2: transmitting a signal S from a radarT(t) and a received signal SR(t) input to a mixer to obtain a mixed signal SM(t) filtering the high frequency part by a low pass filter to obtain an intermediate frequency signal frequency SIF(t) wherein 77GHz millimeter wave radar transmits signal ST(t) is specifically represented by the formula:

wherein A isTRepresenting the amplitude of the transmitted signal, fcIs a carrier waveCenter frequency, T is the sawtooth signal pulse width, fT(τ) represents the frequency of the emitted signal over the T time;

radar reception signal SR(t) is specifically represented by the formula:

wherein A isTFor received signal amplitude, Δ t is the time delay, fR(T) is the received signal frequency within T time;

obtaining a mixing signal S by a mixerM(t) represented by the formula: sM(t)=ST(t)SR(t), mixing signal SM(t) passing through a low-pass filter to obtain an intermediate frequency signal SIF(t) represented by the formula:

and step 3: performing two-dimensional Fourier transform on the intermediate frequency signal frame by frame, and then performing high-pass filtering to obtain a time-distance-speed three-dimensional map;

and 4, step 4: the method comprises the steps that a single gesture action set is segmented into video data sets according to a time axis sliding window, sequence frames which are partially overlapped are arranged in front of and behind segmented samples, the number of the overlapped frames is determined by the window length and the step length of a sliding window algorithm, the effect of gesture segmentation and recognition is directly influenced by different step lengths and window lengths, after a series of samples are obtained, the samples are randomly divided into a training set StrainAnd a verification set Sval

And 5: establishing a three-dimensional convolution neural network model, and collecting a training set StrainTraining three-dimensional convolutional neural network as input data of the three-dimensional convolutional neural network, and using verification set SvalTesting the performance of the test paper specifically comprises the following steps:

step 5-1: building a three-dimensional convolutional neural network model, wherein the three-dimensional convolutional neural network model comprises 4 layers of 3D convolutional layers, the number of convolutional kernels is 4, 8, 32 and 64, the number of activation functions adopts ReLU, 4 layers of BN layers and 3 layers of maximum 3D pooling layers, flattening is performed through a flatten layer, then through 3 layers of full connection layers, the number of neurons is 256, 32 and 3, the first two layers of activation functions adopt tanh, and the activation functions of an output layer adopt softmax to obtain output;

step 5-2: selecting the total number of training rounds, and randomly disorganizing the training set S before each round of trainingtrain.

Step 5-3: input training set StrainTraining the model to be epoch of 30, taking 10 samples as a batch, and taking the loss function as cross entropy; the Adam algorithm is adopted as an optimization algorithm of the model gradient during training, the Adam algorithm adopts the learning rate of self-adaptive dynamic adjustment, different learning rates can be selected according to different parameters, dynamic constraint can be formed on the learning rate, larger gradient fluctuation is avoided, the loss function value and the accuracy rate of the training set are recorded, and the verification set S is utilized after each epoch is finishedvalAnd (5) carrying out verification, and recording the loss function value and the accuracy of the verification set.

Will train set StrainTraining three-dimensional convolutional neural network as input data of the three-dimensional convolutional neural network, and using verification set SvalThe performance was tested.

The model application phase comprises:

step 6: and (5) for the continuous gesture set, sliding a window frame by frame to extract samples, putting the samples into the three-dimensional convolutional neural network model trained in the step (5) for recognition, and performing accurate positioning and gesture segmentation on the primary recognition result through a segmentation algorithm to finally obtain complete gesture information.

Step 4 of the invention specifically comprises the following steps:

step 4-1: estimating the action period, and determining the optimal window length L and step length L by multiple experimentsspThe window length should be less than the number of frames of one period of the fastest action;

step 4-2: in steps of lspSliding a window, intercepting a sample, and adding a label;

step 4-3: dividing the sample into training sets S according to the proportion of 80 percent and 20 percenttrainAnd a verification set Sval

The step 6 of the invention specifically comprises the following steps:

step 6-1: according to step length l on time axisspAnd window length L, where LspDividing a continuous gesture three-dimensional map by L, putting the continuous gesture three-dimensional map into the model obtained in the step 5 for recognition, and recording a recognition result as an array for visualization;

step 6-2: marking the window with the maximum recognition probability lower than 0.8 as a transition window;

step 6-3: recording segments continuously recognized as the same action as the action;

step 6-4: intersecting every two probability matrix curves of the output labels, wherein the output probability of one gesture label is reduced along with the increase of time, the output probability of the other gesture label is increased, and the intersection point of the output probability curves of the two labels is used as a boundary point of the segmentation and is determined as an action starting point or a division point to finish the gesture segmentation; step 6-5: and comparing the found segmentation boundary with the segmentation boundary recorded by division and performance analysis.

The invention specifically adopts the following steps:

step 1: wave hand about the design, wave hand and push forward and pull back three kinds of gesture actions, record each kind of action as different categories, in every collection cycle of single gesture action, the action is all repeated many times, simultaneously, gathers continuous gesture action, wherein contains two kinds of gestures at least to record the moment that the gesture changed, including following step:

step 1-1: three gesture actions of waving hands up and down, waving hands left and right and pushing and pulling forwards and backwards are designed as actions to be collected, and a label is added to each action;

step 1-2: carrying out parameter configuration on a 77GHz millimeter wave radar used for data acquisition, and setting appropriate radar waveform parameters according to the actual application scene of gesture recognition; an IWR1642 radar of TI company can be adopted, the waveform is a linear frequency modulation continuous wave, the sampling frequency is 2000kHz, the frame period is 45ms, 150 frames of data are collected each time, 128 chirp signals are arranged in each frame, each chirp signal is provided with 64 sampling points, an antenna adopts a single-transmitting single-receiving mode, and the collection environment is a relatively open corridor;

step 1-3: collecting single gesture actions, wherein the actions are continuously and repeatedly carried out in each collection period;

step 1-4: collecting mixed gesture actions, wherein each collection period at least comprises two gestures, and recording the gesture change time;

step 2: transmitting a signal S from a radarT(t) and a received signal SR(t) input to a mixer to obtain a mixed signal SM(t) filtering the high frequency part by a low pass filter to obtain an intermediate frequency signal frequency SIF(t);

And step 3: the method comprises the following steps of performing two-dimensional Fourier transform on an intermediate frequency signal frame by frame, and then performing high-pass filtering to obtain a time-distance-Doppler three-dimensional map, wherein the method comprises the following steps:

step 3-1: one frame is formed by radar parameters and 128 frequency sweep signals;

step 3-2: performing two-dimensional Fourier transform on each frame of signal to obtain a range-Doppler image;

step 3-3: selecting a high-pass filter, and carrying out high-pass filtering on each frame of signal to remove clutter interference of the static target;

step 3-4: arranging the distance-Doppler images according to the frame sequence to obtain a time-distance-Doppler three-dimensional map;

and 4, step 4: the method comprises the steps of conducting fragment segmentation on radar videos through a single gesture action set according to a time axis sliding window to obtain a series of samples, and dividing the samples into a training set StrainAnd a verification set SvalThe method comprises the following steps:

step 4-1: estimating an action period, determining a window length L, wherein the window length is less than the frame number of the fastest action in one period, and ensuring that video segment information of the action is recorded in a sample, and the window length is not too large, so that the segmentation boundary of a continuous gesture for subsequent testing is not clear, is not too small, and has too little video segment information which is not enough to represent the characteristic of a gesture motion state;

step 4-2: assuming that the total number of frames of the collected continuous gestures is N, the total number of frames is a certain step length l in the time dimensionspSliding a window, intercepting samples from a single gesture signal set, adding labels, and acquiring each three-dimensional mapSampleThen, the process is carried out;

step 4-3: dividing all samples into training sets S according to the proportion of 80 percent and 20 percenttrainAnd a verification set Sval

And 5: will train set StrainTraining three-dimensional convolutional neural network as input data of the three-dimensional convolutional neural network, and using verification set SvalThe performance of the test piece is tested, and the test piece comprises the following steps:

step 5-1: establishing a three-dimensional convolutional neural network model which comprises 4 layers of 3D convolutional layers, wherein the number of convolutional kernels is 4, 8, 32 and 64, and an activation function adopts ReLU; 4 BN layers and 3 maximum 3D pooling layers are flattened through a flatten layer and then pass through 3 full-connection layers, the number of neurons is 256, 32 and 3 respectively, the first two layers of activation functions adopt tanh, and the activation functions of an output layer adopt softmax to obtain output;

step 5-2: selecting the total number of training rounds, and randomly disorganizing the training set S before each round of trainingtrain

Step 5-3: input training set StrainTraining the model to be epoch of 30, taking 10 samples as a batch, and taking the loss function as cross entropy; during training, an Adam algorithm is used as an optimization algorithm of the model gradient, the Adam algorithm adopts a learning rate of self-adaptive dynamic adjustment, different learning rates can be selected according to different parameters, a dynamic constraint can be formed on the learning rate, and large gradient fluctuation is avoided; recording loss function values and accuracy of the training set; use of validation set S after the end of each epochvalCarrying out verification, and recording the loss function value and the accuracy of the verification set;

step 6: for a continuous gesture set, sliding a window frame by frame to extract samples and putting the samples into a trained network for recognition, performing accurate positioning and gesture segmentation on a primary recognition result through a segmentation algorithm, and finally obtaining complete gesture information, wherein the method comprises the following steps:

step 6-1: according to the window length L on the time axis, step length LspDividing a continuous gesture three-dimensional map, putting the continuous gesture three-dimensional map into a network for recognition testing, and recording and visualizing the recognition probability of the three labels;

step 6-2: marking the window with the maximum recognition probability lower than a set threshold value as a transition window, and marking the other windows with corresponding labels according to the maximum recognition probability;

step 6-3: recording the segment with the continuous recognition result as the same action as the action;

step 6-4: and determining the starting point and the ending point of the action to finish gesture segmentation.

Compared with the traditional gesture recognition technology, the gesture recognition method has the advantages that the sliding window algorithm and the three-dimensional convolutional neural network are adopted, continuous gesture signals can be processed, different gestures can be recognized and segmented, time information is fully utilized, and the problem that only a single gesture can be recognized by the traditional gesture recognition technology is solved.

Description of the drawings:

FIG. 1 is a schematic flow diagram of the present invention.

FIG. 2 is a diagram of five gestures corresponding to a gesture data set in an embodiment of the present invention.

Fig. 3 is a diagram of RTM and DTM in an embodiment of the present invention.

FIG. 4 is an RDTM map in an embodiment of the invention.

FIG. 5 is a schematic diagram of a sliding window batch of actual motion stream data in an embodiment of the present invention.

FIG. 6 is a schematic diagram of a three-dimensional convolutional neural network model in an embodiment of the present invention.

FIG. 7 is a graph illustrating 5 gesture tag output probability distributions of real-time gesture data according to an embodiment of the present invention.

FIG. 8 is a system for real-time gesture recognition according to an embodiment of the present invention.

The specific implementation mode is as follows:

the invention is further described below with reference to the figures and examples.

The invention takes actual control of a Russian dice game in non-contact game application as a requirement, 5 gesture actions of swinging up and down, pushing forward and pulling backward, rotating a palm, drawing a circle on a horizontal plane and swinging left and right are recognized in a classification manner, and as shown in figure 2, 5 gestures corresponding to a gesture data set are given.

An AWR1642 development board of TI company is adopted for data acquisition in the experiment, and linear frequency modulation continuous waves, also called chirp signals, are transmitted by a radar. Intermediate frequency signal sampling data output by the millimeter wave radar chip is acquired by a DCA1000 high-speed data acquisition card and is transmitted to a desktop computer through an Ethernet port. The performance index requirements for determining radar based on game application requirements are shown in table 2-1.

TABLE 2-1 Radar Performance index

Assuming that a radar continuously transmits M Chirp signals, the number of sampling points of an intermediate frequency signal of each Chirp is N, the sampling points are stored according to rows to obtain an M multiplied by N matrix, and the matrix is subjected to Fourier transform along the direction of the Chirp sampling points to obtain distance spectrum information of a target, namely distance FFT; then, fourier transform is performed along a slow time domain, i.e., a direction of chirp index, to obtain velocity information, which is referred to as velocity FFT for short, and after two FFTs, a Range-Doppler-Map (RDM), called 2D-FFT, is finally obtained. Because the person is close to the radar, the palm, the arm, the head and other parts of the person have strong radar echoes, and the human body belongs to a distributed target for the radar. Echo signals of the head, the abdomen and the like of a human are not concerned by people and belong to clutter interference, the signals with the real classification significance are echoes of a palm, an elbow and an arm which make gestures, the two types of signals have the largest difference of the frequency spectrums and are different in motion speed, so that a high-pass filter is added in a slow time domain for filtering before the FFT is carried out, and the echo interference of an object which is static and has the speed close to zero is restrained.

Dynamic gesture information is more concentrated on the motion of hands and arms, time information plays a key role, RDM obtained after radar data are preprocessed only reflects information of a target observed by a radar within a certain short time, and the RDM lacks time information and cannot be directly used as data for representing gestures.

The invention constructs gesture data in a two-dimensional image format with Time information based on RDM in a hand movement process, and particularly relates to a method for constructing gesture data in a two-dimensional image format with Time information. Similarly, the amplitudes of all range gate units of the RDM of each frame are correspondingly accumulated according to the Doppler unit where the range gate unit is located, and the frames are arranged according to the Time sequence to obtain a Doppler-Time-Map (DTM). As shown in fig. 3, are the DTM and RTM with a duration of 2 seconds.

RTM and DTM are two-dimensional radar images, record the rule that the distance and the speed of a hand change along with time respectively, have certain gesture representation significance, but the RTM and the DTM split the inherent characteristic relation between the distance and the speed of a gesture target. The invention constructs three-dimensional video image data from the aspect of feature combination, and closely links the distance and speed features of radar echo data, and the specific method is to arrange the RDM of each frame according to the Time sequence to form a distance-Doppler-Time-Map (RDTM) as shown in figure 4, wherein the RDTM is gesture data in a radar video format, combines the distance and speed features of a target at each moment, and completely retains the feature information of a Time dimension.

The input of convolutional network model training generally requires grid data with a fixed size, the traditional method is to manually cut and fill the input model, but for real-time gesture recognition, gesture action flow data generated continuously cannot be directly input into the built network for training. Therefore, the invention adopts the sliding window algorithm to carry out batch processing, the batch processing step is to firstly estimate the gesture action period, and determine the proper window length L and the sliding step length L through a plurality of times of experimentsspThe real-time performance is the best when the difference of different gestures is the largest, and the optimal window lengths L of different characteristic frequency spectrums are different. FIG. 5 is a schematic diagram of a sliding window batch process of actual motion stream data. Action stream data is processed in batches, and the action stream data can be obtained by completely containing actions, partially containing idle state, partially containing actions and totally containing time data segments of idle state, and carrying out the steps from small to large according to the time sequenceAnd (4) numbering the numbers, wherein the time data segments of each number need to be analyzed, judged and labeled manually.

The gesture recognition method based on the deep learning mode is very important for obtaining a large number of data samples with specific labels. The method for acquiring the training sample set is different from the traditional method, the method is used for acquiring the motion flow data of a single gesture with repeated motion cycle and without idle state, the data can be printed with specific gesture labels after batch processing of a sliding window, the segmentation of gesture motion is not involved, and the method for acquiring the training sample set can reduce the workload of data acquisition and also can reduce the workload of mass data label labeling. Assuming that the total frame number of the gesture motion cycle repetition of a single acquisition which does not contain the motion flow data of the idle state is N, the time dimension is divided into a window length L and a step length LspBatch processing, then the number of samples n available is:

the number m of the overlapping time data frames between adjacent samples intercepted by the sliding window is:

m=L-lsp (2-2)

the invention collects gesture data of 15 volunteers, wherein 10 experimental objects are used as training objects, and the other 5 experimental objects are used as testing objects. The invention classifies and identifies 5 gesture actions, and each type of data set has 4000 training samples, and the total number of the training samples is 20000. And recording actual motion flow data of 5 test objects, and synchronously recording the video of the experiment camera, so that manual label labeling can be conveniently carried out subsequently.

For gesture segmentation, most of the existing segmentation methods are manual segmentation, that is, only data of one gesture action is collected when gesture data is collected, and then a key time region of the gesture action is judged manually, so that the method is only suitable for offline processing. And a small number of researchers also perform pre-segmentation according to a speed threshold, but the segmentation method is relatively rough and has a general segmentation effect, is only suitable for segmenting gesture data with obvious idle states, and has a poor segmentation effect on data without idle states which continuously perform different gesture actions.

The invention adopts a method of recognizing first and then determining the starting and ending data frames of gestures, and in practical application, real-time motion stream data of a sensor is batched in real time through a sliding window algorithm, the batched time data segments are numbered from small to large according to the time sequence, then a model trained in advance is input for classification and judgment to obtain the probability values of various gesture labels of each time data segment, and the output probability values of various gesture labels are sequentially drawn according to the numbering sequence to obtain the probability distribution of the gesture labels. As shown in fig. 7, it is a probability distribution result obtained by performing identification after batch processing on real-time data of a test object according to the present invention, wherein: a is an action 1 state time period, b is an idle state time period, and c is a time period for continuously making different action states; the time data segment with the recognition probability value larger than the specific threshold is marked as an action state time segment, the time data segment smaller than the specific threshold is marked as an idle state time segment, the intersection point of a threshold straight line and a curve is regarded as a demarcation point of gesture action conversion, the initial data frame and the ending data frame of the gesture action can be reversely deduced according to the sample number of the demarcation point and two parameters of the window length and the step length of a sliding window framing algorithm, so that different gesture action state intervals and idle state intervals are obtained, the gesture action algorithm segmentation is completed, and the prediction label of the gesture data is output according to the gesture segmentation result.

At present, there is no unified standard for evaluation indexes of gesture segmentation. The present invention uses the degree of correspondence between the labels of the manual segmentation markers and the output labels of the model segmented by the algorithm as a relative segmentation criterion. Assuming that the number of manually marked labels is S, and the number of labels obtained by algorithm segmentation is NbWhere the output of the model consistent with the label is NbWherein the segmentation accuracy ξ is as follows:

from the equations (2-3), it can be known that the relative segmentation accuracy of the present invention is actually the classification accuracy of the test data set.

Feature detection and recognition of a Convolutional network (Convolutional Neural Networks-CNN) have been successful in face recognition, picture classification and the like, because a face picture or other pictures already contain all important information, and more importantly, timing information is used in a classification task of dynamic gesture recognition, the CNN is not a model for time information, and RTM and DTM represent time information in an imaging manner, so that the invention utilizes the CNN model of VGG-16 to perform transfer learning on RTM and DTM. The traditional CNN network can not process the RDTM data set of the invention, so the invention designs a model (volumetric 3_ dimensional Neural Networks, C3DN) based on three-dimensional convolution, which specially processes the RDTM data set; the structural composition of the C3DN model is shown in fig. 6, and the model includes 5 layers of 3D convolutional layers, 4 layers of BN layers and 4 layers of maximum 3D pooling layers, the activation function adopts the Relu function, and is finally connected to the softmax layer for outputting the classification result and the similarity distribution array.

The experiments and results were analyzed as follows:

after the RDTM is processed in batches, the RDTM is input into a network built by the method for feature learning and model training, an adma optimizer is adopted, the self-adaptive learning rate is achieved, 30 Epochs are iterated, the batch sample number (batch size) of each iteration is 10, a GPU is adopted for accelerated training, and verification of a test set is carried out in the training process. In order to analyze the influence of the window length of the sliding window algorithm on the gesture recognition performance in the batch processing of the motion flow data, taking gesture data in an RDTM format as an example, data samples with window lengths of 6 frames, 9 frames, 12 frames, 15 frames, 18 frames, 21 frames, 24 frames and 30 frames are respectively input for model training and testing, the overall average classification accuracy is calculated to obtain a table 3-1, and it can be seen from the table that a data set with less than 12 frames has poor classification effect because sample data cannot completely contain main key information of gesture classification. The classification accuracy of the data set with more than 18 frames is not obviously improved, and the real-time performance is also influenced while larger memory overhead is brought. When the window length of the framing is greater than 21 frames, the accuracy of the test object data set is rapidly reduced, because the gesture duration is about 1 second, the frame number is about 20 frames, and the window length is too large, the gesture segmentation of the continuous gesture data stream is inaccurate, and the accurate output of the gesture label is influenced. The invention therefore selects a 12 frame time length of RDTM to characterize the three gestures of the invention. The same may be said for the optimum window length and step size for both RTM and DTM spectra, as shown in table 3-3.

TABLE 3-1 comparison of Classification accuracy of test data sets at different time window lengths

TABLE 3-2 comparison of Classification accuracy for different sliding time step test datasets

TABLE 3-3 optimal window length and step size comparison for different formats of gesture data

And inputting the batch-processed RDTM into a C3DN network model for classification and identification, and recording the test result of the model. The probability confusion matrix for recording the test samples of the test objects A and B is shown in tables 3-4 and 3-5, and the table 3-4 and 3-5 shows that the confusion degrees of different gesture actions are different, and the palm rotation and other two actions hardly have the confusion phenomenon because the radar images of the action and other four actions have great difference; there is also a relatively significant difference between different subjects. Therefore, training samples of different objects should be increased as much as possible to reduce the difference of the objects; meanwhile, the accuracy of gesture recognition cannot be based on the accuracy of a certain object or the classification accuracy of a certain gesture, and when the performance of the algorithm is evaluated, the average classification accuracy of a plurality of specific gesture actions and a plurality of experimental objects is used as a relative performance evaluation index. Due to the space limitation, the other three test objects are not listed one by one, and only the average classification accuracy is given. The average classification accuracy of the test sets of 5 test subjects was 88.275%, 91.800%, 94.625%, 89.375% and 91.125%, respectively.

Table 3-4 probability confusion matrix for test object a gesture data

TABLE-5 probability confusion matrix for test object B gesture data

In order to compare the performances of different classification recognition methods, RDTM, DTM, RTM and DTM are respectively spliced as data representing gestures, and corresponding network model training and testing are input to obtain test results as shown in tables 3-6. As can be seen from the table, the method classification accuracy of the gesture data RDTM and the C3DN network provided by the invention is higher than that of other methods.

TABLE 3-6 comparison of accuracy rates of test data sets for different gesture recognition methods

The invention provides an end-to-end real-time gesture recognition method based on an FMCW millimeter wave radar, and gesture segmentation is innovatively carried out. A C3DN network model based on a 3DCNN network unit is designed for radar videos, three-dimensional data representing gestures of a distance-Doppler-time spectrum are utilized, corresponding data sets are constructed for training and testing, and data sets of 5 test objects are classified and recognized respectively. The classification result shows that compared with a classification method of two-dimensional image type gesture data and a common CNN network model, the RDTM three-dimensional data representing gestures provided by the invention and a C3DN network built by the user obtain the optimal classification performance. The gesture recognition method provided by the invention not only has better gesture characterization capability, but also has better test object generalization capability.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:基于人工智能的视频处理方法、装置、设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!