Gesture recognition method and device, storage medium and electronic equipment
1. A gesture recognition method, comprising:
acquiring a gesture picture to be recognized;
sequentially inputting the gesture pictures into a multilayer feature extraction network, and generating a plurality of second feature pictures with different sizes by adopting a plurality of first feature pictures output by the multilayer feature extraction network;
and outputting gesture category information and gesture key point information by adopting a multi-scale detection module based on the second class characteristic graphs.
2. The method of claim 1, wherein sequentially inputting the gesture pictures into a multi-layer feature extraction network comprises:
inputting the gesture picture into a first feature extraction network to obtain a first feature map with a first size;
inputting the first feature map into a second feature extraction network to obtain a second feature map with a second size;
inputting the second feature map into a third feature extraction network to obtain a third feature map with a third size;
wherein the first size is greater than the second size, which is greater than the third size.
3. The method of claim 2, wherein generating a plurality of second class feature maps of different sizes using the plurality of first feature maps output by the multi-layer feature extraction network comprises:
generating a fourth feature map of a second size by using the third feature map and the second feature map, and generating a fifth feature map of a first size by using the second feature map and the first feature map;
and outputting the third feature map, the fourth feature map and the fifth feature map as the plurality of second type feature maps with different sizes.
4. The method of claim 3,
generating a fourth feature map of a second size using the third feature map and the second feature map comprises:
upsampling the third feature map to the same size as the second feature map;
performing matrix addition on the up-sampled third feature map and the second feature map to obtain a fourth feature map; and/or the presence of a gas in the gas,
generating a fifth feature map of the first size using the second feature map and the first feature map includes:
upsampling the second feature map to the same size as the first feature map;
and performing matrix addition on the second characteristic diagram after the upsampling and the first characteristic diagram to obtain a fifth characteristic diagram.
5. The method of claim 1, wherein outputting gesture category information and gesture keypoint information based on the plurality of second class feature maps by using a multi-scale detection module comprises:
for each second-class feature map in the N second-class feature maps, inputting the second-class feature map into a multi-scale detection module, and outputting M third-class feature maps with different sizes, wherein each second-class feature map corresponds to one multi-scale detection module, and M is the scale number of the multi-scale detection module;
according to M third-class feature maps of each second-class feature map in the multiple second-class feature maps, carrying out size alignment on the M third-class feature maps to generate a recognition picture with the maximum size;
and generating gesture category information and gesture key point information by adopting the recognition picture, wherein the gesture category information is used for representing semantic texts represented by the gesture picture.
6. The method of claim 5, wherein the N multi-scale detection modules corresponding to the N second class feature maps are all the same multi-scale detection module, and each multi-scale detection module comprises one convolution factor of 3x3, two convolution factors of 3x3, and three convolution factors of 3x 3.
7. The method of claim 5, wherein the size-aligning the M third class feature maps to generate the largest-sized identification picture comprises:
respectively performing upsampling processing on the sixth feature map and the seventh feature map to a sixth size, wherein M =3, and the M third-class feature maps include: a sixth feature of a fourth size, a seventh feature of a fifth size, an eighth feature of a sixth size, the sixth size being greater than the fifth size, the fifth size being greater than the fourth size;
and outputting the sixth feature map after the up-sampling processing, the seventh feature map after the up-sampling processing and the eighth feature map as an identification picture.
8. A gesture recognition apparatus, comprising:
the acquisition module is used for acquiring a gesture picture to be recognized;
the extraction module is used for sequentially inputting the gesture pictures into a multilayer feature extraction network and generating a plurality of second feature maps with different sizes by adopting a plurality of first feature maps output by the multilayer feature extraction network;
and the detection module is used for outputting the gesture category information and the gesture key point information by adopting a multi-scale detection module based on the second class characteristic graphs.
9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program is operative to perform the method steps of any of the preceding claims 1 to 7.
10. An electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; wherein:
a memory for storing a computer program;
a processor for performing the method steps of any of claims 1 to 7 by executing a program stored on a memory.
Background
In the related art, gestures are a form of non-verbal communication, and can be used in a variety of fields such as communication between deaf-mutes, robot control, Human Computer Interaction (HCI), home automation, and medical applications.
In the related art, gesture recognition has adopted many different techniques, which are mainly classified into three main categories: firstly, matching the characteristic parameters of the gesture to be recognized with the pre-stored characteristic parameters of the template by a template matching technology, and completing a recognition task by measuring the similarity between the characteristic parameters and the pre-stored characteristic parameters of the template. For example, the edge images of the gesture to be recognized and the template gesture are transformed to Euclidean distance space, and Hausdorff distance (Hausdorff distance, which is used for measuring the distance between proper subsets in the space) or Hausdorff is corrected. And representing the similarity of the gesture to be recognized and the template gesture by using the distance value. And the recognition result takes the template gesture corresponding to the minimum distance value. And secondly, determining a classification method of the classifier based on the probability statistics theory by the statistical analysis technology through the statistical sample feature vector. Extracting fingertip and gravity center characteristics from each image, calculating the distance and the included angle, respectively counting the distance and the included angle of different gestures to obtain the distributed digital characteristics, and obtaining the values of the distance and the included angle for segmenting the different gestures according to Bayesian decision based on the minimum error rate. And after the classifier is obtained, classifying and identifying the acquired gesture image. The technology of the neural network has self-organizing and self-learning capabilities, has the characteristic of distributivity, can effectively resist noise, process incomplete modes and has the mode popularization capability. With this technique, a training (learning) phase of a pair of neural networks is required before recognition. For the template matching technology, a large amount of manual design feature operations are needed, and under different environment backgrounds, the considered features are various, so that the engineering quantity is large, and the system implementation is complex. For the statistical analysis technology, feature sets with different gesture category characteristics are allowed to be defined, a local optimal linear discriminator is estimated, corresponding gesture categories are identified according to a large number of features extracted from a gesture image, the learning efficiency is low, and the algorithm identification rate is not obviously improved along with the continuous increase of the sample amount. The deep learning-based method is proved to be capable of accurately extracting features for operation and obtaining higher identification accuracy. However, in some scenes, the hand is blocked, and the gesture direction of the hand is changeable, so that the information related to the hand in the picture is lost, and the accuracy of gesture recognition is not high. In view of the above problems in the related art, no effective solution has been found at present.
Disclosure of Invention
The embodiment of the invention provides a gesture recognition method and device, a storage medium and electronic equipment.
According to an aspect of an embodiment of the present application, there is provided a gesture recognition method, including: acquiring a gesture picture to be recognized; sequentially inputting the gesture pictures into a multilayer feature extraction network, and generating a plurality of second feature pictures with different sizes by adopting a plurality of first feature pictures output by the multilayer feature extraction network; and outputting gesture category information and gesture key point information by adopting a multi-scale detection module based on the second class characteristic graphs.
Further, sequentially inputting the gesture picture into a multilayer feature extraction network comprises: inputting the gesture picture into a first feature extraction network to obtain a first feature map with a first size; inputting the first feature map into a second feature extraction network to obtain a second feature map with a second size; inputting the second feature map into a third feature extraction network to obtain a third feature map with a third size; wherein the first size is greater than the second size, which is greater than the third size.
Further, generating a plurality of second-class feature maps with different sizes by using the plurality of first feature maps output by the multilayer feature extraction network comprises: generating a fourth feature map of a second size by using the third feature map and the second feature map, and generating a fifth feature map of a first size by using the second feature map and the first feature map; and outputting the third feature map, the fourth feature map and the fifth feature map as the plurality of second type feature maps with different sizes.
Further, generating a fourth feature map of a second size using the third feature map and the second feature map comprises: upsampling the third feature map to the same size as the second feature map; performing matrix addition on the up-sampled third feature map and the second feature map to obtain a fourth feature map; and/or generating a fifth feature map of a first size by using the second feature map and the first feature map; upsampling the second feature map to the same size as the first feature map; and performing matrix addition on the second characteristic diagram after the upsampling and the first characteristic diagram to obtain a fifth characteristic diagram.
Further, outputting the gesture category information and the gesture key point information by adopting a multi-scale detection module based on the plurality of second-class feature maps comprises: for each second-class feature map in the N second-class feature maps, inputting the second-class feature map into a multi-scale detection module, and outputting M third-class feature maps with different sizes, wherein each second-class feature map corresponds to one multi-scale detection module, and M is the scale number of the multi-scale detection module; according to M third-class feature maps of each second-class feature map in the multiple second-class feature maps, carrying out size alignment on the M third-class feature maps to generate a recognition picture with the maximum size; and generating gesture category information and gesture key point information by adopting the recognition picture, wherein the gesture category information is used for representing semantic texts represented by the gesture picture.
Further, the N multi-scale detection modules respectively corresponding to the N second-class feature maps are all the same multi-scale detection module, and each multi-scale detection module includes one convolution factor of 3x3, two convolution factors of 3x3, and three convolution factors of 3x 3.
Further, performing size alignment on the M third-class feature maps to generate a maximum-size identification picture, including: respectively performing upsampling processing on the sixth feature map and the seventh feature map to a sixth size, wherein M =3, and the M third-class feature maps include: a sixth feature of a fourth size, a seventh feature of a fifth size, an eighth feature of a sixth size, the sixth size being greater than the fifth size, the fifth size being greater than the fourth size; and outputting the sixth feature map after the up-sampling processing, the seventh feature map after the up-sampling processing and the eighth feature map as an identification picture.
According to another aspect of the embodiments of the present application, there is also provided a gesture recognition apparatus, including: the acquisition module is used for acquiring a gesture picture to be recognized; the extraction module is used for sequentially inputting the gesture pictures into a multilayer feature extraction network and generating a plurality of second feature maps with different sizes by adopting a plurality of first feature maps output by the multilayer feature extraction network; and the detection module is used for outputting the gesture category information and the gesture key point information by adopting a multi-scale detection module based on the second class characteristic graphs.
Further, the extraction module comprises: the first extraction unit is used for inputting the gesture picture into a first feature extraction network to obtain a first feature map with a first size; the second extraction unit is used for inputting the first feature map into a second feature extraction network to obtain a second feature map with a second size; a third extraction unit, configured to input the second feature map into a third feature extraction network, so as to obtain a third feature map of a third size; wherein the first size is greater than the second size, which is greater than the third size.
Further, the extraction module comprises: a generating unit, configured to generate a fourth feature map of a second size using the third feature map and the second feature map, and generate a fifth feature map of a first size using the second feature map and the first feature map; and the output unit is used for outputting the third feature map, the fourth feature map and the fifth feature map as the plurality of second-type feature maps with different sizes.
Further, the generation unit includes: the first sampling unit is used for up-sampling the third feature map to the same size as the second feature map; the first summing unit is used for performing matrix summing on the third feature map subjected to the upsampling and the second feature map to obtain a fourth feature map; and/or the second sampling unit is used for up-sampling the second characteristic diagram to the same size as the first characteristic diagram; and the second summation unit is used for carrying out matrix summation on the second characteristic diagram after the up-sampling and the first characteristic diagram to obtain a fifth characteristic diagram.
Further, the detection module includes: the first processing unit is used for inputting the second class feature maps into a multi-scale detection module and outputting M third class feature maps with different sizes aiming at each second class feature map in N second class feature maps, wherein each second class feature map corresponds to one multi-scale detection module, and M is the scale number of the multi-scale detection module; the second processing unit is used for aligning the sizes of the M third type feature maps of each second type feature map in the plurality of second type feature maps to generate a recognition picture with the maximum size; and the generating unit is used for generating gesture category information and gesture key point information by adopting the identification picture, wherein the gesture category information is used for representing semantic texts represented by the gesture picture.
Further, the N multi-scale detection modules respectively corresponding to the N second-class feature maps are all the same multi-scale detection module, and each multi-scale detection module includes one convolution factor of 3x3, two convolution factors of 3x3, and three convolution factors of 3x 3.
Further, the second processing unit includes: a sampling subunit, configured to perform upsampling processing on the sixth feature map and the seventh feature map to a sixth size, respectively, where M =3, and the M third-class feature maps include: a sixth feature of a fourth size, a seventh feature of a fifth size, an eighth feature of a sixth size, the sixth size being greater than the fifth size, the fifth size being greater than the fourth size; and the output subunit is configured to output the sixth feature map after the upsampling processing, the seventh feature map after the upsampling processing, and the eighth feature map as an identification picture.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program that executes the above steps when the program is executed.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein: a memory for storing a computer program; a processor for executing the steps of the method by running the program stored in the memory.
Embodiments of the present application also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the steps of the above method.
According to the invention, a gesture picture to be recognized is obtained, the gesture picture is sequentially input into a multilayer feature extraction network, a plurality of second class feature maps with different sizes are generated by adopting a plurality of first class feature maps output by the multilayer feature extraction network, a multi-scale detection module is adopted to output gesture category information and gesture key point information based on the plurality of second class feature maps, a large amount of data is needed for training based on a deep learning method, the phenomenon that overfitting easily occurs to a neural network due to less data is avoided by adopting conversion between the multilayer feature extraction network and the feature maps, in a recognition scene with a complicated actual life scene, the gesture category information and the gesture key point information are output by adopting the multi-scale detection module, the phenomenon that the gesture recognition accuracy is reduced due to the fact that a hand is shielded and the gesture direction of the hand is changed in the gesture detection process is avoided, and the robustness of the complicated scene is improved, the technical problem of low gesture recognition rate in the related art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a block diagram of a hardware configuration of a computer according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of gesture recognition according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of feature extraction performed by an embodiment of the present invention;
FIG. 4 is a schematic diagram of feature detection performed by an embodiment of the present invention;
FIG. 5 is a diagram of an identification picture output by an embodiment of the present invention;
fig. 6 is a block diagram of a gesture recognition apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
The method provided by the embodiment one of the present application may be executed in a server, a computer, or a similar computing device. Taking an example of the present invention running on a computer, fig. 1 is a block diagram of a hardware structure of a computer according to an embodiment of the present invention. As shown in fig. 1, the computer may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the configuration shown in FIG. 1 is illustrative only and is not intended to limit the configuration of the computer described above. For example, a computer may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to a gesture recognition method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
In this embodiment, a gesture recognition method is provided, and fig. 2 is a flowchart of a gesture recognition method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
step S202, acquiring a gesture picture to be recognized;
the gesture picture of the embodiment comprises a hand area and a background area, and can be collected through a camera.
Step S204, inputting the gesture pictures into a multilayer feature extraction network in sequence, and generating a plurality of second feature maps with different sizes by adopting a plurality of first feature maps output by the multilayer feature extraction network;
in this embodiment, each layer of the feature extraction network of the multi-layer feature extraction network outputs one first-type feature map, and the extraction modes of each layer of the feature extraction network are different, and the types of the plurality of first-type feature maps are different.
And S206, outputting the gesture category information and the gesture key point information by adopting a multi-scale detection module based on the plurality of second-class feature maps.
In some examples, after the gesture category information and the gesture key point information are output by the multi-scale detection module based on the second-class feature maps, the matching degree between the gesture category information and the gesture key point information may be further determined, if the matching degree between the gesture category information and the gesture key point information is greater than a preset value, the repeated gesture key point information is deleted, and determining the matching degree between the gesture category information and the gesture key point information includes: the method comprises the steps of locating a plurality of key pixel points in gesture key point information, determining adjacent pixel points adjacent to each key pixel point, connecting any two groups of key pixel points and the adjacent pixel points to obtain a gesture track, searching sign language information matched with the gesture track through a preset mapping table, and judging the matching degree between the gesture key point information and the sign language information.
Through the steps, a gesture picture to be recognized is obtained, the gesture picture is sequentially input into a multilayer feature extraction network, a plurality of second class feature maps with different sizes are generated by adopting a plurality of first class feature maps output by the multilayer feature extraction network, a multi-scale detection module is adopted to output gesture category information and gesture key point information based on the plurality of second class feature maps, a large amount of data are needed for training based on a deep learning method, the phenomenon that overfitting easily occurs to a neural network due to less data is avoided by adopting conversion between the multilayer feature extraction network and the feature maps, in a recognition scene with a complicated actual life scene, the gesture category information and the gesture key point information are output by adopting the multi-scale detection module, the phenomenon that the gesture recognition accuracy is reduced due to the fact that a hand is shielded and the gesture direction of the hand is changed in the gesture detection process is avoided, and the robustness of the complicated scene is improved, the technical problem of low gesture recognition rate in the related art is solved.
In an implementation manner of this embodiment, sequentially inputting the gesture pictures into the multi-layer feature extraction network includes: inputting the gesture picture into a first feature extraction network to obtain a first feature picture of a first size; inputting the first feature map into a second feature extraction network to obtain a second feature map with a second size; inputting the second feature map into a third feature extraction network to obtain a third feature map with a third size; wherein the first size is larger than the second size, and the second size is larger than the third size.
In another embodiment, sequentially inputting the gesture pictures into the multi-layer feature extraction network comprises: inputting the gesture picture into a first feature extraction network to obtain a first feature picture of a first size; inputting the gesture picture into a second feature extraction network to obtain a second feature picture of a second size; inputting the gesture picture into a third feature extraction network to obtain a third feature picture of a third size; wherein the first size is larger than the second size, and the second size is larger than the third size.
In some examples, generating a plurality of second class feature maps of different sizes using a plurality of first feature maps output by a multi-layer feature extraction network comprises: generating a fourth feature map of a second size by using the third feature map and the second feature map, and generating a fifth feature map of a first size by using the second feature map and the first feature map; and outputting the third characteristic diagram, the fourth characteristic diagram and the fifth characteristic diagram into a plurality of second characteristic diagrams with different sizes.
In one example, generating a fourth feature map of the second size using the third feature map and the second feature map includes: upsampling (upsampling) the third feature map to the same size as the second feature map; and performing matrix addition on the up-sampled third feature map and the second feature map to obtain a fourth feature map. In another aspect, generating a fifth feature map of the first size using the second feature map and the first feature map comprises: upsampling the second characteristic diagram to the same size as the first characteristic diagram; and performing matrix addition on the second characteristic diagram after the upsampling and the first characteristic diagram to obtain a fifth characteristic diagram.
Upsampling the second feature map to the same size as the first feature map: and (4) an interpolation method is adopted, namely, a proper interpolation algorithm is adopted to insert new elements among pixel points on the basis of the second characteristic diagram until the size is the same as that of the first characteristic diagram.
And (3) classifying interpolation algorithms: in order to overcome the defects of the conventional method, the embodiment provides an edge-based image interpolation algorithm, which enhances the edge of an interpolated image to a certain extent, so that the visual effect of the image is better, and the edge-protected interpolation method of the embodiment can be divided into two types: a method based on original low resolution image edges and a method based on interpolated high resolution image edges. Firstly, detecting the edge of a low-resolution image (a second characteristic image), then classifying pixels according to the detected edge, and adopting pixel block interpolation for the pixels in a flat area; and for the pixels in the edge area, pixel point interpolation is adopted to achieve the purpose of keeping edge details. (2) The interpolation method based on the method of the high-resolution image edge after interpolation is that firstly, the traditional method is adopted to interpolate the low-resolution image (the second characteristic image), then the edge of the high-resolution image is detected, and finally, the edge and the nearby pixels are specially processed to remove the blur and enhance the edge of the image. The embodiment also provides an image interpolation algorithm based on the region: firstly, an original low-resolution image is divided into different areas, then interpolation points are mapped to the low-resolution image, the areas to which the interpolation points belong are judged, finally, different interpolation formulas are designed according to neighborhood pixels of the interpolation points, and the values of the interpolation points are calculated.
Fig. 3 is a schematic diagram of feature extraction performed in the embodiment of the present invention, in which a gesture picture including a gesture area is sent to a feature extraction network, a corresponding feature map is obtained after each passage of a convolutional layer (feature layer), and the extracted corresponding feature maps are fused. As shown in fig. 3, the defect picture passes through the feature extraction network to obtain 3 feature maps: characteristic diagram 1, characteristic diagram 2 and characteristic diagram 3. In a conventional target detection method, the last layer of feature map (feature map 3) is mainly extracted for decoding and post-processing operations. This has the obvious disadvantage that the small object itself has less pixel information and is easily lost during the down-sampling process, which eventually makes the small object difficult to detect. In order to solve the problem of obvious size difference of the object, the embodiment adopts a mode of fusing multiple layers of feature maps to obtain the feature map for decoding. As shown in fig. 3, upsampling the feature map 3 to the size same as that of the feature map 2, and then performing matrix addition to obtain a feature map 4, upsampling the feature map 2 to the resolution size same as that of the feature map 1, and then performing matrix addition to obtain a feature map 5, we can obtain 3 feature maps for subsequent processing: feature map 3, feature map 4, feature map 5. After the characteristics are obtained, prediction is carried out, and the detection accuracy rate of the small hand in the gesture recognition algorithm can be improved.
In an implementation manner of this embodiment, outputting the gesture category information and the gesture keypoint information by using the multi-scale detection module based on the plurality of second-class feature maps includes: inputting the second class feature maps into a multi-scale detection module aiming at each second class feature map in N second class feature maps, and outputting M third class feature maps with different sizes, wherein each second class feature map corresponds to one multi-scale detection module, and M is the scale number of the multi-scale detection module; aiming at M third-class feature maps of each second-class feature map in the multiple second-class feature maps, carrying out size alignment on the M third-class feature maps to generate a recognition picture with the maximum size; and generating gesture category information and gesture key point information by adopting the recognition picture, wherein the gesture category information is used for representing semantic texts represented by the gesture picture.
Optionally, the multi-scale detection modules corresponding to the plurality of second class feature maps are the same multi-scale detection module.
In some examples, a target organ in a gesture picture is identified according to the identification picture, and a gesture key point is searched in the identification picture according to a joint distribution track of the target organ.
Alternatively, the semantic text may be "good", "OK", "bye", etc.
Optionally, the N multi-scale detection modules respectively corresponding to the N second-class feature maps are all the same multi-scale detection module, and each multi-scale detection module includes one 3x3 convolution factor, two 3x3 convolution factors, and three 3x3 convolution factors.
In some examples, the size-aligning the M third class feature maps to generate the largest-sized identification picture includes: respectively performing upsampling processing on the sixth feature map and the seventh feature map to a sixth size, wherein M =3, and the M third-class feature maps include: a sixth feature of a fourth size, a seventh feature of a fifth size, an eighth feature of a sixth size, the sixth size being greater than the fifth size, the fifth size being greater than the fourth size; and outputting the sixth feature map after the up-sampling processing, the seventh feature map after the up-sampling processing and the eighth feature map as identification pictures. In one example, the identification picture is output after pixel points of the sixth feature map after the upsampling process, the seventh feature map after the upsampling process and the eighth feature map are overlapped.
Fig. 4 is a schematic diagram of feature detection performed in the embodiment of the present invention, and after the above 3 feature maps (the third feature map, the fourth feature map, and the fifth feature map) are obtained, the three feature maps are sent to a designed multi-scale detection module. The multiscale detection module is composed of a 3x3 convolution module, two 3x3 convolution modules and three 3x3 convolution modules, and the receptive fields of the three modules are 3x3, 5x5 and 7x7 respectively. The receptive field represents the size of the pixel points on the output characteristic image mapped on the input image, the larger the receptive field is, the more semantic information contained in the characteristic image is, and the more accurate the prediction obtained through the neural network is. The feature map convolved by 3x3 and the feature map convolved by two 3x3 are respectively up-sampled to the size of the feature map convolved by three 3x3, and the final gesture category and the key point detection of the hand are performed, fig. 5 is a schematic diagram of the recognition picture output by the embodiment of the invention, and the output recognition picture comprises 'OK' and 21 key point coordinates of the hand, so that the accuracy of the gesture recognition task is improved through the key point detection task.
The embodiment provides a multi-task learning method, which combines a gesture detection task and a hand key point detection task, and supplements gesture recognition information with the hand key point information, so that a neural network can find mutual relations among feature information in different tasks to improve the performance of single-task learning. In addition, the problems of too small data volume and network overfitting can be relieved to a certain extent through multi-task learning, and the accuracy of gesture recognition is finally improved.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
In this embodiment, a gesture recognition apparatus is further provided for implementing the above embodiments and preferred embodiments, which have already been described and are not repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 6 is a block diagram of a gesture recognition apparatus according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes: an acquisition module 60, an extraction module 62, a detection module 64, wherein,
an obtaining module 60, configured to obtain a gesture picture to be recognized;
the extraction module 62 is configured to sequentially input the gesture pictures into a multilayer feature extraction network, and generate a plurality of second feature maps of different sizes by using a plurality of first feature maps output by the multilayer feature extraction network;
and the detection module 64 is configured to output the gesture category information and the gesture key point information by using a multi-scale detection module based on the plurality of second class feature maps.
Optionally, the extracting module includes: the first extraction unit is used for inputting the gesture picture into a first feature extraction network to obtain a first feature map with a first size; the second extraction unit is used for inputting the first feature map into a second feature extraction network to obtain a second feature map with a second size; a third extraction unit, configured to input the second feature map into a third feature extraction network, so as to obtain a third feature map of a third size; wherein the first size is greater than the second size, which is greater than the third size.
Optionally, the extracting module includes: a generating unit, configured to generate a fourth feature map of a second size using the third feature map and the second feature map, and generate a fifth feature map of a first size using the second feature map and the first feature map; and the output unit is used for outputting the third feature map, the fourth feature map and the fifth feature map as the plurality of second-type feature maps with different sizes.
Optionally, the generating unit includes: the first sampling unit is used for up-sampling the third feature map to the same size as the second feature map; the first summing unit is used for performing matrix summing on the third feature map subjected to the upsampling and the second feature map to obtain a fourth feature map; and/or the second sampling unit is used for up-sampling the second characteristic diagram to the same size as the first characteristic diagram; and the second summation unit is used for carrying out matrix summation on the second characteristic diagram after the up-sampling and the first characteristic diagram to obtain a fifth characteristic diagram.
Optionally, the detection module includes: the first processing unit is used for inputting the second class feature maps into a multi-scale detection module and outputting M third class feature maps with different sizes aiming at each second class feature map in N second class feature maps, wherein each second class feature map corresponds to one multi-scale detection module, and M is the scale number of the multi-scale detection module; the second processing unit is used for aligning the sizes of the M third type feature maps of each second type feature map in the plurality of second type feature maps to generate a recognition picture with the maximum size; and the generating unit is used for generating gesture category information and gesture key point information by adopting the identification picture, wherein the gesture category information is used for representing semantic texts represented by the gesture picture.
Optionally, the N multi-scale detection modules respectively corresponding to the N second-class feature maps are all the same multi-scale detection module, and each multi-scale detection module includes one 3x3 convolution factor, two 3x3 convolution factors, and three 3x3 convolution factors.
Optionally, the second processing unit includes: a sampling subunit, configured to perform upsampling processing on the sixth feature map and the seventh feature map to a sixth size, respectively, where M =3, and the M third-class feature maps include: a sixth feature of a fourth size, a seventh feature of a fifth size, an eighth feature of a sixth size, the sixth size being greater than the fifth size, the fifth size being greater than the fourth size; and the output subunit is configured to output the sixth feature map after the upsampling processing, the seventh feature map after the upsampling processing, and the eighth feature map as an identification picture.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 3
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a gesture picture to be recognized;
s2, sequentially inputting the gesture pictures into a multilayer feature extraction network, and generating a plurality of second feature maps with different sizes by adopting a plurality of first feature maps output by the multilayer feature extraction network;
and S3, outputting gesture category information and gesture key point information by adopting a multi-scale detection module based on the second class feature maps.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a gesture picture to be recognized;
s2, sequentially inputting the gesture pictures into a multilayer feature extraction network, and generating a plurality of second feature maps with different sizes by adopting a plurality of first feature maps output by the multilayer feature extraction network;
and S3, outputting gesture category information and gesture key point information by adopting a multi-scale detection module based on the second class feature maps.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.