Training method of motion evaluation model, motion evaluation method and related equipment
1. A method for training a motion evaluation model, comprising:
acquiring a training set, wherein the training set comprises standard action segments corresponding to multiple types of actions, and each standard action segment comprises multiple standard action frames;
for each standard action segment, extracting the key point characteristics of each standard action frame in the standard action segment to form a key point characteristic sequence of the standard action segment;
processing the key point feature sequence of the standard action segment by utilizing an action evaluation model based on a space-time attention mechanism to obtain action features of the standard action segment;
and adjusting the parameters of the action evaluation model based on the action characteristics of the standard action segment.
2. The method of claim 1, wherein the spatiotemporal attention mechanism comprises a spatial attention mechanism, and wherein processing the sequence of keypoint features of the canonical action segment using a spatiotemporal attention mechanism-based action evaluation model comprises:
and processing different key point features in the same standard action frame by utilizing the spatial attention mechanism.
3. The method of claim 1, wherein the spatiotemporal attention mechanism comprises a temporal attention mechanism, and wherein processing the sequence of keypoint features of the canonical action segment using a spatiotemporal attention mechanism-based action evaluation model comprises:
and processing the key point characteristics of different standard action frames by utilizing the time attention mechanism.
4. The method of claim 3, wherein prior to said processing keypoint features of different said standard action frames with said temporal attention mechanism, further comprising:
and processing the key point characteristics of different standard action frames by using a long-term and short-term memory network.
5. The method of claim 1, wherein before the adjusting the parameters of the motion evaluation model based on the motion characteristics of the standard motion segment, the method comprises:
pooling the action characteristics of the standard action segments by using a pooling layer;
and processing the action characteristics of the pooled standard action segments by utilizing a full connection layer.
6. The method of claim 1, wherein adjusting the parameters of the motion assessment model based on the motion characteristics of the standard motion segment comprises:
obtaining a loss function of the action characteristics of the standard action segment, wherein the loss function is used for measuring the inter-class distance and the intra-class distance of the action characteristics of the standard action segment;
adjusting parameters of the motion assessment model based on the loss function.
7. The method of claim 1, wherein the plurality of types of actions include target actions, and wherein obtaining a loss function of the action characteristics of the standard action fragment comprises:
grouping features of a plurality of the standard action segments, wherein each group comprises three features of the standard action segments, and two of the three features of the standard action segments belong to the target action;
and acquiring triple losses among the characteristics of the three standard action fragments in each group.
8. An action evaluation method, comprising:
the method comprises the steps of obtaining an action fragment to be evaluated and a target action fragment, wherein the action fragment to be evaluated and the target action fragment both belong to a target action, the action fragment to be evaluated comprises a plurality of action frames to be evaluated, and the target action fragment comprises a plurality of target action frames;
extracting the key point characteristics of each action frame to be evaluated to form a key point characteristic sequence of the action segment to be evaluated, extracting the key point characteristics of each target action frame to form a key point characteristic sequence of the target action segment;
respectively processing the key point feature sequence of the action fragment to be evaluated and the key point feature sequence of the target action fragment by utilizing the action evaluation network;
obtaining an action evaluation result of the action fragment to be evaluated based on the similarity between the processed key point feature sequence of the action fragment to be evaluated and the processed key point feature sequence of the target action fragment;
wherein the action evaluation network is obtained based on the training method of any one of claims 1 to 7.
9. An electronic device comprising a processor, a memory coupled to the processor, wherein,
the memory stores program instructions;
the processor is configured to execute the program instructions stored by the memory to implement the method of any of claims 1-8.
10. A computer-readable storage medium, characterized in that the storage medium stores program instructions that, when executed, implement the method of any of claims 1-8.
Background
The purpose of action evaluation is to compare the action segment to be evaluated (also called student action) with the corresponding standard action (also called teacher action) segment, and give a similarity score according to the difference between the action segment to be evaluated and the standard action segment.
Application scenarios for action evaluation are numerous. For example, the method can be applied to sports action teaching, model posture walking training, stretching rehabilitation training and other scenes. Taking the physical exercise teaching as an example, if the physical exercise is not standard, the bone of the student is damaged, and the action evaluation can effectively judge whether the physical exercise such as push-up, sit-up, pull-up and the like performed by the student is standard or not.
The existing action evaluation method comprises the steps of extracting key point characteristics of single frames in action segments, calculating the similarity between corresponding single frames in different action segments, and obtaining an action evaluation result based on the similarity. However, the conventional operation evaluation method has a poor operation evaluation effect.
Disclosure of Invention
The application provides a training method of an action evaluation model, an action evaluation method, an electronic device and a computer readable storage medium, which can solve the problem of poor action evaluation effect of the existing action evaluation method.
In order to solve the technical problem, the application adopts a technical scheme that: provided is a method for training a motion evaluation model. The training method comprises the following steps: acquiring a training set, wherein the training set comprises standard action segments corresponding to multiple types of actions, and each standard action segment comprises multiple standard action frames; for each standard action segment, extracting the key point characteristics of each standard action frame in the standard action segment to form a key point characteristic sequence of the standard action segment; processing the key point feature sequence of the standard action segment by utilizing an action evaluation model based on a space-time attention mechanism to obtain action features of the standard action segment; and adjusting the parameters of the action evaluation model based on the action characteristics of the standard action segment.
In order to solve the above technical problem, another technical solution adopted by the present application is: provided is an operation evaluation method. The method comprises the following steps: the method comprises the steps of obtaining an action fragment to be evaluated and a target action fragment, wherein the action fragment to be evaluated and the target action fragment both belong to a target action, the action fragment to be evaluated comprises a plurality of action frames to be evaluated, and the target action fragment comprises a plurality of target action frames; extracting the key point characteristics of each action frame to be evaluated to form a key point characteristic sequence of an action fragment to be evaluated, extracting the key point characteristics of each target action frame to form a key point characteristic sequence of a target action fragment; respectively processing the key point feature sequence of the action fragment to be evaluated and the key point feature sequence of the target action fragment by utilizing an action evaluation network; obtaining an action evaluation result of the action fragment to be evaluated based on the similarity between the processed key point feature sequence of the action fragment to be evaluated and the processed key point feature sequence of the target action fragment; wherein, the action evaluation network is obtained based on the training method.
In order to solve the above technical problem, another technical solution adopted by the present application is: an electronic device is provided, which comprises a processor and a memory connected with the processor, wherein the memory stores program instructions; the processor is configured to execute the program instructions stored by the memory to implement the above-described method.
In order to solve the above technical problem, the present application adopts another technical solution: there is provided a computer readable storage medium storing program instructions that when executed are capable of implementing the above method.
Through the mode, in the training process of the motion evaluation model, the time-space attention mechanism can process the key point feature sequence (the time-space attention mechanism is used), so that the relation between different key point features in the key point feature sequence is considered; and evaluating the processing effect of the action evaluation model based on the intra-class distance and the inter-class distance of the action characteristics obtained by the action evaluation model processing. Therefore, the action features extracted by the action evaluation model obtained by training are easy to distinguish, thereby being beneficial to subsequent action evaluation. Therefore, the method provided by the application can improve the effect of subsequent action evaluation.
Drawings
FIG. 1 is a schematic flow chart of a first embodiment of a training method for an action evaluation model according to the present application;
FIG. 2 is a schematic view of the detailed process of S14 in FIG. 1;
FIG. 3 is a schematic structural diagram of a motion estimation model according to the present application;
FIG. 4 is a schematic flow chart diagram illustrating an embodiment of an action evaluation method according to the present application;
FIG. 5 is a schematic diagram of the architecture of the training process and the action evaluation process of the present application;
FIG. 6 is a schematic structural diagram of an embodiment of an electronic device of the present application;
FIG. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first", "second" and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any indication of the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Fig. 1 is a schematic flow chart of a first embodiment of a training method for an action evaluation model according to the present application. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 1 is not limited in this embodiment. As shown in fig. 1, the present embodiment may include:
s11: a training set is obtained.
The training set comprises standard action segments corresponding to a plurality of types of actions, and each standard action segment comprises a plurality of standard action frames.
The action segment according to the present application is an action sequence formed by a plurality of action frames. The actions of multiple categories may belong to the same scene or may belong to different scenes. For example, the plurality of types of motions all belong to a sports motion teaching scene, and the plurality of types of motions include push-up, sit-up, pull-up, and the like.
S12: and for each standard action segment, extracting the key point features of each standard action frame in the standard action segment to form a key point feature sequence of the standard action segment.
Keypoint features in the standard action frame may be extracted by a pose estimation algorithm. The dimensions of the key point feature sequence of the standard action fragment are (T, V, C). Where T represents the number of standard action frames (time dimension) included in the standard action fragment, i.e. the sequence length; v represents the number of key points (spatial dimension) in the standard motion frame, and C represents the key point coordinates (number of channels).
S13: and processing the key point feature sequence of the standard action segment by utilizing an action evaluation model based on a space-time attention mechanism to obtain the action features of the standard action segment.
The action evaluation model can assign higher attention to the key point features which are important for action evaluation (are beneficial to distinguishing actions) in the key point feature sequence through a space-time attention mechanism.
The spatiotemporal attention mechanism may include a temporal attention mechanism (graph attention network GAT) and/or a spatial attention mechanism (self-attention module self-attention), among others. The action evaluation model can learn the importance (spatial dimension) of different key point features in the same standard action frame through a spatial attention mechanism, so that higher attention is allocated to key point features which are important (beneficial to distinguishing actions) for action evaluation in the same standard action frame. The action evaluation model can learn the importance (time dimension) of the key point features in different standard action frames through a time attention mechanism, so that higher attention is allocated to the key point features in the standard action frames which are important for action evaluation.
If the spatiotemporal attention mechanism comprises a spatial attention mechanism, the step may comprise: and processing different key point features in the same standard action frame by using a spatial attention mechanism. Specifically, weights corresponding to different key point features may be obtained based on the spatial relationship of the different key point features, and each key point feature may be multiplied by the corresponding weight to obtain a processing result.
If the spatiotemporal attention mechanism comprises a temporal attention mechanism, the step may comprise: and processing the key point features in different standard action frames by using a time attention mechanism. Specifically, weights corresponding to different standard action frames can be obtained based on a relationship between the different standard action frames; and multiplying the key point characteristics of different standard action frames by the corresponding weights to obtain a processing result.
Additionally, the action evaluation model may also include long short term memory networks (LSTM or BilSTM). In order to improve the effect of the time attention mechanism, before the time attention mechanism is used to process the key point features in different standard action frames, the long-term and short-term memory network is used to process the key point features of the different standard action frames so as to learn the position information of the standard action segments in the time dimension (i.e. the sequence of the different standard action frames).
If the action evaluation model comprises a time attention mechanism and a space attention mechanism, or comprises the time attention mechanism, the space attention mechanism and the long-short term memory network, the sequence of processing the key point feature sequences is not limited by the time attention mechanism, the space attention mechanism and the long-short term memory network. For example, the key feature sequence can be processed by using a time attention mechanism, a space attention mechanism and a long-short term memory network in sequence.
In addition, the action evaluation model may further include a pooling layer and a fully connected layer. The method can also comprise the following steps: pooling the action characteristics of the standard action segments by using a pooling layer; and processing the action characteristics of the pooled standard action segments by utilizing the full connection layer.
S14: and adjusting the parameters of the action evaluation model based on the action characteristics of the standard action segment.
Referring to fig. 2 in combination, this step can be further extended to the following substeps:
s141: and acquiring a loss function of the action characteristics of the standard action segment.
The penalty function is used to measure the inter-class distance and the intra-class distance of the action features of the standard action fragment.
The inter-class distance is the distance between the motion features belonging to different classes of motion, and the intra-class distance is the distance between the motion features belonging to the same class of motion.
S142: parameters of the motion estimation model are adjusted based on the loss function.
And evaluating the processing effect of the action evaluation model based on the loss function, and further adjusting the parameters of the action evaluation model so as to gradually reduce the distance (intra-class distance) between the same type of action features obtained by the action evaluation model and increase the distance (inter-class distance) between different types of action features until the preset condition is met. The preset conditions may be that the training effect, the training times, the training time, and the like reach the expectation.
A triplet loss function is a loss function that can measure both inter-class distances and intra-class distances. Each standard action fragment carries a label to indicate the action to which it belongs.
In the case where the loss function is a triple loss function, one of the plurality of types of actions may be selected as the target action. Grouping features of the plurality of standard action segments based on the tags, each group including features of three standard action segments, and two of the features of the three standard action segments belonging to the target action; triple penalties between the features of the three standard action segments in each group are obtained. The formula according to which the triple loss is obtained may be as follows:
Ltriplet=max(0,||f(X)-f(X+)||2-||f(X)-f(X-)||2+αmargin);
wherein L istripletShowing triple loss, f (X) showing the action characteristics of the standard action fragment Anchor, f (X)+) Action feature f (X) representing standard action fragment Positive-) The method is characterized in that the method represents the action characteristics of a standard action segment Negative, Anchor and Positive belong to target action, and alpha margin is a hyper-parameter and is used for controlling the distance between different types of action characteristics.
The action evaluation model obtained through the triple loss function training can be suitable for the evaluation process of the action segments related to the target action. In addition, in order to adapt to the requirement of the actual action evaluation, the action type of the action segment of the actual action evaluation applied to the action evaluation model can be changed by changing the type of the target action in the training process.
By implementing the embodiment, in the training process of the motion evaluation model, the time-space attention mechanism can process the key point feature sequence, so that the relation between different key point features in the key point feature sequence is considered; and evaluating the processing effect of the action evaluation model based on the intra-class distance and the inter-class distance of the action characteristics obtained by the action evaluation model processing. Therefore, the action features extracted by the action evaluation model obtained by training are easy to distinguish, thereby being beneficial to subsequent action evaluation. Therefore, the method provided by the application can improve the effect of subsequent action evaluation.
The following describes, as an example, a method for training a motion evaluation model provided in the present application:
1) and acquiring a key point feature sequence (T, V, C) of the standard action fragment by utilizing a posture estimation algorithm.
2) (T, V, C) is sent to the motion evaluation model (ST-Attention). Referring to fig. 3, fig. 3 is a schematic structural diagram of the motion evaluation model of the present application. As shown in fig. 3, the motion evaluation model includes a first spatiotemporal Attention Layer (ST-BiLSTM-Attention Layer) and 6 second spatiotemporal Attention layers (ST-Attention Layer), a Global Pooling Layer (Global Pooling), and a full connection Layer (full connection) connected in sequence. The ST-Bilstm-Attention Layer includes a graph Attention network (GAT), a BilSTM, and a Self-Attention module (Self-Attention) connected in sequence, and the ST-Attention Layer includes a GAT and a Self-Attention connected in sequence.
Thus, in the ST-Bilstm-Attention Layer, the GAT pair (T, V, C) is processed in the spatial dimension to update (T, V, C); converting (T, V, C) to a time dimension to obtain (V, T, C); processing (V, T, C) in a time dimension by using the BilSTM and the Self-orientation in sequence to update (V, T, C); (V, T, C) is sent to ST-Attention Layer.
Since the ST-Bilstm-Attention Layer has learned the position information of the time dimension, the ST-Attention Layer no longer includes the BilSTM. Converting (V, T, C) to spatial dimensions at ST-Attention Layer to obtain (T, V, C); processing the spatial dimension of the (T, V, C) by utilizing the GAT to update the (T, V, C); converting (T, V, C) to a time dimension to obtain (V, T, C); processing (V, T, C) in a time dimension by using Self-orientation to update (V, T, C); (V, T, C) was sent to Global Pooling.
Global Pooling (V, T, C) to update (V, T, C); (V, T, C) was sent to full connection.
And (V, T, C) is processed by full connection to obtain a final processing result, namely the action characteristic of the standard action segment.
During the above process, only the C dimension changes and is controlled by GAT. Through the ST-Bilstm-Attention Layer, the number of the C dimension features is increased from 2 to 64, the number of the C dimension features is doubled (increased to 128 and 256 respectively) when the C dimension features pass through the 3 rd and 5 th layers of the ST-Attention Layer, the feature dimensions are (T, V and 256) at the moment, the feature dimensions are changed into 256 after global pooling, and finally, deep feature vectors with the dimensions of 100 are obtained through a full connection Layer.
The motion evaluation model obtained by the method training can be used for a motion evaluation process of the target motion. Referring to fig. 4, fig. 4 is a schematic specific flowchart of an embodiment of the action evaluation method of the present application. As shown in fig. 4, the present embodiment may include:
s21: and acquiring the action segment to be evaluated and the target action segment.
The action segment to be evaluated and the target action segment both belong to a target action, the action segment to be evaluated comprises a plurality of action frames to be evaluated, and the target action segment comprises a plurality of target action frames.
S22: and extracting the key point characteristics of each action frame to be evaluated to form a key point characteristic sequence of the action segment to be evaluated, and extracting the key point characteristics of each target action frame to form a key point characteristic sequence of the target action segment.
S23: and respectively processing the key point feature sequence of the action fragment to be evaluated and the key point feature sequence of the target action fragment by utilizing an action evaluation network.
S24: and obtaining an action evaluation result of the action fragment to be evaluated based on the similarity between the processed key point feature sequence of the action fragment to be evaluated and the processed key point feature sequence of the target action fragment.
For example, the cosine similarity between the processed keypoint feature sequence of the action segment to be evaluated and the processed keypoint feature sequence of the target action segment can be obtained. The similarity is positively correlated with the standard degree of the action segment to be evaluated, and the similarity is valued between 0 and 1 according to the standard degree because the action segment to be evaluated and the target action segment both belong to target actions.
For further details of this embodiment, please refer to the description of the previous embodiment, which is not repeated herein.
By implementing the present embodiment, the present application uses the motion evaluation model obtained by the training method, and therefore, the accuracy of the motion evaluation result can be improved.
The training process and the operation evaluation process of the present application will be described below as an example.
With reference to fig. 5, in the training process, the standard action segments included in the training set include target actions (anchors, reactive) and non-target actions Negative, the anchors, the reactive and the Negative are subjected to attitude estimation respectively to obtain corresponding key point feature sequences, then the key point feature sequences are processed by using an action evaluation network ST-Attention to obtain action features of the anchors, the reactive and the Negative, triple Loss triplets-Loss is calculated based on the action features, and parameters of the ST-Attention are adjusted based on the triplets-Loss. The parameters of ST-Attention obtained from the last training are saved and applied to the motion evaluation process.
In the action evaluation process, posture estimation is respectively carried out on a segment to be evaluated (Student) and a target action segment (Teacher) to obtain corresponding key point feature sequences, then the key point feature sequences are processed by utilizing an action evaluation network ST-Attention to obtain action features of the Student and the Teacher, Cosine similarity (Cosine similarity) between the action features of the Student and the Teacher is calculated, and an action evaluation result of the Student is obtained based on the Cosine similarity.
Fig. 6 is a schematic structural diagram of an embodiment of an electronic device according to the present application. As shown in fig. 6, the electronic device may include a processor 31, a memory 32 coupled to the processor 31.
Wherein the memory 32 stores program instructions for implementing the method of any of the above embodiments; the processor 31 is operative to execute program instructions stored by the memory 32 to implement the steps of the above-described method embodiments. The processor 31 may also be referred to as a CPU (Central Processing Unit). The processor 31 may be an integrated circuit chip having signal processing capabilities. The processor 31 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 31 may be any conventional processor or the like.
FIG. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application. As shown in fig. 7, the computer-readable storage medium 40 of the embodiment of the present application stores program instructions 41, and when executed, the program instructions 41 implement the method provided by the above-mentioned embodiment of the present application. The program instructions 41 may form a program file stored in the computer-readable storage medium 40 in the form of a software product, so as to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned computer-readable storage medium 40 includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.