Classroom action recognition method based on double-scale space-time block mutual attention
1. The classroom action identification method based on double-scale space-time block mutual attention is characterized in that the method firstly obtains high-definition classroom student video data and then carries out the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence;
constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation;
constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector;
step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;
step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;
and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.
2. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 1, wherein the step (1) is specifically as follows:
(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 60 k.
3. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 2, wherein the step (2) is specifically as follows:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristicsWherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporalThe height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t momentAnd small scale block feature vectorsD represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrixAnd small scale spatio-temporal feature matrix[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocksTotal number of small-scale spatial feature blocksOutput dual scale spatiotemporal feature representation { Xl,Xs}。
4. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 3, wherein the step (3) is specifically as follows:
(3-1) the space-time block mutual attention encoder is composed of R space-time block mutual attention modules in series connection, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule(ii) a Input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention moduleInput dual scale spatiotemporal feature tensorWherein the input large-scale space-time feature matrixInput small scale space-time feature matrix Andclassifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorWherein, the output large-scale mutual attention feature matrixOutput small-scale mutual attention feature matrix Andfor the output large scale classification vector and the small scale classification vector,andthe large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrixInput small scale space-time feature matrixLarge scale classification vectorAnd small scale classification vectorsObtained by random initialization;
when R is more than or equal to R and more than 1, the input double-scale space-time feature tensorFor last space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorNamely, it is
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention moduleDouble-scale classification vector of (1)And
(3-3) the r-th double-scale space-time partitioning mutual attention moduleThe space-time block generation submodule of (a) will inputZ in (1)r ,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform sizeAnd small scale feature mappingWherein the height dimensionWidth dimension
According to the height dimension hrWidth dimension wrTime dimension trWill be provided withPerforming space-time blocking to obtain the r group of large-scale space-time block feature tensorsWhere j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely
Then will beDimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time blockWherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr;
Will be provided withAndsplicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Obtaining the updated small-scale space-time block feature tensor element by the same operationWherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2;
Obtaining the r group dual-scale space-time block feature tensorAnd
(3-4) the r-th double-scale space-time partitioning mutual attention moduleThe input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submoduleAndthe jth large-scale space-time block feature tensor elements of the r groupLinear mapping is carried out to obtain the query matrix of the target object at each attention headKey matrixSum matrixWherein, the attention head sequence number a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix isCalculating the corresponding multi-head space-time self-attention weight characteristics Wherein Softmax (-) is a normalized exponential function;
use ofLearnable parameterAnd calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Will be provided withDecomposing to obtain updated large-scale space-time block classification vectorAnd large scale space-time block space-time feature matrixMLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensorsAnd
(3-5) the r-th double-scale space-time partitioning mutual attention moduleThe input of the scale mutual attention submodule is the output of the space-time attention submoduleAndwherein the jth group of jth dual-scale space-time block classification vectors areAndthe space-time feature matrix of the dual-scale space-time block isAnd
classifying large scale space-time blocks into vectorsLinear mapping is carried out to obtain the query vectorClassifying large scale space-time blocks into vectorsWith small scale space-time block space-time characteristic matrixLinear mapping is carried out to obtain the key matrixSum matrixComputing multi-headed spatiotemporal self-attention weight features
Use ofLearnable parameterCalculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Thereby obtaining the r-th group of all large-scale space-time block classification vectorsLinear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrixSplicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Same operation is carried out to obtain small-scale classification vectorsAnd small scale mutual attention feature matrix
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
5. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 4, wherein the step (4) is specifically as follows:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoderAndrespectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptronAnd a small scale score vector
(4-2) outputting the action class probability vector
6. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 5, wherein the step (5) is specifically as follows:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
(5-2) motion recognition modelThe input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectorsAndinputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy lossOptimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,is a real mark, if the action category of the classroom student video belongs to b,otherwise
7. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 6, wherein step (6) is specifically:
(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain the target detection model
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection modelObtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2) Wherein phi is the serial number of the student, chi is the total number of the student,representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each studentInputting the motion recognition model obtained by training in the step (5)In the method, motion class probability vectors of phi-th students are obtainedAnd the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
Background
The traditional online class-leaving room is the main place for students to study and teachers to give lessons, and in recent years, online class-leaving especially in epidemic situations becomes a popular mode among teachers and students, and network live broadcast or advance recorded broadcast teaching is generally adopted. No matter the online class in the classroom or the online class using the network platform, the learning effect of the student is directly influenced by the quality of the teaching. The dilemma that is often encountered in practice is that teachers need to spend much energy on classroom discipline management in order to ensure the quality of classroom teaching, and cannot be put into teaching of teaching with full attention, which is particularly obvious in primary school classrooms. Therefore, a video action recognition technology is introduced to recognize actions of students in a classroom, the learning state of the students is sensed in real time, and an intelligent analysis report reflecting the classroom quality is provided. The classroom action recognition task takes the student action video frame sequence as input and outputs student action categories, and has wide application in scenes such as classroom teaching, self-service management, unmanned invigilation and the like. For example, in an unmanned invigilation environment, the classroom action recognition method can recognize the action of an examinee in real time, and the examinee can be investigated if a suspected cheating action occurs, so that the examination discipline is ensured. The main challenges are: it is difficult to unify the offline and online classroom motion recognition methods, there are students in different distances in the same video picture, and a large amount of calculation overhead is required for performing motion recognition on a plurality of students.
Currently, practical applications for classroom scene action recognition are few, and the existing method is mainly based on wearable equipment and skeleton information. However, the wearable device may cause discomfort to the student, which in turn affects the learning efficiency of the student; the method based on the skeleton information can identify fewer motion types, and the identification performance is very easily influenced by the shielding of objects such as tables, chairs and books. In addition, the traditional motion recognition method needs to encode the video frame into manual features (such as features of HOG3D, 3Dsurf and the like), but the manual features have great limitations and the extraction speed is low, so that the real-time requirement cannot be met. In recent years, an action recognition method with a Convolutional Neural Network (CNN) as a core can learn feature representation reflecting video latent semantic information end to end, and the accuracy of action recognition is greatly improved. In order to extract more effective visual features, a residual error network (ResNet) uses residual error connection to connect different layers of the network, so that the problems of overfitting, gradient disappearance or gradient explosion and the like generated during the training of a deeper neural network model are solved; a Non-Local Network (Non-Local Network) captures long-distance dependency relationship by using a Non-Local operation, establishes connection among pixel blocks at different distances of a video frame image through an attention mechanism, and mines semantic information among the pixel blocks. In addition, Transformer (Transformer) models, derived from the natural language processing domain, have recently been favored in the computer vision domain, where a lot of attention is paid to extracting the critical timing information of diversity in a sequence of video frames, so that the models can learn more discriminative feature representations.
The existing classroom action recognition technology still has many defects: firstly, designing a model separately for an offline classroom or an online classroom, and lacking a unified interface for fusing two types of classroom action identification methods; secondly, calculating space-time attention on all video frames in blocks when the features are extracted, neglecting the local characteristics of the space-time features to reduce the recognition rate, and calculating cost is overlarge when the video resolution is large; in addition, many methods only extract the space-time characteristics of single-scale blocks, and are difficult to adapt to the situation that the picture scales of individual students are different. In order to solve the problems of lack of a local space-time characteristic information exchange mechanism, adaptation to individual student pictures with different scales and the like, an efficient classroom action identification method which unifies an offline classroom and an online classroom and can improve student action identification accuracy is urgently needed.
Disclosure of Invention
The invention aims to provide a classroom action recognition method based on double-scale space-time block mutual attention aiming at the defects of the prior art, wherein a plurality of groups of space-time blocks are modeled by space-time attention so as to capture multi-scale space-time information of videos of students in offline and online classrooms, and the scale mutual attention is utilized to depict the picture information of the students in different scales so as to improve the recognition rate of classroom actions.
The method firstly acquires high-definition classroom student video data, and then sequentially performs the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence;
constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation;
constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector;
step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;
step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;
and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.
Further, the step (1) is specifically:
(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 60 k.
Still further, the step (2) is specifically:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristicsWherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporalThe height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t momentAnd small scale block feature vectorsD represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrixAnd small scale spatio-temporal feature matrix[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocksTotal number of small-scale spatial feature blocksOutput dual scale spatiotemporal feature representation { Xl,Xs}。
Further, the step (3) is specifically:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention moduleInput dual scale spatiotemporal feature tensorWherein the input large-scale space-time feature matrixInput small scale space-time feature matrix Andclassifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorWherein, the output large-scale mutual attention feature matrixOutput small-scale mutual attention feature matrix Andfor the output large scale classification vector and the small scale classification vector,andthe large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrixInput small scale space-time feature matrixLarge scale classification vectorAnd small scale classification vectorsObtained by random initialization;
when R is more than or equal to R and more than 1, the input is in double scaleEmpty feature tensorFor last space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorNamely, it is
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention moduleDouble-scale classification vector of (1)And
(3-3) the r-th double-scale space-time partitioning mutual attention moduleThe space-time block generation submodule of (a) will inputZ in (1)r,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform sizeAnd small scale feature mappingWherein the height dimensionWidth dimension
According to the height dimension hrWidth dimension wrTime dimension trWill be provided withPerforming space-time blocking to obtain the r group of large-scale space-time block feature tensorsWhere j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namelyr≥2:
Then will beDimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time blockWherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr;
Will be provided withAndsplicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Obtaining the updated small-scale space-time block feature tensor element by the same operationWherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2;
Obtaining the r group dual-scale space-time block feature tensorAnd
(3-4) the r-th double-scale space-time partitioning mutual attention moduleThe input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submoduleAndthe jth large-scale space-time block feature tensor elements of the r groupLinear mapping is carried out to obtain the query matrix of the target object at each attention headKey matrixSum matrixWherein, the attention head number a is 1, …, a is attentionTotal number of heads, dimension of each vector in the mapping matrixCalculating the corresponding multi-head space-time self-attention weight characteristics Wherein Softmax (-) is a normalized exponential function;
use ofLearnable parameterAnd calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Will be provided withDecomposing to obtain updated large-scale space-time block classification vectorAnd large scale space-time block space-time feature matrixMLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensorsAnd
(3-5) the r-th double-scale space-time partitioning mutual attention moduleThe input of the scale mutual attention submodule is the output of the space-time attention submoduleAndwherein the jth group of jth dual-scale space-time block classification vectors areAndthe space-time feature matrix of the dual-scale space-time block isAnd
classifying large scale space-time blocks into vectorsLinear mapping is carried out to obtain the query vectorClassifying large scale space-time blocks into vectorsWith small scale space-time block space-time characteristic matrixLinear mapping is carried out to obtain the key matrixSum matrixComputing multi-headed spatiotemporal self-attention weight features
Use ofLearnable parameterCalculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Thereby obtaining the r-th group of all large-scale space-time block classification vectorsLinear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrixSplicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
By the same operation, get smallScale classification vectorAnd small scale mutual attention feature matrix
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Still further, the step (4) is specifically:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoderAndrespectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptronAnd a small scale score vector
(4-2) outputting the action class probability vector
Still further, the step (5) is specifically:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
(5-2) motion recognition modelThe input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectorsAndinputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy lossOptimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,is a real mark, if the action category of the classroom student video belongs to b,otherwise
Still further, the step (6) is specifically:
(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain a targetDetection model
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection modelObtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2) Wherein phi is the serial number of the student, chi is the total number of the student,representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each studentInputting the motion recognition model obtained by training in the step (5)In the method, motion class probability vectors of phi-th students are obtainedAnd the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
The method of the invention utilizes a double-scale space-time block mutual attention encoder to identify the student action in the student video, and has the following characteristics: 1) different from the existing method which only designs for offline or online classes, the method of the invention firstly utilizes the target detection model to obtain the action frame sequence of students, and then further identifies the action category of each student, and can be generally used in the application scenes of the offline classes and the online classes; 2) different from the existing method for calculating the space-time attention of all video frame blocks during each step of feature extraction, the method of the invention uses a space-time block generation submodule and a space-time attention submodule to extract the space-time features in a plurality of groups of space-time blocks so as to realize local space-time feature information exchange and greatly reduce the calculation overhead; 3) the method of the invention uses two different sizes to block the video frame, and combines the scale mutual attention sub-module, so as to better extract the action information of the individual student pictures with different scales in the video.
The method is suitable for action recognition under the complex classroom scene with participation of a plurality of students and different picture scales of individual students, and has the advantages that: 1) the method unifies the action recognition methods of the offline classroom and the online classroom, and reduces the technical cost of applying the action recognition method to the two classes; 2) extracting features of a plurality of different space-time regions through a space-time block generation submodule and a space-time attention submodule, and fully considering the local characteristics of space-time features to obtain more accurate identification categories and improve the calculation efficiency; 3) the scale mutual attention submodule is used for learning the individual student pictures with different scales, and the space-time characteristics under the two scale blocks are fully fused to obtain better identification performance. The invention has the capability of local space-time characteristic learning and the capability of capturing individual student picture space characteristics with different scales, and can improve the student action recognition rate in practical application scenes such as classroom teaching supervision, self-service class management, unmanned invigilation and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
A classroom action recognition method based on double-scale space-time block mutual attention is characterized by firstly sampling classroom student videos to obtain video frame sequences of the classroom students, obtaining a boundary frame of each student position by using a target detection model, further intercepting frame images in the boundary frame to obtain the student action video frame sequences, then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, and finally judging classes of student actions by using the action recognition model. The method utilizes a target detection model to obtain a student action frame sequence to further identify action which can be commonly used in an off-line class and an on-line class, utilizes a space-time block generation submodule and a space-time attention submodule to extract space-time characteristics of a plurality of groups of space-time blocks so as to realize local space-time characteristic information exchange, and utilizes two block scales and scale mutual attention submodules to capture action information of different scales so as to adapt to the condition that the picture scales of students are different. The classroom action recognition system constructed in the mode can be uniformly deployed and applied to two classes, and meanwhile, the spatiotemporal information of student action video frames can be effectively extracted and student action categories can be efficiently recognized.
As shown in fig. 1, the method first obtains high definition classroom student video data, and then performs the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence; the method comprises the following steps:
(1-1) processing each online or offline high-definition classroom student video into a corresponding video frame sequence at a sampling rate of 25 frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 1500 frames per minute to obtain a high-definition classroom student image data set;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiRGB three-channel image with ith height H and width W in frame sequenceAnd T is the total frame number, namely T is 1500.
Constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation; the method comprises the following steps:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristicsWherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporalThe height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t momentAnd small scale block feature vectorsD represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrixAnd small scale spatio-temporal feature matrix[·,…,·]Representing a splicing operation; wherein, the large-scale space feature block assemblyNumber ofTotal number of small-scale spatial feature blocksOutput dual scale spatiotemporal feature representation { Xl,Xs}。
Constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector; the method comprises the following steps:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention moduleInput dual scale spatiotemporal feature tensorWherein the input large-scale space-time feature matrixInput small scale space-time feature matrix Andclassifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorWherein, the output large-scale mutual attention feature matrixOutput small-scale mutual attention feature matrix Andfor the output large scale classification vector and the small scale classification vector,andthe large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrixInput small scale space-time feature matrixLarge scale classification vectorAnd small scale classification vectorsObtained by random initialization;
when R is not less thanWhen r is more than 1, the input double-scale space-time feature tensorFor last space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorNamely, it is
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention moduleDouble-scale classification vector of (1)And
(3-3) the r-th double-scale space-time partitioning mutual attention moduleThe space-time block generation submodule of (a) will inputZ in (1)r,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform sizeAnd small scale feature mappingWherein the height dimensionWidth dimension
According to the height dimension hrWidth dimension wrTime dimension trWill be provided withPerforming space-time blocking to obtain the r group of large-scale space-time block feature tensorsWhere j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namelyr≥2:
Then will beDimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time blockWherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr;
Will be provided withAndsplicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Obtaining the updated small-scale space-time block feature tensor element by the same operationWherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2;
Obtaining the r group dual-scale space-time block feature tensorAnd
(3-4) the r-th double-scale space-time partitioning mutual attention moduleThe input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submoduleAndthe jth large-scale space-time block feature tensor elements of the r groupLinear mapping is carried out to obtain the query matrix of the target object at each attention headKey matrixSum matrixWherein, attention headThe index a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix isCalculating the corresponding multi-head space-time self-attention weight characteristics Wherein Softmax (-) is a normalized exponential function;
use ofLearnable parameterAnd calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Will be provided withDecomposing to obtain updated large-scale space-time block classification vectorAnd large scale space-time block space-time feature matrixMLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensorsAnd
(3-5) the r-th double-scale space-time partitioning mutual attention moduleThe input of the scale mutual attention submodule is the output of the space-time attention submoduleAndwherein the jth group of jth dual-scale space-time block classification vectors areAndthe space-time feature matrix of the dual-scale space-time block isAnd
classifying large scale space-time blocks into vectorsLinear mapping is carried out to obtain the query vectorClassifying large scale space-time blocks into vectorsWith small scale space-time block space-time characteristic matrixLinear mapping is carried out to obtain the key matrixSum matrixComputing multi-headed spatiotemporal self-attention weight features
Use ofLearnable parameterCalculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Thereby obtaining the r-th group of all large-scale space-time block classification vectorsLinear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrixSplicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Same operation is carried out to obtain small-scale classification vectorsAnd small scale mutual attention feature matrix
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector; the method comprises the following steps:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoderAndrespectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptronAnd a small scale score vector
(4-2) outputting the action class probability vector
Step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged; the method comprises the following steps:
(5-1) embedding the dual-scale features of the step (2)Forming an action recognition model by the input module, the double-scale space-time blocked mutual attention encoder in the step (3) and the action classification module in the step (4)
(5-2) motion recognition modelThe input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectorsAndinputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy lossOptimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,is a real mark, if the action category of the classroom student video belongs to b,otherwise
Step (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of student action; the method comprises the following steps:
(6-1) inputting the high-definition classroom student image dataset marked with the student position bounding box into an open source target detection model YOLOv5 pre-trained on the existing COCO2017 dataset, and iteratively training the model until the model converges to obtain a target detection model
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection modelObtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2) Wherein phi is the serial number of the student, chi is the total number of the student,representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each studentInputting the motion recognition model obtained by training in the step (5)In the method, motion class probability vectors of phi-th students are obtainedAnd the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种三维人脸识别的生成方法及相关装置