Transformer-based non-contact heart rate measurement method

文档序号:8510 发布日期:2021-09-17 浏览:35次 中文

1. A method for contactless heart rate measurement based on Transformer, comprising the steps of:

step S10, acquiring a video frame sequence to be detected containing face information in a set time period;

step S20, acquiring a human face region-of-interest image sequence through a human face key point model based on the video frame sequence to be detected;

step S30, preprocessing the image sequence of the region of interest of the face, and acquiring a heart rate sequence in a set time period through a trained end-to-end Transformer model based on the preprocessed image sequence of the region of interest of the face;

the end-to-end Transformer model is constructed on the basis of a linear layer, a space Transformer module, a time Transformer module and a full connection layer which are connected in sequence;

the spatial Transformer module comprises N first processing modules; the first processing module is constructed on the basis of a multi-head attention submodule and a multi-layer perceptron submodule;

the multi-head attention submodule is constructed on the basis of a normalization layer and a multi-head attention layer which are connected in sequence;

the multilayer perceptron submodule is constructed on the basis of a normalization layer and a multilayer perceptron structure which are sequentially connected; the multilayer perceptron structure is constructed on the basis of a full connection layer, an activation function layer, a Dropout layer, a full connection layer and a Dropout layer which are connected in sequence;

adding the input of the multi-head attention submodule and the output of a multi-head attention layer in the multi-head attention submodule to form the output of the multi-head attention submodule; adding the output of the multi-head attention submodule and the output of a multi-layer perceptron structure in the multi-layer perceptron submodule to form the output of the multi-layer perceptron submodule;

the time Transformer module comprises M second processing modules, and the second processing modules have the same structure as the first processing modules.

2. The transducer-based non-contact heart rate measurement method according to claim 1, wherein the attention mechanism adopted by each head of the multi-head attention layer is,

multiplying the output of a normalization layer in the multi-head attention submodule with a weight matrix to obtain q, k and v;

wherein the content of the first and second substances,is shown asThe inputs to the plurality of multi-headed attention sub-modules,representing the layer normalization operations by a normalization layer within a multi-head attention submodule,is shown asSecond of multiple head attention submoduleA head is installed;representing a weight matrix;

calculating the dot product of q and k, and multiplying the result obtained by the dot product calculation by v as a coefficient after the result passes through an activation function layer and a Dropout layer in sequence;

and outputting the result obtained by multiplication after passing through a linear layer and a normalization layer.

3. The method for contactless heart rate measurement based on Transformer according to claim 1, characterized in that the method for "preprocessing the image sequence of the region of interest of the human face" is: and uniformly acquiring F images as to-be-processed sampling frames according to the time sequence based on the human face interesting region image sequence.

4. The method for contactless heart rate measurement based on Transformer according to claim 3, wherein the method for obtaining the heart rate sequence in the set time period through the trained end-to-end Transformer model comprises:

step S31, preprocessing the F to-be-processed sample frames to obtain F embedded vectors, including: dividing the sampling frame to be processed into F multiplied by N sampling blocks with the size of P multiplied by P, wherein each sampling frame to be processed corresponds to N sampling blocks;

drawing each sampling block into a vector to obtain a vector to be processed, and based on the vector to be processed, obtaining an embedded vector to be processed through linear mapping;

stacking the embedded vectors to be processed corresponding to the same sampling frame to be processed to obtain F embedded vectors;

step S32, acquiring F first vectors to be processed through a space Transformer module based on the embedded vectors;

step S33, obtaining a first output vector by position coding and stacking based on the first output vector to be processed; the first output vector is an F multiplied by D matrix, wherein D is a dimension output by the space Transformer module;

step S34, acquiring a second output vector through a time Transformer module based on the first output vector;

step S35; and obtaining a heart rate sequence in a set time period through a full connection layer based on the second output vector.

5. The method for contactless heart rate measurement based on Transformer according to claim 1, wherein the trained end-to-end Transformer model is trained by:

step A10, acquiring a training video frame sequence, and acquiring a face region-of-interest image sequence through a face key point model based on the training video frame sequence; taking a face interesting region image sequence corresponding to the training video frame sequence and a standard heart rate sequence thereof as training samples to construct a training sample set;

step A20, preprocessing the image sequence of the region of interest of the face in the training sample set, inputting the preprocessed image sequence into an end-to-end Transformer model, and obtaining a predicted heart rate sequence within a set time period;

step A30, calculating a loss value based on a heart rate sequence and a standard heart rate sequence within a set time period output by an end-to-end Transformer model, and adjusting parameters of the end-to-end Transformer model;

and step A40, circularly executing the steps A20-A30 until a trained end-to-end Transformer model is obtained.

6. The method for contactless heart rate measurement based on Transformer according to claim 5, wherein the loss function adopted by the trained end-to-end Transformer model in the training process is

Wherein gamma is a weight coefficient,in order to obtain the value of the total loss,in order to be a loss in the time domain,for frequency domain loss, X is a heart rate sequence within a set time period output by an end-to-end Transformer model, Y is a standard heart rate sequence within the set time period, T is the length of a video signal corresponding to a video frame sequence to be detected,the power spectral density calculated for the heart rate sequence within the set time period based on the end-to-end Transformer model output,CE is the cross-entropy loss for the power spectral density calculated based on a standard heart rate sequence over a set period of time.

7. The method for measuring the non-contact heart rate based on the Transformer according to claim 5, wherein in step A10, the constructed training sample set includes a face region-of-interest image sequence and an amplified face region-of-interest image sequence obtained by sample amplification of the face region-of-interest image sequence, and the sample amplification method is as follows:

based on the human face interesting region image sequence, obtaining human face image sets with different scales by cutting and affine transformation;

based on the face picture sets with different scales, sample amplification is carried out through a partial region erasing and left-right turning method to obtain an amplified face picture set, and the amplified face picture set is ordered according to time to generate an amplified face interesting region image sequence.

8. A Transformer-based contactless heart rate measurement system, the system comprising: the device comprises an image acquisition unit, a human face extraction unit and a heart rate extraction unit;

the image acquisition unit is configured to acquire a video frame sequence to be detected containing face information within a set time period;

the face extraction unit is configured to obtain a face region-of-interest image sequence through a face key point model based on the video frame sequence to be detected;

the heart rate extraction unit is configured to preprocess the face region-of-interest image sequence, and based on the preprocessed face region-of-interest image sequence, obtain a heart rate sequence within the set time period through a trained end-to-end Transformer model;

the end-to-end Transformer model is constructed on the basis of a linear layer, a space Transformer module, a time Transformer module and a full connection layer which are connected in sequence;

the spatial Transformer module comprises N first processing modules; the first processing module is constructed on the basis of a multi-head attention submodule and a multi-layer perceptron submodule;

the multi-head attention submodule is constructed on the basis of a normalization layer and a multi-head attention layer which are connected in sequence;

the multilayer perceptron submodule is constructed on the basis of a normalization layer and a multilayer perceptron structure which are sequentially connected; the multilayer perceptron structure is constructed on the basis of a full connection layer, an activation function layer, a Dropout layer, a full connection layer and a Dropout layer which are connected in sequence;

adding the input of the multi-head attention submodule and the output of a multi-head attention layer in the multi-head attention submodule to form the output of the multi-head attention submodule; adding the output of the multi-head attention submodule and the output of a multi-layer perceptron structure in the multi-layer perceptron submodule to form the output of the multi-layer perceptron submodule;

the time Transformer module comprises M second processing modules, and the second processing modules have the same structure as the first processing modules.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,

the memory stores instructions executable by the processor for execution by the processor to implement the transducer-based contactless heart rate measurement method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions for execution by the computer to implement the method for transducer-based contactless heart rate measurement according to any one of claims 1-7.

Background

The Transformer network structure is already in a dominance in the natural language field, and exceeds other methods such as machine translation, text generation and the like on many tasks. More and more researchers are trying to apply the powerful modeling capabilities of the Transformer model to the field of computer vision today.

Heart rate is an important indicator to be measured in many situations, especially in the health area. In the conventional devices, the monitoring of heart rate and heart activity is accomplished by measuring electrophysiological signals and electrocardiograms or photoplethysmography, which require connecting electrodes to the body for measurement, and contact-type devices make the tested person feel uncomfortable, especially for the surge of telemedicine during epidemic situations, and have been extensively studied in industry and academia in recent years by the technique of measuring heart rate with human face. First explained is why the heart rate of a person can be measured by a camera: the light absorption of the skin changes periodically with the blood volume pulse and chromosomes such as hemoglobin in the dermis and subcutaneous microvasculature absorb a disproportionate amount of light, thus making a slight change in color as blood is pumped through the underlying veins and arteries. Although invisible to human eyes, the RGB sensors embedded in the wearable device can be easily captured, which is the theoretical basis for acquiring heart rate through human face.

In early remote heart rate measurement studies, many conventional methods accomplished this task in two stages, first acquiring rPPG's signals from key regions of the detected or tracked face, and then analyzing the heart rate values from the frequency domain. On the one hand, some conventional methods analyze subtle color changes of the face region for heart rate measurements: verkrussyse first found that rPPG signals can be used to derive a heart rate signal using a green channel extracted from natural light; poh removing noise using independent component analysis; li proposes a method for tracking a well-defined human face key region to recover a coarse rPPG signal by light correction and non-rigid motion elimination; tulyakov proposes a method of adaptive matrix decomposition to make heart rate estimation. CHROM and POS, on the other hand, use a color subspace transform approach to make heart rate measurements of skin pixels.

The task of heart rate measurement by human face is designed to be a non-end-to-end way based on a priori knowledge of some traditional methods. An rPPG signal is extracted through a traditional CHROM method, and a heart rate value is obtained through time domain filtering, principal component analysis, signal selection and heart rate estimation of the obtained rPPG signal.

In recent years, some non-end-to-end methods based on deep learning for testing heart rate through human faces have been developed, and motik proposes a two-stage method, in which rPPG signals are first acquired through a two-dimensional convolutional neural network, and then heart rate values are obtained through regression through another one-dimensional convolutional neural network. Some end-to-end methods for measuring heart rate through human face have been proposed recently, and Niu proposes an end-to-end method in RhythmNet. For example, inputting a human face video frame sequence, and directly obtaining a real-time heart rate of the human face or an average value of the heart rate over a period of time (for example, 10 s).

Deep learning is a popular research direction in the field of machine learning in recent years, and has been greatly successful in the fields of computer vision, natural language processing and the like. The existing method for testing the heart rate through the human face has the following defects: firstly, the existing data set is not large enough, so that only a shallow neural network can be adopted, and a learned model is easy to overfit; secondly, applying attention to all 3D feature maps of a spatio-temporal sequence is computationally expensive; finally, different loss functions have a relatively large impact on the result.

Disclosure of Invention

In order to solve the problems in the prior art, namely to solve the problems of low precision and high calculation cost of the measurement result in the conventional human face heart rate test, the invention provides a non-contact heart rate measurement method based on a Transformer, which comprises the following steps:

step S10, acquiring a video frame sequence to be detected containing face information in a set time period;

step S20, acquiring a human face region-of-interest image sequence through a human face key point model based on the video frame sequence to be detected;

step S30, preprocessing the image sequence of the region of interest of the face, and acquiring a heart rate sequence in a set time period through a trained end-to-end Transformer model based on the preprocessed image sequence of the region of interest of the face;

the end-to-end Transformer model is constructed on the basis of a linear layer, a space Transformer module, a time Transformer module and a full connection layer which are connected in sequence;

the spatial Transformer module comprises N first processing modules; the first processing module is constructed on the basis of a multi-head attention submodule and a multi-layer perceptron submodule;

the multi-head attention submodule is constructed on the basis of a normalization layer and a multi-head attention layer which are connected in sequence;

the multilayer perceptron submodule is constructed on the basis of a normalization layer and a multilayer perceptron structure which are sequentially connected; the multilayer perceptron structure is constructed on the basis of a full connection layer, an activation function layer, a Dropout layer, a full connection layer and a Dropout layer which are connected in sequence;

adding the input of the multi-head attention submodule and the output of a multi-head attention layer in the multi-head attention submodule to form the output of the multi-head attention submodule; adding the output of the multi-head attention submodule and the output of a multi-layer perceptron structure in the multi-layer perceptron submodule to form the output of the multi-layer perceptron submodule;

the time Transformer module comprises M second processing modules, and the second processing modules have the same structure as the first processing modules.

In some preferred embodiments, the attention mechanism employed by each head of the multi-headed attention layer is such that,

multiplying the output of a normalization layer in the multi-head attention submodule with a weight matrix to obtain q, k and v;

wherein the content of the first and second substances,is shown asThe inputs to the plurality of multi-headed attention sub-modules,representing the layer normalization operations by a normalization layer within a multi-head attention submodule,is shown asSecond of multiple head attention submoduleA head is installed;representing a weight matrix;

calculating the dot product of q and k, and multiplying the result obtained by the dot product calculation by v as a coefficient after the result passes through an activation function layer and a Dropout layer in sequence;

and outputting the result obtained by multiplication after passing through a linear layer and a normalization layer.

In some preferred embodiments, the method of "preprocessing the image sequence of the region of interest of the human face" is: and uniformly acquiring F images as to-be-processed sampling frames according to the time sequence based on the human face interesting region image sequence.

In some preferred embodiments, the "obtaining the heart rate sequence within the set time period by the trained end-to-end Transformer model" is performed by:

step S31, preprocessing the F to-be-processed sample frames to obtain F embedded vectors, including: dividing the sampling frame to be processed into F multiplied by N sampling blocks with the size of P multiplied by P, wherein each sampling frame to be processed corresponds to N sampling blocks;

drawing each sampling block into a vector to obtain a vector to be processed, and based on the vector to be processed, obtaining an embedded vector to be processed through linear mapping;

stacking the embedded vectors to be processed corresponding to the same sampling frame to be processed to obtain F embedded vectors;

step S32, acquiring F first vectors to be processed through a space Transformer module based on the embedded vectors;

step S33, obtaining a first output vector by position coding and stacking based on the first output vector to be processed; the first output vector is an F multiplied by D matrix, wherein D is a dimension output by the space Transformer module;

step S34, acquiring a second output vector through a time Transformer module based on the first output vector;

step S35; and obtaining a heart rate sequence in a set time period through a full connection layer based on the second output vector.

In some preferred embodiments, the trained end-to-end Transformer model is trained by:

step A10, acquiring a training video frame sequence, and acquiring a face region-of-interest image sequence through a face key point model based on the training video frame sequence; taking a face interesting region image sequence corresponding to the training video frame sequence and a standard heart rate sequence thereof as training samples to construct a training sample set;

step A20, preprocessing the image sequence of the region of interest of the face in the training sample set, inputting the preprocessed image sequence into an end-to-end Transformer model, and obtaining a predicted heart rate sequence within a set time period;

step A30, calculating a loss value based on a heart rate sequence and a standard heart rate sequence within a set time period output by an end-to-end Transformer model, and adjusting parameters of the end-to-end Transformer model;

and step A40, circularly executing the steps A20-A30 until a trained end-to-end Transformer model is obtained.

In some preferred embodiments, the loss function used in the training process of the trained end-to-end Transformer model is

Wherein gamma is a weight coefficient,in order to obtain the value of the total loss,in order to be a loss in the time domain,for the loss of the frequency domain, X is a heart rate sequence within a set time period output by an end-to-end Transformer model, Y is a standard heart rate sequence within the set time period,t is the length of the video signal corresponding to the video frame sequence to be tested,the power spectral density calculated for the heart rate sequence within the set time period based on the end-to-end Transformer model output,CE is the cross-entropy loss for the power spectral density calculated based on a standard heart rate sequence over a set period of time.

In some preferred embodiments, in step a10, the constructed training sample set includes a face region-of-interest image sequence and an amplified face region-of-interest image sequence obtained after sample amplification is performed on the face region-of-interest image sequence, where the sample amplification method is as follows:

based on the human face interesting region image sequence, obtaining human face image sets with different scales by cutting and affine transformation;

based on the face picture sets with different scales, sample amplification is carried out through a partial region erasing and left-right turning method to obtain an amplified face picture set, and the amplified face picture set is ordered according to time to generate an amplified face interesting region image sequence.

On the other hand, the invention provides a Transformer-based non-contact heart rate measurement system, which comprises an image acquisition unit, a human face extraction unit and a heart rate extraction unit;

the image acquisition unit is configured to acquire a video frame sequence to be detected containing face information within a set time period;

the face extraction unit is configured to obtain a face region-of-interest image sequence through a face key point model based on the video frame sequence to be detected;

the heart rate extraction unit is configured to preprocess the face region-of-interest image sequence, and based on the preprocessed face region-of-interest image sequence, obtain a heart rate sequence within the set time period through a trained end-to-end Transformer model;

the end-to-end Transformer model is constructed on the basis of a linear layer, a space Transformer module, a time Transformer module and a full connection layer which are connected in sequence;

the spatial Transformer module comprises N first processing modules; the first processing module is constructed on the basis of a multi-head attention submodule and a multi-layer perceptron submodule;

the multi-head attention submodule is constructed on the basis of a normalization layer and a multi-head attention layer which are connected in sequence;

the multilayer perceptron submodule is constructed on the basis of a normalization layer and a multilayer perceptron structure which are sequentially connected; the multilayer perceptron structure is constructed on the basis of a full connection layer, an activation function layer, a Dropout layer, a full connection layer and a Dropout layer which are connected in sequence;

adding the input of the multi-head attention submodule and the output of a multi-head attention layer in the multi-head attention submodule to form the output of the multi-head attention submodule; adding the output of the multi-head attention submodule and the output of a multi-layer perceptron structure in the multi-layer perceptron submodule to form the output of the multi-layer perceptron submodule;

the time Transformer module comprises M second processing modules, and the second processing modules have the same structure as the first processing modules.

In a third aspect of the present invention, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,

the memory stores instructions executable by the processor for execution by the processor to implement the Transformer-based contactless heart rate measurement method described above.

In a fourth aspect of the present invention, a computer-readable storage medium is provided, which stores computer instructions for being executed by the computer to implement the above-mentioned method for contactless heart rate measurement based on Transformer.

The invention has the beneficial effects that:

the invention solves the problems of low precision and high calculation cost of the measurement result in the existing human face heart rate test.

(1) The method uses an end-to-end Transformer model to automatically learn abundant characteristics with discrimination from the human face sequence image end to end, and predicts the heart rate; in the using process, the end-to-end Transformer model only uses a two-dimensional convolution kernel without using a three-dimensional convolution kernel, so that the algorithm precision can be effectively improved, and the problem of end-to-end deployment on a poor hardware platform is solved.

(2) According to the invention, the time and space dimensions are decomposed by constructing the space Transformer module and the time Transformer module, so that the calculated amount can be greatly reduced, and the calculation cost of space-time attention is relatively affordable. For the spatial Transformer module, a spatial attention mechanism is applied among different sampling blocks of the same sampling frame to be processed, so that spatial position information of an image can be better captured; for the time Transformer module, a time attention mechanism is adopted for the output of the time Transformer module, so that displacement information and the like between the micro-expressions can be captured better. Since the time Transformer module is performed in a higher dimension, the additional cost of the time Transformer module relative to the space Transformer module is negligible.

(3) The invention enables the network to obtain better generalization capability and precision based on a mode of simultaneously supervising frequency domain loss and time domain loss.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a method for contactless heart rate measurement based on a Transformer according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an end-to-end Transformer model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the first and second processing modules of one embodiment of the present invention;

FIG. 4 is a schematic illustration of an attention mechanism employed by each head of a multi-headed attention layer of one embodiment of the present invention;

FIG. 5 is a flowchart illustrating a training process of an end-to-end Transformer model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention discloses a Transformer-based non-contact heart rate measurement method, which comprises the following steps as shown in figure 1:

step S10, acquiring a video frame sequence to be detected containing face information in a set time period;

step S20, acquiring a human face region-of-interest image sequence through a human face key point model based on the video frame sequence to be detected;

step S30, preprocessing the image sequence of the region of interest of the face, and acquiring a heart rate sequence in a set time period through a trained end-to-end Transformer model based on the preprocessed image sequence of the region of interest of the face;

the end-to-end Transformer model is constructed on the basis of a linear layer, a space Transformer module, a time Transformer module and a full connection layer which are connected in sequence;

the spatial Transformer module comprises N first processing modules; the first processing module is constructed on the basis of a multi-head attention submodule and a multi-layer perceptron submodule;

the multi-head attention submodule is constructed on the basis of a normalization layer and a multi-head attention layer which are connected in sequence;

the multilayer perceptron submodule is constructed on the basis of a normalization layer and a multilayer perceptron structure which are sequentially connected; the multilayer perceptron structure is constructed on the basis of a full connection layer, an activation function layer, a Dropout layer, a full connection layer and a Dropout layer which are connected in sequence;

adding the input of the multi-head attention submodule and the output of a multi-head attention layer in the multi-head attention submodule to form the output of the multi-head attention submodule; adding the output of the multi-head attention submodule and the output of a multi-layer perceptron structure in the multi-layer perceptron submodule to form the output of the multi-layer perceptron submodule;

the time Transformer module comprises M second processing modules, and the second processing modules have the same structure as the first processing modules.

In order to more clearly describe the method for measuring a heart rate without contact based on a transducer according to the present invention, the following describes the steps in the embodiment of the present invention in detail with reference to the drawings.

In the following embodiment, the process of constructing and training an end-to-end Transformer model is detailed first, and then the process of acquiring a heart rate sequence within a set time period of a video frame sequence to be measured by a Transformer-based non-contact heart rate measurement method is detailed.

1. Construction and training of end-to-end Transformer model, as shown in FIG. 5

Step A10, acquiring a training video frame sequence, and acquiring a face region-of-interest image sequence through a face key point model based on the training video frame sequence; taking a face interesting region image sequence corresponding to the training video frame sequence and a standard heart rate sequence thereof as training samples to construct a training sample set; wherein, the numerical values in the heart rate sequence represent the heart rate values corresponding to different time points. It should be noted that the acquisition of the image sequence of the region of interest of the face may also be performed based on the face detection model and the face key point model.

In this embodiment, the constructed training sample set includes a face region-of-interest image sequence corresponding to a training video frame sequence and an amplified face region-of-interest image sequence obtained by performing sample amplification on the face region-of-interest image sequence, wherein the sample amplification method includes,

based on the human face interesting region image sequence, obtaining human face image sets with different scales by cutting and affine transformation;

based on the face picture sets with different scales, sample amplification is carried out through a partial region erasing and left-right turning method to obtain an amplified face picture set, and the amplified face picture set is ordered according to time to generate an amplified face interesting region image sequence.

Step A20, preprocessing the image sequence of the region of interest of the face in the training sample set, inputting the preprocessed image sequence into an end-to-end Transformer model, and obtaining a predicted heart rate sequence within a set time period;

in this embodiment, the preprocessing is to uniformly acquire F images as to-be-processed sampling frames in time sequence based on the face roi image sequence, where the time interval between each image and its adjacent image is the same, such as: for a human face interesting region image sequence corresponding to a30 s video, uniformly collecting 16 or 32 or more images as to-be-processed sampling frames according to a time sequence; and inputting the image sequence of the human face interesting region obtained after preprocessing into an end-to-end transform model to obtain a predicted heart rate sequence within a set time period.

The structure and the working process of the end-to-end Transformer model are detailed as follows:

(1) the structure of the end-to-end transform model is shown in FIG. 2

The end-to-end Transformer model is constructed on the basis of a linear layer, a space Transformer module, a time Transformer module and a full connection layer which are connected in sequence;

the spatial Transformer module comprises N first processing modules connected in sequence, in this embodiment, N is preferably 12; the input of the 1 st first processing module in the spatial Transformer module is the output of a linear layer in the processed end-to-end Transformer model, and the input of the 2 nd to 12 th first processing modules in the spatial Transformer module is the output of the previous first processing module;

the time Transformer module comprises M second processing modules connected in sequence, in this embodiment, M is preferably 6; the input of the 1 st second processing module in the time Transformer module is the processed output of the space Transformer module, and the input of the 2 nd to 6 th second processing modules in the time Transformer module is the output of the previous second processing module. And it is emphasized that the second processing module is identical in structure to the first processing module.

(2) Working process related to end-to-end Transformer model

The end-to-end Transformer model acquires F to-be-processed sampling frames acquired based on a human face interesting region image sequence, and preprocesses the F to-be-processed sampling frames to acquire F embedded vectors; specifically, the pretreatment process is as follows:

dividing F to-be-processed sampling frames into F multiplied by N sampling blocks with the size of P multiplied by P, wherein the size of each to-be-processed sampling frame is H multiplied by W, and each to-be-processed sampling frame corresponds to N sampling blocks,

drawing each sampling block into a vector to obtain a vector to be processed, and performing linear mapping through a linear layer in an end-to-end Transformer model based on the vector to be processed to obtain F multiplied by N embedded vectors to be processed; it should be noted that the number of linear layers in the end-to-end Transformer model is set corresponding to the number of embedded vectors to be processed, that is, different embedded vectors to be processed are respectively input into different linear layers for linear mapping, in this embodiment, the number of linear layers in the end-to-end Transformer model is F × N.

Stacking the embedded vectors to be processed corresponding to the same sampling frame to be processed in the F multiplied by N embedded vectors to be processed to obtain F embedded vectors, thereby realizing the conversion of the F sampling frames to be processed into the F embedded vectors.

Then, respectively inputting the obtained F embedded vectors into a first processing module in different space transform modules, calculating space attention among different sampling blocks corresponding to each to-be-processed sampling frame through the space transform modules so as to better capture space position information of the image and output F first to-be-processed output vectors; it should be noted that the number of spatial Transformer modules is set corresponding to the number of embedded vectors (i.e., to-be-processed sample frames), and in this embodiment, the number of spatial Transformer modules is F. It should be added that the linear layers corresponding to the same sampling frame to be processed are connected to the same spatial Transformer module, and the linear layers corresponding to different sampling frames to be processed are connected to different spatial Transformer modules. Performing position coding on a first to-be-processed output vector output by a space Transformer module, and stacking the position-coded first to-be-processed output vector to obtain a first output vector; the first output vector is an F multiplied by D matrix, wherein D is a dimension output by the space Transformer module;

then, inputting the obtained first output vector into a first second processing module in the time Transformer module, calculating time attention between each to-be-processed sampling frame through the time Transformer module so as to better capture displacement information and the like between the micro-expressions and output a second output vector;

and finally, inputting the second output vector output by the Transformer module into a full connection layer, and obtaining a predicted heart rate sequence in a set time period through the full connection layer.

Furthermore, the first processing module and the second processing module are both constructed based on a multi-head attention submodule and a multi-layer perceptron submodule which are connected in sequence. The following describes the structure of the first processing module and the second processing module in detail, taking the first processing module as an example, as shown in fig. 3.

Specifically, the multi-head attention submodule is constructed on the basis of a normalization layer and a multi-head attention layer which are connected in sequence; the input of the first processing module is the input of the multi-head attention submodule, and the input of the multi-head attention submodule and the output of a multi-head attention layer in the multi-head attention submodule are added to form the output of the multi-head attention submodule. Note that the input in fig. 3 is referred to as an input of the first processing module.

A plurality of heads are arranged in the multi-head attention layer, wherein, as shown in figure 4, the attention mechanism adopted by each head is that,

multiplying the output of a normalization layer in the multi-head attention submodule with a weight matrix to obtain q, k and v;

(1)

(2)

(3)

wherein the content of the first and second substances,is shown asThe inputs to the plurality of multi-headed attention sub-modules,representing the layer normalization operations by a normalization layer within a multi-head attention submodule,is shown asSecond of multiple head attention submoduleThe head of the device is provided with a plurality of heads,representing a weight matrix; for the spatial Transformer module,is in the range of 1-12, for the time Transformer module,the value range of (A) is 1 to 8. In this embodiment, if the input dimension of the multi-head attention submodule is 768 and 12 heads are provided in the multi-head attention submodule, the setting is madeSo that each head in the multi-head attention submodule obtains q, k and v with the dimension of 64;

calculating the dot product of q and k, and multiplying the result obtained by the dot product calculation by v as a coefficient after the result passes through an activation function layer and a Dropout layer in sequence;

and outputting the result obtained by multiplication after passing through the linear layer and the normalization layer, namely outputting the result which is the output of a single head in the multi-head attention layer. In addition, it should be noted that the input in fig. 4 refers to the input of the multi-head attention submoduleGo outReferred to as the output of a single head in a multi-head attention layer.

Integrating the output of each head in the multi-head attention tier to form the output of the multi-head attention tier, wherein the output of the multi-head attention tier is consistent with the dimensions of the input of the multi-head attention sub-module.

The multilayer perceptron submodule is constructed on the basis of a normalization layer and a multilayer perceptron structure which are sequentially connected; the output of the multi-head attention submodule is the input of the multilayer perceptron submodule, and the output of the multi-head attention submodule and the output of the multilayer perceptron structure in the multilayer perceptron submodule are added to form the output of the multilayer perceptron submodule.

The multi-head attention submodule output is subjected to layer normalization operation through a normalization layer and then input into the multi-layer perceptron structure, and the full connection layer in the multi-layer perceptron structure is large in parameter amount and easy to overfit, so that the Dropout layer is connected behind the full connection layer to reduce overfit and increase generalization capability.

Step A30, calculating a loss value based on a heart rate sequence and a standard heart rate sequence within a set time period output by an end-to-end Transformer model, and adjusting parameters of the end-to-end Transformer model;

in this embodiment, when the end-to-end Transformer model is trained, a loss value is calculated according to a heart rate sequence within a set time period output by the end-to-end Transformer model and a standard heart rate sequence within the set time period, and the end-to-end Transformer model is updated according to the loss value, so as to obtain an optimal end-to-end Transformer model. Specifically, the loss function adopted by the end-to-end Transformer model in the training process isWherein, in the step (A),

(4)

(5)

(6)

wherein gamma is a weight coefficient,in order to obtain the value of the total loss,in order to be a loss in the time domain,for frequency domain loss, X is a heart rate sequence within a set time period output by an end-to-end Transformer model, Y is a standard heart rate sequence within the set time period, and T is the length of a video signal corresponding to a video frame sequence to be detected;the power spectral density calculated for the heart rate sequence within the set time period based on the end-to-end Transformer model output,CE is the cross-entropy loss for the power spectral density calculated based on a standard heart rate sequence over a set period of time.

In the prior art, usually, a face region-of-interest image sequence corresponding to a training video frame sequence and a standard average heart rate value thereof are used as training samples, and a cross entropy loss function is used to calculate loss, but for some training, for example, for measuring a heart rate after a person is in a healthy state, the average heart rate value cannot well measure a heart rate result of the person within a set time, so that based on pearson correlation loss, the standard heart rate sequence is used to replace the standard average heart rate value as the training sample, and the correlation between measurement vectors can be better calculated by calculating with a heart rate sequence output by an end-to-end Transformer model, so that a better retransmission can be achieved. Meanwhile, based on a cross entropy loss function, the power spectral density is adopted to calculate the frequency domain loss, so that the heart rate condition can be better measured, and the calculation precision is improved.

And step A40, circularly executing the steps A20-A30 until a trained end-to-end Transformer model is obtained.

In this embodiment, the parameters of the end-to-end Transformer model are adjusted by a back-propagation gradient method until the sum of the time domain loss and the frequency domain loss is smaller than a preset first threshold or iterated to a preset number of times, so as to obtain a trained end-to-end Transformer model.

2. A method for non-contact heart rate measurement based on Transformer is shown in figure 1

Step S10, acquiring a video frame sequence to be detected containing face information in a set time period;

step S20, acquiring a human face region-of-interest image sequence through a human face key point model based on the video frame sequence to be detected;

in this embodiment, the video frame sequence to be detected is processed with reference to the face key point model adopted in step a10, so as to obtain a face region-of-interest image sequence;

step S30, preprocessing the image sequence of the region of interest of the face, and acquiring a heart rate sequence in the set time period through a trained end-to-end Transformer model based on the preprocessed image sequence of the region of interest of the face;

in this embodiment, the face roi image sequence in step S20 is preprocessed by referring to the preprocessing method in step a20, and is input into the trained end-to-end transform model stored in step a40, so as to obtain a heart rate sequence within a corresponding set time period.

Although the foregoing embodiments describe the steps in the above sequential order, those skilled in the art will understand that, in order to achieve the effect of the present embodiments, the steps may not be executed in such an order, and may be executed simultaneously (in parallel) or in an inverse order, and these simple variations are within the scope of the present invention.

A second embodiment of the invention is a Transformer-based contactless heart rate measurement system, as shown in fig. 2, the system includes: the device comprises an image acquisition unit, a human face extraction unit and a heart rate extraction unit;

the image acquisition unit is configured to acquire a video frame sequence to be detected containing face information within a set time period;

the face extraction unit is configured to obtain a face region-of-interest image sequence through a face key point model based on the video frame sequence to be detected;

the heart rate extraction unit is configured to preprocess the face region-of-interest image sequence, and based on the preprocessed face region-of-interest image sequence, obtain a heart rate sequence within the set time period through a trained end-to-end Transformer model;

the end-to-end Transformer model is constructed on the basis of a linear layer, a space Transformer module, a time Transformer module and a full connection layer which are connected in sequence;

the spatial Transformer module comprises N first processing modules; the first processing module is constructed on the basis of a multi-head attention submodule and a multi-layer perceptron submodule;

the multi-head attention submodule is constructed on the basis of a normalization layer and a multi-head attention layer which are connected in sequence;

the multilayer perceptron submodule is constructed on the basis of a normalization layer and a multilayer perceptron structure which are sequentially connected; the multilayer perceptron structure is constructed on the basis of a full connection layer, an activation function layer, a Dropout layer, a full connection layer and a Dropout layer which are connected in sequence;

adding the input of the multi-head attention submodule and the output of a multi-head attention layer in the multi-head attention submodule to form the output of the multi-head attention submodule; adding the output of the multi-head attention submodule and the output of a multi-layer perceptron structure in the multi-layer perceptron submodule to form the output of the multi-layer perceptron submodule;

the time Transformer module comprises M second processing modules, and the second processing modules have the same structure as the first processing modules.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the non-contact heart rate measurement system based on the end-to-end transform model provided in the foregoing embodiment is only illustrated by the division of the functional modules, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

An electronic apparatus according to a third embodiment of the present invention includes:

at least one processor; and

a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,

the memory stores instructions executable by the processor for execution by the processor to implement the Transformer-based contactless heart rate measurement method described above.

A computer-readable storage medium of a fourth embodiment of the present invention stores computer instructions for execution by the computer to implement the method for contactless heart rate measurement based on transformers described above.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:用于自动驾驶的标识牌识别方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!