Pedestrian re-recognition method for generating confrontation network based on attitude guidance
1. A pedestrian re-identification method based on an attitude guidance production confrontation network is characterized by comprising the following steps: the method comprises the following steps:
s1: constructing a trunk network of a pedestrian re-identification model by using ResNet-50, and extracting pedestrian features;
s2: constructing a multi-scale pedestrian re-identification network (MPN); fusing pedestrian features of different scales;
s3: constructing a posture transfer module PCM for fusing the posture and appearance characteristics of the pedestrian;
s4: constructing a generating confrontation network PGGAN based on attitude guidance, and generating pedestrian sample images under different attitudes;
s5: and designing a joint training strategy, excavating appearance information of the pedestrian, overcoming the influence caused by posture change, and improving the quality of the generated sample image.
2. The pedestrian re-identification method based on the attitude-guided production confrontation network as claimed in claim 1, wherein: in S1, the specific process of constructing the backbone network of the pedestrian re-identification model by using ResNet-50 comprises the following steps: the ResNet-50 is pre-trained on ImageNet, the last spatial down-sampling operation in conv5_ x is removed, and N is increasedi-dim full connectivity layers, sorting by NiRepresenting the number of pedestrian identities in the data set.
3. The pedestrian re-identification method based on the attitude-guided production confrontation network as claimed in claim 1, wherein: the specific flow of S2 is as follows:
on the basis of a ResNet-50 network model, a multi-scale information fusion structure comprising three branches is created and used for extracting feature information of different scales in the network, the three branches are respectively a first prediction branch of a main network with a normal ResNet-50 network model, a second prediction branch and a third prediction branch which are respectively added after a stage-3 and a stage-4 of the ResNet-50 network model, and the first prediction branch, the second prediction branch and the third prediction branch are collectively called as a multi-scale classification module.
4. The pedestrian re-identification method based on the attitude-guided production confrontation network as claimed in claim 3, wherein: the three branches have the same structure and are composed of three full-connection layers including a global average pooling layer, a feature vector layer and a classification layer, each branch of the three branches uses a classification loss function, after the features of each branch are cascaded, a verification loss function and a total loss function L are usedreidThe following were used:
Lreid=Lid 1+μ1Lid 2+μ2Lid 3+νLid t;
wherein L isid 1,Lid 2,Lid 3For cross-entropy classification loss function, Lid tIs a triplet loss function, mu1,μ2To balance the coefficients of the multi-scale prediction branches, ν is the coefficient of the balance validation loss and classification loss.
5. The pedestrian re-identification method based on the attitude-guided production confrontation network as claimed in claim 1, wherein: in S3, the posture transfer module PCM is formed by cascading a plurality of posture conversion blocks PCB, and the construction process of the posture transfer module PCM is as follows:
s3-1: encoding the appearance of the original pedestrianPose coding with fusion of original and target posesFusing and outputting pedestrian appearance codes of target postures
S3-2: will be provided withThe input generates pedestrian samples of the target pose in a deconvolution network of the countermeasure network PGGAN.
6. The pedestrian re-identification method based on the attitude-guided production confrontation network as claimed in claim 1, wherein: at S4, the generation of the countermeasure network PGGAN includes the generator network G and the discriminator network D, and the pair usedLoss-immunity function, wherein the generator network G comprises a pedestrian appearance encoder EfPedestrian posture encoder EpPCM modules and deconvolution networks.
7. The pedestrian re-identification method based on the attitude-guided production confrontation network as claimed in claim 6, wherein: in S5, the design flow of the joint training strategy is as follows:
s5-1: the pedestrian appearance coding module E inputs a pair of pedestrian samples each time into the generation countermeasure network PGGAN in S4fPedestrian posture coding module EpRespectively extracting visual appearance characteristics and posture characteristics;
s5-2: fusing the visual appearance characteristics and the attitude characteristics, and sending the fused visual appearance characteristics and attitude characteristics into a deconvolution network to generate a new pedestrian sample;
s5-3: on the basis of S5-2, a pedestrian appearance coding module E is usedfExtracting pedestrian appearance characteristics from the generated pedestrian sample;
s5-4: and simultaneously sending the visual appearance characteristics of the original picture and the generated picture into the MPN to calculate the classification loss.
Background
Nowadays, video monitoring which is continuously developed provides powerful basic support and technical support for the construction of 'safe cities' and 'smart cities'. The public safety problem is a hot topic in a plurality of social challenges, is also a basic requirement of national stable economic development and social stable progress, and is also an important foundation for preventing and dealing with various safety accidents and ensuring the properties of people. In the past more than ten years, China has become the country with the fastest installation and growth of monitoring camera equipment, monitoring video systems are widely applied to various occasions, and currently, China has built the largest video monitoring network system in the world. A large number of cameras form a huge sky-eye network to cover the streets and alleys of the city. According to incomplete statistics, the number of domestic video monitoring equipment used for public security service currently exceeds 3000 thousands, and provincial video image data platforms are basically built in provincial public security departments in China. These monitoring systems generate massive video data every day, and need the relevant security departments to actively analyze and mine useful information, and to perform early warning on abnormal conditions in time. Although large-scale networked monitoring can improve the reliability of the video monitoring system, great difficulty is brought to management and analysis work. At present, manual processing is mostly adopted for screening and processing of monitoring data and understanding and analyzing work of contents, however, due to the fact that scene change in monitoring is uncertain, long-time monitoring and processing work enables workers to be difficult to concentrate on attention, and particularly in the situations that scenes in a monitoring video are complex, staff are dense, the flow of people is large and the like, careless omission easily occurs in a manual processing mode, and events occurring in the video are difficult to effectively monitor. This manual processing method is not only inaccurate, but also takes a lot of time, and is a great challenge for the personnel responsible for data processing. Therefore, it has become one of the key contents of domestic and foreign researchers and various intelligent enterprises to explore how to use a computer vision method to realize intelligent video monitoring similar to human abilities, such as detection, identification, tracking, behavior analysis and the like.
Pedestrians are the main bodies in video monitoring, and the research and analysis of the behaviors of the pedestrians by using a computer vision technology are important components of an intelligent monitoring technology. The current pedestrian analysis technology mainly comprises pedestrian detection, pedestrian counting, pedestrian tracking, pedestrian recognition, pedestrian re-recognition and the like. The pedestrian detection mainly detects the position and the area of a pedestrian in a video; the pedestrian counting is to count the pedestrians on the basis of pedestrian detection, and is mainly applied to places with large pedestrian flow, such as important traffic intersections, stations, airports and the like; the pedestrian tracking is that the pedestrian tracks the single monitoring equipment from the time of entering the visual field of the monitoring equipment to the time of leaving the visual field of the monitoring equipment; the pedestrian identification mainly identifies the identity of a detected pedestrian; the pedestrian re-identification technology solves the problem of identification and retrieval of pedestrians in a cross-camera and cross-scene mode. The existing pedestrian identification technology mainly relies on the specific information of the pedestrian, such as the face, the fingerprint, the iris and the like. In an actual scene, the acquisition of fingerprints and irises requires the contact-type cooperation of each pedestrian, most monitoring equipment cannot acquire clear pictures of faces of the pedestrians, meanwhile, a single camera cannot usually cover all areas, and areas covered by the fields of vision of a plurality of cameras are almost not overlapped. Pedestrian re-identification is also called pedestrian re-identification, is widely considered to belong to one of image retrieval technologies, is one of core technologies of intelligent monitoring and multimedia application, and is often used for tasks such as criminals discovery, human activity analysis and tracking of multiple targets.
However, compared with the traditional image recognition and retrieval tasks, the pedestrian re-recognition task still faces many challenges, in the actual monitoring environment, due to the fact that the distance and the angle between the camera and the pedestrian are different, the pedestrian gesture and the visual angle in the shot picture are different, meanwhile, the cameras deployed at different positions have larger environmental difference, the shot picture is often subjected to illumination intensity, the difference between severe weather and the camera is large, the illumination difference between the day and the night is large, the scene is disordered and has shielding, the camera style is large, the resolution ratio is low, and the difficulty of pedestrian re-recognition is greatly increased.
Disclosure of Invention
The invention aims to solve the problems in the background art, and provides a pedestrian re-identification method based on an attitude-guided production confrontation network, which fuses the features extracted at different stages of the network on the basis of not changing a backbone network, and the fused features can effectively correspond to pedestrian samples with different scales.
The purpose of the invention is realized as follows:
a pedestrian re-identification method based on an attitude guidance production confrontation network comprises the following steps:
s1: constructing a trunk network of a pedestrian re-identification model by using ResNet-50, and extracting pedestrian features;
s2: constructing a multi-scale pedestrian re-identification network (MPN); fusing pedestrian features of different scales;
s3: constructing a posture transfer module PCM for fusing the posture and appearance characteristics of the pedestrian;
s4: constructing a generating confrontation network PGGAN based on attitude guidance, and generating pedestrian sample images under different attitudes;
s5: and designing a joint training strategy, excavating appearance information of the pedestrian, overcoming the influence caused by posture change, and improving the quality of the generated sample image.
Preferably, in S1, the specific process of constructing the backbone network of the pedestrian re-identification model by using ResNet-50 includes: the ResNet-50 is pre-trained on ImageNet, the last spatial down-sampling operation in conv5_ x is removed, and N is increasedi-dim full connectivity layers, sorting by NiRepresenting the number of pedestrian identities in the data set.
Preferably, the specific process of S2 is:
on the basis of a ResNet-50 network model, a multi-scale information fusion structure comprising three branches is created and used for extracting feature information of different scales in the network, the three branches are respectively a first prediction branch of a main network with a normal ResNet-50 network model, a second prediction branch and a third prediction branch which are respectively added after a stage-3 and a stage-4 of the ResNet-50 network model, and the first prediction branch, the second prediction branch and the third prediction branch are collectively called as a multi-scale classification module.
Preferably, the three branches have the same structure and are all composed of three full-connection layers including a global average pooling layer, a feature vector layer and a classification layer, each branch of the three branches uses a classification loss function, and after the features of each branch are cascaded, a verification loss function and a total loss function L are usedreidThe following were used:
wherein the content of the first and second substances,in order to cross-entropy classify the loss function,is a triplet loss function, mu1,μ2To balance the coefficients of the multi-scale prediction branches, ν is the coefficient of the balance validation loss and classification loss.
Preferably, in S3, the posture transfer module PCM is formed by cascading a plurality of posture conversion blocks PCB, and the construction process of the posture transfer module PCM is as follows:
s3-1: encoding the appearance of the original pedestrianPose coding with fusion of original and target posesFusing and outputting pedestrian appearance codes of target posturesS3-2: will be provided withThe input generates pedestrian samples of the target pose in a deconvolution network of the countermeasure network PGGAN.
Preferably, in S4, generating the confrontation network PGGAN includes a generator network G including the pedestrian appearance encoder E and a discriminator network D, and the confrontation loss function usedfPedestrian posture encoder EpPCM modules and deconvolution networks.
Preferably, in S5, the design flow of the joint training strategy is as follows:
s5-1: the pedestrian appearance coding module E inputs a pair of pedestrian samples each time into the generation countermeasure network PGGAN in S4fPedestrian posture coding module EpRespectively extracting visual appearance characteristics and posture characteristics; s5-2: fusing the visual appearance characteristics and the attitude characteristics, and sending the fused visual appearance characteristics and attitude characteristics into a deconvolution network to generate a new pedestrian sample; s5-3: on the basis of S5-2, a pedestrian appearance coding module E is usedfExtracting pedestrian appearance characteristics from the generated pedestrian sample;
s5-4: and simultaneously sending the visual appearance characteristics of the original picture and the generated picture into the MPN to calculate the classification loss.
Preferably, the pedestrian appearance encoder EfFormed using a convolution layer in a ResNet-50 network, pedestrian attitude encoder EpFormed using convolutional layers in a VGG-16 network, first of all before encoding the poseThe original posture and the target posture are superposed along the channel direction, and two different postures are mixed together and sent into a pedestrian posture encoder EpThe encoding is performed so that the information of the original pose and the target pose can be retained as much as possible.
Preferably, the PCM module and the deconvolution network constitute a decoder in a generation countermeasure network PGGAN.
Preferably, the discriminator network D comprises an appearance discriminator DaAnd attitude discriminator DpTwo discriminators, appearance discriminator DaA posture discriminator D for judging the appearance similarity between the generated picture and the original picturepAnd is used for judging the similarity between the generated picture posture and the target posture.
Preferably, the two discriminators both adopt a ResNet-50 network structure to extract features, and the input of the two discriminators is data obtained by overlapping the generated picture and the original picture or the generated picture and the target posture heat map.
Preferably, the output scores of the two discriminators are each denoted as RaAnd Rp,RaAnd RpFor the output of softmax layer, the final score of the whole discriminator is recorded as the product of two discriminator scores, i.e. R ═ Ra·Rp。
Preferably, the PCB is an independent module structure having a dual-stream structure, the dual-stream structure of the PCB includes an appearance coding stream and a posture coding stream, an interaction structure is provided between the appearance coding stream and the posture coding stream, the input of the appearance coding stream is recorded, the input of the posture coding stream is recorded, and the PCB gradually outputs f under the action of the two streamsi+1And pi+1。
Preferably, the PCB processes the appearance data distribution of the original pedestrian, gradually converts the appearance data distribution of the original pedestrian into the appearance data distribution of the target pedestrian posture, uses the posture code as a posture mask, and applies the posture mask to the appearance code, wherein the posture code mask corresponds to the weight of the new appearance data distribution and is denoted as MpThen M ispWatch (A)The expression is as follows: mp=α·conνp1(pi) Wherein con vp1The first convolution operation, representing the attitude-encoded stream, contains three convolution layers and one BN layer.
Preferably, after the pose mask is obtained, the pose mask is fused with appearance coding data, and a residual structure is introduced into an appearance coding stream, so as to solve the problem that a gradient disappears due to a high network layer depth of a coding feature, and the output of the appearance coding stream is expressed as:
fi+1=Mp□[fi+conνf(fi)]wherein □ represents the matrix dot product operation, con vfThe convolution operation representing the appearance-coded stream is also composed of three convolution layers and one BN layer, similar to the convolution operation of the pose-coded stream.
Compared with the prior art, the invention has the beneficial effects that:
1. the pedestrian re-identification method based on the attitude guidance production confrontation network provided by the invention fuses the characteristics extracted at different stages of the network on the basis of not changing the backbone network, and the fused characteristics can effectively cope with pedestrian samples with different scales.
2. The invention provides a pedestrian re-recognition method based on an attitude guide production countermeasure network, which is characterized in that a pedestrian sample image and a target pedestrian attitude are input in the generation countermeasure network PGGAN, the appearance and the attitude of a pedestrian are coded, then the coded pedestrian sample image and the target attitude are fused and decoded, an original pedestrian sample can be converted into a clear pedestrian sample image under the target attitude, and the scale and the attitude diversity of a pedestrian re-recognition data set can be effectively expanded through the PGGAN.
3. The pedestrian re-recognition method based on the posture-guided production confrontation network provided by the invention designs a strategy of joint training of the PGGAN and the pedestrian re-recognition network, so that the pedestrian re-recognition network makes full use of pedestrian appearance information in the PGGAN and expands a data set, the influence caused by pedestrian posture change is overcome, and the quality of generating a sample image by the PGGAN is improved.
Drawings
Fig. 1 is an overall network block diagram of the present invention.
Fig. 2 is a multi-scale pedestrian re-identification network framework diagram of the invention.
FIG. 3 is a block diagram of a posture transfer module of the present invention.
FIG. 4 is a diagram of a discriminating network framework of the present invention.
FIG. 5 is a diagram of the posture-based generated confrontation network framework of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by those skilled in the art without any creative work based on the embodiments of the present invention belong to the protection scope of the present invention.
Example 1
With reference to fig. 1, a pedestrian re-identification method based on an attitude-guided production countermeasure network includes the following steps:
s1: constructing a trunk network of a pedestrian re-identification model by using ResNet-50, and extracting pedestrian features;
s2: constructing a multi-scale pedestrian re-identification network (MPN); fusing pedestrian features of different scales;
s3: constructing a posture transfer module PCM for fusing the posture and appearance characteristics of the pedestrian;
s4: constructing a generating confrontation network PGGAN based on attitude guidance, and generating pedestrian sample images under different attitudes;
s5: and designing a joint training strategy, excavating appearance information of the pedestrian, overcoming the influence caused by posture change, and improving the quality of the generated sample image.
Example 2
With reference to fig. 1-5, a pedestrian re-identification method for generating a countermeasure network based on attitude guidance includes the following steps: s1: and constructing a backbone network of the pedestrian re-identification model by using ResNet-50, and extracting the characteristics of the pedestrian.
First, ResNet-50 needs to be performed on ImageNetPre-training, removing the last spatial down-sampling operation in conv5_ x, and adding Ni-dim full connectivity layer classification, NiRepresenting the number of pedestrian identities in the data set.
S2: and constructing a multi-scale pedestrian re-identification network (MPN).
On the basis of a ResNet-50 network model, a multi-scale information fusion structure is designed, except a normal trunk network prediction branch I, two new prediction branches, namely a prediction branch II and a prediction branch III, are respectively added after a stage-3 and a stage-4 of a network, the structures of the three branches are the same and are composed of three fully-connected layers, namely a global average pooling layer, a feature vector layer and a classification layer, and the three branches are used for extracting feature information of different scales in the network.
The number of neurons of the feature vector layer in each branch is 512, the number of neurons of the classification layer is the same as the number of IDs of pedestrian samples in the data set, if the number of IDs of pedestrians in the Market-1501 training set is 751, the number of neurons of the classification layer is 751 when training on the Market-1501 data set, and the three prediction branches are collectively called a multi-scale classification module.
Each branch uses a classification loss function, and then after the characteristics of each branch are cascaded, a verification loss function is used, and the total loss function is as follows:wherein the content of the first and second substances,in order to cross-entropy classify the loss function,is a triplet loss function. Mu.s1,μ2To balance the coefficients of the multi-scale prediction branches, ν is the coefficient of the balance validation loss and classification loss.
S3: and constructing a posture transfer module PCM for fusing the posture and appearance characteristics of the pedestrian.
The PCM is the core of the whole PGGAN and is formed by cascading a plurality of attitude conversion blocks (PCBs).
PCM encodes the appearance of the original pedestrianPose coding with fusion of original and target posesFusing and outputting pedestrian appearance codes of target posturesWill eventually beAnd inputting the pedestrian samples into the deconvolution network to generate target postures.
The PCB is an independent module structure, the PCB is a double-stream structure, one is an appearance coding stream, the other is a posture coding stream, and the two streams have an interaction structurei+1And pi+1。
The specific principle of the PCB is as follows: the PCB mainly processes the appearance data distribution of an original pedestrian, converts the appearance data distribution into the appearance data distribution of the target pedestrian posture step by step, and considers the posture code as a posture mask, the PCB mainly applies the posture mask to the appearance code, and the posture code mask corresponds to the weight of the new appearance data distribution, thereby inputting p of the posture code streamiAfter the convolutional layer is coded, multiplying by a weight coefficient, and then performing dot multiplication operation on the weight coefficient and the characteristics of the appearance coded stream; notation of attitude code mask as MpThen M ispThe expression of (a) is: mp=α·conνp1(pi) (ii) a Wherein, con vp1The first convolution operation representing the pose-encoded stream, which mainly includes three convolution layers and one BN layer, is configured in detail as shown in fig. 4.
When coming toAfter the gesture mask is reached, the gesture mask and the appearance coded data are fused, considering that the problem that the network layer depth of the coding features is high and gradient disappearance is easy to generate, a residual error structure is introduced into the appearance coded stream, and the output of the appearance coded stream is recorded as: f. ofi+1=Mp□[fi+conνf(fi)](ii) a Wherein □ represents the matrix dot product operation, con vfThe convolution operation representing the appearance-coded stream is also composed of 3 convolution layers and one BN layer, similar to the convolution operation of the pose-coded stream.
The single PCB module is not enough to directly convert the original appearance coded data distribution into the appearance coded data distribution of the target pedestrian gesture, so that a plurality of PCB modules are required to be connected in series to process data, and for a non-final PCB module, the processed appearance coded data needs to be blended into the gesture coded data and input into the next PCB module, so that: p is a radical ofi+1=conνp2[β·fi+1+conνp1(pi)](ii) a Wherein, con vp2Representing a second convolution operation of the pose-encoded stream.
S4: and constructing a generating confrontation network PGGAN based on attitude guidance for generating pedestrian sample images under different attitudes.
The PGGAN mainly includes a generator network G and a discriminator network D, and a penalty function used, wherein the generator network G includes a pedestrian appearance encoder, a pedestrian attitude encoder, a PCM module, and a deconvolution network.
An encoder: using convolutional layers in a ResNet-50 network as a pedestrian appearance encoder EfUsing the convolutional layer in the VGG-16 network as the pedestrian attitude encoder EpBefore encoding the attitude, the original attitude and the target attitude are firstly superposed along the channel direction, and two different attitudes are mixed together and sent to an encoder EpAnd coding is carried out, so that the information of the original posture and the target posture can be kept as much as possible, the calculated amount can be greatly reduced by adopting a structure of firstly fusing the posture estimation point heat map and then coding, the dependency relationship between the calculated amount and the target posture can be effectively extracted, and better performance can be kept.
A decoder: the module is mainly divided into two parts: a Position Change Module (PCM) and a deconvolution network Module.
A discriminator: mainly comprises an appearance discriminator DaAnd attitude discriminator DpAnd the similarity judging module is respectively used for judging the appearance similarity of the generated picture and the original picture and the similarity of the generated picture posture and the target posture. The two discriminators adopt a ResNet-50 network structure to extract features, and the input of the two discriminators is data obtained by overlapping a generated picture and an original picture or overlapping the generated picture and a target posture heat map. The output scores of the two discriminators are respectively marked as RaAnd Rp,RaAnd RpFor the output of softmax layer, the final score of the whole discriminator is recorded as the product of two discriminator scores, i.e. R ═ Ra·Rp。
S5: a joint training strategy is designed, pedestrian appearance information is fully mined, the influence caused by posture change is overcome, and the quality of generated sample images is improved.
Specifically, the pedestrian appearance coding module E inputs a pair of pedestrian samples each timefPedestrian posture coding module EpRespectively extracting visual appearance characteristics and attitude characteristics, fusing the visual appearance characteristics and the attitude characteristics, sending the fused visual appearance characteristics and the attitude characteristics into a deconvolution network to generate a new pedestrian sample, and using E on the basisfAnd extracting the appearance characteristics of the pedestrians from the generated pedestrian samples, and finally sending the visual appearance characteristics of the original picture and the generated picture into the MPN to calculate the classification loss.
Preprocessing a data set and setting training parameters, uniformly adjusting the sizes of all pedestrian images to be 256 multiplied by 128, using data enhancement strategies such as random cutting, random horizontal turning, random erasing and the like on all pedestrian images, and optimizing the methods provided by S1-S5 by adopting an Adam optimizer, wherein the iteration is carried out for 800 periods in total, the learning rate is initially 10% after 400 periods of training are carried out, every 100 periods are attenuated, and a pedestrian appearance coding module E is used for coding the pedestrian appearancefAnd the PCM module uses a Dropout strategy, the full-connection layer of the multi-scale prediction module uses a Leaky ReLU activation function and a coefficient of a negative slope, and in the training process, the method adoptsAnd (3) alternately training PGGAN in a mode of iterating the once arbiter after iterating the generators twice, and in a testing stage, evaluating the similarity between two pedestrian samples by using Euclidean distance after the feature vectors of the pedestrian samples are normalized.
The invention provides a posture guidance-based generation countermeasure network (PGGAN), which mainly comprises a pedestrian appearance coding module, a pedestrian posture coding module and a decoding module, wherein a pedestrian sample image and a target pedestrian posture are input, the pedestrian appearance and posture are coded, and then the coded pedestrian appearance and posture are fused with the target posture and then decoded, so that the network can convert an original pedestrian sample into a clear pedestrian sample image in the target posture. The scale and the posture diversity of the pedestrian re-recognition data set can be effectively expanded through the PGGAN. In addition, in order to enable the pedestrian re-recognition network to fully utilize pedestrian appearance information in PGGAN and expand a data set, overcome influences caused by pedestrian attitude change and improve the quality of a sample image generated by the PGGAN, the invention provides a strategy for combined training of the PGGAN and the pedestrian re-recognition network.
The above description is only a preferred embodiment of the present invention, and should not be taken as limiting the invention, and any modifications, equivalents and substitutions made within the scope of the present invention should be included.