Training method of target re-recognition model, target re-recognition method and device
1. A training method of an object re-recognition model is characterized by comprising the following steps:
acquiring a plurality of images, wherein the plurality of images respectively have a plurality of corresponding modals and a plurality of corresponding labeling target categories;
acquiring a plurality of convolution characteristic graphs respectively corresponding to the plurality of modes, and acquiring a plurality of edge characteristic graphs respectively corresponding to the plurality of modes;
acquiring a plurality of types of characteristic distance information respectively corresponding to the plurality of types of modes; and
and training an initial re-recognition model according to the plurality of images, the plurality of convolution feature maps, the plurality of edge feature maps, the plurality of feature distance information and the plurality of labeled target classes to obtain a target re-recognition model.
2. The method of claim 1, wherein the training an initial re-recognition model based on the plurality of images, the plurality of convolution feature maps, the plurality of edge feature maps, the plurality of feature distance information, and the plurality of labeled target classes to obtain a target re-recognition model comprises:
processing the plurality of images by using the initial re-recognition model to obtain an initial loss value;
processing the plurality of convolution feature maps and the plurality of edge feature maps by using the initial re-recognition model to obtain a perception edge loss value;
processing the multiple characteristic distance information by adopting the initial re-recognition model to obtain a cross-mode center contrast loss value;
and training the initial re-recognition model according to the initial loss value, the perception edge loss value and the cross-mode center contrast loss value to obtain the target re-recognition model.
3. The method of claim 2, wherein the initial re-identification model comprises: a first network structure to identify perceptual loss values between the convolved feature map and the edge feature map.
4. The method of claim 3, wherein said processing said plurality of convolved feature maps and said plurality of edge feature maps with said initial re-identification model to obtain perceptual edge loss values comprises:
inputting the plurality of convolution feature maps and the plurality of edge feature maps into the first network structure to obtain a plurality of convolution loss feature maps respectively corresponding to the plurality of convolution feature maps and a plurality of edge loss feature maps respectively corresponding to the plurality of edge feature maps;
determining a plurality of convolution feature map parameters corresponding to the plurality of convolution loss feature maps respectively, and determining a plurality of edge feature map parameters corresponding to the plurality of edge loss feature maps respectively;
processing the corresponding convolution loss feature maps according to the convolution feature map parameters to obtain a plurality of first perception edge loss values;
processing the corresponding edge loss feature maps according to the edge feature map parameters to obtain a plurality of second perception edge loss values; and
generating the perceptual edge loss value according to the plurality of first perceptual edge loss values and the plurality of second perceptual edge loss values.
5. The method of claim 2, wherein the initial re-identification model comprises: the batch standardization layer is used for acquiring various characteristic distance information respectively corresponding to the various modes and comprises the following steps:
inputting the images into the batch standardization layer respectively to obtain a plurality of feature vectors which are output by the batch standardization layer and correspond to the images respectively;
determining feature central points of a plurality of targets respectively corresponding to the plurality of images according to the plurality of feature vectors;
determining a first distance between feature center points of different targets, and determining a second distance between feature center points of the same target corresponding to different modalities, where the first distance and the second distance together constitute the multiple kinds of feature distance information.
6. The method of claim 5, wherein said processing said plurality of feature distance information using said initial re-identification model to obtain a cross-modal center contrast loss value comprises:
determining a first target distance from a plurality of first distances by using the initial re-recognition model, wherein the first target distance is the first distance with the smallest median of the plurality of first distances;
and calculating the cross-mode center contrast loss value according to the first target distance, the plurality of second distances and the number of the targets.
7. The method of claim 2, wherein the initial re-identification model comprises: a fully connected layer and an output layer connected in sequence, said processing said plurality of images with said initial re-recognition model to obtain an initial loss value comprising:
sequentially inputting the plurality of images into the full-connection layer and the output layer to obtain a plurality of category feature vectors which are output by the output layer and respectively correspond to the plurality of images;
determining a plurality of coding vectors respectively corresponding to the plurality of labeling target categories;
and generating an identity loss value according to the plurality of category feature vectors and the plurality of corresponding encoding vectors, and taking the identity loss value as the initial loss value.
8. The method of claim 5, wherein said processing said plurality of images using said initial re-recognition model to obtain initial loss values comprises:
performing image division on the plurality of images by referring to the plurality of labeling target categories to obtain a ternary sample set, wherein the ternary sample set comprises: the plurality of images, the plurality of first images and the plurality of second images, the plurality of first images correspond to the same labeling target category, and the plurality of second images correspond to different labeling target categories;
determining a first Euclidean distance between a feature vector of the image and a feature vector of the first image, the feature vectors being output by the batch normalization layer;
determining a second Euclidean distance between the feature vector of the image and the feature vector of the second image; and
and determining a ternary loss value according to the plurality of first Euclidean distances and the plurality of second Euclidean distances, and taking the ternary loss value as the initial loss value.
9. The method of claim 2, wherein said training said initial re-recognition model based on said initial loss value, said perceptual edge loss value, and said cross-modal center contrast loss value to obtain said target re-recognition model comprises:
generating a target loss value according to the initial loss value, the perception edge loss value and the cross-modal center contrast loss value;
and if the target loss value meets a set condition, taking the re-recognition model obtained by training as the target re-recognition model.
10. The method of any one of claims 1-9, wherein the plurality of modalities includes: a color image modality and an infrared image modality.
11. A target re-identification method is characterized by comprising the following steps:
acquiring a reference image and an image to be identified, wherein the reference image and the image to be identified have different modals, and the reference image comprises: a reference category;
respectively inputting the reference image and the image to be recognized into the target re-recognition model obtained by training the training method of the target re-recognition model according to any one of claims 1 to 10, so as to obtain a target output by the target re-recognition model and corresponding to the image to be recognized, wherein the target has a corresponding target class, and the target class is matched with the reference class.
12. A training device for a target re-recognition model is characterized by comprising:
the first acquisition module is used for acquiring a plurality of images, and the plurality of images respectively have a plurality of corresponding modals and a plurality of corresponding labeling target categories;
a second obtaining module, configured to obtain a plurality of convolution feature maps corresponding to the multiple modalities, and obtain a plurality of edge feature maps corresponding to the multiple modalities;
a third obtaining module, configured to obtain multiple kinds of characteristic distance information respectively corresponding to the multiple modalities; and
and the training module is used for training an initial re-recognition model according to the plurality of images, the plurality of convolution characteristic graphs, the plurality of edge characteristic graphs, the plurality of characteristic distance information and the plurality of labeled target categories to obtain a target re-recognition model.
13. The apparatus of claim 12, wherein the training module comprises:
a first processing sub-module, configured to process the plurality of images using the initial re-recognition model to obtain an initial loss value;
a second processing submodule, configured to process the plurality of convolution feature maps and the plurality of edge feature maps by using the initial re-recognition model to obtain a perceptual edge loss value;
the third processing submodule is used for processing the multiple kinds of characteristic distance information by adopting the initial re-recognition model to obtain a cross-modal center contrast loss value;
and the training submodule is used for training the initial re-recognition model according to the initial loss value, the perception edge loss value and the cross-modal center contrast loss value so as to obtain the target re-recognition model.
14. The apparatus of claim 13, wherein the initial re-identification model comprises: a first network structure to identify perceptual loss values between the convolved feature map and the edge feature map.
15. The apparatus of claim 14, wherein the second processing submodule is specifically configured to:
inputting the plurality of convolution feature maps and the plurality of edge feature maps into the first network structure to obtain a plurality of convolution loss feature maps respectively corresponding to the plurality of convolution feature maps and a plurality of edge loss feature maps respectively corresponding to the plurality of edge feature maps;
determining a plurality of convolution feature map parameters corresponding to the plurality of convolution loss feature maps respectively, and determining a plurality of edge feature map parameters corresponding to the plurality of edge loss feature maps respectively;
processing the corresponding convolution loss feature maps according to the convolution feature map parameters to obtain a plurality of first perception edge loss values;
processing the corresponding edge loss feature maps according to the edge feature map parameters to obtain a plurality of second perception edge loss values; and
generating the perceptual edge loss value according to the plurality of first perceptual edge loss values and the plurality of second perceptual edge loss values.
16. The apparatus of claim 13, wherein the initial re-identification model comprises: a batch normalization layer, the third acquisition module comprising:
the standardization processing submodule is used for respectively inputting the images into the batch standardization layer so as to obtain a plurality of feature vectors which are output by the batch standardization layer and respectively correspond to the images;
the central point determining submodule is used for determining the characteristic central points of a plurality of targets respectively corresponding to the plurality of images according to the plurality of characteristic vectors;
and the distance determining submodule is used for determining a first distance between the feature central points of different targets and determining a second distance between the feature central points of the same target corresponding to different modalities, and the first distance and the second distance jointly form the multiple kinds of feature distance information.
17. The apparatus of claim 16, wherein the third processing sub-module is specifically configured to:
determining a first target distance from a plurality of first distances by using the initial re-recognition model, wherein the first target distance is the first distance with the smallest median of the plurality of first distances;
and calculating the cross-mode center contrast loss value according to the first target distance, the plurality of second distances and the number of the targets.
18. The apparatus of claim 13, wherein the initial re-identification model comprises: the first processing submodule is specifically configured to:
sequentially inputting the plurality of images into the full-connection layer and the output layer to obtain a plurality of category feature vectors which are output by the output layer and respectively correspond to the plurality of images;
determining a plurality of coding vectors respectively corresponding to the plurality of labeling target categories;
and generating an identity loss value according to the plurality of category feature vectors and the plurality of corresponding encoding vectors, and taking the identity loss value as the initial loss value.
19. The apparatus of claim 16, wherein the first processing submodule is specifically configured to:
performing image division on the plurality of images by referring to the plurality of labeling target categories to obtain a ternary sample set, wherein the ternary sample set comprises: the plurality of images, the plurality of first images and the plurality of second images, the plurality of first images correspond to the same labeling target category, and the plurality of second images correspond to different labeling target categories;
determining a first Euclidean distance between a feature vector of the image and a feature vector of the first image, the feature vectors being output by the batch normalization layer;
determining a second Euclidean distance between the feature vector of the image and the feature vector of the second image; and
and determining a ternary loss value according to the plurality of first Euclidean distances and the plurality of second Euclidean distances, and taking the ternary loss value as the initial loss value.
20. The apparatus of claim 13, wherein the training submodule is specifically configured to:
generating a target loss value according to the initial loss value, the perception edge loss value and the cross-modal center contrast loss value;
and if the target loss value meets a set condition, taking the re-recognition model obtained by training as the target re-recognition model.
21. The apparatus of any of claims 12-20, wherein the plurality of modalities comprises: a color image modality and an infrared image modality.
22. An object re-recognition apparatus, comprising:
a fourth obtaining module, configured to obtain a reference image and an image to be identified, where modalities of the reference image and the image to be identified are different, and the reference image includes: a reference category;
a recognition module, configured to input the reference image and the image to be recognized into the target re-recognition model obtained by training the training apparatus of the target re-recognition model according to any one of claims 12 to 21, respectively, so as to obtain a target output by the target re-recognition model and corresponding to the image to be recognized, where the target has a corresponding target category, and the target category is matched with the reference category.
23. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10 or to perform the method of claim 11.
24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-10 or to perform the method of claim 11.
Background
Along with the importance of people on safety, the video monitoring camera is arranged in various environmental scenes of life and work. The common camera adopts the mode of shooting color video in the daytime and shooting infrared video at night to record all-weather information.
And the cross-modal target re-recognition aims to match targets in three primary color images (Red Green Blue, RGB) collected by a visible light camera and Infrared images (Infrared Radiation, IR) collected by an Infrared camera. Since images of different modalities (RGB and IR) are heterogeneous, modality differences can degrade the performance of matching.
When the cross-modal target re-recognition is performed by the network model in the related technology, the feature mining in the RGB image and the IR image is not sufficient, and the stability in the model training process is not strong, so that the cross-modal target re-recognition effect is influenced.
Disclosure of Invention
The present disclosure provides a training method for a target re-recognition model, a target re-recognition method, an apparatus, an electronic device, and a storage medium, which are intended to solve at least one of the technical problems in the related art to a certain extent.
An embodiment of a first aspect of the present disclosure provides a training method for a target re-recognition model, including: acquiring a plurality of images, wherein the plurality of images respectively have a plurality of corresponding modals and a plurality of corresponding labeling target categories; acquiring a plurality of convolution characteristic graphs respectively corresponding to a plurality of modes, and acquiring a plurality of edge characteristic graphs respectively corresponding to the plurality of modes; acquiring a plurality of types of characteristic distance information respectively corresponding to a plurality of types of modes; and training an initial re-recognition model according to the multiple images, the multiple convolution characteristic graphs, the multiple edge characteristic graphs, the multiple characteristic distance information and the multiple labeled target classes to obtain a target re-recognition model.
An embodiment of a second aspect of the present disclosure provides a target re-identification method, including: acquiring a reference image and an image to be identified, wherein the reference image and the image to be identified have different modals, and the reference image comprises: a reference category; and respectively inputting the reference image and the image to be recognized into the target re-recognition model obtained by the training of the target re-recognition model to obtain a target which is output by the target re-recognition model and corresponds to the image to be recognized, wherein the target has a corresponding target category, and the target category is matched with the reference category.
An embodiment of a third aspect of the present disclosure provides a training apparatus for a target re-recognition model, including: the first acquisition module is used for acquiring a plurality of images, and the plurality of images respectively have a plurality of corresponding modals and a plurality of corresponding labeling target categories; the second acquisition module is used for acquiring a plurality of convolution feature maps respectively corresponding to the plurality of modes and acquiring a plurality of edge feature maps respectively corresponding to the plurality of modes; the third acquisition module is used for acquiring various characteristic distance information respectively corresponding to various modes; and the training module is used for training an initial re-recognition model according to the multiple images, the multiple convolution characteristic graphs, the multiple edge characteristic graphs, the multiple characteristic distance information and the multiple labeled target categories to obtain a target re-recognition model.
An embodiment of a fourth aspect of the present disclosure provides an apparatus for re-identifying an object, including: the fourth acquisition module is used for acquiring a reference image and an image to be identified, the reference image and the image to be identified have different modals, and the reference image comprises: a reference category; and the recognition module is used for respectively inputting the reference image and the image to be recognized into the target re-recognition model obtained by the training of the target re-recognition model so as to obtain a target which is output by the target re-recognition model and corresponds to the image to be recognized, wherein the target has a corresponding target class, and the target class is matched with the reference class.
An embodiment of a fifth aspect of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training a target re-recognition model of an embodiment of the disclosure or to perform a method of target re-recognition.
A sixth aspect of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute a training method of a target re-recognition model disclosed in an embodiment of the present disclosure or execute the target re-recognition method.
In this embodiment, by obtaining a plurality of images, the plurality of images respectively have a plurality of corresponding modalities and a plurality of corresponding labeled target categories, and obtaining a plurality of convolution feature maps corresponding to the plurality of modalities, and obtaining a plurality of edge feature maps corresponding to the plurality of modalities, and obtaining a plurality of feature distance information corresponding to the plurality of modalities, and training an initial re-recognition model according to the plurality of images, the plurality of convolution feature maps, the plurality of edge feature maps, the plurality of feature distance information, and the plurality of labeled target categories, a target re-recognition model is obtained, so that the trained re-recognition model can fully mine features in images of the plurality of modalities, can enhance the accuracy of image matching in different modalities, and thereby improve the effect of cross-modality target re-recognition. And further, the technical problems that a network model in the related technology is insufficient in feature mining in the multi-modal image and influences the cross-modal target re-identification effect are solved.
Additional aspects and advantages of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart diagram illustrating a method for training a target re-identification model according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a network architecture providing a re-recognition model according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart diagram illustrating a method for training a target re-recognition model according to another embodiment of the present disclosure;
FIG. 4 is a schematic block diagram of a first network architecture provided in accordance with an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a feature space structure of a target provided in accordance with an embodiment of the present disclosure;
FIG. 6 is a flowchart illustrating a method for training a target re-recognition model according to another embodiment of the present disclosure;
FIG. 7 is a flow chart of a training process of a target re-identification model provided according to an embodiment of the present disclosure;
FIG. 8 is a schematic flow chart diagram illustrating a target re-identification method according to another embodiment of the present disclosure;
FIG. 9 is a schematic diagram of a training apparatus for a target re-identification model according to another embodiment of the present disclosure;
FIG. 10 is a schematic diagram of a training apparatus for a target re-identification model according to another embodiment of the present disclosure;
FIG. 11 is a schematic diagram of an object re-identification apparatus provided in accordance with another embodiment of the present disclosure; and
FIG. 12 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
Reference will now be made in detail to the embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of illustrating the present disclosure and should not be construed as limiting the same. On the contrary, the embodiments of the disclosure include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
In view of the technical problems mentioned in the background art that a network model in the related art is insufficient for feature mining in a multimodal image and affects the cross-modal target re-recognition effect, the technical solution of the present embodiment provides a training method for a target re-recognition model, and the method is described below with reference to specific embodiments.
It should be noted that an execution subject of the training method for the target re-recognition model in this embodiment may be a training apparatus for the target re-recognition model, the apparatus may be implemented in a software and/or hardware manner, the apparatus may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal, a server, and the like.
Fig. 1 is a schematic flowchart of a training method of a target re-recognition model according to an embodiment of the present disclosure. Referring to fig. 1, the method includes:
s101: and acquiring a plurality of images, wherein the plurality of images respectively have a plurality of corresponding modals and a plurality of corresponding labeling target categories.
The plurality of images may be images acquired under any possible scene by using an image acquisition device, or may also be images acquired from the internet, which is not limited to this.
The plurality of images respectively have a plurality of modalities such as: a color image modality, an infrared image modality, and any other possible image modality, wherein the color image modality may be an RGB modality, and the infrared image modality may be an IR modality, which is not limited with respect to a plurality of modalities.
That is, the plurality of images in the present embodiment may have an RGB modality and an IR modality. In practical applications, an image capturing device (e.g., a camera) may be used to capture a color image or a video frame during the day (RGB mode) and an infrared image or a video frame during the night (IR mode), so that a plurality of images with multiple modes can be obtained.
Multiple target objects may be present in multiple images, for example: the method includes acquiring a plurality of images of multiple modalities for different target objects, and acquiring a plurality of images of multiple modalities for different target objects, wherein the plurality of target objects may be pedestrians 1, pedestrians 2, vehicles 1, vehicles 2, and the like.
The information for labeling the category of the target object may be referred to as a labeled target category, where the labeled target category may be, for example, in the form of scores, different scores represent different categories of target objects, and the labeled target category may be used to distinguish the target objects in the multiple images.
In addition, the plurality of images can be divided into a training set (train set) and a test set (test set), wherein the images and the corresponding labeling target categories of the images are included.
S102: a plurality of convolution feature maps corresponding to the plurality of modes are acquired, and a plurality of edge feature maps corresponding to the plurality of modes are acquired.
After the plurality of images are acquired, a plurality of convolution feature maps and a plurality of edge feature maps corresponding to the plurality of modalities are further acquired.
The feature map obtained by performing a convolution operation on the images of the plurality of modalities may be referred to as a convolution feature map. Embodiments of the present disclosure may employ any one or more convolutional layers in a neural network to perform convolution operations on images of multiple modalities, such as: the plurality of convolution feature maps are extracted by using the Layer of the residual neural network ResNet Layer0, or may be obtained by any other possible method, which is not limited in this respect.
The edge feature map may represent edge contour information of a target object in an image in multiple modalities, and in this embodiment, for example, a Sobel operator may be used to perform convolution operation on multiple images, extract edge information of the target object, and obtain multiple edge feature maps, or may also be used to obtain multiple edge feature maps in any other possible manner, which is not limited to this.
That is to say, in order to solve the feature difference between the RGB modality and the IR modality, the embodiment of the present disclosure may adopt the edge contour information of the target object as a guide in the model training process, and optimize the characteristic feature space, thereby implementing mining of common features between the modalities.
S103: and acquiring various characteristic distance information respectively corresponding to various modes.
After the plurality of convolution feature maps and the plurality of edge feature maps are obtained, a plurality of kinds of feature distance information respectively corresponding to a plurality of modes are further obtained.
The plurality of feature distance information may be distances between feature center points of objects in different labeled object categories, and/or distances between feature center points of the same object corresponding to different modalities, or may also be any other possible feature distance information, which is not limited in this respect.
For example, in the process of determining the various feature distance information, first, a plurality of feature vectors corresponding to the plurality of images may be determined, and further, the feature center point may be determined according to the plurality of feature vectors, so that the various feature distance information may be determined according to the feature center point, and a specific manner of calculating the various feature distance information may be referred to in the following embodiments.
S104: and training an initial re-recognition model according to the plurality of images, the plurality of convolution characteristic graphs, the plurality of edge characteristic graphs, the plurality of characteristic distance information and the plurality of labeled target classes to obtain a target re-recognition model.
The re-recognition model of the embodiment of the present disclosure may be based on a convolutional neural network structure, and specifically, the residual neural network ResNet50 may be used as a backbone network of the re-recognition model. Fig. 2 is a schematic diagram of a network structure providing a re-identification model according to an embodiment of the present disclosure, and as shown in fig. 2, the embodiment of the present disclosure may divide ResNet50 into two parts, where a convolutional Layer (ResNet Layer0) in a beginning stage may adopt a dual-stream design, and a convolutional Layer (ResNet Layer1-4) in the last four stages may uniformly extract information of two modes using a policy of weights shared by the dual streams.
In the training process, the parameters of the initial re-recognition model (ResNet50) can be optimized and adjusted according to the relationship among a plurality of images, a plurality of convolution feature maps, a plurality of edge feature maps, a plurality of feature distance information and a plurality of labeled target categories until the model converges to obtain the target re-recognition model.
In this embodiment, by obtaining a plurality of images, the plurality of images respectively have a plurality of corresponding modalities and a plurality of corresponding labeled target categories, and obtaining a plurality of convolution feature maps corresponding to the plurality of modalities, and obtaining a plurality of edge feature maps corresponding to the plurality of modalities, and obtaining a plurality of feature distance information corresponding to the plurality of modalities, and training an initial re-recognition model according to the plurality of images, the plurality of convolution feature maps, the plurality of edge feature maps, the plurality of feature distance information, and the plurality of labeled target categories, a target re-recognition model is obtained, so that the trained re-recognition model can fully mine features in images of the plurality of modalities, can enhance the accuracy of image matching in different modalities, and thereby improve the effect of cross-modality target re-recognition. And further, the technical problems that a network model in the related technology is insufficient in feature mining in the multi-modal image and influences the cross-modal target re-identification effect are solved.
Fig. 3 is a flowchart illustrating a training method of a target re-recognition model according to another embodiment of the present disclosure. Referring to fig. 3, the method includes:
s301: and acquiring a plurality of images, wherein the plurality of images respectively have a plurality of corresponding modals and a plurality of corresponding labeling target categories.
S302: a plurality of convolution feature maps corresponding to the plurality of modes are acquired, and a plurality of edge feature maps corresponding to the plurality of modes are acquired.
S303: and acquiring various characteristic distance information respectively corresponding to various modes.
For specific descriptions of S301 to S303, reference may be made to the above embodiments, which are not described herein again.
S304: the plurality of images are processed using an initial re-recognition model to obtain an initial loss value.
In the operation of training the initial re-recognition model, a plurality of images are first processed using the initial re-recognition model to obtain an initial loss value, for example: the initial Loss value of the initial re-recognition model may be calculated using an identity Loss function (Id Loss), or may be determined using other Loss functions, which is not limited in this respect.
In some embodiments, as shown in fig. 2, the initial re-recognition model may include a fully connected layers (FC) and an output layer (e.g., a Softmax classifier) connected in sequence, and in the process of processing the plurality of images by using the initial re-recognition model to obtain the initial loss value, the plurality of images may be first sequentially input into the fully connected layers and the output layer to obtain a plurality of class feature vectors output by the output layer and corresponding to the plurality of images, respectively.
For example, multiple images with rgb and ir representing multiple modalities, respectively, may be employed, let Xm={xm|xm∈RH ×W×3And (4) representing a plurality of input image sets (training sets or test sets), wherein m belongs to { RGB, IR }, H and W respectively represent the height and width of the image, and 3 represents the channel number of the image (the RGB image comprises three channels of R \ G \ B, and the IR image is converted into 3 channels by repeating a single channel thereof for 3 times). For example: in the training process, one Batch (Batch) contains B pictures, soRepresenting an RGB or IR image, i ∈ {1, 2.
As shown in fig. 2, an image is inputAfter going through the network model to the final Full Connection (FC) layer and output layer (Softmax) operations, the resulting vectors may be referred to as class feature vectors, which may be p, for exampleiRepresenting, the feature vectors of multiple categories corresponding to multiple images are represented asWhere j ∈ {1, 2.,. N }, where N is the number of object classes in the plurality of images.
Further, a plurality of code vectors corresponding to the plurality of labeled target categories are determined, for example, the plurality of labeled target categories may be encoded by one-hot (one-hot) coding to obtain the code vectors, and the code vectors may be yiTo represent, then a plurality of code vectors can be represented as
Further, an identity Loss value is generated according to the plurality of category feature vectors and the plurality of corresponding coding vectors, that is, the embodiment may calculate the plurality of category feature vectors and the plurality of corresponding coding vectors by using an identity Loss function (Id Loss) to obtain the identity Loss value, and use the identity Loss value as the initial Loss value.
The identity Loss function Id Loss may be expressed as:
it is to be understood that the above examples are merely illustrative of the identity loss value as the initial loss value, and other loss functions may be used to determine the initial loss value in practical applications, which is not limited thereto.
The identity loss value is used as the initial loss value, so that the model has a good pedestrian re-identification effect.
S305: and processing the plurality of convolution feature maps and the plurality of edge feature maps by adopting an initial re-recognition model to obtain a perception edge loss value.
In some embodiments, the initial re-identification model may include a first network structure, and fig. 4 is a schematic structural diagram of the first network structure provided according to the embodiment of the present disclosure, as shown in fig. 4, the first network structure may be, for example, a deep convolutional neural network VGGNet-16, and may identify a perceptual loss value between an edge convolutional feature map and an edge feature map. By adopting VGGNet-16 as the first network structure, the loss between the convolution feature map and the edge feature map can be deeply identified, so that the accuracy of the perception loss value is improved.
Specifically, as shown in fig. 4, the plurality of convolution feature maps extracted by the ResNet Layer0 and the plurality of edge feature maps extracted by the Sobel operator may be input into VGGNet-16, where the VGGNet-16 network uses ═ phi [ # [ ]1,φ2,φ3,φ4The method comprises four stages, wherein a plurality of convolution characteristic graphs can obtain a plurality of corresponding convolution loss characteristic graphs after passing through the four stages, and a plurality of edge characteristic graphs can obtain a plurality of edge characteristic graphs after passing through the four stagesEdge loss feature maps.
Further, a plurality of convolution feature map parameters corresponding to the plurality of convolution loss feature maps, respectively, are determined, and a plurality of edge feature map parameters corresponding to the plurality of edge loss feature maps, respectively, are determined.
Wherein, let phit(z) a plurality of convolution loss feature maps and a plurality of edge loss feature maps extracted from the first network structure at the 0-t th stage, wherein the shapes of the convolution loss feature maps and the edge loss feature maps are assumed to be Ct×Ht×WtThen C ist×Ht×WtCan be used as the characteristic map parameters of the convolution loss characteristic map and the edge loss characteristic map.
Wherein, the calculation formula of the perception edge loss value is as follows:
wherein z is andrespectively representing the input convolution feature map and the edge loss feature map.
Further, the corresponding convolution loss feature maps are processed according to the convolution feature map parameters to obtain a plurality of first perception edge loss values, and the corresponding edge loss feature maps are processed according to the edge feature map parameters to obtain a plurality of second perception edge loss values.
Wherein the first perceptual edge loss value may be expressed as:the second perceptual edge loss value may be expressed as:
wherein, theAndrespectively representing the convolution signatures extracted by the respective ResNet Layer0 of the two modes,andrespectively, represent the edge feature maps of the corresponding modalities.
Further, a perceptual edge loss value is generated based on the plurality of first perceptual edge loss values and the plurality of second perceptual edge loss values, for example: and taking the sum of the first perception edge loss value and the second perception edge loss value as the perception edge loss value.
The perceptual edge loss value is expressed as
In the embodiment, the perceptual edge Loss (PEF Loss) is combined, the edge information of the image can be used as a guide, the common information in the modal characteristic space is mined, and the difference between different modalities is reduced, so that the effect of cross-modality target re-identification is improved.
S306: and processing various characteristic distance information by adopting an initial re-recognition model to obtain a cross-mode center contrast loss value.
The embodiment of the disclosure can also process various characteristic distance information by adopting an initial re-recognition model so as to obtain a cross-mode center contrast loss value.
Fig. 5 is a schematic structural diagram of a feature space of a target provided according to an embodiment of the present disclosure, as shown in fig. 5, a cross-modal center contrast loss may act on a common feature space of a modality, in this embodiment, an initial re-recognition model may be used to process a variety of feature distance information, for example: and processing the distances between the characteristic center points of the targets in different categories, or processing the distances between the characteristic center points of the targets in the same category corresponding to different modalities to obtain a cross-modality center contrast loss value.
S307: and training an initial re-recognition model according to the initial loss value, the perception edge loss value and the cross-mode center contrast loss value to obtain a target re-recognition model.
Some embodiments may first generate a target loss value from the initial loss value, the perceptual edge loss value, and the cross-modal center contrast loss value, which may be, for example, a sum of the initial loss value, the perceptual edge loss value, and the cross-modal center contrast loss value, and then the target loss value may be expressed as:wherein the content of the first and second substances,a value representing the perceived edge loss is indicated,the value of the initial loss is represented,cross mode center contrast loss values can be represented.
Further, an initial re-recognition model is trained according to the target loss value, namely: and adjusting parameters of the re-recognition model according to the target loss value until the target loss value meets set conditions, such as: and if the condition of model convergence is met, the re-recognition model obtained by training is used as the target re-recognition model. Therefore, in the model training process, the multi-task loss (namely various loss values) is combined to carry out targeted optimization and adjustment on the modal characteristic feature space and the common feature space, the cross-modal feature extraction capability of the model is enhanced, the more discriminative features can be extracted by the model, the requirements of cross-modal target re-recognition on the features can be met, and the target re-recognition effect is improved.
In this embodiment, by obtaining a plurality of images, the plurality of images respectively have a plurality of corresponding modalities and a plurality of corresponding labeled target categories, and obtaining a plurality of convolution feature maps corresponding to the plurality of modalities, and obtaining a plurality of edge feature maps corresponding to the plurality of modalities, and obtaining a plurality of feature distance information corresponding to the plurality of modalities, and training an initial re-recognition model according to the plurality of images, the plurality of convolution feature maps, the plurality of edge feature maps, the plurality of feature distance information, and the plurality of labeled target categories, a target re-recognition model is obtained, so that the trained re-recognition model can fully mine features in images of the plurality of modalities, can enhance the accuracy of image matching in different modalities, and thereby improve the effect of cross-modality target re-recognition. And further, the technical problems that a network model in the related technology is insufficient in feature mining in the multi-modal image and influences the cross-modal target re-identification effect are solved. In addition, the identity loss value is used as the initial loss value, so that the model has a better pedestrian re-identification effect. By adopting VGGNet-16 as the first network structure, the loss between the convolution feature map and the edge feature map can be deeply identified, so that the accuracy of the perception loss value is improved. In the model training process, the multi-task loss (namely various loss values) is combined to carry out targeted optimization adjustment on the mode characteristic feature space and the common feature space, the cross-mode feature extraction capability of the model is enhanced, the model can extract more discriminative features, the requirements of cross-mode target re-recognition on the features can be met, and the effect of target re-recognition is improved.
Fig. 6 is a flowchart illustrating a training method of a target re-recognition model according to another embodiment of the present disclosure. Referring to fig. 6, the method includes:
s601: and acquiring a plurality of images, wherein the plurality of images respectively have a plurality of corresponding modals and a plurality of corresponding labeling target categories.
S602: a plurality of convolution feature maps corresponding to the plurality of modes are acquired, and a plurality of edge feature maps corresponding to the plurality of modes are acquired.
For specific descriptions of S601-S602, reference may be made to the above embodiments, which are not described herein again.
S603: and respectively inputting the images into the batch standardization layer to obtain a plurality of feature vectors which are output by the batch standardization layer and respectively correspond to the images.
In some embodiments, as shown in fig. 2, the initial re-recognition model further includes a Batch Normalization layer (BN), and in the operation of obtaining the feature distance information corresponding to the modalities, the images are first input into the Batch Normalization layer respectively to obtain feature vectors corresponding to the images output by the BN layer (for example, using f)i mRepresentation).
S604: and determining feature central points of a plurality of targets respectively corresponding to the plurality of images according to the plurality of feature vectors.
For example, in a Batch (Batch), there are P classes of objects, each class containing K RGB images and K IR images, i.e. B2 × P × K, assuming thatRepresenting the feature center point of the k-th class target in a different modality, the feature center point may be represented as:wherein m is belonged to { rgb, ir }, and can be calculated by the formulaAndthe characteristic center point of the kth class object
S605: determining a first distance between feature center points of different targets, and determining a second distance between feature center points of the same target corresponding to different modalities, wherein the first distance and the second distance jointly form multiple kinds of feature distance information.
Further, a first distance between feature center points of different targets is determined, namely: determining between centers of different classes of target featuresCan be dinterRepresenting a first distance. Also, a second distance between feature center points of the same target corresponding to different modalities may also be determined, namely: determining the distance between the centers of the features of two modalities of a target of the same class, which can be used as dintraAnd representing the second distance, and combining the first distance and the second distance to form a plurality of kinds of characteristic distance information. Therefore, the information of various feature distances is determined through the relationship between the feature central points of the target, the relationship between the modal center and the category center can be constrained, and the feature extraction capability of the model can be well adjusted.
It is understood that the above examples are only illustrative for obtaining the distance information of various features, and in practical applications, any other possible way may be used for obtaining the distance information, and the present invention is not limited thereto.
S606: the plurality of images are processed using an initial re-recognition model to obtain an initial loss value.
In some embodiments, in the determining the initial loss value, the image partitioning may further be performed on the plurality of images with reference to a plurality of annotation target classes to obtain a ternary sample set, where the ternary sample set may include: multiple images (for)Representation), a plurality of first images (with) And a plurality of second images (forTo indicate that),a plurality of first images in the set correspond to the same annotation object class,a plurality of second images in the set correspond to different annotation target classes,andit is possible to construct a pair of positive samples,andnegative sample pairs may be constructed.
Further, a first euclidean distance between the feature vector of the image and the feature vector of the first image is determined, and the feature vector is output by the batch normalization layer, that is, the distance between the feature vector of the image and the feature vector of the first image can be calculated by using the batch normalization layer (BN), so as to obtain the first euclidean distance.
Also, a second euclidean distance between the feature vector of the image and the feature vector of the second image may be determined, the first and second euclidean distances being denoted by d, for example.
Further, according to the plurality of first euclidean distances and the plurality of second euclidean distances, a ternary loss value is determined, and the ternary loss value is used as an initial loss value, and an initial loss value calculation formula is as follows:
wherein the content of the first and second substances,dii+representing a first Euclidean distance, dii-The second euclidean distance is represented as the second euclidean distance,and respectivelyRepresenting positive sample pairs and negativeA set of sample pairs. Therefore, a positive and negative sample concept can be introduced by combining a weighted ternary Loss function (WRT Loss) in the model training process, so that the classification prediction results are more aggregated, and the classification can be further separated.
S607: and processing the plurality of convolution feature maps and the plurality of edge feature maps by adopting an initial re-recognition model to obtain a perception edge loss value.
For specific description of S607, refer to the above embodiments, which are not described herein.
S608: and determining a first target distance from the plurality of first distances by using the initial re-recognition model, wherein the first target distance is the first distance with the smallest median of the plurality of first distances.
Wherein, the first distance with the smallest value among the plurality of first distances may be referred to as a first target distance, for example:denotes all dinterMinimum value of (1), thenMay be the first target distance.
S609: and calculating the cross-modal center contrast loss value according to the first target distance, the plurality of second distances and the number of targets.
Further, a cross-modal center contrast loss value is calculated according to the first target distance, the plurality of second distances and the number of targets, and the cross-modal center contrast loss value (which may be referred to as CMCC loss) is calculated according to the following formula:
in this embodiment, the distance between different modalities of the same category can be reduced through CMCC loss, and the distance between features of different categories can be reduced, so that the feature f extracted by the model is optimizedi mIs convenient for later use of the layer of characteristics to carry out the targetAnd matching work of re-identification.
S610: and training an initial re-recognition model according to the initial loss value, the perception edge loss value and the cross-mode center contrast loss value to obtain a target re-recognition model.
For example: generating a target loss value according to the initial loss value, the perceptual edge loss value, and the cross-modal center contrast loss value, where the target loss value may be, for example, a sum of the initial loss value, the perceptual edge loss value, and the cross-modal center contrast loss value, and then the target loss value may be expressed as:wherein the content of the first and second substances,a value representing the perceived edge loss is indicated,andthe value of the initial loss is represented,cross mode center contrast loss values can be represented. Further, an initial re-recognition model is trained based on the target loss value.
In this embodiment, by obtaining a plurality of images, the plurality of images respectively have a plurality of corresponding modalities and a plurality of corresponding labeled target categories, obtaining a plurality of convolution feature maps corresponding to the plurality of modalities, obtaining a plurality of edge feature maps corresponding to the plurality of modalities, obtaining a plurality of feature distance information corresponding to the plurality of modalities, and training an initial re-recognition model according to the plurality of images, the plurality of convolution feature maps, the plurality of edge feature maps, the plurality of feature distance information, and the plurality of labeled target categories, so as to obtain a target re-recognition model, the trained re-recognition model can fully mine features in images of the plurality of modalities, and can enhance image matching in different modalitiesAccuracy, thereby improving the effect of cross-modal target re-identification. And further, the technical problems that a network model in the related technology is insufficient in feature mining in the multi-modal image and influences the cross-modal target re-identification effect are solved. In addition, various feature distance information is determined through the relationship between feature center points of the target, the relationship between the modal center and the category center can be constrained, and the feature extraction capability of the model can be well adjusted. In addition, the distance between different modes of the same category can be shortened through CMCC loss, and the distance between features of different categories can be lengthened simultaneously, so that the feature f extracted by the model is optimizedi mThe distribution state of the target is convenient for the matching work of target re-identification by using the layer of characteristics in the later period.
In practical applications, as shown in fig. 2, the backbone network of the target re-recognition model is a convolutional neural network (where ResNet50 is used), and specifically, for the input of two modalities, namely, color images and infrared images, the present disclosure divides ResNet50 into two parts, where the convolutional Layer (ResNet Layer0) in the beginning stage adopts a dual-stream design, the convolutional Layer (ResNet Layer1-4) in the following four stages uses a strategy of dual-stream shared weight, extracts information of the two modalities uniformly, then performs a Pooling operation on the feature map obtained by the convolutional Layer (in this embodiment, Generalized-mean (gem) stacking), obtains feature vectors corresponding to each image extraction (for image re-recognition matching in the test application process) through Batch regularization (bat Normalization \ BN), and the feature vectors continue to pass through a Full Connection (FC) Layer and somax operation in the training process, and obtaining the classification score of the target object.
In the model training process, a multitask Loss function is used, as shown in formula 1, wherein four Loss functions are fused, namely identity Loss (Id Loss), weighted ternary Loss (WRT Loss), perceptual edge Loss (PEF Loss) and cross-mode center contrast Loss (CMCC Loss). The first two losses are Loss functions commonly used in the existing method, the last two losses (PEF Loss and CMCC Loss) are Loss functions newly proposed in the present disclosure, the first two losses are briefly introduced below, and the last two Loss functions are explained in detail later.
Let RGB, IR represent the RGB image modality and IR image modality, respectivelyRepresenting the input RGB image and IR image data set, where m e { RGB, IR }, H and W represent the height and width of the image, respectively, and 3 represents the number of channels of the image (RGB image contains three channels R \ G \ B, IR image is converted into 3 channels by repeating its single channel 3 times). Suppose that B pictures are contained in a Batch (Batch) in the training process, letRepresenting an RGB or IR image, i ∈ {1, 2.
(1) Loss of identity (Id Loss) and ternary Loss with weight (WRT Loss)
(1.1) Loss of identity (Id Loss):
as shown in FIG. 1(a), an image is inputThe vectors after the final Full Connection (FC) layer and Softmax operations are obtained through the network model, here piDenotes that its one-hot (one-hot) code corresponding to the label is yiRepresents:
where j ∈ {1, 2., N }, where N is the number of classes of target objects in the data training set, then Id Loss can be expressed as:
(1.2) ternary Loss with weight (WRT Loss):
as shown in FIG. 1(a), WRT lossThe method is calculated by a model batch regularization (BN) layer and a feature vector obtained after L2-Norm operation, and the operation formula of the loss function is as follows:
whereinRepresented is a ternary sample set comprising samplesSamples of the same classAnd samples of different classesAnda pair of positive samples is formed,andforming negative sample pairs, d denotes the euclidean distance between the feature vectors,and respectivelyA set of positive and negative sample pairs is shown.
(2) Perception edge Loss (PEF Loss)
As shown in fig. 1(a) and (b), the perceptual edge loss acts on the characteristic feature space of the modalities, the partial features are generated by the non-shared ResNet Layer0, and in order to solve the feature difference between the RGB modality and the IR modality, the PEF loss is directly optimized for the characteristic feature space using the edge contour information of the target as a guide, thereby realizing the mining of the inter-modality commonality features.
Specifically, as shown in fig. 1(b), taking the calculation of the loss of one of the modalities as an example, the calculation of the PEF loss includes two inputs: one is the convolution feature map extracted by ResNet Layer 0; and the other branch is to perform convolution operation on the image input by the original mode by using a Sobel operator, and extract the edge information of the image to obtain an edge feature map. Then, the perceptual loss between the edge feature map and the convolution feature map is calculated in the PEF, and a VGGNet-16 model trained on ImageNet is used as a perceptual network, where phi ═ phi is used1,φ2,φ3,φ4Denotes four stages therein, let phit(z) represents a feature map extracted by the sensing network of stages 0-t, assuming its shape as Ct×Ht×WtThe formula for calculating the PEF loss is as follows:
wherein z is andthe PEF loss for both RGB and IR modalities is calculated as follows, representing the input convolution and edge feature maps, respectively:
thereinAndrespectively representing the convolution characteristic maps extracted by the respective ResNet Layer0 of the two modes,andthe edge feature maps of the corresponding modes are represented, and the final loss is the sum of the losses of the two modes.
In the perception edge Loss (PEF Loss), prior knowledge edge contour information is used as a guide of modal commonality characteristics, so that modal characteristic characteristics extracted by the unshared Layer-0 are more consistent, the difference between the modes is favorably reduced, and the cross-modal target re-identification task is better realized.
(3) Cross-modal center contrast Loss (CMCC Loss)
The present disclosure proposes a new cross-modal center contrast loss that acts on the common feature space of the modes, i.e., the feature vector (assumed with f) after the BN layer of fig. 1(a)i mRepresentation) of the space in which the device is located. Assuming that there are P classes of objects in a Batch, each class contains K RGB images and K IR images, i.e. B is 2 × P × K, with dinterRepresenting the distance between the centers of different classes of object features, dintraThe distance between the centers of the features of the two modalities of an object representing the same class is assumed to beAnd representing the characteristic centers of different modes of the kth class object, and then the calculation formula is as follows:
wherein m is the { rgb, ir }, which can be calculated by formula 8Andthe center of the kth class target object featureAfter which CMCC losses can be obtainedThe calculation formula of (a) is as follows:
whereinDenotes all dinterThe distance between different modals of the same category can be reduced by optimizing the loss function, and the distance between features of different categories can be reduced, so that the feature f extracted by the model is optimizedi mThe distribution state of the target is convenient for the matching work of target re-identification by using the layer of characteristics in the later period.
Fig. 7 is a training flowchart of a target re-recognition model provided according to an embodiment of the present disclosure, as shown in fig. 7, including the following steps:
(1) input image preprocessing stage
Step 1-1: reading a cross-modal target re-identification image data set, and acquiring an original image and the category information of a corresponding target object;
wherein the data set comprises: a training set (train set) and a testing set (test set), wherein the training set (train set) comprises an original image and an object class label corresponding to the image, an image input model is used in the training process, then a loss function is calculated by combining the class label, and the testing set is divided into a to-be-queried set (query) and a to-be-matched set (galery) in the testing process and is used for testing the re-recognition performance of the model;
the algorithm model is hyper-parametric: the method comprises the steps of inputting the size of an image, the size of a Batch (Batch), the number of target objects with different modalities in the Batch, the image data enhancement mode, the training iteration round number (Epoch), the learning rate (learning rate) adjusting strategy and the type of an optimizer (optimizer) used in the model training process, and is concretely as follows.
Inputting the image size in the model training process: 288, 144;
the batch size is: 64 (comprising 8 target objects, 4 images of one target object per modality);
manner of image data enhancement: randomly cutting and horizontally turning;
the number of training iteration rounds is: 200 of a carrier;
an optimizer: using Adam optimizer, weight decay (weight decay) is 0.0005;
learning rate adjustment strategy:
the learning rate linearly increases from 0.0005 to 0.005 during the first 10 epochs, maintains 0.005 for 10-20 epochs, then decays to one tenth of the original every 5 epochs until the 35 th epoch to the end of training maintains 0.000005.
Step 1-2: arranging data of RGB and IR modes into a Batch (Batch) according to the set size of the Batch, the number of categories in the Batch and the number of images in each category;
step 1-3: and carrying out standardization operation on the image, then adjusting the image to the set width and height size, carrying out specified data enhancement transformation on the image, then loading batch data into a GPU video memory for later input into a trained model, and using a corresponding label to participate in late loss calculation.
(2) Feature extraction stage
Step 2-1: inputting image data of two modes along a double-flow feature extraction network (a structure shown in FIG. 2) respectively, and sending the data of each mode into a respective inlet branch;
step 2-2: the input data is transmitted layer by layer, corresponding hierarchy calculation is carried out, and the modal characteristic part and the modal commonality part are sequentially carried out;
step 2-3: through the forward propagation in the step 2-2, the intermediate features and the final classification prediction scores can be obtained and used for the next-stage multitask loss calculation.
(3) Multitask penalty computation phase
Step 3-1: for the input data of one Batch, the calculation mode according to the above equations 1-9 can be obtainedAnd
step 3-2: adding the four losses to obtain the final multitask lossThe value is obtained.
(4) Model iterative optimization phase
Step 4-1: the implementation code of the present disclosure uses an auto-differentiated pytorreh depth learning framework, which supports back propagation of the entire algorithm model directly from the calculated multi-task loss value, calculating the gradient value of the learnable parameters therein;
step 4-2: updating and optimizing the learnable parameters of the model algorithm by using the gradient calculated in the step 4-1 by using a set optimizer;
step 4-3: and repeating all the steps, continuously updating the model parameters in the process until the set number of training rounds is reached, and then stopping the training process of the algorithm model.
(5) Model test evaluation phase
Step 5-1: dividing a test set, taking an IR image as a query set (query), taking an RGB image as a matching set (gallery), and matching the IR image of an object in the RGB image set by using the IR image of the object as the query so as to detect the cross-modal target re-identification performance of the model;
step 5-2: in the testing process, reading images (including images of query and galery) of a test set, inputting data of two modes into a test model, and obtaining a feature vector (a feature vector behind a BN layer in fig. 2) of each image through forward propagation and layer-by-layer operation of the model;
step 5-3: using cosine distance to measure similarity between the query image and all the galery images, and then sorting according to the distance to obtain a galery image (RGB image) list matched with each query image (IR image);
step 5-4: calculating common evaluation indexes Rank-n and mAP in the target re-identification task, and evaluating the model performance by observing the index values;
step 5-4: if the evaluation result does not meet the set requirement, the hyper-parameters of the model can be adjusted, the algorithm model is continuously trained from the first step of the flow steps, if the evaluated indexes meet the requirement, the model weight is stored, and the weight and the model code are the final cross-modal target re-identification solution.
In the technical scheme of the embodiment:
1. the mode characteristic feature space and the common feature space are subjected to targeted optimization adjustment by using multi-task loss, and a cross-mode target re-identification task is completed end to end.
2. The perception edge loss is provided, the edge information of the image can be used as guidance, the common information in the modal characteristic space is mined, and the difference between different modalities is reduced.
3. The cross-modal center contrast loss is provided, acts on a common feature space, and can well adjust the feature extraction capability of the model by restricting the relationship between the modal center and the category center, so that the model achieves excellent performance.
By the scheme, the characteristic space can be optimized, the characteristic space and the common characteristic space are divided, and the targeted adjustment and optimization are performed, so that an efficient end-to-end cross-modal target re-identification method is realized. In the embodiment, the provided sensing edge loss can directly constrain the characteristics of different modes, introduce prior knowledge to the model characteristic extraction process and enhance the cross-mode characteristic extraction capability of the model; the cross-modal center contrast loss provided can enable the model to extract more discriminative features, effectively reduce the difference between the modes of the same type of objects, increase the feature difference of different types of objects, and facilitate the model to correctly re-identify cross-modal data.
Fig. 8 is a flowchart illustrating a target re-identification method according to another embodiment of the disclosure. Referring to fig. 8, the method includes:
s801: acquiring a reference image and an image to be identified, wherein the reference image and the image to be identified have different modals, and the reference image comprises: a reference category.
The reference image and the image to be identified can be images acquired under any scene, and the modalities of the reference image and the image to be identified are different.
In some embodiments, the reference image may be an image of an RGB modality, and the image to be recognized may be an image of an IR modality; or the reference image may be an image of an IR modality, and the image to be recognized may be an image of an RGB modality, which is not limited thereto.
And, the reference image also corresponds to a reference category, where the reference category is used to describe a category of the target object in the reference image, for example: the categories of the target object are, but not limited to, vehicles, pedestrians, and any other possible categories.
S802: and respectively inputting the reference image and the image to be recognized into the target re-recognition model obtained by the training of the target re-recognition model to obtain a target which is output by the target re-recognition model and corresponds to the image to be recognized, wherein the target has a corresponding target category, and the target category is matched with the reference category.
After the reference image and the image to be recognized are obtained, the reference image and the image to be recognized are further input into the target re-recognition model obtained by training in the above embodiment, and a target corresponding to the image to be recognized and a corresponding target category may be output through the target re-recognition model, where the target category is matched with the reference category, for example: the target class and the reference class are the same vehicle.
That is, through the target re-recognition model, the same object as the target object in the reference image is recognized from the image to be recognized, so as to achieve the purpose of cross-modal target re-recognition.
According to the embodiment of the disclosure, by acquiring the reference image and the image to be recognized, the reference image and the image to be recognized have different modalities, and the reference image includes: and referring to the category, and respectively inputting the reference image and the image to be recognized into the target re-recognition model obtained by training through the training method of the target re-recognition model so as to obtain a target which is output by the target re-recognition model and corresponds to the image to be recognized, wherein the target has a corresponding target category, and the target category is matched with the reference category. The target re-recognition model trained by the training method of the target re-recognition model is used for recognizing the image to be recognized, so that the characteristics of the image to be recognized can be fully mined, the accuracy of image matching under different modalities can be enhanced, and the cross-modality target re-recognition effect is improved.
FIG. 9 is a schematic diagram of a training apparatus for a target re-recognition model according to another embodiment of the present disclosure. Referring to fig. 9, the training device 90 for the target re-recognition model includes:
a first obtaining module 901, configured to obtain a plurality of images, where the plurality of images respectively have a plurality of corresponding modalities and a plurality of corresponding labeling target categories;
a second obtaining module 902, configured to obtain a plurality of convolution feature maps corresponding to the multiple modalities, and obtain a plurality of edge feature maps corresponding to the multiple modalities;
a third obtaining module 903, configured to obtain multiple kinds of feature distance information corresponding to multiple modalities respectively; and
and a training module 904, configured to train an initial re-recognition model according to the multiple images, the multiple convolution feature maps, the multiple edge feature maps, the multiple feature distance information, and the multiple labeled target categories, so as to obtain a target re-recognition model.
Optionally, in some embodiments, fig. 10 is a schematic diagram of a training apparatus for a target re-recognition model provided according to another embodiment of the present disclosure, and as shown in fig. 10, the training module 904 includes:
the first processing sub-module 9041 is configured to process the multiple images by using an initial re-recognition model to obtain an initial loss value;
the second processing sub-module 9042 is configured to process the multiple convolution feature maps and the multiple edge feature maps by using an initial re-recognition model to obtain a perceptual edge loss value;
the third processing submodule 9043 is configured to process multiple kinds of feature distance information by using the initial re-recognition model to obtain a cross-modal center contrast loss value;
and the training submodule 9044 is configured to train an initial re-recognition model according to the initial loss value, the perceptual edge loss value, and the cross-modal center contrast loss value, so as to obtain a target re-recognition model.
Optionally, in some embodiments, the initial re-recognition model comprises: a first network structure for identifying perceptual loss values between the convolved feature map and the edge feature map.
Optionally, in some embodiments, the second processing sub-module 9042 is specifically configured to:
inputting the plurality of convolution feature maps and the plurality of edge feature maps into a first network structure to obtain a plurality of convolution loss feature maps respectively corresponding to the plurality of convolution feature maps and a plurality of edge loss feature maps respectively corresponding to the plurality of edge feature maps;
determining a plurality of convolution characteristic map parameters respectively corresponding to the plurality of convolution loss characteristic maps, and determining a plurality of edge characteristic map parameters respectively corresponding to the plurality of edge loss characteristic maps;
processing the corresponding convolution loss feature maps according to the convolution feature map parameters to obtain a plurality of first perception edge loss values;
processing the corresponding edge loss feature maps according to the edge feature map parameters to obtain a plurality of second perception edge loss values; and
and generating a perceptual edge loss value according to the plurality of first perceptual edge loss values and the plurality of second perceptual edge loss values.
Alternatively, in some embodiments, as shown in fig. 10, the initial re-recognition model comprises: batch normalization layer, third acquisition module 903, comprising:
the standardization processing sub-module 9031 is configured to input the multiple images into the batch standardization layer, respectively, so as to obtain multiple feature vectors, which are output by the batch standardization layer and correspond to the multiple images, respectively;
a center point determining submodule 9032, configured to determine, according to the plurality of feature vectors, feature center points of the plurality of targets respectively corresponding to the plurality of images;
and the distance determining submodule 9033 is configured to determine a first distance between feature center points of different targets, and determine a second distance between feature center points of the same target corresponding to different modalities, where the first distance and the second distance together form multiple kinds of feature distance information.
Optionally, in some embodiments, the third processing sub-module 9043 is specifically configured to:
determining a first target distance from the plurality of first distances by adopting an initial re-recognition model, wherein the first target distance is the first distance with the minimum median of the plurality of first distances;
and calculating the cross-modal center contrast loss value according to the first target distance, the plurality of second distances and the number of targets.
Optionally, in some embodiments, the initial re-recognition model comprises: the fully-connected layer and the output layer which are connected in sequence, and the first processing submodule 9041 are specifically configured to:
sequentially inputting the plurality of images into a full connection layer and an output layer to obtain a plurality of category feature vectors which are output by the output layer and respectively correspond to the plurality of images;
determining a plurality of coding vectors respectively corresponding to a plurality of labeling target categories;
and generating an identity loss value according to the plurality of category feature vectors and the corresponding plurality of encoding vectors, and taking the identity loss value as an initial loss value.
Optionally, in some embodiments, the first processing sub-module 9041 is specifically configured to:
performing image division on the plurality of images by referring to the plurality of labeling target categories to obtain a ternary sample set, wherein the ternary sample set comprises: the image annotation device comprises a plurality of images, a plurality of first images and a plurality of second images, wherein the plurality of first images correspond to the same annotation target category, and the plurality of second images correspond to different annotation target categories;
determining a first Euclidean distance between the feature vector of the image and the feature vector of the first image, wherein the feature vector is output by a batch normalization layer;
determining a second Euclidean distance between the feature vector of the image and the feature vector of the second image; and
and determining a ternary loss value according to the first Euclidean distances and the second Euclidean distances, and taking the ternary loss value as an initial loss value.
Optionally, in some embodiments, the training sub-module 9044 is specifically configured to:
generating a target loss value according to the initial loss value, the perception edge loss value and the cross-modal center contrast loss value;
and if the target loss value meets the set condition, taking the re-recognition model obtained by training as a target re-recognition model.
Optionally, in some embodiments, the plurality of modalities includes: a color image modality and an infrared image modality.
It should be noted that the foregoing explanation of the training method for the target re-recognition model is also applicable to the apparatus of this embodiment, and is not repeated herein.
In this embodiment, by obtaining a plurality of images, the plurality of images respectively have a plurality of corresponding modalities and a plurality of corresponding labeled target categories, and obtaining a plurality of convolution feature maps corresponding to the plurality of modalities, and obtaining a plurality of edge feature maps corresponding to the plurality of modalities, and obtaining a plurality of feature distance information corresponding to the plurality of modalities, and training an initial re-recognition model according to the plurality of images, the plurality of convolution feature maps, the plurality of edge feature maps, the plurality of feature distance information, and the plurality of labeled target categories, a target re-recognition model is obtained, so that the trained re-recognition model can fully mine features in images of the plurality of modalities, can enhance the accuracy of image matching in different modalities, and thereby improve the effect of cross-modality target re-recognition. And further, the technical problems that a network model in the related technology is insufficient in feature mining in the multi-modal image and influences the cross-modal target re-identification effect are solved.
Fig. 11 is a schematic diagram of an object re-identification apparatus provided in accordance with another embodiment of the present disclosure. Referring to fig. 11, the object re-recognition apparatus 100 includes:
a fourth obtaining module 1001, configured to obtain a reference image and an image to be identified, where modalities of the reference image and the image to be identified are different, and the reference image includes: a reference category;
the recognition module 1002 is configured to input the reference image and the image to be recognized into the target re-recognition model obtained by the training method of the target re-recognition model, respectively, so as to obtain a target output by the target re-recognition model and corresponding to the image to be recognized, where the target has a corresponding target category, and the target category is matched with the reference category.
It should be noted that the foregoing explanation of the training method for the target re-recognition model is also applicable to the apparatus of this embodiment, and is not repeated herein.
According to the embodiment of the disclosure, the target re-recognition model trained by the training method of the target re-recognition model can be used for recognizing the image to be recognized, and determining the target corresponding to the image to be recognized. Therefore, the characteristics of the image to be identified can be fully mined, the accuracy of image matching under different modalities can be enhanced, and the cross-modality target re-identification effect is improved.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
In order to achieve the above embodiments, the present disclosure further provides a computer program product, which when executed by an instruction processor in the computer program product, performs the training method of the target re-identification model as proposed in the foregoing embodiments of the present disclosure.
FIG. 12 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present disclosure. The computer device 12 shown in fig. 12 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.
As shown in FIG. 12, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 12, and commonly referred to as a "hard drive").
Although not shown in FIG. 12, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described in this disclosure.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and training of the object re-recognition model by executing programs stored in the system memory 28, for example, implementing the training method of the object re-recognition model mentioned in the foregoing embodiments.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
It should be noted that, in the description of the present disclosure, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present disclosure have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present disclosure, and that changes, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present disclosure.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:人脸检测方法、装置、机器人及存储介质