Few-sample weak and small target detection method based on template matching and attention mechanism

文档序号:8551 发布日期:2021-09-17 浏览:34次 中文

1. A method for detecting a small and weak target with few samples based on template matching and attention mechanism is characterized in that a matching convolution kernel is obtained by utilizing a matching convolution kernel generation network, and a channel attention module and a noise reduction module are utilized to improve network representation capability and reduce noise; the target detection method comprises the following steps:

(1) generating a training set and template images:

(1a) selecting at least 10 images, cutting each image into 1024 x 1024 pixels by taking 256 pixels as a cutting step length, wherein each image comprises at least 1 target, marking the cut target in the cutting process as a special target, setting the flag bit difficult of the target as 1, and forming a data set by the cut image and the flag bit;

(1b) labeling 4 corner points of each target in the data set in a form of a rotating rectangular frame to obtain a labeling coordinate of each target;

(1c) forming a training set by the labeled coordinates and the data set of each target;

(1d) randomly selecting 1 image with clear target outline as a template image;

(2) constructing a matching convolution kernel generation network:

building a matching convolution kernel generation network consisting of a main network ResNet101 and a pooling layer, setting the kernel size of the pooling layer to be 3 multiplied by 3, setting the pooling step length to be 2, and adopting an average pooling method by the pooling layer;

(3) constructing a characteristic pyramid network with 4 stages and 256 output channels in each stage;

(4) building a channel attention module and a noise reduction module:

(4a) build passageway attention module, its structure does in proper order: the device comprises a global average pooling layer, a first full-connection layer, a ReLU activation function, a second full-connection layer and a Sigmoid activation function, wherein the core size of the global average pooling layer is set to be 3 multiplied by 3, the pooling step length is 2, the average pooling method is adopted, the number of input channels of the first full-connection layer and the number of output channels of the second full-connection layer are respectively set to be 256 and 128, and the number of output channels of the first full-connection layer and the number of output channels of the second full-connection layer are respectively set;

(4b) build the noise reduction module, its structure does in proper order: the convolutional code comprises a first convolutional layer, a ReLU activation function, a second convolutional layer, a ReLU activation function, a third convolutional layer and a Softmax function, wherein the sizes of convolutional kernels of the first convolutional layer, the second convolutional layer and the third convolutional layer are all set to be 3 multiplied by 3, and the step sizes are all set to be 1;

(5) building a classification output network and a coordinate output network:

(5a) building a classification output network formed by connecting a first full connection layer and a second full connection layer in series, wherein the input size of the first full connection layer is set to be 49, the output size is set to be 1024, the input size of the second full connection layer is set to be 1024, and the output size is set to be 2;

(5b) building a coordinate output network formed by connecting a first full connection layer and a second full connection layer in series, wherein the input size of the first full connection layer is set to be 49, the output size is set to be 1024, the input size of the second full connection layer is set to be 1024, and the output size is set to be 5;

(6) building a detection network:

(6a) sequentially cascading a matching convolution kernel generation network, a characteristic pyramid network, a channel attention module, a noise reduction module, an area suggestion network, a RoIAlign network, a classification output network and a coordinate output network into a detection network;

(7) training a detection network:

(7a) respectively combining each image in the training set with a template image into a group, and sequentially inputting each group of data into a detection network to obtain the class and the coordinate of a prediction frame output by the detection network to each group of data;

(7b) calculating the category loss value of each group of prediction frame categories and label categories by using the binary cross entropy; calculating the coordinate loss value of each group of prediction frame coordinates and label coordinates by using the smooth L1 norm, and adding the category loss value and the coordinate loss value of each group to obtain the loss value of the group;

(7c) iteratively updating the detection network weight by using each group of loss values by using a back propagation method until the loss value of the detection network is not reduced any more, and obtaining a trained detection network;

(8) detecting an image to be detected:

(8a) with 256 pixels as a cutting step length, cutting one image to be detected into a plurality of sub-images with the size of 1024 multiplied by 1024 pixels;

(8b) respectively inputting each sub-image into a trained detection network to obtain the coordinate and the category of a prediction frame of each sub-image;

(8c) mapping the prediction frame coordinate and the prediction frame category of each sub-image to the original image to be detected according to the position of the sub-image relative to the original image;

(8d) and filtering the overlapped prediction frame in the image to be detected by using a non-maximum value inhibition method to obtain a final detection result.

Background

The object detection technology realizes the perception of image content by comparing the current information with the stored information (information in memory). Deep learning shows incomparable superiority in the field of target recognition: and extracting the low-to-high level abstract features through layer-by-layer nonlinear change, thereby realizing accurate target identification. However, in the target detection of the remote sensing image, due to the characteristics of ultrahigh resolution, wide field of view, lack of sample data, strong background interference of the target and low signal-to-noise ratio of the remote sensing image, the existing target detection method is difficult to be directly applied to the detection of such scenes, i.e. the existing target detection algorithm is driven by large data and massive clear samples, and cannot directly process the problem of detection of weak and small targets under the condition of few samples.

The patent document 'a remote sensing target detection method based on a multilevel characteristic selection convolutional neural network' (patent application number: CN202110090408.2, application publication number: CN112766184A) applied by the university in southeast south China discloses a remote sensing target detection method based on a multilevel characteristic selection convolutional neural network. The method comprises the following steps: firstly, building a convolutional neural network model, and setting structural parameters of the built convolutional neural network and initializing training parameters; preprocessing the training image and converting the label format, and then performing data enhancement on the preprocessed and converted training image; carrying out convolutional neural network model training to obtain network weight and bias; inputting the test image into the trained neural network model to obtain positioning and classification results; the method also uses the rotating frame to detect the object, thereby greatly improving the positioning precision. However, this method still has two disadvantages: firstly, the method utilizes the backbone network to extract features for ResNet50, the depth of the network is shallow, the capability of extracting features for weak and small targets is weak, and the method uses a maximum pooling layer in the backbone network, so that the quality of extracting the weak and small target feature map by the backbone network is reduced to a great extent; secondly, the detection performance of the model provided by the method depends on a large amount of training data as support, and the applicability of the model to a data set with small data volume, such as a remote sensing image, is greatly reduced.

The patent document applied by the research institute of aerospace information innovation of the Chinese academy of sciences "a method and a system for detecting a remote sensing target of a small sample based on transfer learning" (patent application number: CN202010643231.X, application publication number: CN111860236A) discloses a method for detecting a remote sensing target of a small sample based on transfer learning. The method comprises the following steps: firstly, inputting the remote sensing image to be detected into a pre-trained dual-stage target detection model to obtain the category and horizontal frame regression of the remote sensing image to be detected, wherein the dual-stage target detection model is trained by a source data set to obtain the parameters of the dual-stage target detection model; and then, after the parameters are fixed, the migration parameters in the target data set fine-tuning dual-stage target detection model are constructed. The method utilizes a transfer learning method to finely adjust the trained model on the remote sensing data set with few samples, so as to solve the overfitting of the model when the samples are insufficient. The method has two disadvantages: firstly, the difference between the characteristics of a source data set and the characteristics of a target data set cannot be too large when the migration learning is used, so that an original model needs to be pre-trained on a large number of similar data sets, the difficulty of the model training process is greatly increased, and a large number of false detections are introduced when the migration learning is not used properly; secondly, the method uses a horizontal frame for detecting the remote sensing target, and for the remote sensing target with variable directions, the horizontal frame cannot accurately reflect the direction of the remote sensing target and the horizontal frame contains redundant background.

A Detection method for a remote sensing target based on an extended structure of a target Detection network of a YOLOV2 is disclosed In a paper ' You Only book two ' published by Adam Van Etten: Rapid Multi-Scale Object Detection In Satellite image ' (arXiv:1805.09512,24May 2018). Firstly, cutting an original high-resolution remote sensing picture by using a sliding window method, and ensuring that the picture is cut into sub-pictures with the length and the width of 416 by certain overlap; and then, sending the subgraph into a detection network for training to finally obtain a detection model, and detecting the remote sensing target by using a horizontal frame in a detection process. In order to avoid the missing detection of the small target, the method reduces the step size in the backbone network from 32 to 16, so that the size of the feature map output by the last layer of the network is increased, and the loss of the small target is relieved. However, this method still has two disadvantages: firstly, the method directly reduces the step length of the backbone network from 32 to 16, and although the situation of small target loss is alleviated to a certain extent, the capability of extracting features of the backbone network is greatly reduced; secondly, the method still uses a horizontal frame for detecting the remote sensing target, so that the posture of the remote sensing target cannot be accurately reflected and redundant background can be introduced.

Disclosure of Invention

The invention aims to provide a method for detecting a weak and small target with few samples in a high-resolution image aiming at the defects of the prior art, and aims to solve the problems that the difference between the characteristics of a source data set and the characteristics of a target data set is required to be too large and a large number of training samples are required for the data set by using transfer learning when target detection is carried out in the image, and the problems that the extraction capability of a detection network on the small target characteristics is weak and a horizontal frame cannot accurately reflect the posture of a remote sensing target.

The technical idea for realizing the purpose of the invention is that the backbone network ResNet101 is used for extracting the image features, and the image features have a deeper network structure than that of the backbone network ResNet50, so that the feature extraction capability of the detection network is improved, and the problem of weak feature extraction capability caused by the use of a shallower backbone network ResNet50 in the traditional method is solved. The matching convolution kernel generation network is used for generating the matching convolution kernel to perform template matching, and only one template image and a small amount of training data are needed in the process, so that the problems that a training set is required to have a large number of training samples and a large amount of false detection is introduced by using transfer learning in the traditional deep learning target detection network are solved. The feature map is processed by serially connecting the channel attention module and the noise reduction module, because the channel attention module can increase the response of important channels in the feature map, and the noise reduction module can further improve the signal-to-noise ratio of the feature map, thereby solving the problem that the traditional detection network has weak detection capability on small targets. The target is marked and detected by using the rotating rectangular frame, and the rotating rectangular frame has more accurate marking precision compared with a horizontal frame, so that the problems that the posture of the remote sensing target cannot be accurately reflected by the horizontal frame and redundant background can be introduced are solved.

The method comprises the following specific steps:

(1) generating a training set and template images:

(1a) selecting at least 10 images, cutting each image into 1024 x 1024 pixels by taking 256 pixels as a cutting step length, wherein each image comprises at least 1 target, marking the cut target in the cutting process as a special target, setting the flag bit difficult of the target as 1, and forming a data set by the cut image and the flag bit;

(1b) labeling 4 corner points of each target in the data set in a form of a rotating rectangular frame to obtain a labeling coordinate of each target;

(1c) forming a training set by the labeled coordinates and the data set of each target;

(1d) randomly selecting 1 image with clear target outline as a template image;

(2) constructing a matching convolution kernel generation network:

building a matching convolution kernel generation network consisting of a main network ResNet101 and a pooling layer, setting the kernel size of the pooling layer to be 3 multiplied by 3, setting the pooling step length to be 2, and adopting an average pooling method by the pooling layer;

(3) constructing a characteristic pyramid network with 4 stages and 256 output channels in each stage;

(4) building a channel attention module and a noise reduction module:

(4a) build passageway attention module, its structure does in proper order: the device comprises a global average pooling layer, a first full-connection layer, a ReLU activation function, a second full-connection layer and a Sigmoid activation function, wherein the core size of the global average pooling layer is set to be 3 multiplied by 3, the pooling step length is 2, the average pooling method is adopted, the number of input channels of the first full-connection layer and the number of output channels of the second full-connection layer are respectively set to be 256 and 128, and the number of output channels of the first full-connection layer and the number of output channels of the second full-connection layer are respectively set to be 128 and 256;

(4b) build the noise reduction module, its structure does in proper order: the convolutional code comprises a first convolutional layer, a ReLU activation function, a second convolutional layer, a ReLU activation function, a third convolutional layer and a Softmax function, wherein the sizes of convolutional kernels of the first convolutional layer, the second convolutional layer and the third convolutional layer are all set to be 3 multiplied by 3, and the step sizes are all set to be 1;

(5) building a classification output network and a coordinate output network:

(5a) building a classification output network formed by connecting a first full connection layer and a second full connection layer in series, wherein the input size of the first full connection layer is set to be 49, the output size is set to be 1024, the input size of the second full connection layer is set to be 1024, and the output size is set to be 2;

(5b) building a coordinate output network formed by connecting a first full connection layer and a second full connection layer in series, wherein the input size of the first full connection layer is set to be 49, the output size is set to be 1024, the input size of the second full connection layer is set to be 1024, and the output size is set to be 5;

(6) building a detection network:

(6a) sequentially cascading a matching convolution kernel generation network, a characteristic pyramid network, a channel attention module, a noise reduction module, an area suggestion network, a RoIAlign network, a classification output network and a coordinate output network into a detection network;

(7) training a detection network:

(7a) respectively combining each image in the training set with a template image into a group, and sequentially inputting each group of data into a detection network to obtain the class and the coordinate of a prediction frame output by the detection network to each group of data;

(7b) calculating the category loss value of each group of prediction frame categories and label categories by using the binary cross entropy; calculating the coordinate loss value of each group of prediction frame coordinates and label coordinates by using the smooth L1 norm, and adding the category loss value and the coordinate loss value of each group to obtain the loss value of the group;

(7c) iteratively updating the detection network weight by using each group of loss values by using a back propagation method until the loss value of the detection network is not reduced any more, and obtaining a trained detection network;

(8) detecting an image to be detected:

(8a) with 256 pixels as a cutting step length, cutting one image to be detected into a plurality of sub-images with the size of 1024 multiplied by 1024 pixels;

(8b) respectively inputting each sub-image into a trained detection network to obtain the coordinate and the category of a prediction frame of each sub-image;

(8c) mapping the prediction frame coordinate and the prediction frame category of each sub-image to the original image to be detected according to the position of the sub-image relative to the original image;

(8d) and filtering the overlapped prediction frame in the image to be detected by using a non-maximum value inhibition method to obtain a final detection result.

Compared with the prior art, the invention has the following advantages:

firstly, as the main network ResNet101 is adopted to extract the features of the image, richer semantic information in the image can be extracted by using the main network ResNet101, the feature extraction capability of the detection network is effectively increased, and the problem of weak feature extraction capability caused by using a shallower main network in the prior art is solved, so that the method has good feature extraction capability under a complex background.

Secondly, the matching convolution kernel generation network is adopted to generate the matching convolution kernel, and the matching convolution kernel is used for carrying out template matching on the target in the image, so that the demand of the detection network on the number of the training samples in the training set is effectively reduced, the problem that a large number of training samples are needed in the prior art is solved, the problem that a large number of false detections are introduced by using transfer learning is avoided, and the method can have good detection performance under the scene of few samples.

Thirdly, because the channel attention module and the noise reduction module are adopted, the characteristic diagram is processed in series through the channel attention module and the noise reduction module, the response of an effective channel in the characteristic diagram is effectively increased, the signal to noise ratio of the characteristic diagram is further improved, and the problem that the target cannot be correctly detected in the prior art when the target is small and the target characteristic is not obvious is solved, so that the method can have good detection performance under the scenes of small targets and unobvious target characteristics.

Fourthly, because the target is marked and predicted by adopting the rotating rectangular frame, the representation precision of the target prediction frame to be detected is effectively improved, the background proportion contained in the prediction frame is greatly reduced, and the problems that the target posture cannot be accurately reflected and redundant background can be introduced when the horizontal frame is used for marking and detecting the target in the prior art are solved, so that the target posture can be accurately represented in a scene with variable target postures.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a simulation diagram of the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

The specific steps of the present invention are further described with reference to fig. 1.

Step 1, generating a training set and a template image.

Selecting at least 10 images, cutting each image into 1024 x 1024 pixels by taking 256 pixels as a cutting step length, wherein each image comprises at least 1 target, marking the cut target in the cutting process as a special target, setting the flag bit difficult of the target as 1, and forming a data set by the cut image and the flag bit.

And (3) labeling 4 corner points of each target in the data set in a form of a rotating rectangular frame to obtain a labeled coordinate of each target, and forming the labeled coordinate of each target and the data set into a training set.

Randomly selecting 1 image with clear target outline as a template image.

And 2, building a detection network.

And constructing a matching convolution kernel generation network consisting of a main network ResNet101 and a pooling layer, setting the kernel size of the pooling layer to be 3 multiplied by 3, setting the pooling step length to be 2, and adopting an average pooling method by the pooling layer.

And constructing a characteristic pyramid network which has 4 stages and the number of output channels of each stage is 256.

A channel attention module is built, and the structure of the channel attention module is as follows in sequence: the device comprises a global average pooling layer, a first full-connection layer, a ReLU activation function, a second full-connection layer and a Sigmoid activation function, wherein the core size of the global average pooling layer is set to be 3 x3, the pooling step length is 2, the number of input channels of the first full-connection layer and the number of output channels of the second full-connection layer are set to be 256 and 128 respectively, and the number of output channels of the first full-connection layer and the number of output channels of the second full-connection layer are set to be 128 and 256 respectively by adopting an average pooling method.

Build a noise reduction module, its structure does in proper order: a first convolution layer, a ReLU activation function, a second convolution layer, a ReLU activation function, a third convolution layer, a Softmax function, wherein the convolution kernel sizes of the first, second and third convolution layers are each set to 3 x3 and the step sizes are each set to 1.

And building a classification output network formed by connecting a first full connection layer and a second full connection layer in series, wherein the input size of the first full connection layer is set to be 49, the output size is set to be 1024, the input size of the second full connection layer is set to be 1024, and the output size is set to be 2.

And (3) constructing a coordinate output network formed by connecting a first full connection layer and a second full connection layer in series, wherein the input size of the first full connection layer is set to be 49, the output size is set to be 1024, the input size of the second full connection layer is set to be 1024, and the output size is set to be 5.

And sequentially cascading a matching convolution kernel generation network, a characteristic pyramid network, a channel attention module, a noise reduction module, an area suggestion network, a RoIAlign network, a classification output network and a coordinate output network into a detection network.

And 3, training the detection network by using the training set and the template image.

And (3) forming a group of each image in the training set and the template image respectively, and sequentially inputting each group of data into the detection network to obtain the class and the coordinate of the prediction frame output by the detection network to each group of data.

Calculating the category loss value of each group of prediction frame categories and label categories by using the binary cross entropy; and calculating the coordinate loss value of each group of the predicted frame coordinates and the label coordinates by using the smoothed L1 norm, and adding the category loss value and the coordinate loss value of each group to obtain the loss value of the group.

And iteratively updating the weight of the detection network by using each group of loss values by using a back propagation method until the loss value of the detection network is not reduced any more, thereby obtaining the trained detection network.

And 4, detecting the picture to be detected by using the trained detection network.

And (3) taking 256 pixels as a cutting step length, and cutting one image to be detected into a plurality of sub-images with the size of 1024 multiplied by 1024 pixels.

And respectively inputting each sub-image into the trained detection network to obtain the coordinate and the category of the prediction frame of each sub-image.

And mapping the coordinate and the category of the prediction frame of each sub-image to the original image to be detected according to the position of the sub-image relative to the original image.

And filtering the overlapped prediction frame in the image to be detected by using a non-maximum value inhibition method to obtain a final detection result.

The technical effects of the invention are further explained by combining simulation experiments as follows:

1. and (5) simulating conditions.

The invention uses deep learning frame Pythrch to complete simulation on the computer of Intel (R) core (TM) i7-10700K CPU @3.80GHz processor and Nvidia (R) RTX3090 display card.

2. And (5) simulating content and result analysis.

Setting a simulation scene: in order to verify that the small and weak target detection method based on the template matching and the attention mechanism can detect the small and weak target under the condition of less samples, the simulation experiment scene is set to detect the ship target from the SAR image, and the number of the available data set images is 10. And (2) cutting each image into sub-images with the size of 1024 multiplied by 1024 pixels by taking 256 pixels as a cutting step length, wherein each image comprises at least 1 target, randomly selecting 1 target clear-contour image from the sub-images as a template image, and labeling the rest images by using a rotating rectangular frame to be used as a training set.

After the detection network is built, the maximum number of training rounds is set to be 36, each image in the training set and the template image form a group, the group is sequentially input into the network to iteratively update the network weight, and finally the well-trained detection network is obtained.

In order to prove the simulation effect, the image to be detected is set to be a scene with a small ship target and a complex background. And detecting the image to be detected by using the trained detection network to obtain a detection result. Fig. 2(a) and fig. 2(b) show the detection result of the method of the present invention in a complex background scene, and the detection result is labeled by using a rotating rectangular frame. FIG. 2(c) shows the detection result of the method of the present invention in a small target scene, and the detection result is labeled by using a rotating rectangular frame. The complex scene refers to a scene in which the area of the background in the image is larger than the area of the target, and the small target scene refers to a scene in which the target area is smaller than 32 × 32 pixels in the image.

Fig. 2(a) shows a detection result diagram of a single target in a complex scene in the simulation experiment, and it can be seen from fig. 2(a) that the method of the present invention can achieve a good detection effect even if the number of targets is small in the complex scene. Fig. 2(b) shows a diagram of a detection result of multiple targets in a complex scene, and it can be seen from fig. 2(b) that the method of the present invention can achieve a good detection effect even if the number of targets is large in the complex scene. Fig. 2(c) shows a detection result diagram of the simulation experiment in a plurality of small target scenes, and it can be seen from fig. 2(c) that the method of the present invention can achieve a good detection effect even if the number of targets is large in the small target scenes.

In summary, from the analysis of the simulation effect graph, it can be seen that the method for detecting the small target with the small sample based on the template matching and the attention mechanism, provided by the invention, can accurately detect the small target under the condition of the small sample, has accurate target detection number and accurate labeling position, and meanwhile, as the matching convolution kernel generation network is adopted to generate the matching convolution kernel for template matching, the demand of the detection network for the number of the training set samples is greatly reduced, and the detection network can have good detection performance under the background complex scene by using the channel attention module and the noise reduction module, so that the method has more advantages in the practical engineering application.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:基于图像处理的智能称重管理系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!