Innovative edge feature extraction method-based remote sensing image significance detection method
1. A method for detecting the significance of a remote sensing image based on innovative edge feature extraction is characterized by comprising the following steps:
step (1), extracting depth features, wherein the specific method comprises the following steps:
firstly, constructing an encoder structure, constructing the encoder according to ResNet34, and finally adding Basicblock for better extracting features;
step (2), obtaining edge characteristics, specifically comprising the following steps:
the decoder part has two branches, namely a saliency information extraction branch and a saliency edge information extraction branch; the salient edge information extraction branch extracts edge features by using a U-Net structure, and is parallel to the salient information extraction branch;
inputting the features output by the last layer of the encoder into a significant edge information extraction branch of a decoder, wherein the significant edge information extraction branch consists of five convolution blocks; each convolution block consists of three convolution layers, each convolution layer is respectively an expansion convolution with the expansion rate of 2 and two simple convolution layers (the sizes of convolution kernels are 3 multiplied by 3, and the step length is 1); monitoring the obtained information at each part of the salient edge information extraction branch to finally obtain the edge features with the best effect, wherein the process can be expressed as follows:
Fe1=Conv(F5) (1)
Fei=Conv(C(F6-i,UP(C(Fei-1,F7-i))))(i=2,3,4,5) (2)
wherein: conv denotes a convolutional block consisting of three convolutional layers, F5Features representing the output of the last layer of the encoder part, Fe1Representing the edge characteristics obtained by the first layer of the salient edge information extraction branch; feiRepresenting edge features extracted by the ith layer of the salient edge information extraction branch, F6-iRepresenting current saliency edge information extraction scoresFeatures extracted by corresponding encoders are supported, UP represents UP-sampling of bilinear interpolation, C represents combination operation, and Fe(i-1)Representing the edge feature extracted from the previous layer, F7-iRepresenting the characteristics extracted by the encoder corresponding to the last layer of salient edge information extraction points;
and (3) extracting the significant features, wherein a U-Net structure is also adopted, and the specific method comprises the following steps:
the salient information is obtained through the U-shaped network and jumper connection, the salient information extraction branch is composed of five convolution blocks, and each convolution block is composed of three convolution layers; the first convolutional layer is an expansion convolution with the expansion rate of 2, and the other two layers are common convolutional layers (the sizes of convolution kernels are 3 multiplied by 3, and the step length is 1); an up-sampling operation of bilinear interpolation is carried out between the volume blocks; after the information of each layer is extracted by the rolling block, the information is transmitted to the next layer for extraction through up-sampling; the information extracted from each layer has an added supervision mechanism to improve the quality of the finally obtained saliency map, and the process can be expressed as:
Fl1=Conv(F5) (3)
Fli=Conv(UP(C(Fl(i-1),F6-i)))(i=2,3,4,5) (4)
wherein: conv denotes a convolutional block consisting of three convolutional layers, F5Features representing the output of the last layer of the encoder part, Fl1Representing the salient features obtained by the salient information extraction branch in the decoder; fliRepresenting the significant characteristics obtained by the extraction of the ith layer of the significant information extraction branch, UP representing the UP-sampling operation of bilinear interpolation, C representing the combination operation, Fl(i-1)Showing the significant feature obtained in the previous layer, F6-iRepresenting information extracted by a layer of an encoder corresponding to the current saliency information extraction branch;
and (4) finally outputting a significance prediction graph fused with significance information and significance edge information, wherein the fusion method is as follows:
finally, two kinds of different information are fused by one convolution layer according to the significance information and the significance edge information extracted from the two parallel parts (the convolution kernel size is 3 x 3, and the step length is 1), and then the fused information is converted into single-channel output by one convolution layer (the convolution kernel size is 1 x 1, and the step length is 1), and the process can be described as follows:
Out=Convo(Convf(C(Fl,Fe))) (5)
wherein: out represents the significance prediction graph of the final output, ConvoRepresenting the convolutional layers for the number of conversion channels, ConvfRepresenting convolutional layers for fusing information, C representing a join operation, FlAnd FeRespectively representing the significance information and the significant edge information obtained by the two branches.
2. The method for detecting the saliency of the remote sensing image based on the innovative edge feature extraction as claimed in claim 1, characterized in that a new edge extraction method is used and fused with saliency information for improving the saliency detection quality of the remote sensing image, the decoder part has two branches, namely a saliency information extraction branch and a saliency edge information extraction branch, each branch has five convolution blocks, and 15 convolution layers are provided; the first convolutional layer of each convolutional block is a convolutional layer with the expansion rate of 2, the rest are ordinary convolutional layers, the sizes of convolutional layer convolutional kernels are set to be 3 x 3, and the step size is 1.
3. The method for detecting the saliency of remote sensing images based on innovative edge feature extraction as claimed in claim 1, characterized in that during training, the image size is uniformly adjusted to 224 x 224, the batch size is 8; in the training process, a cross entropy loss function is used as a loss function of the user, parameters in the network are updated by an Adam optimizer, and the basic learning rate is 1 e-4.
Background
The detection of the remarkable objects is a popular research direction in the field of computer vision, and the field of the detection of the remarkable objects is greatly improved under the environment of vigorous development of deep learning. The related achievements are widely applied to the fields of pedestrian detection, video compression, video segmentation, image positioning and the like, and have great research value and market value. The conventional image has achieved remarkable results, but for the remote sensing image, due to the characteristics of complexity of the background and variability of the image size, the effect of directly applying the traditional method to the remote sensing image is not ideal.
The traditional image salient objects are distributed in the center of the image, and the size of the traditional image salient objects is not different greatly. However, due to the characteristics of the remote sensing image, the obvious target is distributed in the center or the edge of the image, and the size of the image is variable. Moreover, the background of the remote sensing image is more complex compared with the traditional image, and the information contained in the image is more, so that the information of the salient object is difficult to effectively extract.
The obvious target detection has achieved little achievement on the traditional method, but with the rise of the neural network, the obvious target detection field has a great progress, and the effect is obviously improved. Recently, the detection of the obvious target of the remote sensing image gradually receives attention, new networks are continuously proposed, and the detection effect is gradually increased.
The existing remote sensing image salient target detection method mostly ignores the importance of edges to salient targets, so that the salient targets have the problem of unclear edges, and the detection effect cannot meet the requirements of people.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method for detecting the significance of a remote sensing image based on innovative edge feature extraction.
The method comprises the following steps:
1. the method for detecting the significance of the remote sensing image based on the innovative edge feature extraction is characterized by comprising the following steps of:
step (1), extracting depth features, wherein the specific method comprises the following steps:
firstly, constructing an encoder structure, constructing the encoder according to ResNet34, and finally adding Basicblock for better extracting features;
step (2), obtaining edge characteristics, specifically comprising the following steps:
the decoder part has two branches, namely a saliency information extraction branch and a saliency edge information extraction branch; the salient edge information extraction branch extracts edge features by using a U-Net structure, and is parallel to the salient information extraction branch;
inputting the features output by the last layer of the encoder into a significant edge information extraction branch of a decoder, wherein the significant edge information extraction branch consists of five convolution blocks; each convolution block consists of three convolution layers, each convolution layer is respectively an expansion convolution with the expansion rate of 2 and two simple convolution layers (the sizes of convolution kernels are 3 multiplied by 3, and the step length is 1); monitoring the obtained information at each part of the salient edge information extraction branch to finally obtain the edge features with the best effect, wherein the process can be expressed as follows:
Fe1=Conv(F5) (1)
Fei=Conv(C(F6-i,UP(C(Fei-1,F7-i))))(i=2,3,4,5) (2)
wherein: conv denotes a convolutional block consisting of three convolutional layers, F5Features representing the output of the last layer of the encoder part, Fe1Representing the edge characteristics obtained by the first layer of the salient edge information extraction branch; feiRepresenting edge features extracted by the ith layer of the salient edge information extraction branch, F6-iRepresenting the characteristics extracted by the encoder corresponding to the current significant edge information extraction branch, UP representing the UP-sampling of the bilinear interpolation, C representing the combination operation, Fe(i-1)Representing the edge feature extracted from the previous layer, F7-iRepresenting the upper layer of the saliency edgeExtracting the information according to the characteristics extracted by the corresponding encoder;
and (3) extracting the significant features, wherein a U-Net structure is also adopted, and the specific method comprises the following steps:
the salient information is obtained through the U-shaped network and jumper connection, the salient information extraction branch is composed of five convolution blocks, and each convolution block is composed of three convolution layers; the first convolutional layer is an expansion convolution with the expansion rate of 2, and the other two layers are common convolutional layers (the sizes of convolution kernels are 3 multiplied by 3, and the step length is 1); an up-sampling operation of bilinear interpolation is carried out between the volume blocks; after the information of each layer is extracted by the rolling block, the information is transmitted to the next layer for extraction through up-sampling; the information extracted from each layer has an added supervision mechanism to improve the quality of the finally obtained saliency map, and the process can be expressed as:
Fl1=Conv(F5) (3)
Fli=Conv(UP(C(Fl(i-1),F6-i)))(i=2,3,4,5) (4)
wherein: conv denotes a convolutional block consisting of three convolutional layers, F5Features representing the output of the last layer of the encoder part, Fl1Representing the salient features obtained by the salient information extraction branch in the decoder; fliRepresenting the significant characteristics obtained by the extraction of the ith layer of the significant information extraction branch, UP representing the UP-sampling operation of bilinear interpolation, C representing the combination operation, Fl(i-1)Showing the significant feature obtained in the previous layer, F6-iRepresenting information extracted by a layer of an encoder corresponding to the current saliency information extraction branch;
and (4) finally outputting a significance prediction graph fused with significance information and significance edge information, wherein the fusion method is as follows:
finally, two kinds of different information are fused by one convolution layer according to the significance information and the significance edge information extracted from the two parallel parts (the convolution kernel size is 3 x 3, and the step length is 1), and then the fused information is converted into single-channel output by one convolution layer (the convolution kernel size is 1 x 1, and the step length is 1), and the process can be described as follows:
Out=Convo(Convf(C(Fl,Fe))) (5)
wherein: out represents the significance prediction graph of the final output, ConvoRepresenting the convolutional layers for the number of conversion channels, ConvfRepresenting convolutional layers for fusing information, C representing a join operation, FlAnd FeRespectively representing the significance information and the significant edge information obtained by the two branches.
Preferably, a new edge extraction mode is used and is fused with significance information for improving the quality of significance detection of the remote sensing image, a decoder part has two branches which are respectively a significance information extraction branch and a significance edge information extraction branch, each branch contains five convolution blocks, and 15 convolution layers are provided in total; the first convolutional layer of each convolutional block is a convolutional layer with the expansion rate of 2, the rest are ordinary convolutional layers, the sizes of convolutional layer convolutional kernels are set to be 3 x 3, and the step size is 1.
Preferably, the image size is uniformly adjusted to 224 × 224 during training, and the batch size is 8; in the training process, a cross entropy loss function is used as a loss function of the user, parameters in the network are updated by an Adam optimizer, and the basic learning rate is 1 e-4.
The invention has the following beneficial effects:
the salient features and the salient edge features are extracted by a U-shaped network, multi-scale information is fused, and the extracted features are higher in quality. The edge features with better quality are fused with the salient features, which is beneficial to optimizing the edge of the salient object and improving the detection quality. The method adopts a new mode to extract the edge characteristics, refines the edges and improves the quality of the obvious target. And the context characteristics are extracted by a U-shaped network, a complex background is suppressed, and a remarkable target is highlighted, so that a remarkable target detection result of the remote sensing image with a better edge effect is obtained.
Drawings
FIG. 1 is a block diagram of the method of the present invention;
FIG. 2 is a block diagram of a convolution block for edge portion feature extraction according to the present invention;
FIG. 3 illustrates the manner in which the method of the present invention combines a salient feature with an edge feature;
FIG. 4 is a diagram showing the effect of the method of the present invention (the first column is RGB image, the second column is label image, and the third column is our predicted image); fig. 4 mainly shows the detection capability of our network for complex background and multi-scale images in remote sensing images and the refinement capability for significant target edges in prediction images.
Detailed Description
The invention is further illustrated by the following figures and examples.
As shown in fig. 1-4. A remote sensing image saliency detection method based on innovation edge feature extraction is characterized by comprising an encoder based on ResNet34, an innovation edge extraction method and a fusion method of saliency information and saliency edge information. The method comprises the steps that an RGB three-channel color optical remote sensing image is input into a model, firstly, features are extracted by an encoder with ResNet34 as a backbone, then the extracted features are respectively input into two decoders of edge features and salient features, the features are extracted through continuous convolution, and finally a saliency prediction image is output. The salient features and the salient edge features are extracted through a U-shaped network, multi-scale information is fused, and the extracted features are high in quality. The edge features with better quality are fused with the salient features, which is beneficial to optimizing the edge of the salient object and improving the detection quality.
As shown in fig. 1, the method of the invention comprises the following steps:
step (1), extracting depth features, wherein the specific method comprises the following steps:
the encoder structure is built first, our encoder is built from ResNet34, and Basicblock is added last for better feature extraction.
Step (2), obtaining edge information, wherein the specific method comprises the following steps:
we innovatively use the structure of U-Net to extract edge features, in parallel with the salient feature extraction module. Inputting the information output by the sixth layer into a decoder of the edge information, wherein the decoder consists of five convolution blocks; as shown in fig. 2, each convolution block consists of three convolution layers, which are respectively a dilation convolution with a dilation rate of 2 and two simple convolution layers (the convolution kernel size is 3 × 3, step size is 1); and monitoring the obtained information at each part of the decoder to finally obtain the edge characteristics with the best effect. The process can be expressed as:
Fe1=Conv(F5)
Fei=Conv(C(F6-i,UP(C(Fei-1,F7-i))))(i=2,3,4,5)
wherein: conv denotes a convolutional block consisting of three convolutional layers, F5Features representing the output of the last layer of the encoder part, Fe1Representing the edge features obtained by the first layer of the edge decoder. FeiRepresenting 2-5 layers of extracted edge features, F6-iRepresenting the features extracted by the encoder corresponding to the current edge decoder module, UP representing the UP-sampling of the bilinear interpolation, C representing the combining operation, Fei-1Representing the edge feature extracted from the previous layer, F7-iAnd representing the characteristics extracted by the encoder corresponding to the edge decoder module of the previous layer.
And (3) extracting the significant features, wherein a U-Net structure is also adopted, and the specific method comprises the following steps:
we get salient information through a U-network and jumper connections, the decoder part consists of five convolutional blocks, each of which consists of three convolutional layers. The first convolutional layer was a convolutional layer with a coefficient of expansion of 2, the remaining two layers were ordinary convolutional layers (convolutional kernel size 3 × 3, step size 1). There is an upsampling operation of bilinear interpolation between the volume blocks. After the information of each layer is extracted by the rolling block, the information is transmitted to the next layer for extraction through upsampling. For each layer of extracted information, we have an added supervision mechanism to improve the quality of the finally obtained saliency map, and the process can be expressed as:
Fl1=Conv(F5)
Fli=Conv(UP(C(Fli-1,F6-i)))(i=2,3,4,5)
wherein:conv denotes a convolutional block consisting of three convolutional layers, F5Features representing the output of the last layer of the encoder part, Fl1Representing the salient features obtained at the first layer of the decoder. FliShowing the significant features obtained by the extraction of the 2-5 th layer, UP showing the UP-sampling operation of bilinear interpolation, C showing the combination operation, Fli-1Showing the significant feature obtained in the previous layer, F6-iRepresenting the information extracted by the layer of the encoder corresponding to the current decoder.
And (4) finally outputting a saliency prediction graph fusing saliency information and edge information, wherein as shown in FIG. 3, the fusion method is as follows:
we combine the salient information and salient edge information extracted from the final two parallel parts, fuse two different information by one convolutional layer (convolutional kernel size is 3 × 3, step size is 1), and then convert the fused information into single-channel output by one convolutional layer (convolutional kernel size is 1 × 1, step size is 1), and the process can be described as follows:
Out=Convo(Convf(C(Fl,Fe)))
wherein: out represents the significance prediction graph of our final output, COnvoRepresenting the convolutional layers for the number of conversion channels, ConvfRepresenting convolutional layers for fusing information, C representing a join operation, FlAnd FeRespectively representing the significant information and the significant edge information obtained by the two branches.
Fig. 4 is an effect diagram of the method of the present invention, wherein the first column is RGB image, the second column is label image, and the third column is our predicted image.
Furthermore, the invention uses a new edge extraction mode, and the edge extraction mode is fused with the significance information to improve the significance detection quality of the remote sensing image. The decoder part of the invention has two branches, namely a saliency information extraction branch and a saliency edge information extraction branch, wherein each branch contains five convolution blocks, and 15 convolution layers in total. The first convolutional layer of each convolutional block is a convolutional layer with the expansion rate of 2, the rest are ordinary convolutional layers, the sizes of convolutional layer convolutional kernels are set to be 3 x 3, and the step size is 1.
Further, the basic learning rate of the invention is 1 e-4.