Example segmentation improvement method based on ordering and semantic consistency constraint
1. An example segmentation improvement method based on ordering and semantic consistency constraint is characterized by comprising the following steps:
1) two-stage network based example segmentation improvement scheme
The first stage is to extract the region of interest, and increase the sorting loss on the basis of the original classification and regression loss aiming at the sorting selection operation of the sub-regions; aiming at the segmentation operation of the second stage, semantic consistency loss is increased on the basis of original classification, regression and segmentation loss;
2) example segmentation improvement scheme based on single-stage network
In the single-stage algorithm, on the basis of original classification, regression and segmentation losses, the sequencing loss and the semantic consistency loss are added to a regression head and a segmentation head respectively;
adding sequencing loss and semantic consistency loss into an original instance segmentation frame;
the two types of algorithms are respectively represented as the original basic modules of Mask-RCNN and Yolact;
Mask-RCNN basic module:
a characteristic extraction network Backbone, wherein ResnexXt 101+ FPN is the Backbone, and the used pre-training weight is a resnet101 file which is pre-trained on an ImageNet data set;
the RPN is used for generating region explosals; the layer judges that anchors belong to positive or negative through softmax, and then correct the anchors by using bounding box regression to obtain accurate ROIs (namely sub-regions);
full connection layer FC: classifying operation to obtain an example label, and regressing operation to obtain an example frame;
full convolution network FCN: performing mask segmentation operation to obtain an example mask;
yolact base module:
feature extraction network backhaul: resnet101+ FPN is a Backbone, and the pre-training weight file is still a Resnet101 file;
a Protonet network, which generates k prototype masks for each image; k is 32; a network with a full convolution structure is connected behind the FPN output, and a group of prototype masks are obtained through prediction;
prediction Head: performing regression classification operation, namely outputting the subareas and classification thereof in the same way as the subarea extraction process in the two-stage method; simultaneously predicting k linear combination coefficients for performing linear combination operation on the prototype Mask and performing Crop operation on a linear combination result to obtain an example Mask;
sub-region ordering penalty
The ordering loss is defined as in formula (1):
wherein P is a positive sample set, and the positive and negative samples are determined according to a threshold value after the intersection and comparison is calculated by a prior frame or an anchor frame and a GT Bbox; samples above 0.7 are classified as positive samples, samples below 0.3 are classified as negative samples, and the rest samples are not trained and processed; sorting the positive and negative samples according to the size of the cross comparison, wherein the sorting serial number is the r value; sorting according to the classification confidence score of the sample according to the size, wherein the sorting sequence number is the value of sort (r); the more the positive sample sorting result is, the smaller the SRloss is;
sub-region semantic consistency loss
Memory MiThe number of pixel points labeled as class i in the mask region is given, and the final loss function form is:
where c is the total number of classes, c is 80 on the MS COCO dataset and α is a hyperparameter;
model training and testing
The pre-training model used has been trained on the imagenet dataset;
the tag file annotation using the standard includes [ id, image _ id, category _ id, segmentation, area, bbox, iscrowd]Category _ id, mask tag segmentation, instance box tag bbox; category _ id is a category label, wherein segment is polygon format data, a pair of adjacent data is coordinate value of an example contour edge point, bbox data format is [ x [ ]1,y1,x2,y2];
In the first stage, the loss function is mainly obtained by training an RPN network, and comprises RPN foreground and background classification losses lrpn_clsThe real label t belongs to {1, 0, -1}, the anchors with the real label of 0 do not participate in the construction of the loss function, and the label of-1 is converted into 0 to carry out cross entropy calculation; RPN target frame regression loss lrpn_regAnd the loss of ordering SRloss proposed by the present invention, denoted as l in this examplerpn_rank;
In the second stage, the loss function is mainly the classification loss lclsRegression loss lregThe division loss lsegAnd the semantic consistency loss SSLoss proposed by the present invention is marked as lsc;
The classification loss and the segmentation loss are usually cross entropy losses, and the regression loss is Smooth _ L1 loss; they are in general form:
cross entropy loss:
y is the true value, yiIn order to predict the probability value, n is the number of samples;
smoothen _ L1 loss:
y*is a predicted value, y is a label, and x is the difference between the two;
the overall training is divided into two parts, firstly, the RPN network part is trained, and the loss function L needing to be optimized1The following were used:
L1=lrpn_cls+lrpn_reg+lrpn_rank (5)
after the ROI region screening enters the second stage, a loss function L needing to be optimized2The following were used:
L2=lcls+lreg+lseg+lsc (6)
in the testing part, the top 100 detection boxes with the highest scores are processed by using Mask prediction branches; in order to accelerate the inference efficiency, the network predicts K mask images for each ROI, but only needs to use the mask image with the largest class probability, the mask image resize is returned to the ROI size, binarization is carried out by using a set threshold value of 0.5, retention is carried out when the threshold value is higher than the threshold value, and finally the segmented mask and the original image are subjected to image-level add operation to obtain a final example segmentation visualization result;
total loss of the network is listed:
L=lcls+lreg+lseg+lsegm+lrank+lsc
wherein lclsAnd lregTo classify losses and regression losses, lsegFor a segmentation loss,/segmFor semantic segmentation penalty, the ordering penalty is denoted as lrankSemantic consistency loss is noted as lscAre all the cross-entropy losses L mentioned aboveceLoss of L from Smooth _ L1smooth_l1Coarse semantic segmentation loss l in addition to regression and classification losssegmStill in the form of cross-entropy losses.
Background
Example segmentation is a very important task in the field of computer vision. Several major tasks of modern computer vision include: image classification, target detection, semantic segmentation, instance segmentation, etc., the complexity of several tasks is progressive step by step. Instance segmentation not only requires segmenting the mask (mask) of the object, identifying its class, but also requires distinguishing between different instances of the same class, and thus instance segmentation can be defined as a technique that solves both the object detection problem and the semantic segmentation problem.
For a given image, the example segmentation algorithm predicts a set of data labels Cj,Bj,Ij,SjIn which C isjClass representing the jth instance of prediction, BjIndicating the predicted location information of the jth instance, IjInformation of the division mask representing the j-th example, SjIs the confidence of the class of the j-th instance of the prediction.
In recent years, with the development of the field of artificial intelligence, a plurality of example segmentation methods and technologies emerge, and the current mainstream example segmentation method can be divided into a two-stage example segmentation method and a single-stage example segmentation method.
Wherein, the two-stage example segmentation comprises two major types of target detection algorithms from top to bottom and basically takes off the fetus from two stages. As shown in fig. 1, the first stage is a simple foreground-background two-classification and regression process, which aims to extract sub-regions, where a sub-region is a rectangular frame (prediction frame) represented by two pairs of coordinates, where the two pairs of coordinates respectively represent two corner points of the rectangular frame, and the rectangular frame may contain an object instance; in the second stage, the corner point coordinates of the rectangular frame are subjected to accurate regression aiming at each sub-region extracted in the first stage, the rectangular frame is subjected to instance classification and pixel level marking, and meanwhile, a mask of an instance is obtained by dividing a sub-network. The classic task of this algorithm is the Mask-RCNN algorithm, which achieves 37.1% mAP (mean rate of accuracy) on the coco dataset.
According to the existing literature, the two-stage example segmentation algorithm achieves the highest precision on the public data set. This type of process has two disadvantages: firstly, the speed of the algorithm is influenced by a two-stage processing scheme, and the real-time performance cannot be guaranteed to be the greatest limitation for the landing use of the algorithm; secondly, the quality of the masks obtained by the algorithm is still uneven, taking Mask-RCNN as an example, the final masks are obtained by sampling on a small area of 28 × 28, so that the quality of the restored masks is poor, and the restored masks cover insufficient or excessive sample areas.
The single-stage example segmentation algorithm is shown in fig. 2, and this type of algorithm basically comes from a single-stage target detection algorithm, and there is no sub-region extraction process as compared with a two-stage algorithm. The classic task of this type of algorithm is the Yolact series of algorithms, which achieve 29.8% mAP on the coco dataset and a speed of 33.5 FPS.
The real-time performance of the single-stage algorithm is enough to meet the landing requirement, and the problems that the precision is more reduced than that of the two-stage algorithm and the mask quality is not high are also serious.
In summary, the best mAP index of the existing two-stage and single-stage example segmentation algorithms on the public data set shows that the mask quality of example segmentation has a further improved space.
The idea of improving the mask quality is to add a new loss function into the framework, define semantic constraints on sub-regions possibly containing examples and pixel-level labels of the examples, and specifically form sequencing loss and semantic consistency constraint loss. The loss function defined by the invention can be directly applied to the existing two-stage and single-stage example segmentation algorithm, and the mAP index of the original algorithm is improved. According to the method, verification experiments are respectively carried out on Mask-RCNN and Yolact which are representative algorithms of two-stage and single-stage example segmentation, and the experimental results show the effectiveness of the loss function provided by the invention.
Disclosure of Invention
Aiming at the problem that the mask quality obtained by the existing example segmentation algorithm is poor, the invention provides a new loss function (sequencing loss and semantic consistency constraint loss) to be added into the existing algorithm framework, thereby forming a new example segmentation method and improving the mask quality obtained by example segmentation.
The invention provides an example segmentation improvement method based on ordering and semantic consistency constraint, which is respectively introduced for implementation schemes on single-stage and two-stage example segmentation frameworks.
1. Two-stage network based example segmentation improvement scheme
As shown in fig. 3, in the two-stage example segmentation algorithm framework, the first stage mainly completes the extraction of the region of interest, and here, the invention increases the sorting loss on the basis of the original classification and regression loss for the sorting selection operation of the sub-regions. Aiming at the segmentation operation of the second stage, semantic consistency loss is increased on the basis of original classification, regression and segmentation loss.
2. Example segmentation improvement scheme based on single-stage network
As shown in fig. 4, in the single-stage algorithm, the ranking loss and the semantic consistency loss are added to the regression head and the segmentation head, respectively, based on the original classification, regression, and segmentation losses.
1. Introduction to the basic model
The invention adds sequencing loss and semantic consistency loss into the original instance segmentation framework.
Two classes of algorithms are presented below as the original basic modules of Mask-RCNN and Yolact, respectively.
Mask-RCNN basic module:
and (3) taking ResnexXt 101+ FPN as a Backbone, and pre-training weights used by the Backbone are resnet101 files which are pre-trained on the ImageNet data set.
RPN network for generating region explosals. The layer judges that anchors belong to positive or negative through softmax, and then correct the anchors by using bounding box regression to obtain accurate ROIs (namely sub-regions).
Full connection layer FC: and (4) obtaining an example label through classification operation, and obtaining an example frame through regression operation.
Full convolution network FCN: and performing mask segmentation operation to obtain an example mask.
The modules are shown in FIG. 5:
yolact base module:
feature extraction network backhaul: resnet101+ FPN is Backbone, and the pre-training weight file is still a Resnet101 file.
And (3) a Protonet network, namely generating k prototype masks for each image. Following the FPN output, a network of full convolution structures is predicted to produce a set of prototype masks.
Prediction Head: performing regression classification operation, namely outputting the subareas and classification thereof in the same way as the subarea extraction process in the two-stage method; and simultaneously predicting k linear combination coefficients for performing linear combination operation on the prototype Mask, and performing Crop operation on a linear combination result to obtain an example Mask.
Loss of sub-region ordering (SRLoss, Subsection Rank Loss)
The core idea of this constraint is: and each prediction box is subjected to non-increasing sequencing according to the scores, so that the positive sample prediction box is encouraged to precede the negative sample prediction box as far as possible, and a more accurate sub-region can be obtained.
The ordering loss is defined as in formula (1):
wherein P is a positive sample set, and the positive and negative samples are determined according to a threshold (such as 0.7 and 0.3, samples higher than 0.7 are classified as positive samples, samples lower than 0.3 are classified as negative samples, and the rest samples are not trained and processed) after the intersection and comparison are calculated by a prior frame or an anchor frame and a GT Bbox; sorting the positive and negative samples according to the size of the cross comparison, wherein the sorting serial number is the r value; and sorting according to the classification confidence score of the sample by the size, wherein the sorting sequence number is the value of sort (r). The higher the positive sample ordering result, the smaller the SRloss.
Sub-region Semantic Consistency Loss (SSCLOSs, Subsection Semantic Consistency Loss)
The loss purpose is to constrain semantic consistency of pixels in the mask region: labeling types of pixels in the constraint subarea are as less as possible; pixels in the restricted sub-region belong to the same class as much as possible. The former counts the segmentation class, and when the mask quality is best, the part is reduced to the lowest; the latter calculates the ratio of the number of pixels with correct semantic categories in the mask region to the total number of pixels in the mask region.
Memory MiThe number of pixel points labeled as class i in the mask region is given, and the final loss function form is:
where c is the total number of classes, c is 80 on the MS COCO dataset and α is the hyperparameter.
Model training and testing
In the training process, the actual flow of Mask-RCNN is shown in fig. 7, the names of function blocks are in a rectangular frame, broken line arrows point to network loss, and newly added loss is represented by bold lines. The pre-training model used by the present invention has been trained on imagenet datasets.
The tag file annotation using the standard includes [ id, image _ id, category _ id, segmentation, area, bbox, iscrowd]Category _ id, mask tag segmentation, instance box tag bbox. category _ id is a category tag in which segmentation is performedFor polygon format data (taking adjacent pair of data as coordinate value of example contour edge point), bbox data format is [ x [ ]1,y1,x2,y2]。
In the first stage, the loss function is mainly obtained by training an RPN network, and comprises RPN foreground and background classification losses lrpn_clsThe real label t belongs to {1, 0, -1}, the anchors with the real label of 0 do not participate in the construction of the loss function, and the label of-1 is converted into 0 to carry out cross entropy calculation; RPN target frame regression loss lrpn_regAnd the loss of ordering SRloss proposed by the present invention, denoted as l in this examplerpn_rank。
In the second stage, the loss function is mainly the classification loss lclsRegression loss lregThe division loss lsegAnd the semantic consistency loss SSLoss proposed by the present invention is marked as lsc。
The classification loss and the segmentation loss are usually cross-entropy losses, and the regression loss is Smooth _ L1 loss. They are in general form:
cross entropy loss:
y is the true value, yiFor predicting the probability value, n is the number of samples.
Smoothen _ L1 loss:
y*is a predicted value, y is a label, and x is the difference between the two.
The overall training is divided into two parts, firstly, the RPN network part is trained, and the loss function L needing to be optimized1The following were used:
L1=lrpn_cls+lrpn_reg+lrpn_rank (5)
after the ROI region screening enters the second stage, a loss function L needing to be optimized2The following were used:
L2=lcls+lreg+lseg+lsc (6)
in the test section, the top 100 highest scoring detection boxes were processed using Mask prediction branches. In order to accelerate the inference efficiency, the network predicts K mask images for each ROI, but only needs to use the mask image with the largest class probability, the mask image resize is returned to the ROI size, binarization is performed at a set threshold value of 0.5, retention is performed when the threshold value is higher than the threshold value, and finally the segmented mask and the original image are subjected to image-level add operation to obtain a final example segmentation visualization result.
The network structure after the single-stage example segmentation represents the algorithm Yolact loss increase is shown in FIG. 8, the names of the function blocks are in the rectangular box, the broken line arrows point to the network loss, and the new loss is represented in a bold manner. The pre-trained model used was pre-trained on the imagenet dataset. The network training and testing principle is basically similar to the previous one, except that all losses are directly optimized in stages, and the total losses of the network are listed as follows:
L=lcls+lreg+lseg+lsegm+lrank+lsc
wherein lclsAnd lregTo classify losses and regression losses, lsegFor a segmentation loss,/segmFor semantic segmentation penalty, the ordering penalty is denoted as lrankSemantic consistency loss is noted as lscAre all the cross-entropy losses L mentioned aboveceLoss of L from Smooth _ L1smooth_l1Coarse semantic segmentation loss l in addition to regression and classification losssegm(still a cross-entropy form loss).
Drawings
FIG. 1 two-stage example segmentation method
FIG. 2 Single-stage example segmentation method
FIG. 3 is a two-stage network architecture diagram with increased ordering and loss of semantic consistency
FIG. 4 is a diagram of a single-stage network architecture that increases ordering and loss of semantic consistency
FIG. 5 Mask-RCNN schematic
FIG. 6 Yolact schematic
FIG. 7 Structure of Mask-RCNN modified
FIG. 8 Yolact modified Structure
Detailed Description
The invention adopts MS COCO series data sets (COCO2015 and COCO2016) to carry out experiments. The COCO dataset is a large, rich object detection, segmentation and caption dataset. The data set is mainly intercepted from a complex daily scene by taking scene understating as a target, and the target in the image is subjected to position calibration through accurate segmentation. The image included 91 classes of targets, 328,000 shots and 2,500,000 labels. So far, the largest data set with semantic segmentation is provided, the provided categories are 80 types, more than 33 ten thousand pictures are provided, 20 ten thousand pictures are marked, and the number of individuals in the whole data set is more than 150 ten thousand. The COCO dataset now has 3 label types: object instances, object keypoints, and image references (see talking), store tags using JSON files. Training and testing were performed using the training and validation sets, respectively. Because the instance segmentation task is done, target instance type annotation is used.
And evaluating the index. The present invention follows an evaluation criteria protocol for example segmentation, evaluating using mAP pairs at different intersection-union (IOU) thresholds. The present invention uses the assessment codes provided by the authorities for experiments.
And (4) setting an experiment. In experiments, the effectiveness of the invention was tested by adding the losses defined by the invention in sequence to each frame.
Experiments with Mask-RCNN and Yolact were performed on version 2.3 of mmdetection, as introduced by the Shang Tang science.
The hyper-parameters were set in Mask-RCNN experiments as follows: α ═ 1; setting a diversity input size; a non-maximum suppression (NMS) threshold of 0.7; the initial learning rate is 0.333, and the training process is dynamically attenuated; the batch size is set to 4, and the training is respectively carried out on four GPUs; the weight is stored once per round (epoch).
The hyper-parameters set on Yolact are: α ═ 1; setting an initial size of an input image to be 550 × 550; a non-maximum suppression (NMS) threshold of 0.7; the initial learning rate is 0.333, and the training process is dynamically attenuated; the batchsize is set to 4 and trained on four GPUs separately, with weights stored once per round (epoch).
All set up above and train 3 rounds each time, carry out once and verify and continue training again.
The present invention compares the performance score of the model with the baseline performance published by the original author. Table 1 shows the experimental results on the COCO data set, which can be seen to be improved.
TABLE 1 COCO data set test results