Semi-supervised learning image classification method based on group representation features
1. The semi-supervised learning image classification method based on the group representation features is characterized by comprising the following steps of:
the method comprises the following steps: preprocessing the image dataset;
using part of the image with the label, and using no label in the rest images; two different data enhancement modes are carried out on each picture to form two observation visual angle images of the same image:
(1) randomly and horizontally turning, cutting the image into a size of 32 multiplied by 32, and finally performing normalization processing to obtain an image called a weak enhancement image;
(2) randomly and horizontally turning, cutting the image into a size of 32 multiplied by 32, randomly enhancing the image and finally performing normalization processing to obtain an image called a strong enhanced image;
only the weakly enhanced image is used for the image with the label, and the weakly enhanced image and the strongly enhanced image are used for the image without the label;
step two: constructing two identical WiderResNet classification network models;
the width and depth parameters of the classification network model are respectively 10 and 28, wherein one classification network model is used as a basic model PbaseAnother classification network model is used as an empirical model Pexp(ii) a Performing parameter optimization by using an SGD (generalized regression) optimization method without Nesterov momentum, wherein the initial learning rate is 1e-2, the weight attenuation parameter is 1e-3, and updating the learning rate by using a Cosine learning rate attenuation strategy;
step three: calculating the classification error of the labeled image on the basic model;
weakly enhanced image I to be taggedL_wInputting a base model PbaseObtaining a class prediction distribution q for an input imageL_w=Pbase(IL_w) According to the label PbCalculating to obtain a classification loss function L of the labeled data by using the cross entropy loss Hsup:
Wherein, B represents the size of each batch;
step four: parameter optimization of base model using SGD optimizer
For empirical models, based on the base model PbaseModel parameter θ oftUpdating the empirical model P using a weighted average method based on MomentumexpParameter θ'tThe subscript t denotes at the t-th iteration, α is the hyperparameter:
θ′t=αθ′t-1+(1-α)θt (2)
step five: calculating the non-label data by using the experience model updated in the step fourPartial consistency constraint loss LconsistenyAnd false tag loss Lpseudo;
Weakly enhancing image I without labeluL_wInputting an empirical model PexpObtaining an empirical characteristic FuL_wAnd predicting the distribution q for the class of the input imageuL_w=Pexp(IuL_w) Simultaneous label-free strong enhancement of image IuL_sInputting a base model PbaseObtaining the basic characteristics FuL_sAnd predicting the distribution q for the class of the input imageuL_s=Pbase(IuL_s) (ii) a Wherein the empirical characteristic FuL_wAnd basic characteristics FuL_sIs the input vector of the last full-connection layer in the classification network model;
use ofAs pseudo-label, the loss function L of the label-free data part is obtained by mean square error lossusp(ii) a WhereinRepresenting a learnable feature mapping matrix; i is an identity matrix; ε and β are the hyper-parameters:
wherein the content of the first and second substances,represents a mask vector whose magnitude is consistent with the output of H and satisfies max (q)uL_w)>The position value of η condition is 1, and others are 0; η represents the confidence threshold, quL_wRepresents a class prediction distribution when quL_wWhen the prediction confidence coefficient of a certain category is greater than eta, the prediction is adopted;
step six: loss function L of labeled data obtained by combining step threesupAnd loss of the label-free data obtained in the fifth stepLoss function LuspObtaining a final loss function of a classification method based on semi-supervised learning; wherein λ is a hyper-parameter, which represents the weight occupied by the loss of label-free data:
L=Lsup+λ·Lusp (4)
step seven: training N complete cycles epochs, and using a trained basic model as a final classifier;
step eight: the new image is classified using the final classifier.
2. The method according to claim 1, wherein in the first step, the percentage of the labeled images in all the images is less than 5%.
3. The semi-supervised learning image classification method based on group representation features as claimed in claim 1, wherein the random image enhancement strategies include contrast enhancement, brightness enhancement, chroma enhancement, sharpness enhancement, maximum image contrast, image histogram equalization, variable bits on color channel setting to 0, random rotation, random miscut and inversion of pixel points; in the process of carrying out the random image enhancement strategy, the random image enhancement strategy is randomly adopted to carry out image transformation, and operation parameters are randomly set.
4. The semi-supervised learning image classification method based on group representation features as claimed in claim 1, wherein in the second step, the widerResnet classification network model is obtained by learning through a semi-supervised learning training method; the WiderResnet classification network model is a variant of a residual error network ResNet, a semi-supervised learning classification method based on group representation features uses a network model of ResNet50, ResNet50 is divided into 5 stages, the structure of stage 0 consists of a 7 x 7 convolutional layer and a maximum pooling layer, the next 4 stages consist of BottleNeck layers BottleNeck, stage 1 comprises 3 BottleNeck, and the remaining three stages respectively comprise 4, 6 and 3 BottleNeck; each BottleNeck is formed by connecting 1 × 1, 3 × 3 and 1 × 1 convolution networks in series; by reducing the depth and increasing the width on the basis of ResNet, a new network WiderResNet is obtained; the number of convolution kernels in each BottleNeck is increased by the WiderResNet, the size of the increased width is represented by a width factor parameter, the larger the width factor is, the wider the network is, and in addition, the WiderResNet also adds a rejection layer Dropout between convolution layers; the convolution kernel of the convolution layer in the stage 1 is changed into 3 multiplied by 3; the depth factor of WiderResNet is 28, the width factor is 10; adding a global averaging layer and two fully connected layers after stage 4; the first full connection layer is an intermediate feature output layer, outputs intermediate features, corresponds to empirical models and basic models, and outputs intermediate features; the second fully-connected layer is a class prediction layer, intermediate features are used as input, output features of the second fully-connected layer are used as class prediction probabilities through a Softmax function, and the intermediate features comprise basic features and empirical features.
Background
Deep learning models have become standard models for computer vision applications. Their success depends in large part on the existence of large annotated datasets, such as ImageNet, COCO, etc., which provide rich natural scene picture samples. Empirically, training on a larger data set generally results in a better performing deep model, which typically achieves robust performance through supervised learning, which requires the use of labeled data. However, for some tasks, it is difficult to collect tagged data, which may result in tagging errors due to subjective factors of the marker when performing manual tagging, or which requires expert knowledge, such as: medical data sets, which can result in significant cost consumption. In contrast, in most tasks, it is a relatively easy matter to obtain unlabeled data.
Semi-supervised learning is an efficient method for training on large-scale data sets without requiring large amounts of labeled data, and greatly reduces the need for labeled data by allowing models to learn unlabeled data. Many semi-supervised learning methods typically add a loss term to the objective function based on unlabeled data, encouraging the model to better generalize the feature distribution of the learned unlabeled data. At present, consistency constraint and pseudo label are two most common methods in a plurality of semi-supervised learning methods, and a method for combining the two methods also exists. The pseudo label method trains the prediction of the model on the label-free data as the label of the label-free data, and the consistency constraint method trains the prediction distribution of the model on the label-free data as the label. The two methods have different implementation strategies, but are trained on the generation of artificial labels for label-free data in meaning.
In the work, the trend of the existing SOTA method is used, and a more effective semi-supervised learning classification method is constructed by combining a consistency constraint scheme based on group representation characteristics.
Disclosure of Invention
In recent semi-supervised classification approaches, it is common to require model predictions to be invariant to noise of the input samples by training a large amount of unlabeled data using a consistency constraint. We use covariance matrices to represent the sample space in manifold space to enhance consistency training performance. We find that such a method, in combination with a pseudo-label method, can result in a more efficient semi-supervised learning classification model.
The semi-supervised learning image classification method based on the group representation features comprises the following steps:
the method comprises the following steps: preprocessing the image dataset;
using part of the image with the label, and using no label in the rest images; two different data enhancement modes are carried out on each picture to form two observation visual angle images of the same image:
(1) randomly and horizontally turning, cutting the image into a size of 32 multiplied by 32, and finally performing normalization processing to obtain an image called a weak enhancement image;
(2) randomly and horizontally turning, cutting the image into a size of 32 multiplied by 32, randomly enhancing the image and finally performing normalization processing to obtain an image called a strong enhanced image;
the tagged images use only weakly enhanced images and the untagged data uses weakly enhanced images and strongly enhanced images.
Further, the percentage of tagged images to the total number of images is less than 5%.
Further, the random image enhancement strategy comprises contrast enhancement, brightness enhancement, chroma enhancement, sharpness enhancement, maximum image contrast, image histogram equalization, setting variable bits on a color channel to 0, random rotation, random miscut and pixel inversion. In the process of carrying out the random image enhancement strategy, the random image enhancement strategy is randomly adopted to carry out image transformation, and operation parameters are randomly set.
Step two: constructing two identical WiderResNet classification network models;
the width and depth parameters of the classification network model are respectively 10 and 28, wherein one classification network model is used as a basic model PbaseAnother classification network model is used as an empirical model Pexp. And (3) carrying out parameter optimization by using an SGD (Stochastic gradient descent) optimization method without Nesterov momentum (Newton momentum), wherein the initial learning rate is 1e-2, the weight attenuation parameter is 1e-3, and the learning rate is updated by using a Cosine learning rate attenuation strategy.
Step three: calculating the classification error of the labeled image on the basic model;
weakly enhanced image I to be taggedL_wInputting a base model PbaseObtaining a class prediction distribution q for an input imageL_w=Pbase(IL_w) According to the label PbCalculating to obtain a classification loss function L of the labeled data by using the cross entropy loss Hsup:
Where B represents the size of each batch.
Step four: parameter optimization of base model using SGD optimizer
For empirical models, based on the base model PbaseModel parameter θ oftUpdating the empirical model P using a weighted average method based on MomentumexpParameter θ'tThe subscript t denotes at the t-th iteration, α is the hyperparameter:
θ′t=αθ′t-1+(1-α)θt (2)
and step four, an empirical model construction method. We can represent a data feature generation algorithm as a slaveToIsomorphic mapping off may be linear or non-linear. All f can also constitute a topological manifold. Further we can consider that f is continuous and differentiable, so that all f forms a differential manifold and is a lie group.
For theUpper covariance matrix group sigma, since f is in the unlabeled dataTo generate data samples in a new feature spaceIs shown onThus in a mapping isomorphic with fUnder the action of (3), generating a covariance matrix group sigma'. Since f is generally a non-linear mapping, neither the groups Σ nor Σ' are linearly isomorphic, or are two different linear representations of the same group, according to eigen-theorem, Λ∑≠Λ∑′。
It is a difficult matter to solve the mapping f directly, but we can use a neural network to fit f. So now the problem becomes: and constructing a mapping f, so that a characteristic diagram with distinguishing property and universality can be obtained when label-free data is input into the mapping f. In semi-supervised learning, learning labeled data and unlabeled data at the same time, wherein the purpose of learning the labeled data is to learn a more accurate feature extraction method which is used as a basis for extracting the unlabeled data features. Therefore, the empirical model is obtained on the basis of the basic model, and the parameters of the basic model are subjected to momentum weighted average to update the parameters of the empirical model. The final parameters of the basic model are not directly used, because less labeled data are generally used, the basic model can be converged quickly on the labeled data and reach an overfitting state, and if the final parameters of the basic model are directly used, the generalization capability of the empirical model is influenced, so that the more accurate empirical model can be obtained by averaging the weights of the basic model in the training step.
Step five: calculating the consistency constraint loss L of the unlabeled data part by using the empirical model updated in the step fourconsistenyAnd false tag loss Lpseudo;
Weakly enhancing image I without labeluL_wInputting an empirical model PexpObtaining an empirical characteristic FuL_wAnd predicting the distribution q for the class of the input imageuL_w=Pexp(IuL_w) Simultaneous label-free strong enhancement of image IuL_sInputting a base model PbaseObtaining an empirical characteristic FuL_sAnd predicting the distribution q for the class of the input imageuL_s=Pbase(IuL_s) (ii) a Wherein the empirical characteristic FuL_wAnd FuL_sIs the input vector of the last full-connection layer in the classification network model;
use ofAs pseudo-label, the loss function L of the label-free data part is obtained by mean square error lossusp. WhereinRepresenting a learnable feature mapping matrix; i is an identity matrix; ε and β are the hyper-parameters:
wherein the content of the first and second substances,represents a mask vector whose magnitude is consistent with the output of H and satisfies max (q)uL_w)>The η condition has a position value of 1 and the others are 0.η represents the confidence threshold, quL_wRepresents a class prediction distribution when quL_wWhen the prediction confidence coefficient of a certain category is greater than eta, the prediction is adopted;
the main innovation point of the invention is mainly embodied in step five. The weak enhanced non-label data obtains experience characteristics through an experience model, and the strong enhanced non-label data obtains basic characteristics through a basic model. Under the constraint of consistency constraint theory and group representation feature theory, we require that the traces of covariance matrices of empirical features and basic features are as similar as possible, which is our consistency constraint. For the pseudo-label part, we make a simple linear change to the empirical features, predict the class of the sample, this part predicts the class as a pseudo label, and calculates the cross entropy loss using the prediction class of the underlying features. The combination of the two parts constitutes our untagged data loss.
And step five, representing the detailed description of the group representation method of the consistency constraint part. We can view the data space as a separable topology space, which is a topology manifold after defining the metric method. The general characteristic of the data space is statistically the distribution characteristic of the data on the data space, and the characteristic has certain symmetry, including translation invariance, rotation invariance and the like, so we represent the unlabeled data of a batch as a matrix(B denotes the size of the batch and D denotes the feature dimension), it is the covariance matrix that characterizes these symmetries or the distribution of the data space. In thatThe spatially different pairs of sample matrices form a group sigma, which is expressed in terms of the characteristics of the groupIn theory, Σ is a linear group, where for a constellation σ ∈ Σ, the trace of the matrix is tr (σ), and for different constellation Σ, we can obtain a function Λ about the trace of the matrix∑A signature called the group Σ. Thus, we can represent the general features of the unlabeled data space as the eigenfunctions of the covariance matrix group ∑.
Loss of consistency constraints as described in step five. In the learning process of the unlabeled data, basic data of a batch obtains a corresponding basic feature representation under the mapping F, namely an empirical model, and correspondingly obtains a new covariance matrix omega related to the empirical feature under the mapping F, and reinforced data of the batch obtains a corresponding empirical feature representation under the mapping F ', namely the basic model, and correspondingly obtains a new covariance matrix omega related to the basic feature under the mapping F'. The requirement for the consistency constraint that we require the two covariance matrices to be as similar as possible, i.e. we can require the Log-Euclidean distance d between the two covariance matrices to be as small as possible:
d(ω,ω′)=||log(tr(ω))-log(tr(ω′)))||F (6)
if the consistency constraint is calculated by directly using the formula (6), the covariance matrix is respectively calculated on the basis of obtaining two feature representations, and then the trace of the covariance matrix is calculated. This approach adds unnecessary computation, so we use another simpler equivalent calculation to compute the consistency constraint. First, the empirical characteristics still need to be calculatedAnd basic featuresBased on the group representation method of the features, we can consider FuL_wAnd FuL_sBelonging to different data spaces and considering that a feature mapping matrix existsCan be substituted byThe underlying features are mapped into the empirical feature space such that they are as close as possible in the same feature space, so we get equation (7):
equation (7) requires FuL_wAnd FuL_sApproaching more and more during the optimization process, if there is a perfect optimization case, both should become equal finally, but the data of the same batch of both obtains the feature vector by adding different data enhancement methods, making both equal is a difficult to realize and too strong constraint condition, so we neutralize the case by adding a smaller bias epsilon, and get the formula (8):
so our final optimization objective for the consistency constraint is shown in equation (9):
the pseudo label method described in step five is lost. For unlabeled data, we compute an artificial label for each sample, which will be used to compute the standard cross entropy of the unlabeled data. To derive the artificial label, we will resort to previously derived empirical models. Firstly, calculating the class prediction distribution q of the empirical model to the basic datauL_s=Pbase(IuL_s) Then we can useAs a pseudo tag. To hereWe can already derive the required pseudo tag objective function, as shown in equation (10):
where η is a scalar hyperparameter representing a threshold, we retain as a pseudo-label that the probability of prediction is above the threshold.
Step six: loss function L of labeled data obtained by combining step threesupAnd the loss function L of the label-free data obtained in the step fiveuspAnd obtaining a final loss function of the classification method based on semi-supervised learning. Wherein λ is a hyper-parameter, which represents the weight occupied by the loss of label-free data:
L=Lsup+λ·Lusp (4)
wherein the first part is the classification loss of tagged data and the second part is the loss of untagged data. This is accomplished by combining step three, step four, step five and step six in claim 1, wherein step three calculates the loss function of the labeled data, step five calculates the loss function of the unlabeled data, and step four describes the optimization process of the model parameters.
Step seven: training N complete cycles epochs, and using a trained basic model as a final classifier;
step eight: the new image is classified using the final classifier.
In the second step, the classification network model is learned by using a semi-supervised learning training method, the semi-supervised learning is an effective method for training on a large-scale data set without a large amount of labeled data, and the requirement on labeled data is greatly reduced by allowing the model to learn unlabeled data. Semi-supervised learning methods typically add a loss term to the objective function based on unlabeled data, encouraging the model to better generalize the feature distribution of the learned unlabeled data. Consistency constraints and pseudo-labels are the two most common methods of semi-supervised learning. The pseudo label method trains the prediction of the model on the label-free data as the label of the label-free data, and the consistency constraint method trains the prediction distribution of the model on the label-free data as the label. The two methods have different implementation strategies, but both require learning from different angle views of the same image, so that corresponding to step one, we perform weak enhancement and strong enhancement on all images to obtain different views of the same image.
In the second step, widerResnet is a variant of residual error network ResNet. The traditional convolution network or the full-connection network has the problems of information loss, loss and the like more or less during information transmission, and simultaneously, the gradient disappears or the gradient explodes, so that the deep network cannot be trained. The residual error network ResNe solves the problem to a certain extent, the integrity of information is protected by directly bypassing the input information to the output, and the whole network only needs to learn the part of the difference between the input and the output, thereby simplifying the learning aim and the difficulty. The biggest difference between resnets is that there are many bypasses connecting the input directly to the following layers, a structure also known as a direct short or skip connection. The invention uses a network model of ResNet50, ResNet50 is divided into 5 stages, the structure of stage 0 is simple, and the stage 0 is composed of a convolution layer of 7 multiplied by 7 and a maximum pooling layer, which is equivalent to the preprocessing of the input image. The last 4 stages are all composed of BottleNeck layers BottleNeck, and the structures are all similar. Phase 1 contains 3 bottlenecks and the remaining three phases contain 4, 6, 3 bottlenecks, respectively. Each BottleNeck is concatenated by 1 x 1, 3 x 3 and 1 x 1 convolutional networks. The jump connection of ResNet results in that only a small amount of BottleNeck learns useful information, so that a new network WiderResNet is obtained by reducing the depth and increasing the width on the basis of ResNet. The widerResNet increases the number of convolution kernels in each BottleNeck by a width factor parameter indicating the size of the increased width, the larger the width factor the wider the network, and additionally, the widerResNet also adds a drop out layer between the convolutional layers. In the present invention, we change the convolution kernel of the convolutional layer in stage 1 to 3 × 3. The depth factor of WiderResNet is 28 and the width factor is 10. One global averaging layer and two fully connected layers are added after phase 4. The first full connection layer is an intermediate feature output layer, outputs intermediate features, corresponds to empirical models and basic models, and outputs intermediate features; the second fully-connected layer is a category prediction layer, the intermediate features are used as input, and the output features of the second fully-connected layer are used as category prediction probability through a Softmax function.
Compared with the prior art, the beneficial effect of this disclosure is:
1. the method has less demand for the labeled sample;
2. the method improves the accuracy of the semi-supervised classification method through the characteristic representation method based on the group representation.
Drawings
FIG. 1 is a diagram of a model learning process;
FIG. 2 is a diagram showing the results of model hyper-parameter selection experiments, (a) error rate at CIFAR-10 and bias relation in consistency constraint; (b) bias relation graph of error rate and consistency constraint at CIFAR-100; (c) an offset relation graph in the error rate and consistency constraint during SVHN; (d) a relation graph of error rate and confidence coefficient threshold value at CIFAR-10; (e) a relation graph of error rate and confidence coefficient threshold value at CIFAR-100; (f) error rate versus confidence threshold for SVHN.
Detailed Description
1. Method for representing group of features
We can view the data space as a separable topology space, which is a topology manifold after defining the metric method. The general characteristic of the data space is statistically the distribution characteristic of the data on the data space, and the characteristic has certain symmetry, including translation invariance, rotation invariance and the like, so we represent the unlabeled data of a batch as a matrix(N denotes the size of the batch and D denotes the feature dimension), what can characterize these symmetries or the distribution of the data space is the covariance matrix. In thatThe addition of spatially different sample matrix pairs forms a group sigma, according to the eigen-standard theory expressed by the group, sigma is a linear group, for the group element sigma epsilon sigma, the track of the matrix is tr (sigma), and for the different sigma groups, a function lambda related to the matrix track can be obtained∑A signature called the group Σ. Thus, we can represent the general features of the unlabeled data space as the eigenfunctions of the covariance matrix group ∑.
2. Empirical model
We can represent a data feature generation algorithm as a slaveToIsomorphic mapping of f may be linear or non-linear. All f can also constitute a topological manifold. Further we can consider that f is continuous and differentiable, so that all f forms a differential manifold and is a lie group.
For theUpper covariance matrix group sigma, since f is in the unlabeled dataTo generate data samples in a new feature spaceIs shown onThus in a mapping isomorphic with fUnder the action of (3), generating a covariance matrix group sigma'. Since f is generally a non-linear mapping, neither the groups Σ nor Σ' are linearly isomorphic, or are two different linear representations of the same group, according to eigen-theorem, Λ∑≠Λ∑′。
It is a difficult matter to solve the mapping f directly, but we can use a neural network to fit f. So now the problem becomes: and constructing a mapping f, so that a characteristic diagram with distinguishing property and universality can be obtained when label-free data is input into the mapping f. In semi-supervised learning, learning labeled data and unlabeled data at the same time, wherein the purpose of learning the labeled data is to learn a more accurate feature extraction method which is used as a basis for extracting the unlabeled data features. Therefore, the empirical model is obtained on the basis of the basic model, and the empirical model is constructed by averaging the weights of the basic model. The final parameters of the basic model are not directly used, because less labeled data are generally used, the basic model can be converged quickly on the labeled data and reach an overfitting state, and if the final parameters of the basic model are directly used, the generalization capability of the empirical model is influenced, so that the more accurate empirical model can be obtained by averaging the weights of the basic model in the training step.
3. Loss function of mixing
We divide the loss function of the classification model into two parts, one is the loss function L for labeled datasupAnd a loss function L for unlabeled datausp。LsupStandard Cross-entropy loss, L, of tagged data onlyuspConsists of two parts, consistency constraint loss and false tag loss.
4. Loss of consistency constraint
In the learning process of the unlabeled data, basic data of a batch obtains a corresponding basic feature representation under the mapping F, namely an empirical model, and correspondingly obtains a new covariance matrix omega related to the empirical feature under the mapping F, and reinforced data of the batch obtains a corresponding empirical feature representation under the mapping F ', namely the basic model, and correspondingly obtains a new covariance matrix omega related to the basic feature under the mapping F'. The requirement for the consistency constraint that we require the two covariance matrices to be as similar as possible, i.e. we can require the Log-Euclidean distance between the two covariance matrices to be as small as possible:
if the consistency constraint is calculated by directly using the formula (1), the covariance matrix is respectively calculated on the basis of obtaining two feature representations, and then the trace of the covariance matrix is calculated. This approach adds unnecessary computation, so we use another simpler equivalent calculation to compute the consistency constraint. First, there is still a need to compute empirical characterizationAnd a basic feature representationAccording to the feature group representation method, phi and psi can be considered to belong to different data spaces, and a feature mapping matrix is considered to existThe basis features can be mapped into the empirical feature space such that they are as close as possible in the same feature space, so we get equation (2):
equation (1) requires that phi and psi get closer and closer in the optimization process, and if there is a perfect optimization case, both should become equal finally, but the data of the same batch as both get the feature vector by adding different data enhancement methods, making both equal is a difficult to realize and too strong constraint condition, so we neutralize this case by adding a smaller bias epsilon, which gets the equation (3):
so our final optimization objective for the consistency constraint is shown in equation (4):
5. pseudo label
For unlabeled data, we compute an artificial label for each sample, which will be used to compute the standard cross entropy of the unlabeled data. To derive the artificial label, we will resort to previously derived empirical models. Firstly, calculating the class prediction distribution q of the empirical model to the basic datab=Pexp(y|α(ub) Then we can use itAs a pseudo tag. To this end we can already derive the required pseudo label objective function, as shown in equation (5):
where η is a scalar hyperparameter representing a threshold, we retain as a pseudo-label that the probability of prediction is above the threshold.
Practical case
1. Standard data set
First, we will compare the performance of the present method with existing methods on a semi-supervised learning reference dataset (CIFAR-10, CIFAR-100, SVHN). The CIFAR-10 comprises 50000 pictures in a training set and 10000 pictures in a testing set, and comprises 10 categories in total. The CIFAR-100 comprises 50000 pictures in a training set and 10000 pictures in a testing set, and comprises 100 categories. SVHN contains 73257 pictures in the training set and 26032 pictures in the test set, and comprises 10 categories.
2. Experimental Environment and parameter settings
The CPU model of the laboratory is I7-5930k, the memory is 32GB, the video card is GeForce 1080Ti, and the video memory is 11 GB. A Pythroch writing model is used, SGD without Nesterov momentum is used as an optimization method, the initial learning rate is 1e-2, the weight decay is 1e-3, and a Cosine learning rate attenuation strategy is used. For all superparameters, in CIFAR-10 and SVHN, λ is 20.0, η is 0.95, α is 0.97, e is 0.25, β is 1.0. In CIFAR-100, λ is 20.0, η is 0.95, α is 0.97, e is 0.15, and β is 1.0.
3. Results of the experiment
In CIFAR-10, the method of the present invention achieves the lowest error rate of 3.55% using 400 supervised pictures in each category. ReMixMatch, UDA and FixMatch can obtain more excellent results in the case of using 250 and 500 labeled pictures, but it can be seen that the method of the present invention can obtain more stable and excellent results, the method of the present invention only uses 250 labeled pictures, the error rate is slightly higher than that of FixMatch by 0.01%, but is superior to other methods, and the method of the present invention exhibits the optimal results in all other cases. In the SVHN dataset, the inventive method achieved the lowest error rate of 2.27% when 400 supervised pictures were used per category. The method of the invention can achieve the best performance of three, but in the other two experiments our method can achieve the second best performance at all, and the error rate is only slightly higher than the best case.
In addition to CIFAR-100, we have found that the process of the invention achieves very satisfactory performance, but ReMixMatch in CIFAR-100 yields optimum performance in all cases.
Table (1): error rate for 5 different unlabeled data usage numbers on CIFAR-10.
All reference models are tested using the same code base
Table (2): error rate at different numbers of label-free data usage in 4 on CIFAR-100.
All reference models were tested using the same code base.
Table (3): error rate at different numbers of unlabeled data usage on SVHN 5.
All reference models were tested using the same code base.