Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework

文档序号:8432 发布日期:2021-09-17 浏览:27次 中文

1. The image Chinese description system based on the multilevel strategy and the deep reinforcement learning framework is characterized in that: the system comprises an image feature extraction module, a multi-stage strategy network module, a multi-stage reward network module, a reinforcement learning training module and a sentence generation module;

the image feature extraction module is responsible for pre-training image information and then transmitting the pre-training image information to the multi-stage strategy network module;

the multi-stage strategy network module converts the characteristic vectors into a matrix and then sends the matrix to the multi-stage reward network module;

the multi-stage reward network module is responsible for outputting the image information to the reinforcement learning training module after deep learning;

the information processed by the multi-stage strategy network module and the multi-stage reward network module is trained together through the reinforcement learning training module and output to the sentence generating module to complete the Chinese sentence description of the image.

2. The image Chinese description system based on the multilevel strategy and the deep reinforcement learning framework as claimed in claim 1, wherein: the image information describes training set pictures in a data set in chinese using standard images for AI challenge contests.

3. The image Chinese description system based on the multi-level strategy and the deep reinforcement learning framework as claimed in claim 2, wherein: the multi-level policy network module includes word-level policy and sentence-level policy functions.

4. The image Chinese description method based on the multilevel strategy and the deep reinforcement learning framework is realized on the basis of the system of any one of claims 1 to 3, and is characterized in that: the method comprises the following specific steps:

step one, extracting image features by adopting a ResNet152 convolutional neural network;

step two, obtaining an image feature mapping vector of a multi-level joint strategy part by adopting a multi-level joint strategy;

calculating the weight through a multi-stage reward network and generating a mapping layer so as to obtain an image feature mapping vector of a multi-stage reward part;

step four, performing joint learning on the image feature mapping vectors obtained in the step two and the step three through reinforcement learning training to generate global feature vectors of the images;

and fifthly, generating the global feature vector of the image into a Chinese sentence through a sentence generating module to complete the Chinese description of the image.

5. The image Chinese description method based on the multi-level strategy and the deep reinforcement learning framework as claimed in claim 4, wherein: in step one, the process of extracting image features is refined as follows:

step one, pre-training a ResNet152 network based on an ImageNet image classification data set;

step two, storing the weight coefficient after pre-training;

step three, the trained weight bias parameters are transferred to the ResNet152 network,

inputting training set pictures into a ResNet152 network for feature extraction, wherein the training set pictures are normalized to 256 × 3;

fifthly, performing convolution pooling calculation on the picture by the ResNet152 network according to the pre-trained weight coefficient to obtain the output of the self-adaptive average pooling layer;

and step six, outputting a 2048-dimensional high-level feature vector picture.

6. The image Chinese description method based on the multi-level strategy and the deep reinforcement learning framework as claimed in claim 4, wherein: the parameters of the multi-level policy network comprise parameters of a word-level policy and parameters of a sentence-level policy

The word level strategy refers to an image Chinese network, and the method specifically comprises the following steps:

step two, extracting features from an input image by using a CNN neural network;

step two, linear mapping is used for embedding, words are represented by unidirectional vectors embedded in the same dimension as the mapping image characteristics, the beginning of each sentence is marked with a special BOS token, and the end of each sentence is marked with an EOS token; under the strategy, words are generated;

step two, inputting the image characteristics I back to the RNN-based module, wherein the image characteristics I are regarded as a first word; finally, the hidden state and the unit of the network are updated, and the distribution of all words is output based on the RNN neural network;

the sentence-level strategy is a visual semantic embedding network, and the image characteristics and sentences are mapped into a common embedding space and the similarity between the image characteristics and the sentences is measured;

and finally unifying the dimensionality of the image characteristic vector and the word characteristic vector, and mapping the 2048-dimensional characteristic vector of the image to a 512-dimensional word vector characteristic space to obtain a final image characteristic mapping vector.

7. The image Chinese description method based on the multi-level strategy and the deep reinforcement learning framework as claimed in claim 4, wherein: the third step is detailed as follows:

step three, establishing the combination of vision-language reward and language-language reward, and fusing word level and sentence level strategies;

step two, using image sentence pairs in the image Chinese data set,

step three, learning RNN weight and mapping layer by using bidirectional ranking loss,

and step three, outputting the image feature mapping vector of the multi-stage reward network part.

8. The image Chinese description method based on the multilevel strategy and the deep reinforcement learning framework according to claim 6, characterized in that: in the fourth step, the reinforced training comprises the following specific steps:

step four, pre-training word-level strategies and visual language rewards in an equation by minimizing negative expectation combination rewards;

and step four, using an equation to train parameters of the multi-stage strategy network together so as to obtain a baseline.

9. The image Chinese description method based on the multi-level strategy and the deep reinforcement learning framework as claimed in claim 4, wherein: the sentence generation module decodes and generates Chinese sentences by constructing a double-layer GRU network model, wherein the GRU network model improves an LSTM network and combines a forgetting gate and an input gate;

the GRU network model comprises a reset gate and an update gate, and is used for recording the hidden layer state at the previous moment and the hidden layer state at the current moment and updating the hidden state.

10. The image Chinese description method based on the multilevel strategy and the deep reinforcement learning framework of claim 9 is characterized in that: in the fifth step, the process of generating the Chinese sentence by the global feature vector of the image is detailed as follows:

fifthly, controlling the forgetting degree of the hidden layer information at the previous moment through the reset door, and capturing the short-term dependency relationship in the sequence data;

step two, controlling the degree of the hidden layer state information at the previous moment brought into the hidden layer at the current moment through the updating door, and capturing long-term dependency relationship in the sequence data;

fifthly, the hidden layer state information at the previous moment is filtered by using the reset gate;

and fifthly, outputting the global feature vector of the image by using the information flow direction of the GRU network model, and finally converting the global feature vector into a Chinese statement.

Background

For the image Chinese description method, scholars at home and abroad have carried out relevant research and have achieved certain achievements. The traditional methods of image description are both template matching based and retrieval based. Although the template matching-based method and the retrieval-based method realize the generation of corresponding descriptions of input pictures, Chinese description sentences generated by using the traditional method are single, have no diversity and depend on large-scale training corpora. In view of the above problems, many researchers have completed the task of image description by using a deep learning method. Mao et al propose a multi-modal Recurrent R Neural Network (m-RNN) method for the generation of image descriptions. The network comprises two sub-networks, a deep recurrent neural network is used for coding text, and a deep convolutional neural network is used for extracting image characteristics. The two sub-networks interact with each other through a multi-modal layer to form the whole m-RNN network. Vinyal et al propose a Neural image description NIC model consisting of a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). The model uses a Google inclusion network as an image feature extractor, and simultaneously uses a Long-Short Term Memory (LSTM) network as a text codec. However, most scholars put experimental improvement on the optimization of the RNN network, and few people can put their efforts on image feature extraction and text preprocessing, neglecting the important influence of the quality of the extracted image features and text vectorization on the finally generated description sentences. In terms of text preprocessing, in response to the problem that the conventional vector representation method cannot characterize ambiguity, some scholars propose to apply a pre-trained language model to word representation, such as Rei which proposes to use word-level language structure to enhance NER training. Devlin et al propose pre-training models using a bi-directional Transformer language structure. The method is mainly characterized in that a Chinese description set is segmented by a multi-purpose Chinese segmentation tool to obtain semantic information among words, and the ambiguity modeling of characters cannot be carried out.

Image chinese is a sequence of word prediction tasks. The most advanced methods generally follow an encoder-decoder framework: they use a Convolutional Neural Network (CNN) to encode images into visually embedded vectors, and then use a Recurrent Neural Network (RNN) to decode the vectors into sentences. During the training and inference process, they attempt to maximize the probability of the next word based on the current prediction context. Recently, there have been studies showing that Reinforcement Learning (RL) is suitable for this task, as RL aims to learn a strategy that determines sequential actions by maximizing future jackpot. Thus, the RL can help explore more fruitful languages in the sentence generation process and can avoid severe bias in training samples. However, existing RL-based image Chinese approaches rely primarily on a single policy network and reward function that does not match well with multiple levels (words and sentences) and multiple modes (visual and visual).

There is therefore a need for a novel multi-level strategy and reward reinforcement learning framework for image chinese that can integrate RNN-based models, language metrics or visual semantic functions for optimization. In particular, the multi-level policy network is directed to jointly updating word and sentence-level policies to generate words, while the multi-level reward function is directed to guiding policies with visual-language and language-language rewards in concert.

Disclosure of Invention

In order to solve the problems that the matching degree of words and sentences converted into Chinese is not high and the polysemy modeling cannot be realized at present, the invention provides an image Chinese description method based on a multilevel strategy and a depth reinforcement learning framework; the technical scheme of the invention is as follows:

the first scheme is as follows: the image Chinese description system based on the multi-stage strategy and the deep reinforcement learning framework comprises an image characteristic extraction module, a multi-stage strategy network module, a multi-stage reward network module, a reinforcement learning training module and a sentence generation module;

the image feature extraction module is responsible for pre-training image information and then transmitting the pre-training image information to the multi-stage strategy network module;

the multi-stage strategy network module converts the characteristic vectors into a matrix and then sends the matrix to the multi-stage reward network module;

the multi-stage reward network module is responsible for outputting the image information to the reinforcement learning training module after deep learning;

the information processed by the multi-stage strategy network module and the multi-stage reward network module is trained together through the reinforcement learning training module and output to the sentence generating module to complete the Chinese sentence description of the image.

Further, the image information describes training set pictures in a data set using standard images used in AI challenge contest in chinese.

Further, the multi-level policy network module includes word-level policy and sentence-level policy functionality.

Scheme II: the image Chinese description method based on the multilevel strategy and the deep reinforcement learning framework is realized on the basis of the system, and the method comprises the following specific steps:

step one, extracting image features by adopting a ResNet152 convolutional neural network;

step two, obtaining an image feature mapping vector of a multi-level joint strategy part by adopting a multi-level joint strategy;

calculating the weight through a multi-stage reward network and generating a mapping layer so as to obtain an image characteristic mapping vector of a multi-stage reward part;

step four, performing joint learning on the image feature mapping vectors obtained in the step two and the step three through reinforcement learning training to generate global feature vectors of the images;

and fifthly, generating the global feature vector of the image into a Chinese sentence through a sentence generating module to complete the Chinese description of the image.

Further, in step one, the process of extracting image features is refined as follows:

step one, pre-training a ResNet152 network based on an ImageNet image classification data set;

step two, storing the weight coefficient after pre-training;

step three, the trained weight bias parameters are transferred to the ResNet152 network,

inputting training set pictures into a ResNet152 network for feature extraction, wherein the training set pictures are normalized to 256 × 3;

fifthly, performing convolution pooling calculation on the picture by the ResNet152 network according to the pre-trained weight coefficient to obtain the output of the self-adaptive average pooling layer;

and step six, outputting a 2048-dimensional high-level feature vector picture.

Further, the parameters of the multi-level policy network include parameters of a word-level policy and parameters of a sentence-level policy

The word level strategy refers to an image Chinese network, and the method specifically comprises the following steps:

step two, extracting features from an input image by using a CNN neural network;

step two, linear mapping is used for embedding, words are represented by unidirectional vectors embedded in the same dimension as the mapping image characteristics, the beginning of each sentence is marked with a special BOS token, and the end of each sentence is marked with an EOS token; under the strategy, words are generated;

step two, inputting the image characteristics I back to the RNN-based module, wherein the image characteristics I are regarded as a first word; finally, the hidden state and the unit of the network are updated, and the distribution of all words is output based on the RNN neural network;

the sentence-level strategy is a visual semantic embedding network, and the image characteristics and sentences are mapped into a common embedding space and the similarity between the image characteristics and the sentences is measured;

and finally unifying the dimensionality of the image characteristic vector and the word characteristic vector, and mapping the 2048-dimensional characteristic vector of the image to a 512-dimensional word vector characteristic space to obtain a final image characteristic mapping vector.

Further, the third step is subdivided into:

step three, establishing the combination of vision-language reward and language-language reward, and fusing word level and sentence level strategies;

step two, using image sentence pairs in the image Chinese data set,

step three, learning RNN weight and mapping layer by using bidirectional ranking loss,

and step three, outputting the image feature mapping vector of the multi-stage reward network part.

Further, in the fourth step, the intensive training specifically comprises the following steps:

step four, pre-training word-level strategies and visual language rewards in an equation by minimizing negative expectation combination rewards;

and step four, using an equation to train parameters of the multi-stage strategy network together so as to obtain a baseline.

Further, the statement generation module decodes and generates a Chinese statement by constructing a double-layer GRU network model, wherein the GRU network model is used for improving an LSTM network and combining a forgetting gate and an input gate;

the GRU network model comprises a reset gate and an update gate, and is used for recording the hidden layer state at the previous moment and the hidden layer state at the current moment and updating the hidden state.

Further, in the fifth step, the process of generating the chinese statement from the global feature vector of the image is detailed as follows:

fifthly, controlling the forgetting degree of the hidden layer information at the previous moment through the reset door, and capturing the short-term dependency relationship in the sequence data;

step two, controlling the degree of the hidden layer state information at the previous moment brought into the hidden layer at the current moment through the updating door, and capturing long-term dependency relationship in the sequence data;

fifthly, the hidden layer state information at the previous moment is filtered by using the reset gate;

and fifthly, outputting the global feature vector of the image by using the information flow direction of the GRU network model, and finally converting the global feature vector into a Chinese statement.

The invention has the beneficial effects that:

the invention provides a multi-level strategy and reward depth reinforcement learning framework for image Chinese text description, which utilizes multi-level and multi-mode properties of image Chinese text description, a multi-level strategy network jointly updates word and sentence level strategies to generate words, and multi-level reward functions cooperatively utilize visual language and language reward to guide strategies;

compared with the traditional single-stage strategy framework, the accuracy of the provided multi-stage strategy framework is improved by about 11%, the objective evaluation index BLEU-4 is improved by 0.05, and the model training time is also shortened by half on the basis of ensuring the integrity of the generated text sentences;

the method provided by the invention can generate the Chinese text description sentences which are more matched with the images, realizes the function of automatically generating the Chinese text description of the images, obviously improves the problems of insufficient diversity of semantic effect generation and insufficient sentence description, and has good improvement on the accuracy of the sentence description content.

Compared with the traditional single-stage strategy model, the method has better stability in training, simple model structure and stronger generalization capability of the framework, can be integrated with more new algorithms, and lays a foundation for image Chinese text description and computer vision development.

Drawings

FIG. 1 is a model framework diagram of a multi-modal fusion emotion recognition method based on a multi-task learning and attention mechanism;

FIG. 2 is a diagram illustrating a parameter sharing mechanism for multi-task learning;

FIG. 3 is a diagram illustrating the semantic representation of bert extracted text;

FIG. 4 is a schematic view of an attention mechanism configuration;

FIG. 5 is a schematic view of modality fusion;

in order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Detailed Description

The first embodiment is as follows: the image Chinese description system based on the multi-stage strategy and deep reinforcement learning framework comprises an image feature extraction module, a multi-stage strategy network module, a multi-stage reward network module, a reinforcement learning training module and a sentence generation module; the image feature extraction module is responsible for pre-training image information and then transmitting the pre-training image information to the multi-stage strategy network module; the multi-stage strategy network module converts the characteristic vectors into a matrix and then sends the matrix to the multi-stage reward network module; the multi-stage reward network module is responsible for outputting the image information to the reinforcement learning training module after deep learning;

the information processed by the multi-stage strategy network module and the multi-stage reward network module is trained together through the reinforcement learning training module, and the sentence generation module outputs sentences to complete image Chinese description.

Preferably, the image information uses training set pictures in a standard image Chinese description data set used in the AI challenge match;

preferably, the training set is normalized to 256 × 3 proportion to obtain 2048-dimensional high-level feature vectors of each picture, and the multi-level strategy network module comprises word-level strategy and sentence-level strategy functions.

The second embodiment is as follows: except for the system provided by the first embodiment, the embodiment provides an image Chinese description method based on a multistage strategy and a deep reinforcement learning framework and subsequent experimental demonstration, and the specific steps and processes are as follows:

1.1, image feature extraction:

the ResNet deep neural network is a deep convolution network with hundreds of layers, the depth of the network is deepened due to residual learning, the performance of the network is guaranteed not to be degraded, the parameters are less in use, and the training of the model can be accelerated. In the aspect of image feature extraction, the ResNet152 convolutional neural network is adopted, the parameter number is lower than that of a VGGNet model, the training time is faster than that of a deep neural network, and the effect is very obvious. The network has 152 layers of networks, which are composed of 152 layers of convolution layers, wherein 150 layers are composed of 50 residual blocks with 3 layers, and the network structure is shown in figure 1.

The process of extracting global features for an image using the ResNet152 network is as follows: firstly, pre-training a ResNet152 network based on an ImageNet image classification data set, and storing pre-trained weight coefficients; and then, transferring the trained weight bias parameters to a ResNet152 network, and then sending training set pictures in a text description data set of a standard image used in the AI challenge match to the ResNet152 network for feature extraction, wherein the pictures sent to the network are uniformly normalized to 256 × 3. And then the convolutional neural network performs a series of operation calculations such as convolutional pooling on the pictures according to the pre-trained weight coefficients, so as to obtain the output of the ResNet152 network and the final adaptive average pooling layer, obtain 2048-dimensional high-level feature vectors of each picture, and store the high-level feature vectors of the pictures.

1.2 multistage policy network:

the image is firstly subjected to feature extraction through a ResNet152 network to obtain a high-level semantic feature V. The multi-level policy network consists of word-level policies and sentence-level policies.

Word level policy refers to image chinese networks, where feature I is first extracted from the input image using CNN, and then embedded using linear mapping. The words are represented by unidirectional vectors embedded in the same dimension as the mapped image features. The beginning of each sentence is marked with a special BOS token and the end of each sentence is marked with an EOS token. Under this strategy, a word will be generated and then input back into the RNN-based module, where image feature I is considered the first word. By updating hidden states and elements of a network, the RNN-based module may output a distribution of all wordsLet thetaπThe parameters of the word-level policy are indicated,byThe goal is to minimize the sum of the negative log-likelihoods of the correct word at each step:

the sentence-level strategy is a visual semantic embedding network, which is successfully applied to image classification and maps image characteristics I and sentences S into a common embedding space to measure the similarity between the image characteristics I and the sentences S. As shown in fig. 2, given a sentence S, its embedded features are represented using the last hidden state of the RNN. By using hp(RNN (S)) represents a sentence mapping layer, and f representsp(I) Representing the image mapping layer. As shown in FIG. 1, the sentence-level strategy generates a title from the image feature I and a portion of the word-level strategyThe confidence between them is calculated by the following formula:

sentence-level policies provide sentence confidence by evaluating the current state from a large-scale context;

finally, in order to ensure the normal training of the statement generation model, the dimensions of the image feature vectors and the word feature vectors need to be unified, a full-connection strategy is adopted to map the 2048-dimensional feature vectors of the image to a 512-dimensional word vector feature space to obtain a final image feature mapping vector, and the mapping formula is as follows:

y=WT(DCNN(I))+b (3)

in the formula: y is a 512-dimensional feature vector obtained by full-connection calculation, W is a 2048 x 512-dimensional matrix, I is an image input into the network, DCNN (I) is a 2048-dimensional feature vector extracted by the network, and b is a weight coefficient.

1.3, multi-stage reward network:

the multi-level reward function is a combination of visual-language rewards and language-language rewards. The method is a visual semantic embedded network, and the structure of the network is the same as a sentence-level strategy. However, there are some differences between them. The first difference is that the visual reward comes from the image feature I and is fully generatedRather than being partially generated, the multi-level policy network fuses word-level and sentence-level policies. It evaluates the fully generated visual language correlation and defines the specific goals of RL optimization.

As shown in FIG. 2, using fr(I) Representing sentence mapping layers, and image mapping layersRepresenting the image mapping layer, the definition of visual language reward is as follows:

the second difference is that the embedding space is pre-trained for the reward, and the sentence-level strategy is trained directly in the RL framework, let θπParameters representing visual sentence rewards, using image sentence pairs in the image chinese dataset and using bi-directional ranking penalties to learn RNN weights and mapping layers:

where γ is the margin of cross-validation, each (I, S) represents a real image sentence pair, S-Indicating a negative description of the image corresponding to I, I-A negative description to S of the image corresponding thereto is indicated.

The language-language reward refers to an automatic evaluation index which is successfully applied to the image Chinese task. Since it is calculated using a predefined rule, sequential operations can be stably evaluated. Using language-language rewards as a complement to visual-language rewards by generating them completelyCorresponding to the true valueAre calculated by comparison.

1.4 reinforcement learning training:

the key problem in reinforcement learning is to combine the strategy part and the reward part for joint learning. Since the visual reward part is pre-trained by the real effective value, the visual reward part can be used as a standard for measuring the correlation between the image and the sentence. By training sentence-level strategies in the reinforcement learning framework using all information in the image environment, it can be considered as an auxiliary table criterion for measuring the similarity between images and sentences. By minimizing G, professional rewards will guide non-professional strategies to optimize and further maximize both parties co-learning revenue. Let thetaπParameters representing sentence-level strategies, usingRepresenting parameters of a multi-level policy network and generating a word r by minimizing a negative expected combined rewardtotalDistribution of (2).

The objective function can be expressed as:

the intensive training process includes two steps.

1. Pre-training word-level strategies θ in equations using standard supervised learningπAnd visual language reward thetar

2. Co-training θ using equationsπAnd thetaaThe resulting baseline RL not only has a sentence-level policy, butAnd obtained under the inference algorithm used by the current model under testLanguage-language rewards. The samples of the gradient are approximated as follows:

whereinAndused as a combined baseline and a moving baseline for η, the subtraction in the evaluation results in a much smaller variance estimate in the policy gradient, which can be scaled as state stEstimation of the dominance of the time action.

1.5 statement generation model:

the embodiment adopts the construction of a double-layer GRU model to decode and generate Chinese sentences, the GRU neural network improves the LSTM network, and combines a forgetting gate and an input gate, and the GRU network only has two gates which are respectively resetDoor rtAnd an update gate zt,ht-1And htThe hidden layer state of the GRU network at the previous moment and the hidden layer state of the GRU network at the current moment are obtained, the GRU network updates the hidden state through two doors, and a schematic structural diagram of the GRU model is shown in fig. 4.

ht-1And htThe hidden layer state of the GRU network at the previous moment and the hidden layer state of the GRU network at the current moment are obtained, the GRU network updates the hidden state through two doors, and the method specifically comprises the following steps:

(1) by resetting the gate rtThe degree of forgetting the hidden layer information at the previous moment is controlled, and the method can be used for effectively capturing the short-term dependency relationship in the sequence data, and the calculation formula is as follows:

rt=σ(Wr·[ht-1,xt]) (10)

in the formula: σ () is a sigmoid function, Wr is a weight coefficient of the reset gate layer, ht-1Is the hidden layer state at the previous time, and xt is the input information at the current time.

(2) By updating the door ztThe method can be used for effectively capturing long-term dependency relationship in sequence data, and the calculation formula is as follows:

zt=σ(Wz·[ht-1,xt]) (11)

in the formula: σ () is a sigmoid function, Wz is a weight coefficient of the reset gate layer, ht-1Is a hidden layer state at the previous moment, xtIs the input information at the current time.

(3) The hidden layer candidate state at the current moment is hidden layer state information to be reserved, the hidden layer state information at the previous moment is filtered by using a reset gate, specifically, a dot product operation is performed on the value of the reset gate and the hidden layer information at the previous moment, and the more the value of the reset gate approaches to 0, the more the hidden layer state information at the previous moment is discarded. The hidden layer candidate state is basically a hidden state at the previous moment after being filtered by a reset gateLayer state ht-1After multiplying and adding the input at the current moment by the weight coefficient, compressing the input to a value in a (-1, 1) interval by using a tanh function, and expressing the hidden layer candidate state as follows:

in the formula: w is a candidate weight coefficient, rtIs a reset gate, ht-1Is the hidden layer state, x, of the previous momenttIs the input information at the current time.

(4) Hidden layer state r at current momenttThe method is a real output of the current time of the GRU network, and updates hidden layer state information of the previous time and hidden layer candidate state of the current time by using an update gate, the value of the update gate is closer to 1, which represents that more hidden layer state information of the previous time is reserved, if the value of the reset gate is 1, the hidden state of the previous time is always reserved without attenuation through a time axis and is transmitted to the hidden state expression of the current time, wherein the hidden state expression of the previous time is as follows:

the function of the double-layer GRU network layer1 is to integrate the image characteristics and the word embedding characteristics and input the integrated image characteristics and the word embedding characteristics into the layer2, and the function of the layer2 is to perform characteristic inference according to the output of the layer1 and decode the inferred image characteristics and the word embedding characteristics to realize the predictive generation of words. The information flow of the model is:

the inputs to layer1 at time t-0 are:

(1) image features after feature mapping;

(2) word embedding features that perform secondary encoding on sparse word codes.

the output of layer1 at time t-0 is:

(1) hidden layer input of layer1 at the time when t is 1;

(2) the actual input of layer2 at time t-0.

the inputs of layer2 at time t-0 are:

(1) hidden layer output of layer 1;

(2) hidden layer initial value of layer 2.

the output of layer2 at time t-0 is:

(1) the real output of layer2 at time t ═ 0;

(2) layer2 is hidden layer input at time t ═ 1.

Deepening the number of network layers not only can enable the model to learn deeper text features, but also can enable the sequence model to obtain better fitting capability so as to generate more accurate sentences, and the overall flow chart of the model is as shown in fig. 4, and a ResNet152 network is used for an AI Challenger challenge match image data set to generate global feature vectors of images, and the global feature vectors are finally converted into Chinese sentences.

2. Experiment and analysis:

2.1 data set:

in order to verify the effectiveness of the model and the fluency and continuity of output sentences, a challenging AI Challenger global AI challenge match image Chinese description data set is selected as a data set in an experiment, the data set comprises images and corresponding 5 Chinese descriptions, the data set comprises 21 ten thousand pictures of a training set, each picture corresponds to 5 Chinese descriptions, and 105 ten thousand Chinese descriptions are obtained in total; 3 thousand pictures and 15 thousand sentences of Chinese description in each of the verification set and the test set.

2.2 details of the experiment

In the experiment, feature extraction is carried out in a ResNet152 network, pictures sent into the network are unified and normalized to 256 × 256 pixels, and after a series of convolution and the last layer of self-adaptive average pooling operation, the size of an output global feature vector is obtained to be [2048, 1, 1 ]. Sentence-level strategies and visual language rewards are embedded networks of visual semantics that all use the same architecture, but are trained independently. The RNN is constructed using one LSTM layer with 2048-d hidden units, and the sizes of both linear mapping layers are set to 2048 × 512.

During the training process, the LSTM hidden dimension, image dimension, word dimension, and attention embedding dimension of the word-level strategy are all fixed at 512. Using the Adam optimizer, the initial learning rate was 5X 10-5 and the mini-batch size was 64. The maximum number of epochs is 30. λ in equation 3, β in equation 9, γ in equation 4, and η in equation 6 are set to 0.4, 0.6, 0.2, and 0.4, respectively. During the test, the beam search is set to 1. All experiments were carried out in PyTorch.

The model is designed based on a Linux operating system, programs are written by using a python language, and the construction of the Chinese text description model of the image is completed by adopting an open source deep learning framework, namely pytorch0.4.0. Three types of image Chinese models, namely CNN-RNN, Attention and Stacked models, are explored aiming at the word-level strategy.

2.3 comparison of experimental results:

in the embodiment, experimental comparison is performed in the aspects of objective evaluation and subjective evaluation, the training accuracy of the current experiment is compared on a Chinese description data set, and the objective evaluation is compared on a BLEU-4 image description standard.

In the aspect of objective evaluation indexes, the bigger the BLEU result value is, the better the effect is, the comparison between the accuracy, the BLEU-4 index and the training time of the novel model provided by the text and other proposed models is carried out, the effect is obviously improved, and the comparison result of the model training accuracy is shown in the table 1:

TABLE 1 model training accuracy comparison

Model (model) Training accuracy
NIC 89.591%
ATT-CNN+LSTM 89.598%
ATT-FCN 89.593%
Adaptive 90.698%
SCST 90.723%
P-CNN-RNN 90.697%
P-Attention 91.657%
P-Stacked 93.121%

Three text description models are used herein to construct word-level strategies, denoted P-CNN-RNN, P-Attention and P-Stacked, respectively, and the results of three model testing comparative experiments are shown in Table 2:

TABLE 2 comparison of model test experiments

The results of the model training time comparison experiment are shown in table 3:

TABLE 3 comparison of model training time results

Model (model) Training time (h)
NIC 9
ATT-CNN+LSTM 9
ATT-FCN 8
Adaptive 8
SCST 6
P-CNN-RNN 8
P-Attention 4
P-Stacked 4

In the aspect of subjective evaluation experiment comparison, an image chinese description model is tested by using an AI challenge global AI challenge match image chinese description test data set, the experiment tests the quality of a chinese sentence generated by a new multi-stage strategy framework and a single-stage strategy proposed in this embodiment, and semantic comparison analysis is performed by dividing into (a) a chinese sentence generated by using a multi-stage strategy deep reinforcement learning framework and (b) a chinese sentence generated by using a single-stage strategy and a reward function, as shown in fig. 5:

wherein part (1): (a) the method comprises the following steps A girl wearing a hat in the room playing with a girl, (b): two girls played in the room;

(2) the method comprises the following steps: (a) the method comprises the following steps A woman hurting the racket to jump up is arranged on the playground, and (b) a woman jumping up is arranged on the path;

(3) the method comprises the following steps: (a) the method comprises the following steps A group of people sitting around the table in a restaurant, (b): a group of people sits on the table;

(4) the method comprises the following steps: (a) the method comprises the following steps A boat is parked on the lake surface, (b): a boat is arranged on the water;

as can be seen from fig. 5, the descriptive sentences generated by the proposed model can correct the wrong content in the sentences compared to the sentences generated by the single-level strategy and the reward function, and describe the content in the image more accurately, the description generated by the multi-level strategy competes with the actual situation, while the single-level strategy usually loses key information, and in addition, both the (a) and (b) descriptive sentences in part (4) of fig. 5 are failure situations, and the described sentences do not conform to the actual situation of the picture display, and it can be known that the provided method cannot completely propose a specific object from a noisy background in some cases.

By combining the subjective and objective evaluation experiments, the new image Chinese description model based on the multi-level strategy and the depth reinforcement learning framework provided by the paper can obtain better accuracy and test value on objective evaluation, time is saved, and continuity and readability of the generated image Chinese description sentence are improved in the subjective aspect.

The third concrete implementation mode: according to the above method example, the functional modules may be divided according to the block diagrams shown in the drawings of the specification, for example, the functional modules may be divided corresponding to the functions, or two or more functions may be integrated into one processing module; the integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Specifically, the system includes a processor, a memory, a bus, and a communication device; the memory is used for storing computer execution instructions, the processor is connected with the memory through the bus, the processor executes the computer execution instructions stored in the memory, and the communication equipment is responsible for being connected with an external network and carrying out a data receiving and sending process; the processor is connected with the memory, and the memory comprises database software;

specifically, the database software is a database of version more than SQLServer2005 and is stored in a computer readable storage medium; the processor and the memory contain instructions for causing the personal computer or the server or the network device to perform all or part of the steps of the method; the type of processor used includes central processing units, general purpose processors, digital signal processors, application specific integrated circuits, field programmable gate arrays or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof; the storage medium comprises a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

Specifically, the software system is loaded on a Central Processing Unit (CPU), a general purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other Programmable logic devices, transistor logic devices, hardware components, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication device for communication between the relevant person and the user may utilize a transceiver, a transceiver circuit, a communication interface, or the like.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention, and any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:基于眼球追踪的智能驾驶评估训练方法、系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!