Gibbs limited text abstract generation method by using pre-training model
1. A Gibbs limited text abstract generation method by utilizing a pre-training model is characterized by comprising the following steps: training and generating a text abstract by using a Trans-BLSTM model, wherein the training process of the Trans-BLSTM model is as follows:
(1) first useThe pre-training language model Bert is used for the source sequence x of the text ═ x1,x2,...,xnPerforming Word vectorization, and meanwhile adding a relative position code to obtain Word Embedding of the text;
(2) at the stage of an encoder, extracting features by using a multi-head attention mechanism and Bi-LSTM, training a model, and finely adjusting the model to obtain the output of the encoder;
(3) adding relative position coding to obtain a target sequence y ═ y { y }in the same way as the word embedding method of the encoder end source sequence1,y2,...,ymWord Embedding of { right };
(4) the structure of the decoder end adopts a Transformer decoder structure, and the parameters are consistent with those of the Transformer;
(5) obtaining an Attention matrix through training, inputting the Attention matrix into a full connection layer, and then calculating by Softmax to obtain probability representation of a vocabulary;
(6) and finally, obtaining an output sequence through a decoding algorithm, fusing an LDA model into a decoder end to extract keywords, and extracting and generating the abstract by combining a Gibbs sampling algorithm.
2. The method of Gibbs constrained text summarization using a pre-trained model according to claim 1, wherein: the Transns-BLSTM model changes the FFN (. cndot.) layer of the transform encoder section to Bi-LSTM and connects a Linear layer, while the decoder section remains unchanged.
3. The method of Gibbs constrained text summarization using a pre-trained model according to claim 2, wherein: the calculation of the relative position code is shown as follows:
wherein i and j represent corresponding position indexes in a word list, 2k represents an even dimension, 2k +1 represents an odd dimension, and dz represents a dimension of a hidden layer of each single Attention.
4. The Gibbs constrained text digest generation method using pre-trained models of claim 3, wherein: in the step (6), firstly, extracting keywords from the text by using an LDA model, and then selecting the highest scoring keyword as a starting mark; meanwhile, the sentence which is already predicted is added into the negative candidate set, and the Gibbs sampling algorithm is repeated once if the sentence is in the negative candidate set at the next prediction time.
5. The Gibbs constrained text digest generation method using pre-trained models of claim 4, wherein: the flow of the decoder side combined with the Gibbs sampling algorithm is as follows:
initial state is x0=[x0,1,x0,2,...,x0,n]At time t, the state is xt=[xt,1,xt,2,...,xt,n]Sampling x by the following proceduret+1:
1) From [ x ]t,1,xt,2,...,xt,n]X of the ith position of the middle samplet,iAnd replacing it from the sequence by [ MASK ]]Obtaining the sequence xt,-i=[xt,1,xt,2,...,xt,i-1,[MASK],xt,i+1,...,x0,n];
2) Calculating xt,iGenerating a probability distribution p in an MLM modelt+1;
3) From pt+1Sampling a y;
4) replacement of x by yt,iThen x is obtainedt+1=xt,[xt,i=y]=[xt,1,xt,2,...,xt,i-1,y,xt,i+1,...,x0,n]。
Background
Under the large background of the highly developed network, hundreds of millions of data traffic is generated on the internet every day, information flow covering the sky is full of our lives, and how to extract the information needed by people from the information flow is very important. Since the mobile internet enters a high-speed development stage in 2012, the amount of text information is in a well-spraying type, exponential and explosive growth, and huge text information makes people spend a great deal of time when browsing on the internet, so that the reading cost of a user and the cost for acquiring important information are greatly improved. How to solve the problem of quickly extracting the key information in the text data from the excessive information has become an urgent need of various industries. The text abstract is a brief expression of the core content of an article, which can improve the efficiency of searching and reading from mass data for a user, but the traditional abstract is generated by manual extraction, so that the cost is too high and the efficiency is low, and thus, the automatic text abstract technology is produced at the discretion of the user.
The emergence of the automatic text summarization technology can effectively alleviate the problem, and as one of important research contents in the fields of natural language processing and artificial intelligence, the automatic text summarization can automatically extract a brief and coherent short text which can accurately reflect the content of an original center from a long text or a text set by using a computer. The automatic text summarization is an important technical means for people to understand human language by using machines, is one of important tasks of natural language processing, and has great research value and profound significance. A good abstract often contains three features:
(1) and (5) concise summarization. For short text, the length of the summary is generally no more than half or less of the original text. For long text, the summary often does not exceed one third of the text.
(2) The significance is elucidated. The abstract generally covers important information of the text and can express the central idea of the original text.
(3) The redundancy is eliminated. Good summary generation should not be tremble, and is a brief summary of important information of the original text, and the summary generated by a good algorithm should eliminate repeated redundant text.
The automatic text summarization is a technology for realizing automatic text analysis, content summarization and summarization generation by utilizing a computer, is an important means for solving the problem of surplus information at present, can help human beings to quickly, accurately and comprehensively obtain key information from a text, is widely applied in the aspects of document summarization generation, public opinion and public opinion supervision, news title generation, complex question answering and the like at present, and has important practical application significance on business and government affair services. It is mainly divided into two forms of extraction method and generation method.
An extraction Text Summarization Method mainly includes the steps of counting and analyzing the characteristics of a Text according to a probability statistics principle, and mining latent semantic information of the Text. The method mainly comprises the steps of carrying out model training on an input text by utilizing a correlation algorithm or a language model, then selecting and extracting related phrases and sentences in a source text by utilizing probabilistic knowledge, recombining the related phrases and sentences into a new sentence or paragraph, and generating a text to generate an abstract. The method for abstracting the text abstract based on the abstraction formula mainly comprises the following steps:
(1) and (4) selecting the content. Mainly based on statistical features or language models.
(2) And (6) sorting the information. Generally, the word frequency of the words or the mutual information importance of the sentences is calculated for sorting.
(3) And constructing sentences according to the importance, and outputting the abstract.
The abstraction type abstract generating method is a simple and direct text abstract generating method. The core of the method lies in content selection, and although the selection of key content is difficult, the method is easier to implement. In addition, sentence consistency is generally poor, and the consistency of the abstract is difficult to ensure, but because sentences are extracted from the original text directly, too many repeated texts do not appear, the meaning of the texts can be understood basically, and the readability of the sentences is good.
An abstract Text Summarization Method (abstract Text Summarization Method) mainly models a source Text by using deep learning related knowledge, analyzes and understands the Text by a training model, can select similar words or similar phrases different from the original Text in a training vocabulary, further expounds key information in the Text, expresses core content and theme ideas of the Text, and generates an abstract. Different from the abstraction-type text abstract, the generated abstract is mainly used for understanding the content of an article by using a trained language model, compressing, refining and understanding the text from the level of words and sentences, and finally generating the text abstract. The generative method is more similar to the process of human reading and understanding text than the pull method. Meanwhile, it can be found that the generative method is more dependent on the understanding and paraphrasing of the text, and the machine itself lacks the comprehension ability of the text information and the reserve of a priori knowledge of a human, so how to design a model or a method to enable the machine to generate the abstract of the text is a more complex, difficult and challenging task.
In recent years, a deep Sequence-to-Sequence (Seq 2Seq) learning method driven by data has made a significant breakthrough in many research fields, and has attracted much attention, and natural language processing has made significant achievements and advances. Although the development of text abstracts is driven by the occurrence of deep learning at present, the generated text abstracts often have the problems of semantic missing, repeated generation, unknown words, ambiguous word, poor readability, difficult abstract evaluation and the like, and further research and urgent solution are needed. The development of the text summary generation task is a hard and challenging task, requiring common efforts of people.
Disclosure of Invention
It is an object of the present invention to provide a method for gibbs-limited text summarization generation using a pre-trained model that overcomes some or some of the deficiencies of the prior art.
According to the Gibbs limited text abstract generating method by utilizing the pre-training model, the training and the text abstract generation are carried out by utilizing the Trans-BLSTM model, and the training process of the Trans-LSTM model is as follows:
(1) first, a pre-training language model Bert is used to determine a source sequence x ═ x of a text1,x2,...,xnPerforming Word vectorization, and meanwhile adding a relative position code to obtain Word Embedding of the text;
(2) at the stage of an encoder, extracting features by using a multi-head attention mechanism and Bi-LSTM, training a model, and finely adjusting the model to obtain the output of the encoder;
(3) adding relative position coding to obtain a target sequence y ═ y { y }in the same way as the word embedding method of the encoder end source sequence1,y2,...,ymWord Embedding of { right };
(4) the structure of the decoder end adopts a Transformer decoder structure, and the parameters are consistent with those of the Transformer;
(5) obtaining an Attention matrix through training, inputting the Attention matrix into a full connection layer, and then calculating by Softmax to obtain probability representation of a vocabulary;
(6) and finally, obtaining an output sequence through a decoding algorithm, fusing an LDA model into a decoder end to extract keywords, and extracting and generating the abstract by combining a Gibbs sampling algorithm.
Preferably, the Transns-BLSTM model changes the FFN (-) layer of the transform encoder section to Bi-LSTM and connects a Linear layer, while the decoder section remains unchanged.
Preferably, the relative position code is calculated as follows:
wherein i and j represent corresponding position indexes in a word list, 2k represents an even dimension, 2k +1 represents an odd dimension, and dz represents a dimension of a hidden layer of each single Attention.
Preferably, in the step (6), keywords are extracted from the text by using an LDA model, and then the highest scoring keyword is selected as a starting mark; meanwhile, the sentence which is already predicted is added into the negative candidate set, and the Gibbs sampling algorithm is repeated once if the sentence is in the negative candidate set at the next prediction time.
Preferably, the flow at the decoder side in conjunction with the Gibbs sampling algorithm is as follows:
initial state is x0=[x0,1,x0,2,...,x0,n]At time t, the state is xt=[xt,1,xt,2,...,xt,n]Sampling x by the following proceduret+1:
1) From [ x ]t,1,xt,2,...,xt,n]X of the ith position of the middle samplet,iAnd replacing it from the sequence by [ MASK ]]Obtaining the sequence xt,-i=[xt,1,xt,2,...,xt,i-1,[MASK],xt,i+1,...,x0,n];
2) Calculating xt,iGenerating a probability distribution p in an MLM modelt+1;
3) From pt+1Sampling a y;
4) replacement of x by yt,iThen x is obtainedt+1=xt,[xt,i=y]=[xt,1,xt,2,...,xt,i-1,y,xt,i+1,...,x0,n]。
The invention has the following advantages:
(1) a text generation model of a Trans-BLSTM model is provided, and the model is finely adjusted by combining a pre-training language model.
(2) The structure of the encoder end is improved, and the feedforward neural network layer is modified into Bi-LSTM to improve the feature extraction capability.
(3) The improvement on the position coding utilizes the relative position coding, adds the relative position relation of the text, and increases the extraction capability of the model position characteristics.
(4) A Gibbs sampling algorithm is introduced, training and prediction are unified by combining the Bert and Gibbs sampling algorithms, and training deviation is reduced.
(5) And an LDA topic model is added, and key information is extracted by using the LDA topic model, so that the generation quality of the model is improved.
Drawings
FIG. 1 is a flowchart of training of the Trans-LSTM model in example 1;
FIG. 2 is a reference diagram of relative position coding in embodiment 1;
FIG. 3 is a graph showing learning rate learning in example 1;
FIG. 4 is a graph of the loss of training for Trans-LSTM in example 1.
Detailed Description
For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples. It is to be understood that the examples are illustrative of the invention and not limiting.
Example 1
As shown in FIG. 1, the present embodiment provides a Gibbs restricted text excerpt generation method using a pre-trained model, which trains and generates a text excerpt using a Trans-BLSTM model, which changes the FFN (. cndot.) layer of the transform encoder section to Bi-LSTM and connects a Linear layer, while the decoder section remains unchanged. The training procedure for the Trans-LSTM model is as follows:
(1) first, a pre-training language model Bert is used to determine a source sequence x ═ x of a text1,x2,...,xnPerforming Word vectorization, and meanwhile adding a relative position code to obtain Word Embedding of the text;
(2) at the stage of an encoder, extracting features by using a multi-head attention mechanism and Bi-LSTM, training a model, and finely adjusting the model to obtain the output of the encoder;
(3) adding relative position coding to obtain a target sequence y ═ y { y }in the same way as the word embedding method of the encoder end source sequence1,y2,...,ymWord Embedding of { right };
(4) the structure of the decoder end adopts a Transformer decoder structure, and the parameters are consistent with those of the Transformer;
(5) obtaining an Attention matrix through training, inputting the Attention matrix into a full connection layer, and then calculating by Softmax to obtain probability representation of a vocabulary;
(6) and finally, obtaining an output sequence through a decoding algorithm, fusing an LDA model into a decoder end to extract keywords, and extracting and generating the abstract by combining a Gibbs sampling algorithm.
Absolute Position encoding (Absolute Position Representations) refers to the use of a unified calculation formula to calculate a Position-encoded value during the Position encoding process. Relative Position codes (Relative Position repeat-positions) refer to values that can dynamically adjust Position codes according to certain rules. The absolute position code is adopted to calculate the value of the position code according to the parity of the position index, is fixed and unchangeable in the whole training process, cannot be changed according to the context environment, and only has a relation with the position index of the word in the word list. The relative position coding structure is generally changed according to a relative position formula, and the relative position coding structure is different from the relative position formula for a reference standard and the final coding is also different. As shown in fig. 2.
The calculation of the relative position code is shown as follows:
wherein i and j represent corresponding position indexes in a word list, 2k represents an even dimension, 2k +1 represents an odd dimension, and dz represents a dimension of a hidden layer of each single Attention.
Many times, we need to generate a target text according to some specific information, and what is called mathematically is a conditional language model, but we cannot get enough corpus pairs to directly supervise and train a conditional language model, but only train an unconditional language model, but we can artificially design an index to quantitatively describe the link between the conditional language model and the target text. In this case, how to make conditional text generation based on the unconditional language model and the link between them becomes the subject of our research. We may refer to restricted Text Generation.
The Gibbs algorithm is calculating p (y | x)t,i) At time t, x at the ith position is sett,iTake away and then predict the probability of the ith position from the remaining sequences, which is particularly similar to the MLM in Bert. Therefore, we propose that it is not possible to combine the Bert and Gibbs sampling algorithms, and the decoder side samples the text using the Gibbs sampling algorithm, just as the training of the Bert MLM model. Thus, the flow at the decoder side in conjunction with the Gibbs sampling algorithm can be described as follows:
initial state is x0=[x0,1,x0,2,...,x0,n]At time t, the state is xt=[xt,1,xt,2,...,xt,n]Sampling x by the following proceduret+1:
1) From [ x ]t,1,xt,2,...,xt,n]X of the ith position of the middle samplet,iAnd replacing it from the sequence by [ MASK ]]Obtaining the sequence xt,-i=[xt,1,xt,2,...,xt,i-1,[MASK],xt,i+1,...,x0,n];
2) Calculating xt,iGenerating a probability distribution p in an MLM modelt+1;
3) From pt+1Sampling a y;
4) replacement of x by yt,iThen x is obtainedt+1=xt,[xt,i=y]=[xt,1,xt,2,...,xt,i-1,y,xt,i+1,...,x0,n]。
Since the model added with the Gibbs sampling algorithm is trained, the predicted sequence length must be known in advance, but the sequence lengths are different, the LDA model is merged into the decoder side, keywords are extracted from the text by the LDA model, and then the highest score is selected as the starting mark. Meanwhile, the concept of sampling the negative candidate set of the previous chapter is adopted, the predicted sentences are added into the negative candidate set, and the Gibbs sampling algorithm is repeated once if the predicted sentences are in the negative candidate set at the next prediction time. This avoids some of the problems of repeated generation.
Results and analysis of the experiments
Introduction to data set
An english data set CNN/daisy Mail data set is used, which is approximately 100 million news data collected by Hermann et al from the american wire news web and Daily Mail web as a machine-readable comprehension corpus, each article having a manually written abstract of multiple sentences. Subsequent Nallapatii et al constructed a CNN/Daily Mail dataset for training text summary generation models on a Hermann basis. The division of the CNN/Daily Mail text summary data set is shown in Table 1:
TABLE 1 CNN/Daily Mail data set
Data
Summary Pairs
Training Set
28,6817
Validation Set
13,368
Test Set
11,487
The training Set is a main component of the corpus, and is a training Set of models, and there are 28,6817 pairs of text abstract pairs of Master academic paper < text, Summaries > of university of electronic technology, in total.
The validationset is used as a Validation Set of models, collectively containing 28,6817 pairs of < text, subminias > text summary pairs.
Test Set is used as a Test Set of models to verify the effect of the model, which collectively contains 13,368 pairs of < text, summary > pairs of text digests.
CNN/DailyMail is a data set that is used for reading and understanding at first, and its original data includes original text and artificially generated abstract and answer, and then is corrected to obtain a multiple sentence abstract. CNN/Daily Mail is a summary data set of long texts, the data size is huge, and each long document contains a plurality of summaries of sentences. On average there are 766 words and 29.74 sentences per text and 53 words and 3.72 sentences per summary.
Data pre-processing
Source file data is stored in text and abstract pairs in a list and dictionary nesting mode, the < text and abstract > pairs are extracted before word segmentation, and then word segmentation is carried out through a word segmentation tool. Because the data volume is huge, the data set is cut into a plurality of small files for storage, so that the subsequent processing is facilitated. The Stanford CoreNLP toolkit for this experiment performed word segmentation on English text. We process the text sentence by sentence, and for each sentence we mark with < p >.
The experimental environment configuration and parameter setting model relates to a neural network, a large amount of operations are needed, and a GPU is adopted to operate the model.
The configuration used in the experiments herein is shown in table 2.
Table 2 experimental hardware configuration table
Hardware device
Configuration of
CPU
Intel (R) CPU i9-9900K 8 core 16 threads
GPU
NVIDIA GeForce 2080(8GB)x2
Memory
64GB
SSD
256GB
HDD
3TB
The codes related to the experimental model are developed and trained based on a Pytrich1.4 deep learning platform, and the configuration of model parameters is shown in Table 3.
TABLE 3 Trans-LSTM model parameter setup Table
Parameter name
Chinese explanation
Parameter setting
voc_size
Vocabulary size
30522
word_vec
Word vector dimension
768
learning_rate
Learning rate
2e-3
train_steps
Number of training steps
100000
warmup_steps
Speeding up learning steps
10000
dropout
Quit midway
0.1
batch_size
Batch size
128
optimizer
Optimizer
Adam
BeamSerach_size
Cluster search beamwidth
3
The vocabulary size selects the default 30000 for the transform. The word vector dimension is also 768, and the LSTM hidden layer dimension is 768, depending on the length distribution of the text words and the length of the sentence. The optimizer selects Adam and sets a parameter β1=0.9,β20.999. After multiple verification, the bundle selected by the bundle searching size is 3, and the best effect is achieved.
Comparative analysis of results
Due to pre-training, the probability distribution of the encoder part generally tends to be smooth, while the decoder part is not trained well, which must be considered to lead to training mismatch, and the training is not smooth during fine tuning. In order to make the model encoder part not fit over and the decoder part not under-fit, we set the learning rates of the encoder and decoder, respectively. As shown in the following formula:
lren=originalen·min(step-0.5,step·warmupen -1.5)
lrde=originalde·min(step-0.5,step·warmupde -1.5)
wherein we set original for the encoder parten=2e-3,warmupenFor the decoder part we set original 20,000de=0.1,warmupen10,000. The learning rate begins to decrease as the number of training iterations increases, and the learning rate gradually tends to plateau. The learning rate curve is shown in fig. 3.
The Trans-BLSTM model constructed in the method runs 200000 Steps on two RTX2080 GPUs in total, takes about 46.6 hours, and the loss curve of model training is shown in FIG. 4.
Due to the pre-training, the text distribution remains stable to some extent, so that the model is not greatly different after Gibbs sampling is added. In addition, during model training, we found that the Trans-BLSTM + Gibbs model had overfitting, so we selected the model parameters at the middle 10,0000 step to generate the summary and calculate the ROUGE score.
The summary evaluation uses a ROUGE evaluation method, and the generated summary is respectively scored according to ROUGE-1, ROUGE-2 and ROUGE-L indexes by calling a pyrogue tool kit. The results are shown in Table 4:
TABLE 4 ROUGE-N and ROUGE-L evaluations
Index (I)
Average score
Confidence level
ROUGE-1
Average_R:0.51699
(95%-conf.int.0.51429-0.51989)
ROUGE-1
Average_P:0.37607
(95%-conf.int.0.37354-0.37847)
ROUGE-1
Average_F:0.42090
(95%-conf.int.0.41874-0.42323)
ROUGE-2
Average_R:0.23734
(95%-conf.int.0.23459-0.24022)
ROUGE-2
Average_P:0.17352
(95%-conf.int.0.17129-0.17568)
ROUGE-2
Average_F:0.19346
(95%-conf.int.0.19128-0.19579)
ROUGE-L
Average_R:0.47311
(95%-conf.int.0.47034-0.47602)
ROUGE-L
Average_P:0.34482
(95%-conf.int.0.34233-0.34724)
ROUGE-L
Average_F:0.38561
(95%-conf.int.0.38343-0.38791)
We run one pass with the designed Trans-LSTM model and then add the Gibbs sampling method at the decoder side and run one side again. Because the sampling algorithm is based on the pre-training language model, the sampling distribution of the user is inaccurate, and the keywords extracted by the LDA topic model are added later for comparison analysis. In order to fully evaluate the effect of the model, the model is compared with several models which are popular at present. We evaluated our model more fully by comparing it to other published experimental results on CNN/daisy Mail data set, the results are shown in table 5.
TABLE 5 comparison of the effects of the experimental model and some of the open experimental models
Model
ROUGE-1
ROUGE-2
ROUGE-L
PGNet+Cov
39.53
17.28
37.98
Transformer
40.21
17.76
37.0
Trans-BLSTM
40.15
18.07
38.66
+Gibbs
41.32
18.29
39.17
+Gibbs+LDA
42.09
19.34
38.56
Compared with the PGNet + COV model, the model is greatly improved, and the model has good results. Meanwhile, the scores of the baseline model Trans-LSTM and the basic model of the Transformer are almost the same, which shows that the improvement based on the Transformer is effective. In addition, after Gibbs sampling is added, the model effect is further improved.
As can be seen from the table, the scores of the baseline model Trans-BLSTM and the transform model are almost the same, which shows that the improvement of the model based on the transform is effective, and also shows that the text generation task based on the Bert pre-training language model is feasible. Compared with a bundle search algorithm of a Trans-BLSTM model, the Gibbs sampling algorithm can obtain words from original text more probably, so that the model effect is further improved, and the consistency of training and prediction of the Gibbs sampling algorithm and the MLM model is also shown. After the prediction sequence is initialized by using the LDA topic model, the model has certain performance improvement (+0.77ROUGE-1, +1.05ROUGE-2, +0.39ROUGE-L) compared with the model which is initialized by simply adopting [ MASK ], and the result shows that sentence initialization is effective. However, the model is still different from the BertSumEXT model, and the defects of the model are further researched later, so that the generation quality of the model is improved.
In order to intuitively understand the text generation summary, we show an example of how to analyze the quality of the summary from a manual perspective. The summary of the generation of the model is shown in table 6.
TABLE 6 example of the digest generated by the Trans-LSTM model
The comparison shows that the model can basically obtain the main information same as the reference abstract, which shows that the model has better effect under certain conditions.
The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种文本聚类处理方法及系统