Word alignment performance improving method based on pre-training model
1. A word alignment performance improving method based on a pre-training model is characterized by comprising the following steps:
1) obtaining word vectors of words in the sentence by using a pre-training model so as to form word vector matrixes X and Y of a translation sentence pair;
2) extracting phrases and terms from the word vector matrixes X and Y of the inter-translation sentences by using a phrase and term extraction tool, then adding and averaging word vectors of words in the phrases and terms, enhancing the relevance between the word vectors and the phrases, and obtaining updated inter-translation sentence pair word vector matrixes X and Y;
3) taking a calculated value of the cosine of the word vector between words as the similarity between two words, and obtaining a similarity matrix Sim of the inter-translation sentence pair, wherein the calculation formula is as follows:
Sim=cos(X,Y)
4) taking a convolution kernel as kappa, wherein the size is n multiplied by n, wherein n is 2-8, and performing convolution operation on the similarity matrix Sim to enable word alignment to be blended into the information of the context words;
5) and extracting corresponding word alignment information from the updated inter-translation sentence pair similarity matrix by using different word alignment extraction methods.
2. The word alignment performance improving method based on the pre-training model as claimed in claim 1, wherein: in step 2), the specific step of updating the corresponding word vector matrix is as follows:
201) extracting phrases and terms from the sentences by using a tool, and then constructing a phrase and term list of the data set;
202) matching phrases and terms of the word vector matrix of the sentence to obtain ei,......,ei+nN source words are a phrase, and the word vectors of the n words are updated as follows:
after the word vectors of all phrases and terms in the sentence are updated, corresponding updated sentence word vector matrixes X and Y can be obtained.
3. The word alignment performance improving method based on the pre-training model as claimed in claim 1, wherein: in step 4), performing convolution operation on the similarity matrix by using a convolution kernel κ, wherein the specific operation steps of performing convolution are as follows:
401) setting a convolution kernel kappa, wherein the size of the convolution kernel is n multiplied by n, and setting all values of elements in the convolution kernel to be between 0 and 1;
402) and carrying out convolution operation on the similarity matrix of the bilingual sentence pair to update the similarity matrix of the bilingual sentence pair.
4. The word alignment performance improving method based on the pre-training model as claimed in claim 1, wherein: in step 5), extracting word alignment data from the similarity matrix in step 4) by using an Argmax method, an Itermax method, and Match, respectively, specifically including the following steps:
501) maximum value processing is carried out on rows and columns in the similarity matrix by using an Argmax method, and if the corresponding module is the maximum value of the rows and the maximum value of the columns, the source words and the target words in the module are aligned;
502) extracting word alignment information by iterating the similarity matrix by using an Itermax method; then, the word alignment obtained by each iteration is used as the finally obtained word alignment result;
503) and mapping the word alignment task to a bipartite graph, and solving the graph by using a Match algorithm to obtain a final word alignment result.
Background
Word alignment is a sub-topic of natural language processing and is of paramount importance. The reason is that the word alignment technology has better application in the tasks of instance extraction, repeated statement generation and part of speech tagging in the natural language field, especially statistical machine translation and the like. A good automatic word alignment system is a key system to advance the promotion of most tasks in the field of natural language processing.
In recent years, along with the advance of artificial intelligence, deep learning is popular among students in various fields, and the field of machine translation is no exception. In recent years, related researchers have moved out of the world of Statistical Machine Translation (SMT) to open the door to neural machine translation (NTM). When a neural machine translation model is constructed, an encoding-decoding framework is mostly adopted, and the most important attention mechanism is added to introduce context information, so that the translation quality is improved. The most well-known NMT translation model, the Transformer model proposed by Ashish Vaswani and Noam Shazeer et al, achieved a breakthrough score among the tasks above the model WMT 2014.
With the success of NMT breakthroughs and the increasing popularity of NMTs, some researchers began to perform word Alignment between a target sentence and a source sentence according to information captured by an Attention mechanism inside the NMT, wherein an initial attempt was made inside the NMT using a recurrent neural network as an encoder and a decoder, and then word Alignment information was continuously captured on a transform machine translation model, and unlike the previous word Alignment using only an Attention matrix, this method proposed an Explicit Alignment Mode (EAM) adding training parameters and introducing Alignment loss Difference (PD) to improve the previous word Alignment on the NMT, which both improved the result compared to the previous Alignment using the Attention matrix alone.
In the process of practical experiments, the effect of alignment based on NMT is found to be not good compared with the effect of alignment based on a statistical method, and even shows great disadvantages on partial data sets, but the alignment on NMT provides a new word alignment perspective for researchers, namely word alignment by deep learning.
Later, in the last two years, the pre-trained model was developed to bring an eosin to the workers studying word alignment. Because the pre-training model is trained based on an unsupervised method, a large amount of linguistic data can be used, so that word vectors in the pre-training model have semantic and contextual information. Therefore, researchers can directly use word vector cosine values of words and phrases as similarity between words and phrases, and then extract word alignment. However, a big problem exists now that the word alignment under this method is easy to use, but the alignment effect for phrases and terms is poor. In summary, the word alignment methods are generally popular at present and have better effect, and can be divided into three categories: rule-based, statistical-based, and deep learning-based methods, the rule-based method uses a built-in bilingual dictionary as a token to judge the alignment of language units. The method is characterized by easy realization and intuition; the statistical-based method does not need any rules to be compiled or knowledge to be learned in advance, but appears in the form of a group of probability matrixes with numerical values to be determined, the probability data can be obtained by training the statistical alignment model by using real corpora, then the probability of each alignment scheme is calculated according to the probability data, the highest probability is selected as an alignment output result, and the word alignment model can also be attributed to a generation-class model; the deep learning method can be regarded as a process of learning knowledge and applying knowledge, and the process of learning knowledge can be regarded as training huge parameters in a network model by using a large number of bilingual alignment data sets. Then, the network judges the alignment information of the pairs of bilingual sentences which have not been learned by using the trained parameters of the network, and the process can be said to be a process in which the model is used for applying the knowledge which has been learned by the network.
Although the deep learning method can obtain better word alignment data, the parameter quantity is huge, so that a huge data set is required for training, thus the resources of manual labeling are consumed, and the use of the word alignment data in other tasks in a plurality of natural language fields is influenced.
Disclosure of Invention
Aiming at the problems that training of huge data is needed to consume artificial tagging resources and use of other tasks in the natural field is affected in deep learning, the invention provides a word alignment performance improving method based on a pre-training model, and the requirement of word alignment data resources can be effectively reduced.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention provides a word alignment performance improving method based on a pre-training model, which comprises the following steps of:
1) obtaining word vectors of words in the sentence by using a pre-training model so as to form word vector matrixes X and Y of a translation sentence pair;
2) extracting phrases and terms from the word vector matrixes X and Y of the inter-translation sentences by using a phrase and term extraction tool, then adding and averaging word vectors of words in the phrases and terms, enhancing the relevance between the word vectors and the phrases, and obtaining updated inter-translation sentence pair word vector matrixes X and Y;
3) taking a calculated value of the cosine of the word vector between words as the similarity between two words, and obtaining a similarity matrix Sim of the inter-translation sentence pair, wherein the calculation formula is as follows:
Sim=cos(X,Y)
4) taking a convolution kernel as kappa, wherein the size is n multiplied by n, wherein n is 2-8, and performing convolution operation on the similarity matrix Sim to enable word alignment to be blended into the information of the context words;
5) and extracting corresponding word alignment information from the updated inter-translation sentence pair similarity matrix by using different word alignment extraction methods.
In step 2), the specific step of updating the corresponding word vector matrix is as follows:
203) extracting phrases and terms from the sentences by using a tool, and then constructing a phrase and term list of the data set;
204) matching phrases and terms of the word vector matrix of the sentence to obtain ei,......,ei+nN source words are a phrase, and the word vectors of the n words are updated as follows:
after the word vectors of all phrases and terms in the sentence are updated, corresponding updated sentence word vector matrixes X and Y can be obtained.
In step 4), performing convolution operation on the similarity matrix by using a convolution kernel κ, wherein the specific operation steps of performing convolution are as follows:
403) setting a convolution kernel kappa, wherein the size of the convolution kernel is n multiplied by n, and setting all values of elements in the convolution kernel to be between 0 and 1;
404) and carrying out convolution operation on the similarity matrix of the bilingual sentence pair to update the similarity matrix of the bilingual sentence pair.
In step 5), extracting word alignment data from the similarity matrix in step 4) by using an Argmax method, an Itermax method, and a Match method, respectively, specifically including the following steps:
501) maximum value processing is carried out on rows and columns in the similarity matrix by using an Argmax method, and if the corresponding module is the maximum value of the rows and the maximum value of the columns, the source words and the target words in the module are aligned;
502) and (5) extracting word alignment information by iterating the similarity matrix by using an Itermax method. Then, the word alignment obtained by each iteration is used as the finally obtained word alignment result;
503) and mapping the word alignment task to a bipartite graph, and solving the graph by using a Match algorithm to obtain a final word alignment result.
The invention has the following beneficial effects and advantages:
1. the invention solves the problem that deep learning needs large training data by a pre-training method.
2. After a word vector matrix of sentence pairs is obtained, phrases and a glossary are used for matching, so that the problem that the alignment among the words is not uniform due to low correlation degree among the words in pre-training is solved.
3. The present invention solves the problem of no association between word alignment contexts by using convolution operations on the similarity matrix of sentence pairs.
4. The method can acquire excellent word alignment data and can help other tasks in the field of natural language processing to be well promoted.
5. The method is simple to operate, high in reproducibility and easy to operate; and the calculation complexity is low, the consumed time in practical application is short, and the time cost and the calculation cost of operation can be saved.
Drawings
FIG. 1 is a flowchart of a pre-training model-based performance improvement method for word alignment according to the present invention;
FIG. 2 is a diagram of the pre-training model used.
Detailed Description
The invention is further elucidated with reference to the accompanying drawings.
The invention provides a word alignment performance improving method based on a pre-training model, which is shown in a specific method in figure 1 and comprises the following steps:
1) obtaining word vectors of words in the sentence by using a pre-training model so as to form word vector matrixes X and Y of a translation sentence pair;
2) extracting phrases and terms from the inter-translated sentence obtained in the step 1) by using a phrase and term extraction tool such as TextBlob, and the like, and then adding and averaging word vectors of words in the phrases and terms, so that the correlation among the words is stronger. Thereby obtaining updated inter-translation sentence pair word vector matrixes X and Y;
3) then, the calculated value of the cosine of the word vector between the words is used as the similarity between the two words, and then a similarity matrix Sim of the inter-translation sentence pair is obtained, wherein the calculation formula is as follows:
Sim=cos(X,Y)
4) taking a convolution kernel as kappa, wherein the size is n multiplied by n, wherein n is 2-8, and performing convolution operation on the similarity matrix Sim to enable word alignment to be blended into the information of the context words;
in this embodiment, the size is 3 × 3, the similarity matrix Sim obtained in step 3) is subjected to convolution operation, and information of context words is merged into word alignment;
5) extracting word alignment from the inter-translated sentence pair similarity matrix updated according to the step 4) by using three different word alignment extraction methods;
the method of the present invention may further comprise step 6) of using the test data to test the quality of the extracted word alignments of steps 1) to 5).
In step 1) a word vector of a sentence is obtained using a pre-trained model. The pre-trained models are trained using a large bilingual corpus, where the task is a language model. The word is also characterized as a real number vector, and the real number vector contains the part of speech and context information of the word, and the like.
As shown in fig. 2, a language model is used when training the pre-trained model. Input w in the structure diagramm-n+1,wm-n+2,......,wm-1Is a window word, i.e. using word wm-n+1,wm-n+2,......,wm-1To predict the next word wm. After training is complete, each DECODER layer of the pre-training generates a corresponding real word vector, which can be used as a word alignment task.
The pre-trained model is also typically formed by stacking the nets, which also means that each layer generates a corresponding word vector. Through a large number of experimental comparisons, the word alignment extraction task generally using the word vector of the eighth layer has a good effect. This is because the word vectors at lower layers may not have been well learned, and the information contained in the real number vectors is not sufficient enough to satisfy the demand. However, although the word vector with a higher layer number contains much information, the information is not desirable, and what is needed by the pre-training model learning target, namely, the information related to many language models is contained in the word vector, and on the contrary, the lexical, syntactic and contextual information needed by the present invention is very little, so that the output of the feature vector of the higher-layer decoder is not favorable for aligning the alignment between the bilingual sentence pairs.
Processing the word vectors in the phrases in the step 2), specifically:
201) phrases and terms in bilingual sentence pairs are first extracted using a phrase extraction tool such as TextBlob, and then tokens are stored. In order to save memory, the marking mode is to record the position of the phrase in the sentence.
202) Processing the word vectors in the phrase according to the marks in the step 201), thereby enhancing the relevance of the words. Matching phrases and terms of the word vector matrix of the sentence to obtain ei,......,ei+nN source words are a phrase, and the word vectors of the n words are updated as follows:
after the word vectors of all phrases and terms in the sentence are updated, corresponding updated sentence word vector matrixes X and Y can be obtained.
In step 4), performing convolution operation on the similarity matrix by using a convolution kernel κ, which includes the following specific operation steps:
401) setting a convolution kernel kappa, setting the size of the convolution kernel to be n multiplied by n, setting all values of elements in the convolution kernel to be 0-1, and performing convolution operation on a similarity matrix of the bilingual sentence pair to update the similarity matrix of the bilingual sentence pair;
402) and then carrying out convolution operation on the similarity matrix Sim obtained in the step 3), thereby integrating the information of the context words into the word alignment.
When calculating the similarity matrix Sim, the derivation of each value depends solely on the word vectors of the two words. When convolution is applied, each value of Sim is associated with its surrounding value, which is actually also associated with alignment, and thus meets the practical requirement. For example, in a bilingual sentence pair "I am very satisfied with I. "and" I am < I > sampled with me. "it is a relationship that the alignment of the first" I "and" I "in the Chinese sentence is aligned with the alignment of the second" I "and" me ", and if the first" I "and" I "are aligned, it means that the second" I "and" I "are aligned with a very small probability. Conversely, it is also represented that the alignment of the first "I" and "I" would facilitate the alignment of the "pair" and "am" following the first "I" and that convolution is the process that is modeled.
In step 5), the similarity matrix in step 4) is used to extract word alignment data by using the Argmax method, the Itermax method and the Match method, respectively, because the results obtained by different algorithms have different advantages (relative to accuracy, recall and F1In terms of values). Therefore, at the beginning, the user needs to select the extraction algorithm that he wants according to different requirements. The specific operation steps are as follows:
501) firstly, the method of Argmax is used to carry out maximum value processing on rows and columns in a similarity matrix, and if a corresponding module is the maximum value of the rows and the maximum value of the columns, source words and target words in the module are aligned. The formula for Argmax is as follows:
where i represents the horizontal axis of Sim, i.e., the position of the word in the source sentence. j represents the vertical axis of Sim, i.e. the position of the word in the target sentence. And if the similarity of the word i and the word j is maximum in the horizontal direction and the vertical direction in the Sim, the word i and the word j are represented to have an alignment relation.
502) The method adopts an Itermax method to extract word alignment information of iteration of the similarity matrix; then, the word alignment obtained in each iteration is used as the final word alignment result, which is an improvement of the Argmax method for the Itermax method. When there are few cases of Argmax being met, but there is indeed a residual word alignment, then an iteration within the Itermax method is required to find this residual word vector.
For Sim, it can be first subjected to Argmax operation, and then the word alignment matrix a is obtained, the size of a being the same as the size of Sim except that the value in a is non-0, i.e. 1, if a isijIf 1, then the representative word i and the word j are in an aligned relationship, otherwise equal to 0, then the representative word i and the word j are not in an aligned relationship. After the alignment matrix a is obtained, the operation continues on Sim. For SimijIf A isijIf 1, then Simij0; if A isij0, but if the i-th row or j-th column of the alignment matrix a has a condition equal to 1, then Simij=SimijX a, which is a hyper-parameter and has a value range of (0, 1); if A isijIf 0 and the i-th row or j-th column of the alignment matrix a does not have a condition equal to 1, then SimijThe value is not changed. After the above transformation is performed on Sim, Argmax operation is performed again, a temporary alignment matrix a' is obtained, and then the following update operation is performed on the alignment matrix a.
A=A+A`
503) The invention maps the word alignment task to a bipartite graph, and then solves the graph by using a Match algorithm to obtain a final word alignment result. The Match method regards the process of extracting word alignment from Sim as a maximum bipartite graph matching problem. Wherein each word can be regarded as a point, a graph G is formed. And all points in the source language can be regarded as A, all points in the target language can be regarded as subset B, and it can be obviously seen that the subset A and the subset B are not intersected, so that the graph G can be regarded as a bipartite graph, and the problem of word alignment is abstracted to be the maximum bipartite graph matching problem.
Although all three methods can solve word alignment, the results of their solution are significantly different. From experiments and analysis, the accuracy rate of word alignment solved by Argmax is higher than that of other two methods; the term-aligned recall rate for Itermax is somewhat higher than for the other two approaches: match has a somewhat higher F1 value.
By the method, a word alignment result with better effect than a word alignment system based on a statistical method can be obtained. The results of the invention were tested with the data set of Europarl gold alignments as the data set of the English-to-German test, the data sets of Tavakoli and Faili as the data set of the English-to-Persian test, and the data sets of Bojar and Prokopov' a as the data set of the English-to-Denmark test, respectively, all achieved good results, and the feasibility and alignment quality advantages of the invention were demonstrated in the laboratory.
With the present invention, users can obtain higher quality word alignment data and apply them to downstream tasks or act as external information. For example, the invention uses the word alignment information to help train the machine translation model, and in the process of machine translation, if the word alignment information of the current predicted word is possessed, the target sentence can be positioned to obtain a higher-quality translation result.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:槽位填充方法、装置、设备及存储介质