Training method, device and equipment of sequence labeling model and storage medium
1. A training method of a sequence labeling model is characterized by comprising the following steps:
acquiring text data required to be input by a sequence labeling model, and performing vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector; the sequence labeling model comprises a Bert model and a Span model, the input _ ids vector is the number of each word in the text data in a Bert dictionary, the segment _ ids vector is used for marking sentences to which each word in the text data belongs, and the mask vector is used for marking words and non-words in the text data;
inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training to obtain an output sequence;
acquiring boundary characteristic data in the text data, and carrying out word vector coding on the boundary characteristic data to obtain a boundary vector;
connecting the output sequence with the boundary vector to obtain a connection vector;
determining a starting position vector and an ending position vector of the boundary characteristic data;
connecting the connecting vector with the initial position vector of the boundary characteristic data by using the Span model, and obtaining an initial logits value after linear transformation;
connecting the connecting vector with the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation;
calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value;
calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value;
calculating a total loss value of the sequence labeling model according to the initial loss value and the end loss value, and judging whether the total loss value meets a preset threshold value;
and when the total loss value meets a preset threshold value, finishing the training of the sequence labeling model.
2. The method of claim 1, wherein the step of determining whether the total loss value satisfies a predetermined threshold further comprises:
when the total loss value does not meet a preset threshold value, adjusting the starting logits value and the ending logits value according to the total loss value;
resetting the parameters of the Bert model according to the adjusted initial logits value and the adjusted ending logits value;
and returning to the step of inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training, so as to train the Bert model with the parameters set again, stopping training until the total loss value meets a preset threshold value, and storing a starting logits value, an ending logits value and parameters corresponding to the total loss value meeting the preset threshold value.
3. The method of claim 1, wherein the step of determining the start position vector and the end position vector of the boundary feature data comprises:
acquiring text sample data and labeled data labeled to the text sample data; the text sample data is reference text data which needs to be subjected to reference starting position and reference ending position labeling of a target entity word, and the labeling data comprises the labeled target entity word in the text sample data and the reference starting position and the reference ending position of the target entity word;
and determining the starting position and the ending position of the boundary feature data according to the reference starting position and the reference ending position of the target entity word, and generating a starting position vector corresponding to the starting position and an ending position vector corresponding to the ending position.
4. The method of claim 3, wherein the step of generating a starting position vector corresponding to the starting position and an ending position vector corresponding to the ending position comprises:
setting the initial position of the boundary feature data to be 1, and setting the rest positions except the initial position in the boundary feature data to be 0 to obtain the initial position vector;
initializing the boundary feature data, setting the end position of the initialized boundary feature data to be 1, and setting the rest positions except the end position in the initialized boundary feature data to be 0 to obtain the end position vector.
5. The method of claim 1, wherein before the step of inputting the input _ ids vector, the segment _ ids vector, and the mask vector into the Bert model for training, further comprising:
judging whether the vector length of the input _ ids vector reaches the maximum length of a sentence preset for the Bert model or not;
and if not, filling 0 at the tail of the input _ ids vector until the vector length of the filled input _ ids vector reaches the maximum length of a sentence preset for the Bert model, and executing the step of inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training.
6. The method of claim 1, wherein the step of calculating cross entropy loss based on the starting logits value and the starting position vector to obtain a starting loss value comprises the following formula:
start_loss=start_positions*logastart_logits+(1-start_positions)log(1-start_logits);
the step of calculating the cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value comprises the following formula:
end_loss=end_positions*logaend_logits+(1-end_positions)log(1-end_logits);
the step of calculating the total loss value of the sequence labeling model according to the starting loss value and the ending loss value comprises the following steps:
total_loss=start_loss+end_loss;
wherein, the start _ loss is a starting loss value, the start _ positions is a starting position vector, the start _ positions is a starting positions value, the end _ loss is an ending loss value, the end _ positions is an ending position vector, the total _ loss is a total loss value, and the a is a constant.
7. The method according to claim 1, wherein the step of connecting the connection vector and the start position vector of the boundary feature data by using the Span model and obtaining a start logits value after linear transformation comprises the following formula:
start_logits=WT·(concat(concat_sequence,start_positions))+b;
the step of connecting the connecting vector and the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation comprises the following formula:
end_logits=WT·(concat(concat_sequence,end_positions))+b;
wherein the start _ locations is a start location value, the concat _ sequence is a connection vector, the start _ positions is a start position vector, the end _ locations is an end location value, the end _ positions is an end position vector, and the W is a start location value, a concat _ sequence value, a start location vector, a stop location vector, a location vector, a locationTThe b is a constant for the weight preset in the Span model.
8. A training device for a sequence labeling model is characterized by comprising:
the conversion module is used for acquiring text data required to be input by the sequence labeling model and performing vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector; the sequence labeling model comprises a Bert model and a Span model, the input _ ids vector is the number of each word in the text data in a Bert dictionary, the segment _ ids vector is used for marking sentences to which each word in the text data belongs, and the mask vector is used for marking words and non-words in the text data;
the input module is used for inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training to obtain an output sequence;
the encoding module is used for acquiring boundary characteristic data in the text data and carrying out word vector encoding on the boundary characteristic data to obtain a boundary vector;
the connection module is used for connecting the output sequence with the boundary vector to obtain a connection vector;
the determining module is used for determining a starting position vector and an ending position vector of the boundary characteristic data;
the first linear transformation module is used for connecting the connecting vector with the initial position vector of the boundary characteristic data by using the Span model, and obtaining an initial logits value after linear transformation;
the second linear transformation module is used for connecting the connecting vector with the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation;
the first calculation module is used for calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value;
the second calculation module is used for calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value;
the judging module is used for calculating the total loss value of the sequence labeling model according to the initial loss value and the finishing loss value and judging whether the total loss value meets a preset threshold value or not;
and the completion module is used for completing the training of the sequence labeling model when the total loss value meets a preset threshold value.
9. A computer device, comprising:
one or more processors;
a memory;
one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to perform the method of training of a sequence annotation model according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, implements the method for training a sequence annotation model according to any one of claims 1 to 7.
Background
As text data grows on the internet, more and more services and applications rely on the assistance of technologies such as knowledge extraction to provide better services. In the specific task of knowledge extraction, the technology of analyzing text data without natural separation plays an important role.
In the prior art, when text data without natural separation is processed, the text data is still affected by word segmentation errors, and the accuracy of boundary prediction is low during word segmentation, namely, the initial position or the end position of an entity extracted by a model is wrong. For example, "how to cancel xxx auto-renewal? "the text data indicates that the name of the insurance product extracted by the existing model is" xxx automatic renewal ", and the name of the actual insurance product is" xxx ", and the word segmentation error belongs to the entity end position prediction error.
Disclosure of Invention
The present application mainly aims to provide a training method, an apparatus, a device, and a storage medium for a sequence tagging model, so as to improve accuracy of boundary prediction when performing word segmentation on text data.
In order to achieve the above object, the present application provides a training method for a sequence annotation model, which includes the following steps:
acquiring text data required to be input by a sequence labeling model, and performing vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector; the sequence labeling model comprises a Bert model and a Span model, the input _ ids vector is the number of each word in the text data in a Bert dictionary, the segment _ ids vector is used for marking sentences to which each word in the text data belongs, and the mask vector is used for marking words and non-words in the text data;
inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training to obtain an output sequence;
acquiring boundary characteristic data in the text data, and carrying out word vector coding on the boundary characteristic data to obtain a boundary vector;
connecting the output sequence with the boundary vector to obtain a connection vector;
determining a starting position vector and an ending position vector of the boundary characteristic data;
connecting the connecting vector with the initial position vector of the boundary characteristic data by using the Span model, and obtaining an initial logits value after linear transformation;
connecting the connecting vector with the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation;
calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value;
calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value;
calculating a total loss value of the sequence labeling model according to the initial loss value and the end loss value, and judging whether the total loss value meets a preset threshold value;
and when the total loss value meets a preset threshold value, finishing the training of the sequence labeling model.
Further, after the step of determining whether the total loss value satisfies a preset threshold, the method further includes:
when the total loss value does not meet a preset threshold value, adjusting the starting logits value and the ending logits value according to the total loss value;
resetting the parameters of the Bert model according to the adjusted initial logits value and the adjusted ending logits value;
and returning to the step of inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training, so as to train the Bert model with the parameters set again, stopping training until the total loss value meets a preset threshold value, and storing a starting logits value, an ending logits value and parameters corresponding to the total loss value meeting the preset threshold value.
Preferably, the step of determining a start position vector and an end position vector of the boundary feature data includes:
acquiring text sample data and labeled data labeled to the text sample data; the text sample data is reference text data which needs to be subjected to reference starting position and reference ending position labeling of a target entity word, and the labeling data comprises the labeled target entity word in the text sample data and the reference starting position and the reference ending position of the target entity word;
and determining the starting position and the ending position of the boundary feature data according to the reference starting position and the reference ending position of the target entity word, and generating a starting position vector corresponding to the starting position and an ending position vector corresponding to the ending position.
Preferably, the step of generating a starting position vector corresponding to the starting position and an ending position vector corresponding to the ending position includes:
setting the initial position of the boundary feature data to be 1, and setting the rest positions except the initial position in the boundary feature data to be 0 to obtain the initial position vector;
initializing the boundary feature data, setting the end position of the initialized boundary feature data to be 1, and setting the rest positions except the end position in the initialized boundary feature data to be 0 to obtain the end position vector.
Further, before the step of inputting the input _ ids vector, the segment _ ids vector, and the mask vector into the Bert model for training, the method further includes:
judging whether the vector length of the input _ ids vector reaches the maximum length of a sentence preset for the Bert model or not;
and if not, filling 0 at the tail of the input _ ids vector until the vector length of the filled input _ ids vector reaches the maximum length of a sentence preset for the Bert model, and executing the step of inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training.
Preferably, the step of calculating cross entropy loss according to the starting logits value and the starting position vector to obtain a starting loss value includes the following formula:
start_loss=start_positions*logastart_logits+(1-start_positions)log(1-start_logits);
the step of calculating the cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value comprises the following formula:
end_loss=end_positions*logaend_logits+(1-end_positions)log(1-end_logits;
the step of calculating the total loss value of the sequence labeling model according to the starting loss value and the ending loss value comprises the following steps:
total_loss=start_loss+end_loss;
wherein, the start _ loss is a starting loss value, the start _ positions is a starting position vector, the start _ positions is a starting positions value, the end _ loss is an ending loss value, the end _ positions is an ending position vector, the total _ loss is a total loss value, and the a is a constant.
Preferably, the step of connecting the connection vector and the initial position vector of the boundary feature data by using the Span model and obtaining an initial logits value after linear transformation includes the following formula:
start_logits=WT·(concat(concat_sequence,start_positions))+b;
the step of connecting the connecting vector and the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation comprises the following formula:
end_logits=WT·(concat(concat_sequence,end_positions))+b;
wherein the start _ locations is a start location value, the concat _ sequence is a connection vector, the start _ positions is a start position vector, the end _ locations is an end location value, the end _ positions is an end position vector, and the W is a start location value, a concat _ sequence value, a start location vector, a stop location vector, a location vector, a locationTThe b is a constant for the weight preset in the Span model.
The present application further provides a training device for sequence labeling model, which includes:
the conversion module is used for acquiring text data required to be input by the sequence labeling model and performing vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector; the sequence labeling model comprises a Bert model and a Span model, the input _ ids vector is the number of each word in the text data in a Bert dictionary, the segment _ ids vector is used for marking sentences to which each word in the text data belongs, and the mask vector is used for marking words and non-words in the text data;
the input module is used for inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training to obtain an output sequence;
the encoding module is used for acquiring boundary characteristic data in the text data and carrying out word vector encoding on the boundary characteristic data to obtain a boundary vector;
the connection module is used for connecting the output sequence with the boundary vector to obtain a connection vector;
the determining module is used for determining a starting position vector and an ending position vector of the boundary characteristic data;
the first linear transformation module is used for connecting the connecting vector with the initial position vector of the boundary characteristic data by using the Span model, and obtaining an initial logits value after linear transformation;
the second linear transformation module is used for connecting the connecting vector with the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation;
the first calculation module is used for calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value;
the second calculation module is used for calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value;
the judging module is used for calculating the total loss value of the sequence labeling model according to the initial loss value and the finishing loss value and judging whether the total loss value meets a preset threshold value or not;
and the completion module is used for completing the training of the sequence labeling model when the total loss value meets a preset threshold value.
The present application further provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods described above.
The method, the device, the equipment and the storage medium for training the sequence labeling model are characterized by firstly obtaining text data required to be input by the sequence labeling model, carrying out vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector, and inputting the input _ ids vector, the segment _ ids vector and the mask vector into a Bert model for training so as to adjust relevant parameters of the Bert model and obtain an output sequence; then obtaining boundary characteristic data in the text data, carrying out word vector coding on the boundary characteristic data to obtain a boundary vector, connecting an output sequence with the boundary vector to obtain a connection vector, determining an initial position vector and an end position vector of the boundary characteristic data, connecting the connection vector with the initial position vector of the boundary characteristic data by using a Span model, and obtaining an initial logits value after linear transformation; connecting the connecting vector with the ending position vector of the boundary characteristic data by using a Span model, and obtaining an ending logits value after linear transformation; calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value; calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value; and finally, calculating the total loss value of the sequence labeling model according to the initial loss value and the end loss value, and finishing the training of the sequence labeling model when the total loss value meets a preset threshold value. According to the sequence labeling model formed by the Bert model and the Span model, the boundary characteristic data are added into the text data input into the sequence labeling model, the boundary characteristic data are mapped to the same vector space of the Span model, the text data of the boundary position to be extracted are strengthened, meanwhile, the total loss value is calculated based on the boundary characteristic data, so that the sequence labeling model after each training is accurately evaluated, and the trained sequence labeling model can accurately predict the boundary information.
Drawings
FIG. 1 is a flowchart illustrating a training method of a sequence annotation model according to an embodiment of the present application;
FIG. 2 is a block diagram illustrating the structure of a training apparatus for a sequence annotation model according to an embodiment of the present application;
fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, the present application provides a training method for a sequence labeling model, which is used for solving the problem that when a current model performs word segmentation, an extracted entity starting position or ending position is wrong, and accuracy of boundary prediction is low, in one embodiment, the training method for the sequence labeling model includes the following steps:
s11, acquiring text data required to be input by the sequence annotation model, and performing vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector; the sequence labeling model comprises a Bert model and a Span model, the input _ ids vector is the number of each word in the text data in a Bert dictionary, the segment _ ids vector is used for marking sentences to which each word in the text data belongs, and the mask vector is used for marking words and non-words in the text data;
s12, inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training to obtain an output sequence;
s13, obtaining boundary characteristic data in the text data, and carrying out word vector coding on the boundary characteristic data to obtain a boundary vector;
s14, connecting the output sequence with the boundary vector to obtain a connection vector;
s15, determining a starting position vector and an ending position vector of the boundary characteristic data;
s16, connecting the connecting vector with the initial position vector of the boundary characteristic data by using the Span model, and obtaining an initial logits value after linear transformation;
s17, connecting the connecting vector and the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation;
s18, calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value;
s19, calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value;
s20, calculating a total loss value of the sequence labeling model according to the initial loss value and the end loss value, and judging whether the total loss value meets a preset threshold value;
and S21, finishing the training of the sequence labeling model when the total loss value meets a preset threshold value.
As described in the above step S11, the sequence annotation model may be composed of a Bert model and a Span model, and each piece of text data required to be input by the Bert model needs to be converted into three vectors: an input _ ids vector, which refers to the number of each word in the text data in the Bert dictionary; segment _ ids vector, which means if the text data includes a plurality of sentences, the id of each sentence marker indicates to which sentence each word belongs; the mask vector is used for setting the word part as 1 and setting the non-word part as 0 to form a vector so as to distinguish words and non-words, wherein the non-words are punctuation marks, mathematical symbols or special characters in the text data.
Among them, BERT (Bidirectional Encoder characterization based on Transformers) is a language model, which effectively improves many natural language processing tasks, including sentence level tasks such as natural language inference, paraphrasing (paraphrasing), and token level tasks such as named entity recognition, SQuAD question and answer. In addition, the BERT model can effectively utilize context information to determine word embedding according to the context/context in which the BERT model is located, thereby obtaining contextualized word embedding.
In a pre-training process based on public datasets, the BERT model typically uses both a mask language model (masking language model) and a next sentence prediction (next context prediction) as loss functions. However, because in the field of real-world intelligent question-answering, manual dialog logs are not the same as traditional machine-read document formats, the sequence coherence between dialogs is usually not that strong, with no apparent context. Therefore, in embodiments herein, when pre-training the BERT model using the training expectation described above, only the mask language model may be selected as the loss function, without using the next sentence prediction. In this way, the pre-training of the BERT model can be completed more specifically.
As described in step S12, in this step, the input _ ids vector, the segment _ ids vector, and the mask vector are input into the Bert model, the Bert model is trained, an output sequence is obtained through output, and the output sequence is labeled as sequence _ output. The output sequence is an initial result of preliminarily dividing the words of the text data into a plurality of words and is used as one input of the Span model.
As described in step S13, the boundary feature data of the text data includes the semantic relationship or correlation between words, and if the two words are similar in semantic or can form a word, it indicates that the semantic relationship between the two words is high and the correlation is strong. The boundary vector may be a one-hot coded vector to indicate whether a current position of each word in the text data is a component of an entity, and 1 is used for filling if the current position is the component of the entity, and 0 is used for filling if the current position is not the component of the entity, so as to generate the boundary vector.
As described in step S14, this step concatenates the output sequence _ output with the boundary vector to obtain a concatenated vector, which is labeled as concat _ sequence, to generate a vector containing the boundary information.
As described in step S15, for the boundary feature data required by the Span model, each boundary feature data needs to be converted into two vectors, start position vector start _ positions, which are used to represent the start features of the entity words in the text data, and the start positions of the entities are set to entity types id, and the other parts are filled with 0 to obtain the start features; and the ending position vector end _ positions is used for representing ending characteristics of entity words in the text data, and is obtained by setting the ending position of the entity as an entity type id and filling other parts with 0.
As described in the above steps S16 and S17, the step of obtaining the initial logits value after connecting the connection vector and the initial position vector of the boundary feature data by using the Span model and performing linear transformation includes the following formula:
start_logits=WT·(concat(concat_sequence,start_positions))+b;
the step of connecting the connecting vector and the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation comprises the following formula:
end_logits=WT·(concat(concat_sequence,end_positions))+b;
wherein the start _ logits is a starting logits value, the concat _ sequence is a concatenated vector, and the start _positions is a start position vector, end _ positions is an end position vector, and WTThe b is a constant for the weight preset in the Span model.
As described in the above steps S18-S20, the cross-entropy loss is mainly used to measure the difference between two probability distributions. Preferably, the step of calculating the cross entropy loss according to the starting logits value and the starting position vector to obtain the starting loss value may include the following formula:
start_loss=start_positions*logastart_logits+(1-start_positions)log(1-start_logits);
the step of calculating the cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value comprises the following formula:
end_loss=end_positions*logaend_logits+(1-end_positions)log(1-end_logits;
the step of calculating the total loss value of the sequence labeling model according to the starting loss value and the ending loss value may specifically include:
total_loss=start_loss+end_loss;
the start _ loss is a starting loss value and used for evaluating the prediction condition of the start boundary of the text data by the start model, the start _ positions is a starting position vector, the start _ positions is a starting positions value, the end _ loss is an ending loss value and used for evaluating the prediction condition of the ending boundary of the text data by the start model, the end _ positions is an ending position vector, the total _ loss is a total loss value, and a is a constant.
As described in step S21, after each training of the sequence annotation model, the total loss value after the training is completed is calculated, and when the total loss value meets the preset threshold or is smaller than the preset loss value, that is, meets the preset requirement, it indicates that the sequence annotation model meets the training requirement, and completes the training of the sequence annotation model, so as to improve the accuracy of the sequence annotation model in predicting the information boundary.
The method for training the sequence labeling model comprises the steps of firstly obtaining text data required to be input by the sequence labeling model, carrying out vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector, inputting the input _ ids vector, the segment _ ids vector and the mask vector into a Bert model for training, adjusting relevant parameters of the Bert model, and obtaining an output sequence; then obtaining boundary characteristic data in the text data, carrying out word vector coding on the boundary characteristic data to obtain a boundary vector, connecting an output sequence with the boundary vector to obtain a connection vector, determining an initial position vector and an end position vector of the boundary characteristic data, connecting the connection vector with the initial position vector of the boundary characteristic data by using a Span model, and obtaining an initial logits value after linear transformation; connecting the connecting vector with the ending position vector of the boundary characteristic data by using a Span model, and obtaining an ending logits value after linear transformation; calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value; calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value; and finally, calculating the total loss value of the sequence labeling model according to the initial loss value and the end loss value, and finishing the training of the sequence labeling model when the total loss value meets a preset threshold value. According to the sequence labeling model formed by the Bert model and the Span model, the boundary characteristic data are added into the text data input into the sequence labeling model, the boundary characteristic data are mapped to the same vector space of the Span model, the text data of the boundary position to be extracted are strengthened, meanwhile, the total loss value is calculated based on the boundary characteristic data, so that the sequence labeling model after each training is accurately evaluated, and the trained sequence labeling model can accurately predict the boundary information.
In an embodiment, in step S20, after the step of determining whether the total loss value satisfies the preset threshold, the method may further include:
when the total loss value does not meet a preset threshold value, adjusting the starting logits value and the ending logits value according to the total loss value;
resetting the parameters of the Bert model according to the adjusted initial logits value and the adjusted ending logits value;
and returning to the step of inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training, so as to train the Bert model with the parameters set again, stopping training until the total loss value meets a preset threshold value, and storing a starting logits value, an ending logits value and parameters corresponding to the total loss value meeting the preset threshold value.
In this embodiment, forward transmission may be performed in the neural network structure of the sequence labeling model according to the total loss value, the start logits value and the end logits value are adjusted, the relevant parameters of the Bert model in the sequence labeling model are reset according to the adjusted start logits value and the end logits value, the adjusted Bert model is retrained based on the reset relevant parameters, and the retrained total loss value is calculated until the total loss value meets the preset requirement, the start logits value and the end logits value corresponding to the total loss value meeting the preset threshold are finally obtained, and the parameters of the sequence labeling model are saved until the training of the sequence labeling model is finished.
In this embodiment, the start logits value, the end logits value and the parameters of the Bert model are adjusted, and the adjusted Bert model is repeatedly trained, so as to ensure that the sequence labeling model formed by the Bert model satisfies the requirement for predicting the boundary of the entity word in the text data.
In step S15, the step of determining the start position vector and the end position vector of the boundary feature data may specifically include:
s151, acquiring text sample data and labeled data labeled to the text sample data; the text sample data is reference text data which needs to be subjected to reference starting position and reference ending position labeling of a target entity word, and the labeling data comprises the labeled target entity word in the text sample data and the reference starting position and the reference ending position of the target entity word;
s152, determining the starting position and the ending position of the boundary feature data according to the reference starting position and the reference ending position of the target entity word, and generating a starting position vector corresponding to the starting position and an ending position vector corresponding to the ending position.
In this embodiment, the obtained text sample data may be manually labeled to determine all target entity words of the text sample data, and the labeling data may be obtained by referring to the start position and the end position. The annotation data may be in the form of:
what is the logo of [ @ red reservoir accident insurance # insurance product ];
what is the mark of [ @ Ping Anchuan branch science, Heng Cui German Yang trade school # insurance product ];
what is the target of [ @ self-driving travel insurance # insurance product within a short period of time);
what is the logo of [ @ public place security # insurance product ];
the data in the square brackets is the name and category of the marked entity.
In addition, in the present embodiment, a start position of the boundary feature data is determined according to the reference start position of the target entity word, an end position of the boundary feature data is determined according to the reference end position of the target entity word, and a start position vector corresponding to the start position and an end position vector corresponding to the end position are generated.
In an embodiment, after the step of acquiring text sample data, the method further includes:
and carrying out data cleaning and preprocessing on the text sample data to remove nonsense words.
The punctuation marks or special characters of the text sample data can be subjected to data cleaning in the step, and the text data meeting the requirements is reserved, so that the subsequent processing of nonsense words is reduced, and the training efficiency is improved.
In an embodiment, the step of generating the starting position vector corresponding to the starting position and the ending position vector corresponding to the ending position may specifically include:
setting the initial position of the boundary feature data to be 1, and setting the rest positions except the initial position in the boundary feature data to be 0 to obtain the initial position vector;
initializing the boundary feature data, setting the end position of the initialized boundary feature data to be 1, and setting the rest positions except the end position in the initialized boundary feature data to be 0 to obtain the end position vector.
In this embodiment, the start position vector may be obtained by setting the start position of the entity word to be entity type id and filling other parts to be 0; the end position vector is obtained by setting the end position of the entity to be the entity type id and filling other parts to be 0 so as to determine the boundary position of the entity word, and subsequently, the sequence labeling model after each training is convenient to adjust.
For example, when the text data is "red flag reservoir accident insurance", the start position of the boundary feature data in the text data is "red", and the end position is "dangerous", the start position vector is 10000000, and the end position vector is 00000001.
In an embodiment, before the step of inputting the input _ ids vector, the segment _ ids vector, and the mask vector into the Bert model for training in step S12, the method may further include:
judging whether the vector length of the input _ ids vector reaches the maximum length of a sentence preset for the Bert model or not;
and if not, filling 0 at the tail of the input _ ids vector until the vector length of the filled input _ ids vector reaches the maximum length of a sentence preset for the Bert model, and executing the step of inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training.
In this embodiment, the vector length is the maximum length of the sentence of the preset input model, and if the sentence length is not long enough to reach the maximum length, the value of the vector is filled with 0. Specifically, a fixed length seg _ size of a sentence can be set for the Bert model in advance, the vector model is used for converting text data into an input _ ids vector, the vector length of the input _ ids vector is calculated, and if the vector length of the input _ ids vector is smaller than the fixed length seg _ size, 0 is filled at the end of the input _ ids vector, so that the vector length of the input _ ids vector reaches the fixed length seg _ size, and the vector length of the vector input to the Bert model meets requirements.
Referring to fig. 2, an embodiment of the present application further provides a training apparatus for a sequence annotation model, including:
the conversion module 11 is configured to obtain text data required to be input by the sequence annotation model, and perform vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector, and a mask vector; the sequence labeling model comprises a Bert model and a Span model, the input _ ids vector is the number of each word in the text data in a Bert dictionary, the segment _ ids vector is used for marking sentences to which each word in the text data belongs, and the mask vector is used for marking words and non-words in the text data;
an input module 12, configured to input the input _ ids vector, the segment _ ids vector, and the mask vector into the Bert model for training, so as to obtain an output sequence;
the encoding module 13 is configured to obtain boundary feature data in the text data, and perform word vector encoding on the boundary feature data to obtain a boundary vector;
a connection module 14, configured to connect the output sequence and the boundary vector to obtain a connection vector;
a determining module 15, configured to determine a start position vector and an end position vector of the boundary feature data;
the first linear transformation module 16 is configured to connect the connection vector and the initial position vector of the boundary feature data by using the Span model, and obtain an initial logits value after linear transformation;
a second linear transformation module 17, configured to connect the connection vector and the ending position vector of the boundary feature data by using the Span model, and obtain an ending logits value after linear transformation;
a first calculating module 18, configured to calculate cross entropy loss according to the initial logits value and the initial position vector, so as to obtain an initial loss value;
a second calculating module 19, configured to calculate cross entropy loss according to the ending logits value and the ending position vector, so as to obtain an ending loss value;
a judging module 20, configured to calculate a total loss value of the sequence labeling model according to the initial loss value and the end loss value, and judge whether the total loss value meets a preset threshold;
and a completion module 21, configured to complete training of the sequence labeling model when the total loss value meets a preset threshold.
The sequence labeling model can be composed of a Bert model and a Span model, and each piece of text data required to be input by the Bert model needs to be converted into three vectors: an input _ ids vector, which refers to the number of each word in the text data in the Bert dictionary; segment _ ids vector, which means if the text data includes a plurality of sentences, the id of each sentence marker indicates to which sentence each word belongs; the mask vector is used for setting the word part as 1 and setting the non-word part as 0 to form a vector so as to distinguish words and non-words, wherein the non-words are punctuation marks, mathematical symbols or special characters in the text data.
Among them, BERT (Bidirectional Encoder characterization based on Transformers) is a language model, which effectively improves many natural language processing tasks, including sentence level tasks such as natural language inference, paraphrasing (paraphrasing), and token level tasks such as named entity recognition, SQuAD question and answer. In addition, the BERT model can effectively utilize context information to determine word embedding according to the context/context in which the BERT model is located, thereby obtaining contextualized word embedding.
In a pre-training process based on public datasets, the BERT model typically uses both a mask language model (masking language model) and a next sentence prediction (next context prediction) as loss functions. However, because in the field of real-world intelligent question-answering, manual dialog logs are not the same as traditional machine-read document formats, the sequence coherence between dialogs is usually not that strong, with no apparent context. Therefore, in embodiments herein, when pre-training the BERT model using the training expectation described above, only the mask language model may be selected as the loss function, without using the next sentence prediction. In this way, the pre-training of the BERT model can be completed more specifically.
In this embodiment, an input _ ids vector, a segment _ ids vector, and a mask vector are input into a Bert model, the Bert model is trained, an output sequence is obtained through output, and the output sequence is labeled as sequence _ output. The output sequence is an initial result of preliminarily dividing the words of the text data into a plurality of words and is used as one input of the Span model.
The boundary characteristic data of the text data comprises the front-back semantic relation or the correlation between each word, and if the semantics between two words are similar or a word can be formed, the semantic relation between the two words is high and the correlation is strong. The boundary vector may be a one-hot coded vector to indicate whether a current position of each word in the text data is a component of an entity, and 1 is used for filling if the current position is the component of the entity, and 0 is used for filling if the current position is not the component of the entity, so as to generate the boundary vector.
In addition, the output sequence _ output can be connected with the boundary vector to obtain a connection vector, which is marked as concat _ sequence, so as to generate a vector containing the boundary information.
Each piece of boundary feature data is required to be converted into two vectors, namely start position vectors start _ positions, which are used for representing the start features of entity words in text data, and the start positions of the entities are set to be entity types id, and other parts are filled to be 0 to obtain the boundary feature data required by the Span model; and the ending position vector end _ positions is used for representing ending characteristics of entity words in the text data, and is obtained by setting the ending position of the entity as an entity type id and filling other parts with 0.
The step of connecting the connecting vector and the initial position vector of the boundary characteristic data by using the Span model, and obtaining an initial logits value after linear transformation comprises the following formula:
start_logits=WT·(concat(concat_sequence,start_positions))+b;
the step of connecting the connecting vector and the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation comprises the following formula:
end_logits=WT·(concat(concat_sequence,end_positions))+b;
wherein the start _ locations is a start location value, the concat _ sequence is a connection vector, the start _ positions is a start position vector, the end _ locations is an end location value, the end _ positions is an end position vector, and the W is a start location value, a concat _ sequence value, a start location vector, a stop location vector, a location vector, a locationTThe b is a constant for the weight preset in the Span model.
The cross entropy loss is mainly used to measure the difference between two probability distributions. Preferably, the step of calculating the cross entropy loss according to the starting logits value and the starting position vector to obtain the starting loss value may include the following formula:
start_loss=start_positions*logastart_logits+(1-start_positions)log(1-start_logits);
the step of calculating the cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value comprises the following formula:
end_loss=end_positions*logaend_logits+(1-end_positions)log(1-end_logits;
the step of calculating the total loss value of the sequence labeling model according to the starting loss value and the ending loss value may specifically include:
total_loss=start_loss+end_loss;
the start _ loss is a starting loss value and used for evaluating the prediction condition of the start boundary of the text data by the start model, the start _ positions is a starting position vector, the start _ positions is a starting positions value, the end _ loss is an ending loss value and used for evaluating the prediction condition of the ending boundary of the text data by the start model, the end _ positions is an ending position vector, the total _ loss is a total loss value, and a is a constant.
After each training of the sequence labeling model, calculating a total loss value after the training is finished, and when the total loss value meets a preset threshold value or is smaller than the preset loss value, namely meets a preset requirement, the sequence labeling model meets the training requirement, the training of the sequence labeling model is finished, and the accuracy of the sequence labeling model for information boundary prediction is improved.
As described above, it can be understood that each component of the training apparatus for sequence labeling models provided in the present application can implement the function of any one of the above-described training methods for sequence labeling models, and the detailed structure is not repeated.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of training a sequence annotation model.
The processor executes the training method of the sequence labeling model, and the training method comprises the following steps:
acquiring text data required to be input by a sequence labeling model, and performing vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector; the sequence labeling model comprises a Bert model and a Span model, the input _ ids vector is the number of each word in the text data in a Bert dictionary, the segment _ ids vector is used for marking sentences to which each word in the text data belongs, and the mask vector is used for marking words and non-words in the text data;
inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training to obtain an output sequence;
acquiring boundary characteristic data in the text data, and carrying out word vector coding on the boundary characteristic data to obtain a boundary vector;
connecting the output sequence with the boundary vector to obtain a connection vector;
determining a starting position vector and an ending position vector of the boundary characteristic data;
connecting the connecting vector with the initial position vector of the boundary characteristic data by using the Span model, and obtaining an initial logits value after linear transformation;
connecting the connecting vector with the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation;
calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value;
calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value;
calculating a total loss value of the sequence labeling model according to the initial loss value and the end loss value, and judging whether the total loss value meets a preset threshold value;
and when the total loss value meets a preset threshold value, finishing the training of the sequence labeling model.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for training a sequence annotation model, including the steps of:
acquiring text data required to be input by a sequence labeling model, and performing vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector; the sequence labeling model comprises a Bert model and a Span model, the input _ ids vector is the number of each word in the text data in a Bert dictionary, the segment _ ids vector is used for marking sentences to which each word in the text data belongs, and the mask vector is used for marking words and non-words in the text data;
inputting the input _ ids vector, the segment _ ids vector and the mask vector into the Bert model for training to obtain an output sequence;
acquiring boundary characteristic data in the text data, and carrying out word vector coding on the boundary characteristic data to obtain a boundary vector;
connecting the output sequence with the boundary vector to obtain a connection vector;
determining a starting position vector and an ending position vector of the boundary characteristic data;
connecting the connecting vector with the initial position vector of the boundary characteristic data by using the Span model, and obtaining an initial logits value after linear transformation;
connecting the connecting vector with the ending position vector of the boundary characteristic data by using the Span model, and obtaining an ending logits value after linear transformation;
calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value;
calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value;
calculating a total loss value of the sequence labeling model according to the initial loss value and the end loss value, and judging whether the total loss value meets a preset threshold value;
and when the total loss value meets a preset threshold value, finishing the training of the sequence labeling model.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
To sum up, the most beneficial effect of this application lies in:
the method, the device, the equipment and the storage medium for training the sequence labeling model are characterized by firstly obtaining text data required to be input by the sequence labeling model, carrying out vector conversion on the text data to obtain an input _ ids vector, a segment _ ids vector and a mask vector, and inputting the input _ ids vector, the segment _ ids vector and the mask vector into a Bert model for training so as to adjust relevant parameters of the Bert model and obtain an output sequence; then obtaining boundary characteristic data in the text data, carrying out word vector coding on the boundary characteristic data to obtain a boundary vector, connecting an output sequence with the boundary vector to obtain a connection vector, determining an initial position vector and an end position vector of the boundary characteristic data, connecting the connection vector with the initial position vector of the boundary characteristic data by using a Span model, and obtaining an initial logits value after linear transformation; connecting the connecting vector with the ending position vector of the boundary characteristic data by using a Span model, and obtaining an ending logits value after linear transformation; calculating cross entropy loss according to the initial logits value and the initial position vector to obtain an initial loss value; calculating cross entropy loss according to the ending logits value and the ending position vector to obtain an ending loss value; and finally, calculating the total loss value of the sequence labeling model according to the initial loss value and the end loss value, and finishing the training of the sequence labeling model when the total loss value meets a preset threshold value. According to the sequence labeling model formed by the Bert model and the Span model, the boundary characteristic data are added into the text data input into the sequence labeling model, the boundary characteristic data are mapped to the same vector space of the Span model, the text data of the boundary position to be extracted are strengthened, meanwhile, the total loss value is calculated based on the boundary characteristic data, so that the sequence labeling model after each training is accurately evaluated, and the trained sequence labeling model can accurately predict the boundary information.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:负例构造方法、装置、设备和存储介质