Entity identification method and device, electronic equipment and storage medium
1. An entity identification method, comprising:
according to an entity word list, segmenting an object subject text into a plurality of characters and at least one entity word, and determining a vocabulary comprising each character in the object subject text;
determining a word vector of each character as a first word vector, determining a word vector of the entity word as a first word vector, and determining a word vector of the vocabulary as a second word vector;
determining a second word vector of the same character according to a second word vector and a first word vector which comprise the same character;
and performing entity recognition on the object subject text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result.
2. The method of claim 1, wherein determining a second word vector for the same character from a second word vector and a first word vector that include the same character comprises:
respectively determining vocabularies beginning with the same character, centering and ending with the same character in a plurality of vocabularies comprising the same character;
determining an average vector of second word vectors of words beginning with the same character as a first average vector, determining an average vector of the second word vectors of words centered with the same character as a second average vector, and determining an average vector of the second word vectors of words ending with the same character as a third average vector;
and splicing the first word vector, the first average vector, the second average vector and the third average vector of the same character to be used as a second word vector of the same character.
3. The method of claim 1, wherein determining a second word vector for the same character from a second word vector and a first word vector that include the same character comprises:
determining an average vector of second word vectors of at least one word comprising the same character;
and splicing the first word vector of the same character and the average vector to be used as a second word vector of the same character.
4. The method of claim 1, wherein performing entity recognition on the object topic text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, comprises:
and performing entity recognition on the object subject text according to the second word vector of each character, the first position information of each character in the object subject text, the first word vector and the second position information of the entity word in the object subject text to obtain an entity recognition result.
5. The method of claim 4, wherein the first location information comprises first beginning location information and first ending location information, and wherein the second location information comprises second beginning location information and second ending location information.
6. The method according to claim 4, wherein performing entity recognition on the object topic text according to the second word vector of each character, the first position information of each character in the object topic text, the first word vector, and the second position information of the entity word in the object topic text to obtain an entity recognition result, includes:
respectively determining the relative position code of each character relative to other characters except the current character in the plurality of characters through an encoder according to the first position information of each character in the object subject text and the second position information of the entity word in the object subject text;
determining an attention weight of each character relative to the other characters through an attention mechanism according to the second word vector of each character, the first word vector and the relative position code of each character relative to the other characters;
and according to the attention weight of each character relative to other characters, carrying out entity recognition on the object subject text through a decoder to obtain an entity recognition result in the object subject text.
7. An entity identification apparatus, comprising:
a text segmentation module configured to perform segmentation of an object topic text into a plurality of characters and at least one entity word according to an entity vocabulary, and determine a vocabulary including each character in the object topic text;
a word vector determination module configured to perform determining a word vector of each character as a first word vector, and determining a word vector of the entity word as a first word vector, and determining a word vector of the vocabulary as a second word vector;
a word vector determination module configured to perform determining a second word vector of the same character from a first word vector and a second word vector comprising the same character;
and the entity recognition module is configured to perform entity recognition on the object subject text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result.
8. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the entity identification method of any of claims 1 to 6.
9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the entity identification method of any of claims 1 to 6.
10. A computer program product comprising a computer program or computer instructions, characterized in that the computer program or computer instructions, when executed by a processor, implement the entity identification method of any of claims 1 to 6.
Background
In the related art, when the entity identification is performed on the object topic, a large-scale language model can be used for identification, or a matching mode is used for directly matching the entity in the object topic by using a vocabulary.
The related technology relies on a large amount of manual labeling data to realize the construction of a training data set, the cost is too high, all kinds are difficult to cover due to the special condition that the object theme text is complex in composition, the training data is easy to have deviation, and the model relies on the information of each character in the object theme to perform entity recognition, so that the recognition accuracy of the model is not high; the matching mode is used for direct matching, no language model is matched, and the accuracy is low.
Disclosure of Invention
The present disclosure provides an entity identification method, apparatus, electronic device and storage medium, to at least solve the problem of low entity identification accuracy in the related art. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided an entity identification method, including:
according to an entity word list, segmenting an object subject text into a plurality of characters and at least one entity word, and determining a vocabulary comprising each character in the object subject text;
determining a word vector of each character as a first word vector, determining a word vector of the entity word as a first word vector, and determining a word vector of the vocabulary as a second word vector;
determining a second word vector of the same character according to a second word vector and a first word vector which comprise the same character;
and performing entity recognition on the object subject text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result.
Optionally, determining a second word vector of the same character according to the second word vector and the first word vector including the same character includes:
respectively determining vocabularies beginning with the same character, centering and ending with the same character in a plurality of vocabularies comprising the same character;
determining an average vector of second word vectors of words beginning with the same character as a first average vector, determining an average vector of the second word vectors of words centered with the same character as a second average vector, and determining an average vector of the second word vectors of words ending with the same character as a third average vector;
and splicing the first word vector, the first average vector, the second average vector and the third average vector of the same character to be used as a second word vector of the same character.
Optionally, determining a second word vector of the same character according to the second word vector and the first word vector including the same character includes:
determining an average vector of second word vectors of at least one word comprising the same character;
and splicing the first word vector of the same character and the average vector to be used as a second word vector of the same character.
Optionally, performing entity recognition on the object topic text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, where the entity recognition result includes:
and performing entity recognition on the object subject text according to the second word vector of each character, the first position information of each character in the object subject text, the first word vector and the second position information of the entity word in the object subject text to obtain an entity recognition result.
Optionally, the first position information includes first start position information and first end position information, and the second position information includes second start position information and second end position information.
Optionally, performing entity recognition on the object topic text according to the second word vector of each character, the first position information of each character in the object topic text, the first word vector, and the second position information of the entity word in the object topic text, to obtain an entity recognition result, where the method includes:
respectively determining the relative position code of each character relative to other characters except the current character in the plurality of characters through an encoder according to the first position information of each character in the object subject text and the second position information of the entity word in the object subject text;
determining an attention weight of each character relative to the other characters through an attention mechanism according to the second word vector of each character, the first word vector and the relative position code of each character relative to the other characters;
and according to the attention weight of each character relative to other characters, carrying out entity recognition on the object subject text through a decoder to obtain an entity recognition result in the object subject text.
Optionally, the determining the word vector of each character as a first word vector, determining the word vector of the entity word as a first word vector, determining the word vector of the vocabulary as a second word vector, and including:
determining a word vector of each character as a first word vector through the word vector model, determining a word vector of the entity word as a first word vector through the word vector model, and determining a word vector of the vocabulary as a second word vector through the word vector model; the word vector model is obtained based on the entity word list training;
and performing entity recognition on the object subject text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, wherein the entity recognition result comprises:
and carrying out entity recognition on the object subject text through an entity recognition model according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, wherein the entity recognition model is obtained by training based on a labeled data set obtained from the entity word list.
Optionally, the method further includes:
replacing the received new entity word list with the entity word list;
according to the new entity word list, labeling the object subject text corpus to obtain a first labeled data set;
training a pre-training word vector model according to the new entity word list and the object subject text corpus to obtain a new word vector model;
determining a second word vector of each character and a first word vector of each entity word in sample data in the first labeled data set according to the new entity word list and the new word vector model;
and training the initial entity recognition model according to the first labeling data set, the second word vector of each character and the first word vector of each entity word in the sample data in the first labeling data set to obtain a new entity recognition model.
Optionally, the method further includes:
if the entity words in the entity recognition result do not exist in the entity word list, storing the entity words in the entity recognition result into a new word list;
removing the duplication of the entity words in the new word list, and adding the entity words after the duplication removal into the entity word list to obtain an expanded word list;
labeling the subject text corpus according to the expansion word list to obtain a second labeled data set;
performing iterative training on the word vector model according to the expansion word list and the object subject text corpus to obtain an iterated word vector model;
determining a second word vector of each character and a first word vector of each entity word in sample data in the second labeled data set according to the extended word list and the iterated word vector model;
and performing iterative training on the entity recognition model according to the second labeling data set, the second word vector of each character and the first word vector of each entity word in the sample data in the second labeling data set to obtain an iterated entity recognition model.
According to a second aspect of the embodiments of the present disclosure, there is provided an entity identifying apparatus, including:
a text segmentation module configured to perform segmentation of an object topic text into a plurality of characters and at least one entity word according to an entity vocabulary, and determine a vocabulary including each character in the object topic text;
a word vector determination module configured to perform determining a word vector of each character as a first word vector, and determining a word vector of the entity word as a first word vector, and determining a word vector of the vocabulary as a second word vector;
a word vector determination module configured to perform determining a second word vector of the same character from a first word vector and a second word vector comprising the same character;
and the entity recognition module is configured to perform entity recognition on the object subject text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result.
Optionally, the word vector determining module includes:
a different position vocabulary determining unit configured to perform determination of vocabularies beginning with, centered, and ending with the same character, respectively, among a plurality of vocabularies including the same character;
a first average vector determination unit configured to perform determining an average vector of second word vectors of words beginning with the same character as a first average vector, determining an average vector of second word vectors of words centered with the same character as a second average vector, and determining an average vector of second word vectors of words ending with the same character as a third average vector;
a first word vector determination unit configured to perform stitching the first word vector, the first average vector, the second average vector, and the third average vector of the same character as a second word vector of the same character.
Optionally, the word vector determining module includes:
a second average vector determination unit configured to perform determining an average vector of second word vectors of at least one word including the same character;
a second word vector determination unit configured to perform stitching the first word vector of the same character and the average vector as a second word vector of the same character.
Optionally, the entity identification module is configured to perform:
and performing entity recognition on the object subject text according to the second word vector of each character, the first position information of each character in the object subject text, the first word vector and the second position information of the entity word in the object subject text to obtain an entity recognition result.
Optionally, the first position information includes first start position information and first end position information, and the second position information includes second start position information and second end position information.
Optionally, the entity identification module includes:
a relative position encoding unit configured to perform relative position encoding of each character with respect to other characters than a current character among the plurality of characters, respectively, by an encoder, according to first position information of the each character in the object subject text and second position information of the entity word in the object subject text;
an attention weight determination unit configured to perform encoding according to the second word vector, the first word vector of each character, and a relative position of each character with respect to the other characters, and determine an attention weight of each character with respect to the other characters by an attention mechanism;
and the entity identification unit is configured to perform entity identification on the object subject text through a decoder according to the attention weight of each character relative to other characters to obtain an entity identification result in the object subject text.
Optionally, the word vector determination module is configured to perform:
determining a word vector of each character as a first word vector through the word vector model, determining a word vector of the entity word as a first word vector through the word vector model, and determining a word vector of the vocabulary as a second word vector through the word vector model; the word vector model is obtained based on the entity word list training;
the entity recognition model is configured to perform:
and carrying out entity recognition on the object subject text through an entity recognition model according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, wherein the entity recognition model is obtained by training based on a labeled data set obtained from the entity word list.
Optionally, the apparatus further comprises:
a vocabulary replacement module configured to perform replacement of the received new entity vocabulary with the entity vocabulary;
the first labeling module is configured to label the object subject text corpus according to the new entity word list to obtain a first labeled data set;
a word vector retraining module configured to perform training on a pre-training word vector model according to the new entity word list and the object topic text corpus to obtain a new word vector model;
a first sample vector determination module configured to perform determining a second word vector for each character and a first word vector for each entity word in sample data in the first labeled data set according to the new entity word list and the new word vector model;
and the entity recognition model retraining module is configured to train the initial entity recognition model according to the first labeled data set, the second word vector of each character and the first word vector of each entity word in the sample data in the first labeled data set, so as to obtain a new entity recognition model.
Optionally, the apparatus further comprises:
the entity word storage module is configured to store the entity words in the entity recognition result into a new word list if the entity words in the entity recognition result do not exist in the entity word list;
the word list expansion module is configured to execute duplication elimination on the entity words in the new word list and add the duplication eliminated entity words to the entity word list to obtain an expanded word list;
the second labeling module is configured to label the object subject text corpus according to the expansion word list to obtain a second labeled data set;
the word vector iterative training module is configured to perform iterative training on the word vector model according to the extended word list and the object subject text corpus to obtain an iterated word vector model;
a second sample vector determination module configured to perform determining a second word vector of each character and a first word vector of each entity word in sample data in the second labeled data set according to the extended word list and the iterated word vector model;
and the entity recognition model iterative training module is configured to perform iterative training on the entity recognition model according to the second labeled data set, the second word vector of each character and the first word vector of each entity word in the sample data in the second labeled data set, so as to obtain an iterated entity recognition model.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the entity identification method of the first aspect.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the entity identification method according to the first aspect.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program or computer instructions, characterized in that the computer program or computer instructions, when executed by a processor, implement the entity identification method of the first aspect.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the method comprises dividing a subject text into a plurality of characters and at least one entity word according to an entity word list, respectively determining a vocabulary including each character in the subject text, determining a first word vector of each character, determining a first word vector of an entity word, determining a second word vector of the vocabulary, determining a second word vector of the character according to the second word vector and the first word vector including the same character, performing entity recognition on the subject text according to the second word vector of each character and the first word vector of the entity word, and obtaining an entity recognition result, wherein the distribution of the vocabulary of the characters in the whole text is enhanced by using the second word vector including the second word vector and the first word vector as the word vectors of the characters, and the boundary information of the entity word is enhanced by performing the entity recognition by combining the second word vector of the character and the first word vector of the entity word, the accuracy of entity identification is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow chart illustrating a method of entity identification in accordance with an exemplary embodiment;
fig. 2 is a schematic diagram of a Lattice structure in an embodiment of the disclosure;
FIG. 3 is a schematic diagram of a FLAT-Lattice structure in an embodiment of the disclosure;
FIG. 4 is a block diagram illustrating an entity identification apparatus in accordance with an exemplary embodiment;
FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating an entity identification method according to an exemplary embodiment, which is used in an electronic device such as a server, as shown in fig. 1, and includes the following steps.
In step S11, the subject topic text is segmented into a plurality of characters and at least one entity word according to the entity vocabulary, and a vocabulary including each character in the subject topic text is determined.
The entity word list stores entity words contained in the subject text. The object can be a commodity, most of commodity subject texts are stacked in a modified phrase mode, no obvious text structure exists, the length is short, and the context is difficult to fully utilize.
When entity recognition is carried out on an object subject text, the object subject text is segmented according to an entity word list, and a plurality of characters and at least one entity word which are sequentially included in the object subject text are obtained. Meanwhile, all the words including the character in the subject text corpus need to be determined according to the entity word list and aiming at each character.
In step S12, the word vector of each character is determined as a first word vector, the word vector of the physical word is determined as a first word vector, and the word vector of the vocabulary is determined as a second word vector.
After the word vector model is obtained, determining a word vector of each character in the subject text by using the word vector model, and taking the word vector as a first word vector of the character; determining a word vector of the entity word by using a word vector model, and taking the word vector as a first word vector; for each of the words, a word vector model is also used to determine a corresponding word vector, which is used as a second word vector.
In step S13, a second word vector of the same character is determined based on the second word vector and the first word vector including the same character.
And respectively taking each character in the object subject text as the same character, screening out a second word vector comprising the same character and a first word vector corresponding to the same character, carrying out preset processing on the second word vector and the first word vector to obtain a second word vector, and taking the second word vector of the same character as basic data for carrying out entity recognition on the same character, namely the word vector for inputting the entity recognition model. And performing the above processing on each character to obtain a second word vector of each character.
In one exemplary embodiment, determining a second word vector of the same character from a second word vector and a first word vector comprising the same character comprises: respectively determining vocabularies beginning with the same character, centering and ending with the same character in a plurality of vocabularies comprising the same character; determining an average vector of second word vectors of words beginning with the same character as a first average vector, determining an average vector of the second word vectors of words centered with the same character as a second average vector, and determining an average vector of the second word vectors of words ending with the same character as a third average vector; and splicing the first word vector, the first average vector, the second average vector and the third average vector of the same character to be used as a second word vector of the same character.
Determining a vocabulary starting with the same character, a vocabulary centered with the same character and a vocabulary ending with the same character from a plurality of vocabularies comprising the same character respectively; for the words beginning, middle and ending with the same character, respectively determining average vectors corresponding to second word vectors, taking the average vector of the second word vectors of the words beginning with the same character as a first average vector, taking the average vector of the second word vectors of the words middle with the same character as a second average vector, and taking the average vector of the second word vectors of the words ending with the same character as a third average vector; and splicing the first word vector, the first average vector, the second average vector and the third average vector of the same character, and taking the spliced vector as the second word vector of the same character.
For example, taking the topic text of "Nanjing Yangtze river bridge" as an example, regarding the character of "long", the related art directly uses the word vector of "long", in the present disclosure, the word vectors of the words beginning, middle and ending with the character of "long" are respectively averaged to obtain a first average vector, a second average vector and a third average vector, assuming that there are 100 words beginning with "long" (such as long river, long height, etc.) in the entity word list, the second word vectors of these words are averaged to obtain a first average vector a, there are several words middle with "long" (such as equal length triangle, etc.) in the entity word list, the second word vectors of these words are averaged to obtain a second average vector b, there are several words ending with "long" (such as city length, long, etc.) in the entity word list, the second word vectors of these words are averaged to obtain a third average vector c, and splicing the first word vector d of the character with the length with the first average vector a, the second average vector b and the third average vector c, namely, connecting each vector end to obtain a second word vector dabc, and taking the second word vector dabc as a new word vector of the character with the length.
According to the first average vectors of the vocabularies beginning with the same character, the second average vectors of the vocabularies centered with the same character, the third average vectors of the vocabularies ending with the same character and the first word vectors of the same character, the words are spliced to be used as the second word vectors of the same character, so that the information of the distribution of the vocabularies in the whole text can be enhanced, and the accuracy of an entity recognition result is improved.
In another exemplary embodiment, determining a second word vector of the same character from a second word vector and a first word vector comprising the same character comprises: determining an average vector of second word vectors of at least one word comprising the same character; and splicing the first word vector of the same character and the average vector to be used as a second word vector of the same character.
The second word vectors of at least one word comprising the same character are averaged to obtain an average vector, the first word vectors and the average vector of the same character are spliced to obtain the second word vectors of the same character, the second word vectors are used as new word vectors of the same character, information of the same character in the word is introduced, and the accuracy of an entity recognition result is improved.
In step S14, performing entity recognition on the object topic text according to the second word vector of each character and the first word vector of the entity word, so as to obtain an entity recognition result.
After the second word vector of each character in the subject text is obtained, the second word vector of each character and the first word vector of the entity word are input into the entity recognition model, entity recognition is carried out on the subject text through the entity recognition model, an entity recognition result of the subject text is obtained, and label information of each character in the subject text is obtained.
In an exemplary embodiment, said determining a word vector for said each character as a first word vector and a word vector for said physical word as a first word vector and a word vector for said vocabulary as a second word vector comprises: determining a word vector of each character as a first word vector through the word vector model, determining a word vector of the entity word as a first word vector through the word vector model, and determining a word vector of the vocabulary as a second word vector through the word vector model; the word vector model is obtained based on the entity word list training;
and performing entity recognition on the object subject text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, wherein the entity recognition result comprises: and carrying out entity recognition on the object subject text through an entity recognition model according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, wherein the entity recognition model is obtained by training based on a labeled data set obtained from the entity word list.
The word vector model may be a model obtained by performing fine tuning training on a pre-trained BERT model. The entity recognition model is based on a vocabulary enhancement model and is obtained by training according to the labeled data set.
After the word vector model is obtained, determining a word vector of each character in the subject text by using the word vector model, and taking the word vector as a first word vector of the character; determining a word vector of the entity word by using a word vector model, and taking the word vector as a first word vector; for each of the words, a word vector model is also used to determine a corresponding word vector, which is used as a second word vector.
The pre-trained BERT model can be subjected to fine-tuning training by utilizing the existing entity word list and the object subject text corpus. And during fine tuning training, taking the entity word list as a word segmentation reference, constructing a word vector, performing secondary fine tuning training on the pre-trained BERT model by using the entity words and the object subject text corpus in the entity word list, and obtaining the word vector model after the training is finished. The pre-training BERT can be a Robert Chinese BERT model, and after a word vector model is obtained through training, all word segmentation results and corresponding word vectors or word vectors of a subject text can be directly output.
Before the entity recognition model is trained, a trained annotation data set needs to be constructed. When constructing the labeled data set, reversely labeling the subject text corpus according to the existing entity word list, namely matching the subject text corpus with entity words in the entity word list, and labeling the matched entity words in the subject text corpus to obtain the labeled data set. When the object subject text corpus is matched with the entity words in the entity word list, if at least two entity words with inclusion relation are matched, the longer entity word is used as the entity word in the object subject text corpus. For example, the subject text is "i want to buy a long-sleeved T-shirt", and the entity words in the entity word list include "T-shirt" and "long-sleeved T-shirt", and the entity word "long-sleeved T-shirt" contains the entity word "T-shirt", then the "long-sleeved T-shirt" is determined to be the entity word in the subject text.
When the entity words matched in the object subject text corpus are labeled, the object subject text corpus can be labeled by using a labeling label system of BIO or BIOES. For the labeling label system of BIO, if there is only one kind of entity, the labels corresponding to the beginning, middle and non-labeling characters of a word are B, I, O respectively, such as: the text corpus is 'I love eating apple', wherein the apple is a solid word, namely a word to be labeled, and the obtained labeled data is 'I love O eat O apple B I'; if multiple categories of data need to be labeled, B, I will be accompanied by the corresponding categories, such as: the text corpus is ' I love eating apple and not love drinking cola ', wherein the apple and the cola are entity words and belong to different categories, the apple is of type a, the cola is of type d, and the obtained labeled data is ' I ' O love eating O apple B-a fruit I-a love O drinking O coca B-d happy B-d '. When the label is labeled by BIOES, the label B represents the beginning, the label I represents the middle, the label O represents no label, the label E represents the end, and the label S represents a single character.
After the labeled data set is obtained by matching the subject text corpus with the entity words in the entity word list, a small amount of manual review can be performed to correct obvious errors in the labeled data set, and if the data volume is large, for example, the word list has ten thousand levels, the manual review part can be omitted.
After the labeled data set is obtained, the entity recognition model can be trained, an object topic text sample in sample data in the labeled data set is input into a word vector model, a second word vector of each character and a first word vector of each entity word are determined through the word vector model according to the above mode, then the second word vector of each character and the first word vector of each entity word are input into an initial entity recognition model, an entity recognition result corresponding to the object topic text sample is determined through the initial entity recognition model, network parameters of the initial entity recognition model are adjusted according to the entity recognition result and a labeling result in the sample data, the initial entity recognition model is trained, and the trained entity recognition model is obtained when a preset training target is reached. After the trained entity recognition model is obtained, entity recognition of the object subject text can be performed by using the entity recognition model.
The data annotation problem is solved with low cost by using a remote supervision mode, the text corpora are reversely annotated by using the entity vocabulary of the commodity, the annotation accuracy is close to that of manual annotation under the condition that all required category entities are covered by the vocabulary, and the data annotation cost is greatly reduced.
In an exemplary embodiment, performing entity recognition on the object topic text according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, including: and performing entity recognition on the object subject text according to the second word vector of each character, the first position information of each character in the object subject text, the first word vector and the second position information of the entity word in the object subject text to obtain an entity recognition result.
The first position information includes first start position information and first end position information, and the second position information includes second start position information and second end position information.
The entity identification model may be a FLAT-Lattice structure. For each character and vocabulary, two pieces of position information, namely head position information (head position encoding) and tail position information (tail position encoding), are constructed, and a common Lattice structure can be reconstructed, as shown in fig. 2, a conventional Lattice structure is flattened, and is flattened from a directed acyclic graph into a planar structure, which is composed of a plurality of ranges: the beginning position information and the end position information of each character are the same, i.e., the first beginning position information and the first end position information are the same, and the beginning and the end of each word are skipped, i.e., the second beginning position information and the second end position information are skipped.
And inputting the second word vector of each character, the first position information of each character in the object subject text, the first word vector of the entity word and the second position information of the entity word in the object subject text into the entity recognition model, and performing entity recognition on the object subject text through the entity recognition model to obtain an entity recognition result. Through the first position information of the characters and the second position information of the entity words, the entity recognition model can well capture long-distance dependence, and the accuracy of entity recognition is improved.
In an exemplary embodiment, performing entity recognition on the object topic text through an entity recognition model according to the second word vector of each character, the first position information of each character in the object topic text, the first word vector, and the second position information of the entity word in the object topic text to obtain an entity recognition result, including:
respectively determining the relative position code of each character relative to other characters except the current character in the plurality of characters through an encoder according to the first position information of each character in the object subject text and the second position information of the entity word in the object subject text;
determining an attention weight of each character relative to the other characters through an attention mechanism according to the second word vector of each character, the first word vector and the relative position code of each character relative to the other characters;
and according to the attention weight of each character relative to other characters, carrying out entity recognition on the object subject text through a decoder to obtain an entity recognition result in the object subject text.
The entity identification model is a FLAT-Lattice structure. Fig. 3 is a schematic diagram of a FLAT-Lattice structure in the embodiment of the present disclosure, and as shown in fig. 3, the Lattice structure mainly inputs each character in an object topic text, first position information of each character in the object topic text, and second position information of an entity word and an entity word in the object topic text. Then, coding processing and entity recognition are carried out through a coder in a Transformer structure, the coder carries out relative position coding on each character based on first position information of each character in the object subject text and second position information of an entity word in the object subject text, namely, the relative position coding of each character relative to other characters is determined; then, an attention mechanism in the Transformer structure determines an attention weight of each character relative to other characters based on the second word vector of each character, the first word vector of each entity word and a relative position code of each character relative to other characters so as to acquire interactive information between the characters and the entity words; and according to the attention weight of each character relative to other characters, carrying out entity recognition on the object subject text through a decoder with a transform structure to obtain an entity recognition result in the object subject text.
When determining the relative position code, four relative distances are relied on, the four relative distances also comprise the relationship between the characters and the entity words, and the four relative distances are expressed as follows:
wherein the content of the first and second substances,indicating the distance from the beginning position of the ith character or entity word to the beginning position of the jth character or entity word,indicating the distance from the beginning position of the ith character or entity word to the end position of the jth character or entity word,indicating the distance from the end position of the ith character or entity word to the beginning position of the jth character or entity word,represents the distance from the end position of the ith character or entity word to the end position of the jth character or entity word, head [ i]First head position information indicating an ith character or second head position information indicating an ith physical word, head [ j [ ]]First head position information indicating a j-th character or second head position information indicating a j-th entity word, tail [ i ]]First end position information indicating an ith character or second end position information indicating an ith entity word, tail [ j]And first ending position information representing a jth character or second ending position information representing a jth entity word.
The relative position-coding expression is as follows:
where ReLU is the activation function, Wr is a learnable parameter,represents the join operator, pd represents the embedded position vector, d isAndof the position code, dmodel represents the vector dimension to which the mapping is required, and k is the dimension index of the position code.
Through relative position coding, the position sensing capability and the direction sensing capability of an encoder in a Transformer structure can be improved, context information in an object subject text can be well distinguished, and the accuracy of entity identification can be improved. The entity recognition model based on vocabulary enhancement can fully utilize the existing data, improves the recognition accuracy of the model, constructs the model by combining the vocabulary enhancement with the large-scale pre-training model, improves the accuracy of the model recognition, and compared with the condition of only using BERT, the accuracy and the recall rate are improved by 2 percent, and the word vector model and the entity recognition model are light-weight models, so that the rapid high-precision recognition can be realized, and the recognition accuracy is improved by two percent compared with the recognition accuracy of the original complex model; and filtering is carried out by utilizing the entity word list, and a required object entity is reserved, so that entity linkage is convenient to carry out.
The entity recognition method provided by the present exemplary embodiment performs entity recognition by segmenting an object topic text into a plurality of characters and at least one entity word according to an entity vocabulary, and determining a vocabulary including each character in the object topic text, determining a first word vector of each character, determining a first word vector of an entity word, determining a second word vector of the vocabulary, determining a second word vector of the same character according to the second word vector and the first word vector including the same character, performing entity recognition on the object topic text according to the second word vector of each character and the first word vector of the entity word, obtaining an entity recognition result, since the second word vector including information of the second word vector and the first word vector is used as a word vector of the character, enhancing the vocabulary distribution of the character in the entire text, and combining the second word vector of the character and the first word vector of the entity word, the boundary information of the entity words is strengthened, and the accuracy of entity identification is improved.
On the basis of the technical scheme, the method further comprises the following steps: replacing the received new entity word list with the entity word list; according to the new entity word list, labeling the object subject text corpus to obtain a first labeled data set; training a pre-training word vector model according to the new entity word list and the object subject text corpus to obtain a new word vector model; determining a second word vector of each character and a first word vector of each entity word in sample data in the first labeled data set according to the new entity word list and the new word vector model; and training the initial entity recognition model according to the first labeling data set, the second word vector of each character and the first word vector of each entity word in the sample data in the first labeling data set to obtain a new entity recognition model.
The new entity word list may be obtained by performing entity recognition for a period of time by using the word vector model and the entity recognition model, recognizing and screening a recognition result, and storing the screened entity words as the new entity word list.
Entity identification is typically performed for a later entity link and classification prior work. If the entity word list needs to be replaced according to the requirement, replacing the original entity word list with the received new entity word list, and labeling the subject text corpus according to the new entity word list to obtain a first labeled data set; performing fine tuning training on the pre-training word vector model according to the new entity word list and the object topic text corpus, so that the pre-training word vector model segments the object topic text corpus according to the new entity word list, determining the vector of the segmented words or words, and obtaining a new word vector model after the fine tuning training is completed; then, a second word vector of each character and a first word vector of each entity word in the sample data in the first labeled data set can be determined according to the new entity word table by using a new word vector model; and then inputting a second character vector of each character and a first word vector of each entity word into an initial entity recognition model, determining an entity recognition result corresponding to the sample data through the initial entity recognition model, adjusting network parameters of the initial entity recognition model according to the entity recognition result and a labeling result in the sample data so as to retrain the initial entity recognition model, and obtaining a new entity recognition model when a preset training target is reached. Wherein the initial entity identification model is a model in which network parameters are randomly initialized. The user can determine a new entity word list according to requirements, so that the linguistic data are labeled according to the new entity word list to obtain a first labeled data set, the word vector model and the entity recognition model are retrained, and the obtained new word vector model and the obtained new entity recognition model can recognize entities in the new entity word list, so that the word vector model and the entity recognition model can be updated according to requirements.
On the basis of the technical scheme, the method further comprises the following steps: if the entity words in the entity recognition result do not exist in the entity word list, storing the entity words in the entity recognition result into a new word list; removing the duplication of the entity words in the new word list, and adding the entity words after the duplication removal into the entity word list to obtain an expanded word list; labeling the subject text corpus according to the expansion word list to obtain a second labeled data set; performing iterative training on the word vector model according to the expansion word list and the object subject text corpus to obtain an iterated word vector model; determining a second word vector of each character and a first word vector of each entity word in sample data in the second labeled data set according to the extended word list and the iterated word vector model; and performing iterative training on the entity recognition model according to the second labeling data set, the second word vector of each character and the first word vector of each entity word in the sample data in the second labeling data set to obtain an iterated entity recognition model.
After entity recognition is carried out on the subject text for a period of time through the trained word vector model and entity recognition model, entity words which do not exist in the entity word list can be recognized, namely the entity words which do not exist in the entity word list exist in the obtained entity recognition result, and then the entity words are stored in a new word list; removing duplication of the entity words in the new word list, and adding the entity words after duplication removal into the entity word list to obtain an expanded word list; labeling the subject text corpus according to the expansion word list and the mode to obtain a second labeled data set; performing iterative training on the word vector model which is trained previously based on the extended word list and the object subject text corpus to obtain an iterated word vector model, so that the iterated word vector model can well determine word vectors of entity words in the extended word list; determining a second word vector of each character of sample data in the second labeled data set and a first word vector of each entity word according to the expanded word list and the iterated word vector model; and then inputting the second word vector of each character and the first word vector of each entity word into an original entity recognition model, determining an entity recognition result corresponding to the sample data through the original entity recognition model, adjusting network parameters of the original entity recognition model according to the entity recognition result and a labeling result in the sample data so as to carry out iterative training on the original entity recognition model, and obtaining the iterated entity recognition model when a preset training target is reached. The method and the device realize the expansion of the word list according to the recognized entity words and re-label the data according to the expansion word list, save a large amount of cost for labeling the data, can iteratively train the model according to the expansion word list, can recognize more and more entity words and reduce the cost for iteratively updating the model.
Fig. 4 is a block diagram illustrating an entity recognition apparatus according to an example embodiment. Referring to fig. 4, the apparatus includes a text segmentation module 21, a word vector determination module 22, a word vector determination module 23, and an entity recognition module 24.
A text segmentation module 21 configured to perform segmentation of the subject topic text into a plurality of characters and at least one entity word according to an entity vocabulary, and determine a vocabulary including each character in the subject topic text;
a word vector determination module 22 configured to perform determining a word vector of each character as a first word vector, and determining a word vector of the entity word as a first word vector, and determining a word vector of the vocabulary as a second word vector;
a word vector determination module 23 configured to perform determining a second word vector of the same character from a second word vector and a first word vector comprising the same character;
and the entity recognition module 24 is configured to perform entity recognition on the object subject text according to the second word vector of each character and the first word vector of the entity word, so as to obtain an entity recognition result.
Optionally, the word vector determining module includes:
a different position vocabulary determining unit configured to perform determination of vocabularies beginning with, centered, and ending with the same character, respectively, among a plurality of vocabularies including the same character;
a first average vector determination unit configured to perform determining an average vector of second word vectors of words beginning with the same character as a first average vector, determining an average vector of second word vectors of words centered with the same character as a second average vector, and determining an average vector of second word vectors of words ending with the same character as a third average vector;
a first word vector determination unit configured to perform stitching the first word vector, the first average vector, the second average vector, and the third average vector of the same character as a second word vector of the same character.
Optionally, the word vector determining module includes:
a second average vector determination unit configured to perform determining an average vector of second word vectors of at least one word including the same character;
a second word vector determination unit configured to perform stitching the first word vector of the same character and the average vector as a second word vector of the same character.
Optionally, the entity identification module is configured to perform:
and performing entity recognition on the object subject text according to the second word vector of each character, the first position information of each character in the object subject text, the first word vector and the second position information of the entity word in the object subject text to obtain an entity recognition result.
Optionally, the first position information includes first start position information and first end position information, and the second position information includes second start position information and second end position information.
Optionally, the entity identification module includes:
a relative position encoding unit configured to perform relative position encoding of each character with respect to other characters than a current character among the plurality of characters, respectively, by an encoder, according to first position information of the each character in the object subject text and second position information of the entity word in the object subject text;
an attention weight determination unit configured to perform encoding according to the second word vector, the first word vector of each character, and a relative position of each character with respect to the other characters, and determine an attention weight of each character with respect to the other characters by an attention mechanism;
and the entity identification unit is configured to perform entity identification on the object subject text through a decoder according to the attention weight of each character relative to other characters to obtain an entity identification result in the object subject text.
Optionally, the word vector determination module is configured to perform:
determining a word vector of each character as a first word vector through the word vector model, determining a word vector of the entity word as a first word vector through the word vector model, and determining a word vector of the vocabulary as a second word vector through the word vector model; the word vector model is obtained based on the entity word list training;
the entity recognition model is configured to perform:
and carrying out entity recognition on the object subject text through an entity recognition model according to the second word vector of each character and the first word vector of the entity word to obtain an entity recognition result, wherein the entity recognition model is obtained by training based on a labeled data set obtained from the entity word list.
Optionally, the apparatus further comprises:
a vocabulary replacement module configured to perform replacement of the received new entity vocabulary with the entity vocabulary;
the first labeling module is configured to label the object subject text corpus according to the new entity word list to obtain a first labeled data set;
a word vector retraining module configured to perform training on a pre-training word vector model according to the new entity word list and the object topic text corpus to obtain a new word vector model;
a first sample vector determination module configured to perform determining a second word vector for each character and a first word vector for each entity word in sample data in the first labeled data set according to the new entity word list and the new word vector model;
and the entity recognition model retraining module is configured to train the initial entity recognition model according to the first labeled data set, the second word vector of each character and the first word vector of each entity word in the sample data in the first labeled data set, so as to obtain a new entity recognition model.
Optionally, the apparatus further comprises:
the entity word storage module is configured to store the entity words in the entity recognition result into a new word list if the entity words in the entity recognition result do not exist in the entity word list;
the word list expansion module is configured to execute duplication elimination on the entity words in the new word list and add the duplication eliminated entity words to the entity word list to obtain an expanded word list;
the second labeling module is configured to label the object subject text corpus according to the expansion word list to obtain a second labeled data set;
the word vector iterative training module is configured to perform iterative training on the word vector model according to the extended word list and the object subject text corpus to obtain an iterated word vector model;
a second sample vector determination module configured to perform determining a second word vector of each character and a first word vector of each entity word in sample data in the second labeled data set according to the extended word list and the iterated word vector model;
and the entity recognition model iterative training module is configured to perform iterative training on the entity recognition model according to the second labeled data set, the second word vector of each character and the first word vector of each entity word in the sample data in the second labeled data set, so as to obtain an iterated entity recognition model.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment. For example, the electronic device 300 may be provided as a server. Referring to FIG. 5, electronic device 300 includes a processing component 322 that further includes one or more processors and memory resources, represented by memory 332, for storing instructions, such as applications, that are executable by processing component 322. The application programs stored in memory 332 may include one or more modules that each correspond to a set of instructions. Further, the processing component 322 is configured to execute instructions to perform the entity identification method described above.
The electronic device 300 may also include a power component 326 configured to perform power management of the electronic device 300, a wired or wireless network interface 350 configured to connect the electronic device 300 to a network, and an input/output (I/O) interface 358. The electronic device 300 may operate based on an operating system stored in the memory 332, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 332 comprising instructions, executable by the processing component 322 of the electronic device 300 to perform the entity identification method described above is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, there is also provided a computer program product comprising a computer program or instructions which, when executed by a processor, implements the above-described entity identification method.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.