Text information extraction method, device and equipment
1. A text information extraction method, characterized by comprising:
extracting text features and part-of-speech features of a text to be processed with a preset length;
fusing the text features and the part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed;
determining the sequence annotation model of the first level as the sequence annotation model of the current level;
inputting the text fusion characteristics of the text to be processed into the sequence labeling model of the current level, and labeling the information item to be extracted corresponding to the sequence labeling model of the current level to obtain a labeling result of the text to be processed output by the sequence labeling model of the current level;
judging whether a sequence labeling model of the next level exists or not;
if the sequence labeling model of the next level exists, fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion feature of the text to be processed to obtain the text fusion feature of the text to be processed again;
determining the sequence annotation model of the next level as the sequence annotation model of the current level, and re-executing the steps of inputting the text fusion characteristics of the text to be processed into the sequence annotation model of the current level and the subsequent steps;
if the sequence annotation model of the next level does not exist, obtaining the annotation result of the text to be processed output by the sequence annotation model of each level;
analyzing the labeling result of the text to be processed output by the sequence labeling model of each layer to obtain the information extraction content of the information items to be extracted of different layers included in the text to be processed.
2. The method according to claim 1, wherein before extracting text features and part-of-speech features of a preset length of text to be processed, the method further comprises:
redundant information filtering and sensitive information desensitization processing are carried out on an original text to obtain a first target text;
if the length of the first target text is larger than a preset length, the first target text is divided into a plurality of second target texts with the lengths smaller than or equal to the preset length, the lengths of the second target texts are filled to the preset length, and a text to be processed is generated;
if the length of the first target text is smaller than a preset length, the length of the first target text is filled to the preset length, and a text to be processed is generated;
and if the length of the first target text is equal to the preset length, determining the first target text as a text to be processed.
3. The method according to claim 1, wherein after obtaining information extraction contents of information items to be extracted at different levels included in the text to be processed, the method further comprises:
acquiring text features of target information extraction content and text features of target term texts, wherein the target information extraction content is any item in the information extraction content, and the target term texts are any item in predetermined term texts;
matching the text features of the target information extraction content with the text features of the target term text;
and if the text features of the target information extraction content are matched with the text features of the target term text, replacing the target information extraction content with the target term text.
4. The method of claim 1, further comprising:
initializing sequence marking models of all layers;
determining the sequence annotation model of the first level as the sequence annotation model of the current level;
inputting the text fusion characteristics of the training text into the sequence marking model of the current level, marking the information item to be extracted corresponding to the sequence marking model of the current level, and obtaining the marking result of the training text output by the sequence marking model of the current level;
obtaining a loss value of the sequence labeling model of the current level according to a standard labeling result of the information item to be extracted corresponding to the sequence labeling model of the current level in the training text and a labeling result of the training text output by the sequence labeling model of the current level;
judging whether a sequence labeling model of the next level exists or not;
if the sequence labeling model of the next level exists, the labeling result of the training text output by the sequence labeling model of the current level is fused with the text fusion feature of the training text, and the text fusion feature of the training text is obtained again;
determining the sequence labeling model of the next level as the sequence labeling model of the current level, and re-executing the steps of inputting the text fusion characteristics of the training text into the sequence labeling model of the current level and the subsequent steps;
if the sequence annotation model of the next level does not exist, obtaining the loss value of the sequence annotation model of each level;
weighting and adding the loss values of the sequence marking models of all the levels to obtain a comprehensive loss value, and adjusting the sequence marking models of all the levels according to the comprehensive loss value;
and re-executing the step of determining the sequence annotation model of the first level as the sequence annotation model of the current level and the subsequent steps until a preset stop condition is reached, and obtaining the sequence annotation model of each level generated by training.
5. The method according to claim 1 or 4, wherein the number of layers of the sequence annotation model and the information items to be extracted corresponding to the sequence annotation model of each layer are predetermined according to the layer of the information items to be extracted.
6. The method according to claim 1, wherein the extracting text features and part-of-speech features of the text to be processed with the preset length comprises:
inputting a preset length of a text to be processed into an ERNIE model to obtain text characteristics of the text to be processed; the text features of the text to be processed represent the grammar and the semantics of the text to be processed and the positions of all characters in the text to be processed; the text features of the text to be processed are m-n-dimensional text feature vectors, wherein m is the preset length, and n is a positive integer;
and inputting the text to be processed into a part-of-speech recognition model to obtain part-of-speech characteristics of the text to be processed, wherein the part-of-speech characteristics of the text to be processed are m-1 dimensional part-of-speech characteristic vectors.
7. The method according to claim 6, wherein the fusing the text features and the part-of-speech features of the text to be processed to obtain the text fusion features of the text to be processed comprises:
mapping the part-of-speech feature vector with the dimension of m × 1 into a part-of-speech feature vector with the dimension of m × n;
and fusing the part-of-speech feature vector of the m-n dimensions with the text feature vector of the m-n dimensions to obtain a text fusion feature of the text to be processed, wherein the text fusion feature of the text to be processed is the text fusion feature vector of the m-n dimensions.
8. The method according to claim 7, wherein the labeling result of the text to be processed output by the sequence labeling model of the current hierarchy is a m x 1-dimensional labeling result vector;
the fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion feature of the text to be processed to obtain the text fusion feature of the text to be processed again includes:
mapping the m-by-1 dimensional labeling result vector into an m-by-n dimensional labeling result vector;
and fusing the m-n-dimensional labeling result vector and the m-n-dimensional text fusion feature vector to obtain the text fusion feature of the text to be processed again.
9. A text information extraction apparatus, characterized in that the apparatus comprises:
the extraction unit is used for extracting text features and part-of-speech features of a text to be processed with a preset length;
the first fusion unit is used for fusing the text features and the part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed;
the first determining unit is used for determining the sequence annotation model of the first level as the sequence annotation model of the current level;
the first labeling unit is used for inputting the text fusion characteristics of the text to be processed into the sequence labeling model of the current level, labeling the information item to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the text to be processed output by the sequence labeling model of the current level;
the first judging unit is used for judging whether a sequence marking model of the next level exists or not;
the second fusion unit is used for fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion feature of the text to be processed if the sequence labeling model of the next level exists, and obtaining the text fusion feature of the text to be processed again;
a second determining unit, configured to determine the sequential annotation model of the next hierarchy as the sequential annotation model of the current hierarchy, and re-execute the sequential annotation model of the current hierarchy with the text fusion feature of the text to be processed input and subsequent steps;
the first obtaining unit is used for obtaining the labeling result of the text to be processed output by the sequence labeling model of each level if the sequence labeling model of the next level does not exist;
and the analysis unit is used for analyzing the labeling result of the text to be processed output by the sequence labeling model of each layer to obtain the information extraction content of the information items to be extracted of different layers included in the text to be processed.
10. A text information extraction device characterized by comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the text information extraction method according to any one of claims 1 to 8 when executing the computer program.
11. A computer-readable storage medium having stored therein instructions that, when run on a terminal device, cause the terminal device to execute the text information extraction method according to any one of claims 1-8.
Background
The text includes a large amount of text information. When extracting text information in a text, the structure of a part of the text is irregular or incomplete, and a predetermined structure model is lacked, so that the text information in the text is difficult to be directly extracted. Where the text is, for example, in the medical field, a doctor writes the generated text of a medical record.
Currently, text processing is usually required for such texts to achieve extraction of text information. However, the process of extracting the text information is complicated, and the accuracy of the obtained text information is low. Therefore, how to efficiently and accurately extract text information is an urgent problem to be solved.
Disclosure of Invention
In view of this, embodiments of the present application provide a method, an apparatus, and a device for extracting text information, which can label a text to be processed through a multi-level sequence labeling model, and obtain more accurate text information by using a labeling result, so as to realize efficient and accurate text information extraction.
In order to solve the above problem, the technical solution provided by the embodiment of the present application is as follows:
a method of textual information extraction, the method comprising:
extracting text features and part-of-speech features of a text to be processed with a preset length;
fusing the text features and the part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed;
determining the sequence annotation model of the first level as the sequence annotation model of the current level;
inputting the text fusion characteristics of the text to be processed into the sequence labeling model of the current level, and labeling the information item to be extracted corresponding to the sequence labeling model of the current level to obtain a labeling result of the text to be processed output by the sequence labeling model of the current level;
judging whether a sequence labeling model of the next level exists or not;
if the sequence labeling model of the next level exists, fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion feature of the text to be processed to obtain the text fusion feature of the text to be processed again;
determining the sequence annotation model of the next level as the sequence annotation model of the current level, and re-executing the steps of inputting the text fusion characteristics of the text to be processed into the sequence annotation model of the current level and the subsequent steps;
if the sequence annotation model of the next level does not exist, obtaining the annotation result of the text to be processed output by the sequence annotation model of each level;
analyzing the labeling result of the text to be processed output by the sequence labeling model of each layer to obtain the information extraction content of the information items to be extracted of different layers included in the text to be processed.
In a possible implementation manner, before extracting text features and part-of-speech features of a preset length of text to be processed, the method further includes:
redundant information filtering and sensitive information desensitization processing are carried out on an original text to obtain a first target text;
if the length of the first target text is larger than a preset length, the first target text is divided into a plurality of second target texts with the lengths smaller than or equal to the preset length, the lengths of the second target texts are filled to the preset length, and a text to be processed is generated;
if the length of the first target text is smaller than a preset length, the length of the first target text is filled to the preset length, and a text to be processed is generated;
and if the length of the first target text is equal to the preset length, determining the first target text as a text to be processed.
In one possible implementation manner, after obtaining information extraction contents of information items to be extracted at different levels included in the text to be processed, the method further includes:
acquiring text features of target information extraction content and text features of target term texts, wherein the target information extraction content is any item in the information extraction content, and the target term texts are any item in predetermined term texts;
matching the text features of the target information extraction content with the text features of the target term text;
and if the text features of the target information extraction content are matched with the text features of the target term text, replacing the target information extraction content with the target term text.
In one possible implementation, the method further includes:
initializing sequence marking models of all layers;
determining the sequence annotation model of the first level as the sequence annotation model of the current level;
inputting the text fusion characteristics of the training text into the sequence marking model of the current level, marking the information item to be extracted corresponding to the sequence marking model of the current level, and obtaining the marking result of the training text output by the sequence marking model of the current level;
obtaining a loss value of the sequence labeling model of the current level according to a standard labeling result of the information item to be extracted corresponding to the sequence labeling model of the current level in the training text and a labeling result of the training text output by the sequence labeling model of the current level;
judging whether a sequence labeling model of the next level exists or not;
if the sequence labeling model of the next level exists, the labeling result of the training text output by the sequence labeling model of the current level is fused with the text fusion feature of the training text, and the text fusion feature of the training text is obtained again;
determining the sequence labeling model of the next level as the sequence labeling model of the current level, and re-executing the steps of inputting the text fusion characteristics of the training text into the sequence labeling model of the current level and the subsequent steps;
if the sequence annotation model of the next level does not exist, obtaining the loss value of the sequence annotation model of each level;
weighting and adding the loss values of the sequence marking models of all the levels to obtain a comprehensive loss value, and adjusting the sequence marking models of all the levels according to the comprehensive loss value;
and re-executing the step of determining the sequence annotation model of the first level as the sequence annotation model of the current level and the subsequent steps until a preset stop condition is reached, and obtaining the sequence annotation model of each level generated by training.
In a possible implementation manner, the number of layers of the sequence annotation model and the information item to be extracted corresponding to the sequence annotation model of each layer are predetermined according to the layer of the information item to be extracted.
In a possible implementation manner, the extracting text features and part-of-speech features of a text to be processed with a preset length includes:
inputting a preset length of a text to be processed into an ERNIE model to obtain text characteristics of the text to be processed; the text features of the text to be processed represent the grammar and the semantics of the text to be processed and the positions of all characters in the text to be processed; the text features of the text to be processed are m-n-dimensional text feature vectors, wherein m is the preset length, and n is a positive integer;
and inputting the text to be processed into a part-of-speech recognition model to obtain part-of-speech characteristics of the text to be processed, wherein the part-of-speech characteristics of the text to be processed are m-1 dimensional part-of-speech characteristic vectors.
In a possible implementation manner, the fusing the text features and the part-of-speech features of the text to be processed to obtain the text fusion features of the text to be processed includes:
mapping the part-of-speech feature vector with the dimension of m × 1 into a part-of-speech feature vector with the dimension of m × n;
and fusing the part-of-speech feature vector of the m-n dimensions with the text feature vector of the m-n dimensions to obtain a text fusion feature of the text to be processed, wherein the text fusion feature of the text to be processed is the text fusion feature vector of the m-n dimensions.
In a possible implementation manner, the labeling result of the text to be processed output by the sequence labeling model of the current hierarchy is a m × 1-dimensional labeling result vector;
the fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion feature of the text to be processed to obtain the text fusion feature of the text to be processed again includes:
mapping the m-by-1 dimensional labeling result vector into an m-by-n dimensional labeling result vector;
and fusing the m-n-dimensional labeling result vector and the m-n-dimensional text fusion feature vector to obtain the text fusion feature of the text to be processed again.
A text information extraction apparatus, the apparatus comprising:
the extraction unit is used for extracting text features and part-of-speech features of a text to be processed with a preset length;
the first fusion unit is used for fusing the text features and the part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed;
the first determining unit is used for determining the sequence annotation model of the first level as the sequence annotation model of the current level;
the first labeling unit is used for inputting the text fusion characteristics of the text to be processed into the sequence labeling model of the current level, labeling the information item to be extracted corresponding to the sequence labeling model of the current level, and obtaining a labeling result of the text to be processed output by the sequence labeling model of the current level;
the first judging unit is used for judging whether a sequence marking model of the next level exists or not;
the second fusion unit is used for fusing the labeling result of the text to be processed output by the sequence labeling model of the current level with the text fusion feature of the text to be processed if the sequence labeling model of the next level exists, and obtaining the text fusion feature of the text to be processed again;
a second determining unit, configured to determine the sequential annotation model of the next hierarchy as the sequential annotation model of the current hierarchy, and re-execute the sequential annotation model of the current hierarchy with the text fusion feature of the text to be processed input and subsequent steps;
the first obtaining unit is used for obtaining the labeling result of the text to be processed output by the sequence labeling model of each level if the sequence labeling model of the next level does not exist;
and the analysis unit is used for analyzing the labeling result of the text to be processed output by the sequence labeling model of each layer to obtain the information extraction content of the information items to be extracted of different layers included in the text to be processed.
A text information extraction device comprising: the text information extraction method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the computer program, the text information extraction method is realized.
A computer-readable storage medium, having stored therein instructions, which, when run on a terminal device, cause the terminal device to execute the above-mentioned text information extraction method.
Therefore, the embodiment of the application has the following beneficial effects:
according to the text information extraction method, the text information extraction device and the text information extraction equipment, the text features and the part-of-speech features of the text to be processed are extracted, so that the features of the text to be processed in two aspects are extracted, and more comprehensive feature information of the text to be processed can be obtained. The part-of-speech characteristics are helpful for determining the information extraction content more accurately, and the accuracy of the obtained information extraction content can be improved. And inputting the text fusion characteristics obtained by fusing the text characteristics and the part-of-speech characteristics into the sequence tagging model of the first level, so as to tag the information item to be extracted corresponding to the current level. And fusing the obtained labeling result and the text fusion feature to obtain the updated text fusion feature. By replacing the sequence marking model of the current level, the marking of the sequence marking model of each level can be carried out in sequence, and the marking result of the sequence marking model of each level is obtained. The labeling result of the sequence labeling model of each level is fused with the text fusion characteristic to be used as the input of the sequence labeling model of the next level, so that the sequence labeling model can be labeled based on the labeling result of the sequence labeling model of the previous level, and the accuracy of the labeling result of the sequence labeling model is improved. According to the labeling result of the sequence labeling model of each layer, the information extraction content of the information items to be extracted of different layers in the text to be processed can be obtained, the extraction of the information extraction content of multiple layers can be realized, and the dual extraction of the relationship between the information extraction content and the information extraction content is considered. In addition, by acquiring multi-level information extraction contents, extraction of information extraction contents with multiple meanings can be realized. Therefore, the more accurate text information of the text to be processed can be obtained on the basis of automatically extracting the text information.
Drawings
Fig. 1 is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application;
fig. 2 is a flowchart of a text information extraction method according to an embodiment of the present application;
fig. 3 is a flowchart of another text information extraction method provided in the embodiment of the present application;
fig. 4 is a flowchart of another text information extraction method provided in the embodiment of the present application;
fig. 5 is a schematic diagram illustrating matching of text features of target term texts with text features of target information extraction contents according to an embodiment of the present application;
fig. 6 is a flowchart of another text information extraction method provided in the embodiment of the present application;
fig. 7 is a flowchart for extracting text features and part-of-speech features of a to-be-processed text with a preset length according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a text information extraction apparatus according to an embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.
In order to facilitate understanding and explaining the technical solutions provided by the embodiments of the present application, the following description will first describe the background art of the present application.
After the traditional text information is researched, the daily generated text contains a large amount of text information. Based on the text information extracted from the text, the subsequent processing and utilization of the text information can be realized. For example, text information is extracted from medical record text written by a doctor, and information related to diseases and medicines can be obtained. And then the obtained related information of the diseases and the medicines is analyzed, so that the arrangement and the utilization of medical information can be realized. However, extraction of text information is complicated for some texts, such as unstructured texts. For such text, structured text can be generated directly by affecting the way text is generated. However, such a method causes inconvenience to the process of generating text and is difficult to be widely used. Or processing the text, extracting text information and realizing the structured processing of the text. However, the current text information extraction method is complex in implementation process and low in accuracy rate, and the requirement of text information extraction is difficult to meet.
Based on this, the embodiment of the application provides a text information extraction method, a text information extraction device and text information extraction equipment, which are used for extracting text features and part-of-speech features of a text to be processed, so that the extraction of the features of the text to be processed in two aspects is realized, and more comprehensive feature information of the text to be processed can be obtained. The part-of-speech characteristics are helpful for determining the information extraction content more accurately, and the accuracy of the obtained information extraction content can be improved. And inputting the text fusion characteristics obtained by fusing the text characteristics and the part-of-speech characteristics into the sequence tagging model of the first level, so as to tag the information item to be extracted corresponding to the current level. And fusing the obtained labeling result and the text fusion feature to obtain the updated text fusion feature. By replacing the sequence marking model of the current level, the marking of the sequence marking model of each level can be carried out in sequence, and the marking result of the sequence marking model of each level is obtained. The labeling result of the sequence labeling model of each level is fused with the text fusion characteristic to be used as the input of the sequence labeling model of the next level, so that the sequence labeling model can be labeled based on the labeling result of the sequence labeling model of the previous level, and the accuracy of the labeling result of the sequence labeling model is improved. According to the labeling result of the sequence labeling model of each layer, the information extraction content of the information items to be extracted of different layers in the text to be processed can be obtained, the extraction of the information extraction content of multiple layers can be realized, and the dual extraction of the relationship between the information extraction content and the information extraction content is considered. In addition, by acquiring multi-level information extraction contents, extraction of information extraction contents with multiple meanings can be realized. Therefore, the more accurate text information of the text to be processed can be obtained on the basis of automatically extracting the text information.
In order to facilitate understanding of the text information extraction method provided in the embodiment of the present application, the following description is made with reference to a scene example shown in fig. 1. Referring to fig. 1, the drawing is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application.
In practical application, firstly, a text needing text information extraction is taken as a text to be processed, text features and part-of-speech features of the text to be processed with preset lengths are extracted, and the text features and the part-of-speech features are fused to obtain text fusion features of the text to be processed. And labeling the text to be processed by utilizing a multi-level sequence labeling model. For example, a sequence annotation model with three levels. The method comprises the steps of firstly determining a sequence annotation model of a first level as a sequence annotation model of a current level, inputting text fusion characteristics into the sequence annotation model of the current level, and labeling information items to be extracted corresponding to the sequence annotation model of the current level through the sequence annotation model of the current level to obtain the sequence annotation model of the current level, namely a labeling result of a text to be processed output by the sequence annotation model of the first level. And fusing the obtained labeling result and the text fusion characteristic to obtain the text fusion characteristic of the text to be processed after re-fusion. And determining the sequence annotation model of the second level as the sequence annotation model of the current level, and inputting the text fusion characteristics after re-fusion into the sequence annotation model of the current level to obtain a corresponding annotation result. And then, the labeling result output by the sequence labeling model of the second level is fused with the text fusion characteristic. And determining the sequence annotation model of the third level as the sequence annotation model of the current level, inputting the fused text fusion characteristics into the sequence annotation model of the current level, namely the sequence annotation model of the third level, and obtaining the annotation result of the text to be processed output by the sequence annotation model of the third level. And after the labeling of the sequence labeling models of the three layers is finished, obtaining the labeling results of the texts to be processed output by the sequence labeling models of the three layers, and analyzing the labeling results of the texts to be processed output by the sequence labeling models of the three layers to obtain the information extraction contents of the information items to be extracted of different layers included in the texts to be processed. Therefore, the extraction of the information items to be extracted in different layers in the text to be processed can be realized, and the accuracy of the text information can be improved on the basis of automatic text information extraction.
Those skilled in the art will appreciate that the block diagram shown in fig. 1 is only one example in which embodiments of the present application may be implemented. The scope of applicability of the embodiments of the present application is not limited in any way by this framework.
Based on the above description, the text information extraction method provided in the present application will be described in detail below with reference to the drawings.
Referring to fig. 2, which is a flowchart of a text information extraction method provided in an embodiment of the present application, as shown in fig. 2, the method may include S201 to S209:
s201: and extracting text features and part-of-speech features of the text to be processed with preset length.
The text to be processed is the text which needs to be subjected to text information extraction. The text to be processed may be unstructured text, for example, in the medical field, the text to be processed may be medical text to be processed, such as medical record text or diagnosis text written by a doctor.
In order to facilitate feature extraction of the text to be processed, the length of the text to be processed may be set to be a preset length. The preset length may specifically represent the number of characters included in the text to be processed. The preset length can be set according to the requirement of processing the text to be processed. In one possible implementation, the preset length may be specifically set to 512 characters.
And performing feature extraction of text features and part-of-speech features for the text to be processed with preset length. The text features refer to features of each character in the text to be processed in terms of text structures, such as text positions, syntax, semantics and the like. Part-of-speech features refer to the characteristics of individual characters in the text to be processed in terms of lexical properties.
By extracting the characteristics of the text structure and the vocabulary characteristics of the text to be processed, the more complete characteristics of the text to be processed can be obtained, and the text information of the text to be processed can be accurately extracted.
In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for extracting text features and part-of-speech features of a to-be-processed text with a preset length, which is specifically referred to below.
S202: and fusing the text features and the part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed.
And fusing the text features and the part-of-speech features based on the extracted text features and part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed including the two aspects of features.
Specifically, the text feature corresponding to the text to be processed may be represented as α, the corresponding part-of-speech feature may be represented as β, and the text fusion feature may be represented as α + β.
In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for fusing text features and part-of-speech features of a text to be processed to obtain text fusion features of the text to be processed, which is specifically referred to below.
S203: and determining the sequence annotation model of the first level as the sequence annotation model of the current level.
The sequence labeling model is used for labeling the information items to be extracted corresponding to the sequence labeling model based on the input text fusion characteristics, and generating a labeling result of the text to be processed corresponding to the sequence labeling model. The sequence labeling model may specifically comprise a CRF (conditional random field) layer. Based on the input text fusion characteristics, each character in the text to be processed is labeled by utilizing a CRF layer to obtain a corresponding label, and then label information corresponding to the text to be processed is analyzed by combining a label rule, so that a labeling result can be obtained. Specifically, the tag rule may be a BIO rule.
It should be noted that, in the embodiment of the present application, the sequence annotation model has multiple hierarchies, and the information items to be extracted corresponding to the sequence annotation model of each hierarchy are different. The hierarchy of the sequence labeling model and the corresponding information items to be extracted can be set according to the requirement of text information extraction.
In a possible implementation manner, the number of layers of the sequence annotation model and the information item to be extracted corresponding to the sequence annotation model of each layer are predetermined according to the layer of the information item to be extracted. The hierarchy of the information items to be extracted may refer to information items describing different levels and categories, wherein the information items to be extracted in the same hierarchy generally appear in a parallel relationship in the text, and the information items to be extracted in different hierarchies are likely to appear in an inclusion relationship. There are three information items to be extracted, namely, disease diagnosis, lesion part and lesion size, the disease diagnosis is an information item to be extracted in a different level compared with the lesion part and the lesion size, and the lesion part and the lesion size are information items to be extracted in the same level. Specifically, for example, in the medical text, the information items to be extracted may be classified into different levels according to their ranges, for example, the information items to be extracted are classified into four levels, i.e., a disease, a treatment method corresponding to the disease, a treatment apparatus or a drug, and a specific apparatus or a type of the drug. For another example, the hierarchy of the information items to be extracted may be of different types divided according to different meanings of the information items. For example, for a text with multiple meanings, different meanings may be correspondingly set as different hierarchies of the information items to be extracted.
The hierarchy of the information items to be extracted can be set according to the requirements of text information extraction. Specifically, for example, if a text to be processed needs to be extracted by 5 levels of information items to be extracted, the number of layers of the corresponding sequence tagging models is set to be 5, and the information items to be extracted corresponding to the sequence tagging models of the 5 levels are the corresponding information items to be extracted of the 5 levels respectively.
Based on a sequence labeling model with multiple layers, the method can label the information items to be extracted with multiple layers on the text to be processed to realize multi-layer text information extraction.
The labeling process of the sequence labeling model of each level is a serial processing mode, and the characteristics are required to be sequentially input into the sequence labeling model of each level for labeling. And determining the sequence annotation model of the first level as the sequence annotation model of the current level.
S204: and inputting the text fusion characteristics of the text to be processed into the sequence marking model of the current level, marking the information item to be extracted corresponding to the sequence marking model of the current level, and obtaining the marking result of the text to be processed output by the sequence marking model of the current level.
And inputting the text fusion characteristics of the text to be processed, which are obtained by fusing the text characteristics and the part-of-speech characteristics of the text to be processed, into the sequence tagging model of the current level. And marking the information item to be extracted corresponding to the sequence marking model of the current level by using the sequence marking model of the current level. And if the text fusion characteristics of the text to be processed have the information item to be extracted corresponding to the sequence marking model of the current level, marking the corresponding information item to be extracted by the sequence marking model of the current level, and further obtaining a marking result of the text to be processed output by the sequence marking model of the current level.
S205: and judging whether a sequence labeling model of the next level exists.
And after the labeling of the sequence labeling model of the current level is finished and the corresponding labeling result is obtained, judging whether the sequence labeling model of the next level exists. Because the sequence annotation model of multiple levels is provided in the embodiment of the present application, if the sequence annotation model of the current level is the sequence annotation model of the first level, the sequence annotation model of the next level exists, and S206 and the subsequent steps are executed. If the sequence annotation model of the current level is the second level and the sequence annotation model after the second level, the sequence annotation model of the next level may not exist. If the sequence labeling model of the next layer exists, executing S206 and the subsequent steps; if there is no sequence tagging model of the next layer, step S208 and the following steps are performed.
S206: and if the sequence marking model of the next level exists, fusing the marking result of the text to be processed output by the sequence marking model of the current level with the text fusion feature of the text to be processed to obtain the text fusion feature of the text to be processed again.
When the sequence marking model of the next level exists, the sequence marking model of the next level needs to be used for marking the text to be processed.
In order to improve the accuracy of the sequence labeling model, considering that the information items to be extracted corresponding to the sequence labeling models of different levels have correlation, the labeling result of the text to be processed output by the sequence labeling model of the current level is fused with the text fusion feature of the text to be processed, so as to obtain the text fusion feature of the text to be processed after re-fusion.
Specifically, using xnText fusion characteristics representing the re-fused text to be processed, in one implementation, xn=α+β+γnWherein n represents the number of layers corresponding to the sequence labeling model of the current layer, γnNotation result, x, of the sequence notation model representing the nth levelnAnd representing the text fusion characteristics of the text to be processed, which are obtained again after the fusion of the labeling result of the sequence labeling model of the nth level and the text fusion characteristics of the text to be processed. In another implementation, xn=concat(α+β,γn),concat(α+β,γn) Denotes the reaction of alpha + beta with gammanSplicing is carried out, e.g. with α, β in an array of m x n dimensions, γnIs m x 1 dimensional array, then α + β is m x n dimensional array, xnIs an array of m x (n +1) dimensions. In a possible implementation manner, an embodiment of the present application provides a specific implementation manner that a labeling result of a to-be-processed text output by a sequence labeling model of a current hierarchy is fused with a text fusion feature of the to-be-processed text, and the text fusion feature of the to-be-processed text is obtained again, please refer to the following text specifically.
S207: and determining the sequence annotation model of the next level as the sequence annotation model of the current level, and re-executing the steps of inputting the text fusion characteristics of the text to be processed into the sequence annotation model of the current level and the subsequent steps.
And replacing the level of the sequence annotation model corresponding to the sequence annotation model of the current level, and determining the sequence annotation model of the next level as the sequence annotation model of the current level. And after the sequence labeling model of the current level is determined again, re-executing S204 and the subsequent steps to realize labeling of the text to be processed by using the sequence labeling model of the current level and corresponding updating of the text fusion characteristics of the text to be processed.
S208: and if the sequence labeling model of the next level does not exist, obtaining the labeling result of the text to be processed output by the sequence labeling model of each level.
And if the sequence annotation model of the current level is the sequence annotation model of the last level, the sequence annotation model of the next level does not exist, and the annotation of the sequence annotation model is finished. And acquiring the labeling result of the text to be processed output by the sequence labeling model of each layer, and extracting text information by using the labeling result of the text to be processed output by the sequence labeling model of each layer.
S209: analyzing the labeling result of the text to be processed output by the sequence labeling model of each layer to obtain the information extraction content of the information items to be extracted of different layers included in the text to be processed.
And the labeling result of the text to be processed output by the sequence labeling model of each layer comprises the related content of the information to be extracted from the text to be processed. Analyzing the labeling result of the text to be processed output by the sequence labeling model of each layer, and further obtaining the information extraction content corresponding to the information items to be extracted of different layers in the text to be processed, wherein the obtained information extraction content is the text information in the text to be processed, thereby realizing the structuralization of the text to be processed.
Based on the relevant contents of the above S201-S209, the extraction of the features of the text to be processed in two aspects is realized by extracting the text features and the part-of-speech features of the text to be processed, so that more comprehensive feature information of the text to be processed can be obtained. And inputting the text fusion characteristics obtained by fusing the text characteristics and the part-of-speech characteristics into the sequence tagging model of the first level, so as to tag the information item to be extracted corresponding to the current level. And fusing the obtained labeling result and the text fusion feature to obtain the updated text fusion feature. By replacing the sequence marking model of the current level, the marking of the sequence marking model of each level can be carried out in sequence, and the marking result of the sequence marking model of each level is obtained. The labeling result of the sequence labeling model of each level is fused with the text fusion characteristic to be used as the input of the sequence labeling model of the next level, so that the sequence labeling model can be labeled based on the labeling result of the sequence labeling model of the previous level, and the accuracy of the labeling result of the sequence labeling model is improved. And the information extraction content in the text to be processed is determined according to the labeling result of the sequence labeling model of each layer, so that multi-layer information extraction can be realized, double extraction of the relationship between the information extraction content and the information extraction content is considered, extraction of the information extraction content with multiple meanings can be realized, and the accuracy of text information extraction of the text to be processed is improved. Therefore, the more accurate text information of the text to be processed can be obtained on the basis of automatically extracting the text information.
It can be understood that, in order to facilitate more accurate feature extraction, the original text for text information extraction needs to be preprocessed first to obtain a text meeting the requirements of subsequent processing.
Correspondingly, an embodiment of the present application provides a text information extraction method, which is shown in fig. 3, and is a flowchart of another text information extraction method provided in the embodiment of the present application. Before extracting the text features and the part-of-speech features of the text to be processed with the preset length, the method further comprises the following four steps.
S301: and carrying out redundant information filtering and sensitive information desensitization processing on the original text to obtain a first target text.
The original text is the text which is not preprocessed and needs to be subjected to text information extraction. Redundant information may be present in the original text. The redundant information may refer to text having repeated meanings, symbols and words not having specific meanings, and the like. Text with repeated meaning may refer to repeated content occurring in the original text due to writing errors. For example, when filling in against template text, the written text may repeat with the template content, resulting in redundant information in the final generated text. Symbols and words that do not have a specific meaning refer to useless symbols and words that do not have a semantic meaning, such as stop words.
The redundant information interferes with the extraction of the text information, and the redundant information in the original text needs to be filtered. The embodiment of the application does not limit the filtering mode of the redundant information, and in a possible implementation mode, the redundant information can be preset in the preset dictionary, and then the redundant information in the original text can be removed based on the preset dictionary.
The original text also has sensitive information, and the sensitive information refers to the information which is not convenient to disclose and exists in the original text. For example, if the original text is a medical record text, the private information such as the name of the patient and the address of the patient in the medical record text is sensitive information. The sensitive information in the original text is not relevant to the extraction of the text information.
Desensitizing sensitive information in the original text. And obtaining a first target text which is subjected to redundant information filtering and sensitive information desensitization.
In addition, partial special symbols in the original text can be replaced. Specifically, common words and symbols can be replaced based on a preset dictionary, so that the replaced text better meets the requirement of feature extraction.
S302: if the length of the first target text is larger than the preset length, the first target text is divided into a plurality of second target texts with the lengths smaller than or equal to the preset length, the lengths of the second target texts are filled to the preset length, and the text to be processed is generated.
After the first target text is obtained, length processing needs to be performed on the first target text to obtain a text to be processed with a preset length.
When the length of the first target text is greater than the preset length, the length of the first target text needs to be reduced. And dividing the first target text into a plurality of second target texts with the lengths less than or equal to the preset length. Specifically, the first target text may be segmented using a specific symbol. For example, the first target text may be segmented using special symbols such as periods, line feed symbols, and the like. It should be noted that, the segmentation of the first target text is implemented on the premise that the content of the first target text is not affected. The specific segmentation mode can be determined according to the original separation mode of the text in the first target text.
And the second target text with the length being the preset length can be directly used as the text to be processed. And for the second target text with the length smaller than the preset length, the length of the second target text is filled to the preset length, and the text to be processed is generated. In one possible implementation manner, the placeholder may be used to perform a preset-length padding on the second target text with a length smaller than the preset length.
S303: and if the length of the first target text is smaller than the preset length, the length of the first target text is filled to the preset length, and the text to be processed is generated.
When the length of the first target text is smaller than the preset length, the length of the first target text needs to be filled. In a possible implementation manner, the place-occupying symbols may be used to complement a first target text with a length smaller than a preset length, so as to generate a text to be processed.
S304: and if the length of the first target text is equal to the preset length, determining the first target text as the text to be processed.
When the length of the first target text is equal to the preset length, the length of the first target text is satisfied as the length of the text to be processed, and the first target text is directly determined as the text to be processed.
In the embodiment of the application, the original text is preprocessed, and the first target text is adjusted to be the text to be processed with the preset length, so that the processed text to be processed meets the requirement of subsequent text information extraction, and the subsequent feature extraction and labeling of the text to be processed are facilitated.
In one possible scenario, there may be irregular terms in the text to be processed. The information extraction content included in the obtained text to be processed may be an irregular text, which is not convenient for further processing and using the information extraction content.
In order to solve the above problem, in one possible implementation manner, an embodiment of the present application provides a text information extraction method. Referring to fig. 4, this figure is a flowchart of another text information extraction method provided in this embodiment of the present application. After obtaining the information extraction contents of the information items to be extracted in different levels included in the text to be processed, the method further comprises the following three steps:
s401: acquiring text characteristics of target information extraction content and text characteristics of target term texts, wherein the target information extraction content is any item in the information extraction content, and the target term texts are any item in predetermined term texts.
The term text may be predetermined for the normalization process of the information extraction content. The term text is standard text used for making substitutions. The specific type of the term text may be determined according to the type of the information extraction content. For example, if the information extraction content is text information in medical treatment, the corresponding term text may be standard medical text, for example, the 10 th edition of ICD (international Classification of diseases) and common terms of diseases are used as term text.
And randomly selecting one information extraction content from the information extraction contents as a target information extraction content. One term text is arbitrarily selected from predetermined term texts as a target term text. And extracting text features of the target information extraction content and text features of the target term text. The text features of the target information extraction content and the text features of the target term text may be features that characterize semantic and grammatical aspects.
The embodiment of the application does not limit the specific implementation mode of extracting the text features of the target information extraction content and the text features of the target term text. In one possible implementation, after determining the target information extraction content and the target term text, the ERNIE model may be used to extract text features of the target information extraction content and the target term text. In another possible implementation, for the target information extraction content, the ERNIE model may be used to extract text features. For the term text, the term text may be extracted in advance by using the ERNIE model and stored in the database corresponding to the term text. And directly acquiring corresponding text characteristics after determining the target term text.
S402: and matching the text features of the target information extraction content with the text features of the target term text.
The text features of the extracted target information extraction content and the text features of the target term text can reflect the difference between the target information extraction content and the target term text. And matching the text features of the extracted content with the text features of the target term text by using the target information.
In a possible implementation manner, refer to fig. 5, which is a schematic diagram illustrating matching of text features of target term texts with text features of target information extraction contents provided by an embodiment of the present application.
The text feature of the target information extraction content and the text feature of the target term text may be processed by using a PCA (Principal Component Analysis) technique.
The PCA technique is a statistical method. A group of variables which possibly have correlation are converted into a group of linearly uncorrelated variables through orthogonal transformation, and the converted variables are called principal components. PCA can convert high-dimensional features into low-dimensional features and make the features after dimensionality reduction linearly uncorrelated. And respectively carrying out dimensionality reduction on the text features of the extracted target information extraction content and the text features of the target term text by adopting a PCA (principal component analysis) technology to obtain the text features of the low-dimensionality target information extraction content and the text features of the low-dimensionality target term text. And secondly, performing secondary classification on the text features of the low-dimensional target information extraction content and the text features of each low-dimensional target term text by utilizing a semantic matching algorithm based on softmax to obtain a classification result. Similarity between the target information extraction content and the target data text may be determined based on the classification result.
By utilizing the PCA technology and the semantic matching algorithm of softmax to extract and match the text features of the target information extraction content and the text features of the target term text, the similarity between the target information extraction content and the target term text can be evaluated at the semantic level, and the flexibility of feature matching and the accuracy of feature matching are effectively improved.
Specifically, in order to narrow the matching range, the term texts may be layered in advance based on the hierarchy of the information items to be extracted, so that the text features of the low-dimensional target information extraction content and the text features of each low-dimensional target term text of the same hierarchy are subjected to secondary classification, and the classification efficiency and accuracy are further improved.
S403: and if the text features of the target information extraction content are matched with the text features of the target term text, replacing the target information extraction content with the target term text.
If the text characteristics of the target information extraction content and the text characteristics of the target term text are matched with each other, the target information extraction content is similar to the target term text, and the target information extraction content needs to be replaced. And replacing the target information extraction content with target term text.
Based on the above-described related contents of S401 to S403, it can be determined whether the target information extraction content has a target term text that can be replaced by extracting and matching the text features of the target information extraction content and the text features of the target term text. And the target information extraction content is replaced based on the matched target term text, so that the information extraction content is normalized, and the subsequent information processing is conveniently carried out by directly utilizing the normalized information extraction content.
In a possible implementation manner, an embodiment of the present application further provides a text information extraction method, as shown in fig. 6, which is a flowchart of another text information extraction method provided in the embodiment of the present application, and in addition to the foregoing S201 to S209, the method further includes S601 to S610:
s601: and initializing the sequence labeling model of each layer.
Based on the requirement of text information extraction, the sequence labeling model is initialized. Specifically, the sequence labeling model of each level may be initialized correspondingly according to the predetermined information items to be extracted of each level.
S602: and determining the sequence annotation model of the first level as the sequence annotation model of the current level.
When the sequence annotation model is trained, a serial processing mode is adopted, and the sequence annotation model of the first level is determined as the sequence annotation model of the current level.
S603: and inputting the text fusion characteristics of the training text into the sequence marking model of the current level, marking the information item to be extracted corresponding to the sequence marking model of the current level, and obtaining the marking result of the training text output by the sequence marking model of the current level.
The training text is used for training the sequence labeling model, and the training text comprises standard labeling results of the information items to be extracted corresponding to the sequence labeling models of all levels. By using the text fusion characteristics of the training text, the sequence labeling models of all levels can be trained.
And inputting the text fusion characteristics of the training text into the sequence marking model of the current level, and marking the information item to be extracted corresponding to the sequence marking model of the current level on the training text by using the sequence marking model of the current level to obtain a marking result of the training text output by the sequence marking model of the current level.
S604: and obtaining the loss value of the sequence marking model of the current level according to the standard marking result of the information item to be extracted corresponding to the sequence marking model of the current level in the training text and the marking result of the training text output by the sequence marking model of the current level.
And comparing the output labeling result of the training text with the standard labeling result of the information item to be extracted corresponding to the sequence labeling model of the current level in the training text, so as to determine the accuracy of the labeling of the sequence labeling model. And calculating to obtain a loss value of the sequence labeling model of the current level of the training by using the standard labeling result of the information item to be extracted corresponding to the sequence labeling model of the current level in the training text and the labeling result of the training text output by the sequence labeling model of the current level. And adjusting the sequence labeling model of the current level based on the obtained loss value, so as to realize the training of the sequence labeling model.
S605: and judging whether a sequence labeling model of the next level exists.
And after the labeling of the sequence labeling model of the current level is finished and the corresponding labeling result and the loss value are obtained, judging whether the sequence labeling model of the next level exists. Since the sequence annotation model in the embodiment of the present application is a sequence annotation model of multiple levels, if the sequence annotation model of the current level is a sequence annotation model of the first level, a sequence annotation model of the next level exists, and S606 and subsequent steps are performed. If the sequence annotation model of the current level is the second level and the sequence annotation model after the second level, the sequence annotation model of the next level may not exist. If the sequence labeling model of the next layer exists, executing S606 and the subsequent steps; if there is no sequence tagging model of the next layer, step S608 and the following steps are performed.
S606: and if the sequence labeling model of the next level exists, fusing the labeling result of the training text output by the sequence labeling model of the current level with the text fusion characteristic of the training text to obtain the text fusion characteristic of the training text again.
And when the sequence labeling model of the next level exists, fusing the obtained labeling result of the training text output by the sequence labeling model of the current level with the training text fusion characteristic to obtain the updated text fusion characteristic of the training text.
In particular, e.g. using ynA text fusion feature representing the re-fused training text, yn=μ+ωnWherein n represents the number of layers corresponding to the sequence labeling model of the current layer, μ represents the initial text fusion feature of the training text, and ω representsnNotation result, y, of the sequence notation model representing the nth levelnAnd representing the text fusion characteristics of the training text obtained by fusing the labeling result of the sequence labeling model of the nth level with the text fusion characteristics of the training text.
S607: and determining the sequence annotation model of the next level as the sequence annotation model of the current level, and re-executing the sequence annotation model of the current level and the subsequent steps of inputting the text fusion characteristics of the training text.
And replacing the level of the sequence annotation model corresponding to the sequence annotation model of the current level, and determining the sequence annotation model of the next level as the sequence annotation model of the current level. And after the sequence labeling model of the current level is determined again, re-executing the step S603 and the subsequent steps to realize labeling of the training text by using the sequence labeling model of the current level and corresponding updating of the text fusion characteristics of the training text.
S608: and if the sequence annotation model of the next level does not exist, obtaining the loss value of the sequence annotation model of each level.
If the sequence marking model of the current level is the sequence marking model of the last level, the sequence marking model of the next level does not exist, the marking of the sequence marking model is finished, and the marking process of the training is finished. And obtaining the loss value of the sequence marking model of each layer.
S609: and weighting and adding the loss values of the sequence annotation models of all levels to obtain a comprehensive loss value, and adjusting the sequence annotation models of all levels according to the comprehensive loss value.
In the embodiment of the application, the training annotation models of each level can be trained in a centralized manner in consideration of the correlation among the sequence annotation models of each level.
And weighting and adding the obtained loss values of the sequence labeling models of all the layers to obtain a comprehensive loss value. The comprehensive loss value can be obtained by multiplying the loss value of the sequence labeling model of each level by the corresponding weight parameter and then adding the multiplied loss values.
Based on the obtained comprehensive loss value, the sequence labeling model of each level can be adjusted, and the training of the sequence labeling model at this time is realized.
S610: and determining the sequence annotation model of the first level as the sequence annotation model of the current level and the subsequent steps again until a preset stop condition is reached, and obtaining the sequence annotation model of each level generated by training.
In order to ensure the accuracy of the sequence labeling model, the sequence labeling model of each level needs to be trained for many times. And re-executing the step S602 and the subsequent steps until a preset stop condition is reached, and stopping the training of the sequence annotation model of each level to obtain the sequence annotation model of each level generated by training. The preset stop condition may specifically be that the loss value meets a preset condition, or that the training reaches a preset test, and may specifically be set according to the training requirement of the sequence labeling model.
Based on the relevant contents of S601-S610, it can be known that the sequence labeling model with a more accurate labeling result can be obtained by performing the centralized training on the sequence labeling models of each level by using the training text. Moreover, the text fusion features of the training texts of the sequence annotation models input into each level comprise the annotation results of the sequence annotation models of other levels, so that the sequence annotation models of each level obtained by training have stronger correlation, and the accuracy of the annotation results of the sequence annotation models of each level is improved.
In a possible implementation manner, the ERNIE model may be used to extract text features of the text to be processed, and the part-of-speech recognition model may be used to extract part-of-speech features of the text to be processed.
Correspondingly, an embodiment of the present application provides a specific implementation manner for extracting text features and part-of-speech features of a to-be-processed text with a preset length, which is shown in fig. 7, and the figure is a flowchart for extracting text features and part-of-speech features of a to-be-processed text with a preset length provided in the embodiment of the present application, and includes S701 to S702:
s701: inputting the text to be processed into an ERNIE model to obtain the text characteristics of the text to be processed; the text features of the text to be processed represent the grammar and the semantics of the text to be processed and the positions of all characters in the text to be processed; the text features of the text to be processed are m-n-dimensional text feature vectors, wherein m is a preset length, and n is a positive integer.
The ERNIE model is a depth feature extractor based on a self-attention mechanism, and is pre-trained by a large amount of label-free data, so that the ERNIE model has the characteristics of understanding the position, grammar, semantics and the like of characters in the general field. In this embodiment of the application, before text feature extraction of a text to be processed is performed by using an ERNIE model, a labeled text belonging to the same field as the text to be processed may be used to perform training of feature extraction of the text in a specific field on the ERNIE model, so that the ERNIE model has a feature extraction capability of the text in the specific field.
And inputting the text to be processed into the ERNIE model to obtain the corresponding text characteristics. The text features of the text to be processed can represent the grammar and semantics of the text to be processed and the positions of the characters in the text to be processed. The text feature of the text to be processed may be denoted as alpham*nAnd m × n represents a text feature vector with the text features of the text to be processed in m × n dimensions, m is a preset length, and n is a positive integer.
Specifically, the length of the text that can be processed by the ERNIE model is 512, and the corresponding m may be 512. The text feature of the text to be processed may be denoted as alpha512×768[α1,α2,……,αi,……,α512]Wherein α isiThe method includes the steps of representing text features corresponding to the ith character in a text to be processed, wherein i is a positive integer smaller than or equal to 512, 768 is a preset dimension parameter for extracting the text features, represents features with 768 different angles, and can be specifically determined by parameters of an ERNIE model.
S702: and inputting the text to be processed into the part-of-speech recognition model to obtain part-of-speech characteristics of the text to be processed, wherein the part-of-speech characteristics of the text to be processed are m-1 dimensional part-of-speech characteristic vectors.
The part-of-speech recognition model may be an open source tool with part-of-speech recognition functionality. Specifically, the part-of-speech recognition model may be a model such as LTP (Language Technology Platform), Hanlp (Han Language Processing chinese Language Processing package), or the like.
Based on the part-of-speech recognition model, part-of-speech features of the text to be processed may be determined. Specifically, the part-of-speech feature may be m × 1 dimensions. It should be noted that the part-of-speech characteristics of each character are related to the vocabulary in which the character is located, and the part-of-speech characteristics of the characters belonging to the same vocabulary are consistent.
In a possible implementation manner, a part-of-speech recognition result of each character in the text to be processed may be determined by the part-of-speech recognition model. And determining the part-of-speech characteristics of the corresponding text to be processed based on the part-of-speech recognition result of each character. For example, the part-of-speech recognition result of each character may be converted into a part-of-speech feature corresponding to each character through a predefined part-of-speech encoding dictionary, and then the part-of-speech feature of the text to be processed is obtained.
The part-of-speech dictionary may be represented by table 1:
part of speech
Encoding
Examples of such applications are
Noun (name)
1
Left arm "
Verb and its usage
2
Fracture "
Preposition word
3
"because"
Punctuation mark
4
“。”
……
……
……
TABLE 1
The code corresponding to each part of speech may be an integer greater than 0.
The corresponding part-of-speech feature may be represented as beta512×1[β1,β2,……,βi,……,β512]Wherein, βiAnd the word characteristic corresponding to the ith character in the text to be processed is shown, i is a positive integer less than or equal to 512, and 1 is a dimension for extracting the word characteristic.
In the embodiment of the application, the ERNIE model and the part-of-speech recognition model are used for extracting the text features and the part-of-speech features of the text to be processed respectively, so that more accurate text fusion features can be obtained, and more accurate text information can be obtained subsequently.
Further, an embodiment of the present application provides a specific implementation manner for fusing text features and part-of-speech features of a text to be processed to obtain text fusion features of the text to be processed, and the specific implementation manner specifically includes:
mapping the part-of-speech feature vector with m x 1 dimensions into a part-of-speech feature vector with m x n dimensions;
and fusing the part-of-speech feature vectors of the m-n dimensions with the text feature vectors of the m-n dimensions to obtain text fusion features of the text to be processed, wherein the text fusion features of the text to be processed are the text fusion feature vectors of the m-n dimensions.
Because the dimensionality of the part-of-speech feature vector is different from the dimensionality of the text feature vector, the dimensionality of the part-of-speech feature vector and the dimensionality of the text feature vector need to be unified before the part-of-speech feature vector and the text feature vector are fused.
And mapping the part-of-speech feature vector with the dimension of m by 1 into the part-of-speech feature vector with the dimension of m by n. For example, mixing beta512×1[β1,β2,……,βi,……,β512]Mapping to beta512×768[β1,β2,……,βi,……,β512]。
In one possible implementation, the mapping of the feature vectors may be performed by a fully connected layer. The activation function of the fully connected layer may be a Relu function.
And after the dimensions are unified, fusing the part-of-speech feature vectors of the m-n dimensions with the text feature vectors of the m-n dimensions to obtain text fusion feature vectors of the text to be processed of the m-n dimensions.
Taking the above-mentioned word feature vector and text feature vector as examples, the text fusion feature vector after fusion can be represented as α512×768+β512×768=[α1+β1,……,αi+βi,……,α512+β512]。
Further, the labeling result of the text to be processed output by the sequence labeling model at the current level may be an m × 1-dimensional labeling result vector.
For such situations, the embodiment of the present application provides a specific implementation manner for fusing a labeling result of a to-be-processed text output by a sequence labeling model of a current hierarchy with a text fusion feature of the to-be-processed text to obtain a text fusion feature of the to-be-processed text again, and the specific implementation manner specifically includes:
mapping the m-1 dimensional labeling result vector into an m-n dimensional labeling result vector;
and fusing the m-n dimensional labeling result vector and the m-n dimensional text fusion feature vector to obtain the text fusion feature of the text to be processed again.
Similarly, the dimension of the labeling result vector is different from that of the text fusion feature vector, and the dimensions are firstly unified. And mapping the m-by-1 dimensional labeling result vector into an m-by-n dimensional labeling result vector.
In one possible implementation, the mapping of the feature vectors may be performed by a fully connected layer. The activation function of the fully connected layer may be a Relu function.
In particular, in gammanRepresenting the labeling result of the sequence labeling model of the nth level, n representing the number of levels corresponding to the sequence labeling model of the current level,wherein the content of the first and second substances,is shown toAnd marking a marking result corresponding to the ith character in the text to be processed marked by the sequence marking model of the current level, wherein i is a positive integer less than or equal to 512. Will be provided withIs mapped as
And after the dimensions are unified, fusing the m-n-dimensional labeling result vector and the m-n-dimensional text fusion feature vector to obtain the m-n-dimensional text fusion feature vector of the text to be processed again.
Taking the above-mentioned labeling result vector and text fusion feature vector as examples, the fused text fusion feature vector (x)n)512×768See the following formula:
wherein, γnAnd representing a labeling result vector of the sequence labeling model of the nth level, wherein n represents the level number corresponding to the sequence labeling model of the current level, i represents the ith character in the text to be processed, and i is a positive integer less than or equal to 512.
In the embodiment of the application, the dimension unification and fusion are performed on the labeling result vector and the text fusion feature vector, so that the updating of the text fusion feature vector can be realized, the subsequent labeling of the sequence labeling model by using the updated text fusion feature vector is facilitated, and a more accurate labeling result is obtained.
Based on the text information extraction method provided by the above method embodiment, the embodiment of the present application further provides a text information extraction device, which will be described below with reference to the accompanying drawings.
Referring to fig. 8, the drawing is a schematic structural diagram of a text information extraction apparatus according to an embodiment of the present application. As shown in fig. 8, the text information extraction device includes:
an extracting unit 801, configured to extract text features and part-of-speech features of a to-be-processed text with a preset length;
a first fusion unit 802, configured to fuse text features and part-of-speech features of the text to be processed to obtain text fusion features of the text to be processed;
a first determining unit 803, configured to determine the sequence annotation model of the first level as the sequence annotation model of the current level;
a first labeling unit 804, configured to input the text fusion feature of the text to be processed into the sequence labeling model of the current level, label the information item to be extracted corresponding to the sequence labeling model of the current level, and obtain a labeling result of the text to be processed output by the sequence labeling model of the current level;
a first judging unit 805, configured to judge whether a sequence tagging model of a next hierarchy exists;
a second fusing unit 806, configured to fuse, if a sequence tagging model of a next level exists, a tagging result of the to-be-processed text output by the sequence tagging model of the current level with a text fusion feature of the to-be-processed text, and obtain a text fusion feature of the to-be-processed text again;
a second determining unit 807, configured to determine the sequence annotation model of the next hierarchy as the sequence annotation model of the current hierarchy, and re-execute the step of inputting the text fusion feature of the text to be processed into the sequence annotation model of the current hierarchy and the subsequent steps;
a first obtaining unit 808, configured to obtain, if a sequence annotation model of a next hierarchy does not exist, an annotation result of the to-be-processed text output by the sequence annotation model of each hierarchy;
the analyzing unit 809 is configured to analyze the labeling result of the to-be-processed text output by the sequence labeling model of each layer, so as to obtain information extraction contents of to-be-extracted information items of different layers included in the to-be-processed text.
In one possible implementation, the apparatus further includes:
the processing unit is used for filtering redundant information and desensitizing sensitive information of the original text to obtain a first target text;
the segmentation unit is used for segmenting the first target text into a plurality of second target texts with the lengths smaller than or equal to the preset length if the length of the first target text is larger than the preset length, and supplementing the lengths of the second target texts to the preset length to generate a text to be processed;
the completion unit is used for completing the length of the first target text to a preset length to generate a text to be processed if the length of the first target text is smaller than the preset length;
and the third determining unit is used for determining the first target text as the text to be processed if the length of the first target text is equal to the preset length.
In one possible implementation, the apparatus further includes:
a second obtaining unit, configured to obtain a text feature of a target information extraction content and a text feature of a target term text, where the target information extraction content is any one of the information extraction contents, and the target term text is any one of predetermined term texts;
the matching unit is used for matching the text features of the target information extraction content with the text features of the target term text;
and the replacing unit is used for replacing the target information extraction content with the target term text if the text characteristics of the target information extraction content are matched with the text characteristics of the target term text.
In one possible implementation, the apparatus further includes:
the initialization unit is used for initializing the sequence marking models of all layers;
a fourth determining unit, configured to determine the sequence annotation model of the first level as the sequence annotation model of the current level;
the second labeling unit is used for inputting the text fusion characteristics of the training text into the sequence labeling model of the current level, labeling the information item to be extracted corresponding to the sequence labeling model of the current level, and obtaining the labeling result of the training text output by the sequence labeling model of the current level;
the first execution unit is used for obtaining a loss value of the sequence labeling model of the current level according to a standard labeling result of an information item to be extracted corresponding to the sequence labeling model of the current level in the training text and a labeling result of the training text output by the sequence labeling model of the current level;
the second judging unit is used for judging whether a sequence marking model of the next level exists or not;
a third fusion unit, configured to fuse, if a sequence labeling model of a next level exists, a labeling result of the training text output by the sequence labeling model of the current level with a text fusion feature of the training text, and obtain a text fusion feature of the training text again;
a second execution unit, configured to determine the sequence annotation model of the next level as a sequence annotation model of a current level, and re-execute the sequence annotation model of the current level and subsequent steps of inputting the text fusion feature of the training text;
a third obtaining unit, configured to obtain a loss value of the sequence annotation model of each level if the sequence annotation model of the next level does not exist;
the adjusting unit is used for weighting and adding the loss values of the sequence labeling models of all the layers to obtain a comprehensive loss value, and adjusting the sequence labeling models of all the layers according to the comprehensive loss value;
and the third execution unit is used for re-executing the sequence annotation model of the first level as the sequence annotation model of the current level and the subsequent steps until a preset stop condition is reached, so as to obtain the sequence annotation models of all levels generated by training.
In a possible implementation manner, the number of layers of the sequence annotation model and the information item to be extracted corresponding to the sequence annotation model of each layer are predetermined according to the layer of the information item to be extracted.
In a possible implementation manner, the extraction unit 801 includes:
the first input subunit is used for inputting a text to be processed with a preset length into the ERNIE model to obtain text characteristics of the text to be processed; the text features of the text to be processed represent the grammar and the semantics of the text to be processed and the positions of all characters in the text to be processed; the text features of the text to be processed are m-n-dimensional text feature vectors, wherein m is the preset length, and n is a positive integer;
and the second input subunit is used for inputting the text to be processed into the part-of-speech recognition model to obtain part-of-speech characteristics of the text to be processed, wherein the part-of-speech characteristics of the text to be processed are m-1-dimensional part-of-speech characteristic vectors.
In a possible implementation manner, the first fusing unit 802 includes:
a mapping subunit, configured to map the part-of-speech feature vector of m × 1 dimensions into a part-of-speech feature vector of m × n dimensions;
and the fusion subunit is used for fusing the part-of-speech feature vector of the m-n dimensions with the text feature vector of the m-n dimensions to obtain a text fusion feature of the text to be processed, wherein the text fusion feature of the text to be processed is the text fusion feature vector of the m-n dimensions.
In a possible implementation manner, the labeling result of the text to be processed output by the sequence labeling model of the current hierarchy is a m × 1-dimensional labeling result vector;
the second fusion unit 806 is specifically configured to map the m × 1-dimensional labeling result vector into an m × n-dimensional labeling result vector; and fusing the m-n-dimensional labeling result vector and the m-n-dimensional text fusion feature vector to obtain the text fusion feature of the text to be processed again.
In addition, an embodiment of the present application further provides a text information extraction device, including: the text information extraction method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the computer program, the text information extraction method is realized according to any one of the above embodiments.
In addition, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is caused to execute the text information extraction method according to any of the above embodiments.
The text information extraction device and the text information extraction equipment provided by the embodiment of the application realize the extraction of the characteristics of the text to be processed in two aspects by extracting the text characteristics and the part-of-speech characteristics of the text to be processed, and can acquire more comprehensive characteristic information of the text to be processed. The part-of-speech characteristics are helpful for determining the information extraction content more accurately, and the accuracy of the obtained information extraction content can be improved. And inputting the text fusion characteristics obtained by fusing the text characteristics and the part-of-speech characteristics into the sequence tagging model of the first level, so as to tag the information item to be extracted corresponding to the current level. And fusing the obtained labeling result and the text fusion feature to obtain the updated text fusion feature. By replacing the sequence marking model of the current level, the marking of the sequence marking model of each level can be carried out in sequence, and the marking result of the sequence marking model of each level is obtained. The labeling result of the sequence labeling model of each level is fused with the text fusion characteristic to be used as the input of the sequence labeling model of the next level, so that the sequence labeling model can be labeled based on the labeling result of the sequence labeling model of the previous level, and the accuracy of the labeling result of the sequence labeling model is improved. According to the labeling result of the sequence labeling model of each layer, the information extraction content of the information items to be extracted of different layers in the text to be processed can be obtained, the extraction of the information extraction content of multiple layers can be realized, and the dual extraction of the relationship between the information extraction content and the information extraction content is considered. In addition, by acquiring multi-level information extraction contents, extraction of information extraction contents with multiple meanings can be realized. Therefore, the more accurate text information of the text to be processed can be obtained on the basis of automatically extracting the text information.
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system or the device disclosed by the embodiment, the description is simple because the system or the device corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.