Information extraction method, device, equipment and medium based on RPA and AI

文档序号:8273 发布日期:2021-09-17 浏览:46次 中文

1. An information extraction method based on RPA and AI, characterized in that, it includes:

s1, identifying the input text which is marked, and determining a marked segment containing marking information and a non-marked segment not containing marking information, wherein the marked segment comprises a marked category and marked content;

s2, determining text information to be extracted according to the label content, and combining the label type and the text information to obtain an extraction node; wherein, the representation mode of the identifier corresponding to the text information in the extraction node is determined according to whether an entity exists in the marked content or not;

s3, generating a text node according to the key field of the non-labeling segment, wherein the representation mode of the corresponding identifier of the text node is determined according to the importance value of the key field in the non-labeling segment;

and S4, combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain an information extraction template, and extracting information of other input texts which are not labeled based on the information extraction template.

2. The method according to claim 1, wherein the S2 specifically includes:

s21, taking the label type as an extraction type in an extraction node;

s22, if the entity fragment corresponding to the extraction type exists in the marked content, replacing the entity fragment with the entity information to be extracted corresponding to the extraction type; the entity information is represented by a first preset identifier to represent that entity extraction is carried out in the information extraction process;

and S23, combining the extraction type and the entity information to be extracted according to a preset connector to obtain an extraction node.

3. The method according to claim 1, wherein the S2 specifically includes:

s21, taking the label type as an extraction type in an extraction node;

s22, if the entity fragment corresponding to the extraction type is not identified in the marked content, determining the length range of the character to be extracted according to the character length of the marked content; the length range of the character to be extracted is represented by a second preset identifier so as to represent that the character of the content of the length range is extracted in the information extraction process;

and S23, combining the extraction type and the length range according to a preset connector to obtain an extraction node.

4. The method according to claim 1, wherein the S3 specifically includes:

s31, dividing the non-labeling segment into a plurality of clauses according to punctuation marks;

s32, determining the importance value of the semantic meaning of each clause in the non-annotated segment;

s33, for any clause, if the importance value is smaller than a preset threshold value, determining the length range of the character to be extracted according to the length of the text content corresponding to the clause, and generating a text node according to the length range; wherein the length range in the text node is represented by a second preset identifier.

5. The method according to claim 4, wherein the S32 specifically includes:

s321, determining score values of each clause based on a text sorting algorithm TextRank in the natural language processing NLP, and selecting score values with the numerical value of K bits in front from the score values to sum to obtain a sum value;

s322, for the score value of any clause, making a quotient of the score value and the sum value, and if the quotient is smaller than a preset threshold, determining that the importance value of the clause is smaller than the preset threshold; and if the quotient value is larger than a preset threshold value, determining that the importance value of the clause is larger than the preset threshold value.

6. The method of claim 4, further comprising:

if the importance value is larger than or equal to the preset threshold value, identifying whether an entity fragment exists in the clause;

if the existence of the entity fragment is identified, replacing the entity fragment with corresponding entity information to serve as a text node; the entity information in the text node is represented by a first preset identifier;

if the entity fragment is not identified, performing resource processing on the clause, and taking the content after the resource processing as a text node; wherein the resource processing comprises normalization processing and/or skeleton extraction processing in Natural Language Processing (NLP).

7. The method of claim 1, wherein the input text is OCR processed text.

8. An information extraction device based on RPA and AI, comprising:

an annotation text recognition module configured to: identifying the marked input text, and determining a marked segment containing marking information and a non-marked segment not containing marking information, wherein the marked segment comprises a marked category and marked content;

an extraction node generation module configured to: determining text information to be extracted according to the labeled content, and combining the labeled category and the text information to obtain an extraction node; wherein, the representation mode of the identifier corresponding to the text information in the extraction node is determined according to whether an entity exists in the marked content or not;

a text node generation module configured to: generating a text node according to the key field of the non-labeling segment, wherein the representation mode of the corresponding identifier of the text node is determined according to the importance value of the key field in the non-labeling segment;

an information extraction module configured to: and combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain an information extraction template, and extracting information of other input texts which are not labeled based on the information extraction template.

9. A computing device, comprising:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the RPA and AI based information extraction method according to any one of claims 1 to 7.

10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the RPA and AI-based information extraction method according to any one of claims 1 to 7.

Background

RPA (robot Process Automation) simulates human operations on a computer through specific "robot software" and automatically executes Process tasks according to rules.

AI (Artificial Intelligence) is a new technical science for studying and developing theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence.

RPA has unique advantages: low code, non-intrusive. The low code means that the RPA can be operated without high IT level, and business personnel who do not know programming can also develop the flow; non-invasively, the RPA can simulate human operation without opening the interface with a software system. But conventional RPA has certain limitations: can only be based on fixed rules and application scenarios are limited. With the continuous development of the AI technology, the limitation of the traditional RPA is overcome by the deep fusion of the RPA and the AI, and the RPA + AI is a Hand work + Head work, which greatly changes the value of the labor force.

The RPA encounters a large amount of textual information during the processing of the task. With the development of information technology, how to extract required information from various text media becomes an increasingly concerned problem. In the information extraction in the industry, the template-based information extraction still occupies a very important place. At present, an engineer writes and maintains a template manually, which not only consumes a lot of labor cost, but also causes great difference in human cognition and extraction resources (such as entity extraction and semantic matching), and is difficult to make good use of the ability of extracting resources. In addition, different people have difficulty in determining the use limit of the fuzzy matching of the text when writing the template, and further the recall is insufficient or over recalled, so that the information extraction effect is unstable.

Disclosure of Invention

Embodiments of the present invention provide an information extraction method, apparatus, device, and medium based on RPA and AI, which improve accuracy of information extraction by using an information extraction template with strong generalization capability.

In a first aspect, the present invention provides an information extraction method based on RPA and AI, including:

s1, identifying the input text which is marked, and determining a marked segment containing marking information and a non-marked segment not containing marking information, wherein the marked segment comprises a marked category and marked content;

s2, determining text information to be extracted according to the label content, and combining the label type and the text information to obtain an extraction node; wherein, the representation mode of the identifier corresponding to the text information in the extraction node is determined according to whether an entity exists in the marked content or not;

s3, generating a text node according to the key field of the non-labeling segment, wherein the representation mode of the text node corresponding to the identifier is determined according to the importance value of the key field in the non-labeling segment;

and S4, combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain an information extraction template, and extracting information of other input texts which are not labeled based on the information extraction template.

Optionally, the S2 specifically includes:

s21, taking the label type as an extraction type in an extraction node;

s22, if the entity fragment corresponding to the extraction type exists in the marked content, replacing the entity fragment with the entity information to be extracted corresponding to the extraction type; the entity information is represented by a first preset identifier to represent that entity extraction is carried out in the information extraction process;

and S23, combining the extraction type and the entity information to be extracted according to a preset connector to obtain an extraction node.

Optionally, the S2 specifically includes:

s21, taking the label type as an extraction type in an extraction node;

s22, if the entity segment corresponding to the extraction type exists in the marked content is not identified, determining the length range of the character to be extracted according to the character length of the marked content; the length range of the character to be extracted is represented by a second preset identifier so as to represent that the character of the content of the length range is extracted in the information extraction process;

and S23, combining the extraction type and the length range according to a preset connector to obtain an extraction node.

Optionally, the S3 specifically includes:

s31, dividing the non-labeling segment into a plurality of clauses according to punctuation marks;

s32, determining the importance value of the semantic meaning of each clause in the non-annotated segment;

s33, for any clause, if the importance value is smaller than a preset threshold value, determining the length range of the character to be extracted according to the length of the text content corresponding to the clause, and generating a text node according to the length range; wherein the length range in the text node is represented by a second preset identifier.

Optionally, the S32 specifically includes:

s321, determining score values of the sub-sentences based on a text sorting algorithm TextRank in the natural language processing NLP, and selecting score values with the numerical value of K bits in front from the score values to sum to obtain a sum value;

s322, for the score value of any clause, making a quotient of the score value and the sum value, and if the quotient is smaller than a preset threshold, determining that the importance value of the clause is smaller than the preset threshold; and if the quotient value is larger than a preset threshold value, determining that the importance value of the clause is larger than the preset threshold value.

Optionally, the method further includes:

if the importance value is larger than or equal to the preset threshold value, identifying whether an entity segment exists in the clause;

if the existence of the entity fragment is identified, replacing the entity fragment with corresponding entity information to serve as a text node; the entity information in the text node is represented by a first preset identifier;

if the entity fragment is not identified, performing resource processing on the clause, and taking the content after the resource processing as a text node; wherein the resource processing comprises normalization processing and/or skeleton extraction processing in Natural Language Processing (NLP).

Optionally, the input text is a text after being processed by optical character recognition OCR.

In a second aspect, an embodiment of the present invention further provides an information extraction device based on RPA and AI, where the information extraction device includes:

an annotation text recognition module configured to: identifying the marked input text, and determining a marked segment containing marking information and a non-marked segment not containing marking information, wherein the marked segment comprises a marked category and marked content;

an extraction node generation module configured to: determining text information to be extracted according to the labeled content, and combining the labeled category and the text information to obtain an extraction node; wherein, the representation mode of the identifier corresponding to the text information in the extraction node is determined according to whether an entity exists in the marked content or not;

a text node generation module configured to: generating a text node according to the key field of the non-labeling segment, wherein the representation mode of the text node corresponding to the identifier is determined according to the importance value of the key field in the non-labeling segment;

an information extraction module configured to: and combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain an information extraction template, and extracting information of other input texts which are not labeled based on the information extraction template.

Optionally, the extraction node generating module is specifically configured to:

taking the label type as an extraction type in an extraction node;

if the entity fragment corresponding to the extraction type exists in the marked content, replacing the entity fragment with entity information to be extracted corresponding to the extraction type; the entity information is represented by a first preset identifier to represent that entity extraction is carried out in the information extraction process;

and combining the extraction type and the entity information to be extracted according to a preset connector to obtain an extraction node.

Optionally, the extraction node generating module is specifically configured to:

taking the label type as an extraction type in an extraction node;

if the entity fragment corresponding to the extraction type is not identified in the marked content, determining the length range of the character to be extracted according to the character length of the marked content; the length range of the character to be extracted is represented by a second preset identifier so as to represent that the character of the length range content is extracted in the information extraction process;

and combining the extraction type and the length range according to a preset connector to obtain an extraction node.

Optionally, the text node generating module specifically includes:

a clause segmentation unit configured to: dividing the non-labeling segment into a plurality of clauses according to punctuation marks;

a clause importance determining unit configured to: determining the importance value of the semantic meaning of each clause in the non-labeled segment;

a text node generating unit configured to: for any clause, if the importance value is smaller than a preset threshold value, determining the length range of the character to be extracted according to the length of the text content corresponding to the clause, and generating a text node according to the length range; wherein the length range in the text node is represented by a second preset identifier.

Optionally, the clause importance determining unit is specifically configured to:

determining score values of each clause based on a text sorting algorithm TextRank in natural language processing NLP, and selecting score values with the numerical value of K bits in front from the score values to sum to obtain a sum value;

for the score value of any clause, making a quotient of the score value and the sum value, and if the quotient is smaller than a preset threshold, determining that the importance value of the clause is smaller than the preset threshold; and if the quotient value is larger than a preset threshold value, determining that the importance value of the clause is larger than the preset threshold value.

Optionally, the apparatus further comprises:

the entity judging module is configured to identify whether an entity fragment exists in the clause if the importance value is greater than or equal to the preset threshold;

an entity information replacement module configured to replace an entity fragment with corresponding entity information as a text node if the existence of the entity fragment is recognized; the entity information in the text node is represented by a first preset identifier;

the resource processing module is configured to perform resource processing on the clause and take the content after the resource processing as a text node if the entity segment is not identified; wherein the resource processing comprises normalization processing and/or skeleton extraction processing in Natural Language Processing (NLP).

Optionally, the input text is a text after being processed by optical character recognition OCR.

In a third aspect, an embodiment of the present invention further provides a computing device, including:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program codes stored in the memory to execute the information extraction method based on the RPA and the AI provided by any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the RPA and AI-based information extraction method provided in any embodiment of the present invention.

According to the technical scheme provided by the embodiment of the invention, after the marked segments and the non-marked segments of the input text are identified, the extraction nodes are generated based on the marked segments of the marked text, and the text nodes are generated based on the non-marked segments. The extraction nodes comprise extraction categories and text information to be extracted, and the text nodes are generated according to key fields in the non-labeled fragments. And combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain the information extraction template. Compared with the mode of manually writing and maintaining the information extraction template, the embodiment of the invention adopts the mode of automatically generating the template based on the marking information and the extraction resources, thereby saving a large amount of human resources, and the generated template has stronger generalization capability. When the template is used for information extraction, the effect of improving the information extraction precision can be achieved.

The innovation points of the embodiment of the invention comprise:

1. the method comprises the steps of generating extraction nodes according to labeled segments in an input text, generating text nodes according to non-labeled segments, and combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain an information extraction template, so that the problem of low implementation efficiency of manual template writing is solved, and the method is one of the innovation points of the embodiment of the invention.

2. The method and the device improve the generalization capability of the information extraction template by performing entity extraction, normalization processing and/or backbone extraction on the text nodes in the information extraction template, and are one of the innovation points of the embodiment of the invention.

3. Different identifiers are respectively added to the text information to be extracted and the text nodes in the extraction nodes, and when the information extraction template is used for information extraction, whether fuzzy matching is carried out or not can be selected in a self-adaptive mode according to the identifiers, so that the precision and the accuracy of the extraction effect are improved, and the method is one of the innovation points of the embodiment of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1a is a flowchart illustrating an RPA and AI based information extraction method according to an embodiment of the present invention;

fig. 1b is a screenshot of an information extraction template according to an embodiment of the present invention;

fig. 1c is a screenshot of an extraction effect of a test text by using an information extraction template according to an embodiment of the present invention;

fig. 2 is a flowchart of another information extraction method based on RPA and AI according to a second embodiment of the present invention;

fig. 3 is a block diagram of an information extraction apparatus based on RPA and AI according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computing device according to a fourth embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.

It is to be noted that the terms "comprises" and "comprising" and any variations thereof in the embodiments and drawings of the present invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

In the description of the present invention, an "information extraction template" is a text expression provided by a developer for an "information extraction" function. By using the expression, a plurality of segments of the text are matched and information is extracted. For the information extraction template provided by the embodiment of the present invention, the following necessary syntax needs to be known:

1. the middle brackets "[ ]" represent strict matching, and the matched objects can be texts, resources and the like. Strict matching requires that the text to be matched and the specified matching content must be identical.

2. The angle brackets "< >" represent fuzzy matching, and the matched objects can be texts or the number of any characters. Fuzzy matching is a concept corresponding to strict matching, and fuzzy matching only needs to be performed if the semantics of the two are close, that is, the similarity needs to be greater than a set threshold.

3. The parenthesis "{ }" includes a field name (key) and a content (value) to be extracted, for example: { company name: <. n, m }.

In the description of the present invention, extracting resources refers to basic underlying NLP capabilities, including: text semantic similarity calculation, entity extraction, syntactic analysis and the like.

In the description of the present invention, the "field" is a name specific to the current information extraction task, which is generally specified by the user, for the key information extracted from the template.

In order to clearly and clearly explain the implementation principle of the embodiment of the present invention, the following briefly introduces the form of the information extraction template:

if the nodes are divided according to the functions, the nodes in the information extraction template comprise text nodes and extraction nodes. The extraction nodes and the text nodes can be distinguished through braces, the content in the braces represents the extraction nodes, and the content outside the braces represents the text nodes.

If the division is performed according to the node form, the node form in the information extraction template may be: 1. resource node [ @ Entity _ XX ], wherein XX represents Entity information to be extracted; 2. any text node < > n, m, wherein n and m are the upper limit and the lower limit of the number of characters to be extracted; 3. plain text node < XXX > or [ XXX ]. The following are detailed below.

Example one

Robot Process Automation (RPA) is a Process task that simulates human operations on a computer through specific robot software and automatically executes according to rules.

Ai (intellectual intelligence) is an english abbreviation for artificial intelligence, which is a new technical science for studying and developing theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. This embodiment employs an OCR (Optical Character Recognition) technique and an NLP (Natural Language Processing) technique among AI techniques. The OCR is used for identifying text contents in pictures and PDF documents, the NLP is used for determining the importance of clauses in texts, and the clauses are subjected to processing such as normalization or skeleton extraction.

In the information extraction in the industry, the template-based information extraction still occupies a very important place. However, at present, the template information extraction method based on the template requires manual writing and maintenance of the template, and the manual writing of the template has a large workload and is difficult to effectively utilize extraction resources, such as entity extraction and semantic matching. The embodiment of the invention provides an information extraction method, device, equipment and medium based on RPA and AI, which can automatically generate a required template according to marked content and extracted resources, and can realize accurate extraction of text content by using the template. By automatically generating the template, the labor cost can be reduced, the extraction resources can be better utilized, and the generalization capability of the information extraction template is enhanced.

Fig. 1a is a flowchart of an RPA and AI-based information extraction method according to an embodiment of the present invention, which can be executed by an RPA and AI-based information extraction device, which can be implemented by software and/or hardware, as shown in fig. 1a, where the method includes:

s110, identifying the marked input text, and determining a marked segment containing marking information and a non-marked segment not containing marking information.

Wherein the input text may be text after OCR recognition. The label segment in the input text comprises a label category and label content. In this embodiment, there are various ways to label the input text, and this embodiment is not limited to this specifically.

For example, when the input text is labeled, different colors can be added to the labeled segment and the non-labeled segment. The annotation category and the annotation content in the annotation segment can be distinguished by different colors. In the labeling mode, the labeled category, the labeled content and the non-labeled segment can be determined by identifying the color.

For example, when the input text is labeled, the setting identifier can be used to distinguish the labeled segment from the non-labeled segment. The annotation category and the annotation content in the annotation segment can be expressed by setting a format, for example, a format of "annotation category (key) ═ annotation content (value)" can be used for the expression. In this labeling mode, when the labeled input text is identified, the labeled segment and the non-labeled segment in the text can be determined by the identification setting identifier. The type and the corresponding content of the marked segment can be identified by setting a format.

Specifically, the following labeled texts are taken as examples:

the disease prevention and control center in health county, committee, and support was granted as the authorizer, with special authorization as follows. An authorizer [ name ═ zhang ] name (identification number: [ identification number ═ 420626xxxxxxxx 12 ]) under the name of authorizer [ duty ═ negotiation with wuhan's biological products research institute limited liability company for purchasing new crown vaccine products, signing products for purchase and sale contracts ]. The present invention relates to a method for providing authorization to an authorized person [ name ═ li some ] (identification number: [ identification number ═ 420321xxxxxx 18 ] _ ______) in the name of the authorized person [ duty ═ collection and/or extraction of goods from the limited liability company of the wuhan biologics research institute ]. An authorizer [ name ═ li some ] ([ identification number ═ 420321xxxxxx 18 ]) under the name of an authorizer [ duty ═ receipt of invoices issued by wuhan's biologics institute limited liability company ], collection and provision of financial statements ]. The term of authorization starts from 10 and 13 of 2020 [ entitled "expiration date ═ 2021, 10 and 13 of 2020 ] annex: authorized personnel identity card copy authorizer (signature): authorized person (signature): authorized person (signed): lie an authorized person (signature): : liu certain signature date: attachment of 10 month and 13 days of 2020: an authorized person identification card copy.

In the above-mentioned annotation text, the content in the parentheses "[ may be determined as an annotation segment, for example, [ name ═ li [ ], [ identification number ═ 420626XXXXXXXXXX12 ] in the above-mentioned annotation text. For any one of the labeled segments, the content before "═ can be used as the labeled category, for example," name ", and the content after" ═ can be used as the labeled content, for example, "lie".

And S120, determining the text information to be extracted according to the labeled content, and combining the text information to be extracted and the labeled category to obtain an extraction node.

The extraction node is used for extracting information from the preset field in the information extraction process. The extraction node may be represented by the following format: { extraction category: value to be extracted }. The extraction type in the extraction node can be represented by a label type, and the value to be extracted can be determined by label content. The extraction category and the text information to be extracted are combined, for example, by a colon ": and combining the two to obtain the extraction node.

In this embodiment, a set syntax identifier is added to the text information to be extracted in the extraction node, and the representation of the syntax identifier is determined according to whether an entity exists in the labeled content.

Illustratively, if the entity fragment corresponding to the extraction category exists in the marked content, replacing the entity fragment with the entity information to be extracted corresponding to the extraction category; the Entity information is represented by a first preset identifier [ @ Entity _ ] so as to represent that Entity extraction is carried out in the information extraction process. By following the extraction category and the entity information to be extracted as preset connectors': "combine to get the extraction node, i.e. the extraction node can be expressed as the form [ @ Entity _ XX ] of the resource node. When the node is used for information extraction, the text information to be extracted is represented as an entity.

Specifically, fig. 1b is a screenshot of an information extraction template according to an embodiment of the present invention. The information extraction template generated in fig. 1b is generated according to the above labeled text. And in the process of generating the extraction node, replacing the entity fragment named after one with entity information named after one to be extracted, and adding a first preset identifier for the named after one. By passing the extraction category "name" and the entity information "person name" to be extracted through preset connectors ": "combine to obtain the extraction node { name: [ @ Entity _ person name ] }.

In this embodiment, Entity may be replaced with V, R or S. Wherein V represents entities extracted by a word list, R represents entities extracted by a regular mode, and S represents entities extracted by a model. In the above-described example of the person name, Entity information extracted by the model may be used instead of Entity information by S.

Illustratively, if the entity fragment corresponding to the extraction type is not identified in the marked content, determining the length range of the character to be extracted according to the character length of the marked content; wherein, the length range of the character to be extracted is determined by a second preset identifier <: n, m > represents to represent the character of the length range content corresponding to the extracted category in the information extraction process. The extraction nodes are obtained by combining the extraction categories and the length ranges according to preset connectors, namely the extraction nodes can be expressed in the form of any text node: n, m >, wherein n and m are the lower limit and the upper limit of the number of the characters to be extracted. When information is extracted by using the extraction node, the text content is extracted as long as the character length of the text content in the label segment is within the length range formed by n and m.

Specifically, also in the above labeled text as an example, if no entity segment is identified in the labeled content "[ job ═ negotiation with wuhan biological products research institute limited responsibility ] negotiation to purchase new crown vaccine products, and contract for product purchase and sale [ ], according to the character length 35 of the labeled content, the lower limit value of the length of the character to be extracted is determined to be 0, and the upper limit value is determined to be 100. The length range may be identified by a second preset identifier < >: 0,100> is shown. The category "duty" and length range will be extracted: 0,100> by presetting a connector ": "combine to get the abstraction node { responsibility: [ solution ]: 0,100> }.

And S130, generating a text node according to the key field of the non-labeled fragment.

The text node is used for performing information extraction on the set field by matching with the extraction node in the information extraction process. By matching the text information before and after the extraction node, whether the text content is extracted according to the extraction node can be determined according to the matching result, so that the accuracy of extracting the content is improved.

In this embodiment, the key fields in the non-labeled fragments can be determined by a semantic recognition algorithm in the NLP. And the representation mode of the corresponding identifier of the text node is determined according to the importance value of the key field in the non-labeled segment.

For example, the non-annotated segment may be divided into a plurality of clauses, and the identification form of the identifier of the text node may be determined by determining the importance of each clause in the non-annotated segment. For example, if the importance value of the clause is smaller than a preset threshold, it indicates that the clause is less important in the non-labeled segment, and at this time, the text node may be represented as a form of any text node < >: n, m >, that is, in the process of matching the text content in the non-labeled segment with the text node, as long as the character length of the text content is within the length range formed by nm, the matching is successful. Specifically, m may be 2L rounded, n may be 0.3L rounded, and L represents the number of characters in a clause.

Specifically, still taking the above labeled text as an example, as shown in fig. 1b, the text node corresponding to the field of the identification number is < >: 0,10 >.

Illustratively, if the importance value of a clause is greater than a preset threshold, it indicates that the clause is of higher importance in the non-labeled segment, at this time, it may be identified whether an Entity segment exists in the clause, and if an Entity segment exists, the Entity segment is replaced with corresponding Entity information to serve as a text node, that is, the text node is represented in the form of an asset node [ @ Entity _ XX ]; if no entity fragment exists in the clause, the clause is subjected to resource processing, for example, normalization processing and/or skeleton extraction processing and the like on the clause by adopting an NLP technology, and the processed content is taken as a text node, namely the text node is represented in the form of a plain text node < XXX > or [ XXX ]. In particular, and still taking the above labeled text as an example, the "authorized person" and "in the name of authorized person" in the non-labeled segment are both exactly matched, i.e., [ authorized person ] and [ in the name of authorized person ] in the template shown in fig. 1 b.

In the process of generating the text nodes and the extraction nodes according to the marked input text, the generation sequence of the text nodes and the extraction nodes does not have any precedence, and the generation can be carried out simultaneously.

S140, combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain an information extraction template, and extracting information of other input texts which are not labeled based on the information extraction template.

In this embodiment, in order to distinguish the text node from the extraction node, in the process of combining the text node and the extraction node, a brace "{ }" may be added to the extraction node to indicate a field name (key) to be extracted and a content (value) to be extracted.

After the information extraction template is obtained, the text nodes in the information extraction template and the contents in the extraction nodes can be sequentially matched with the text contents of other input texts which are not marked, so that the desired information can be extracted from the input texts which are not marked. And the matching mode is determined by the identifiers corresponding to the text node and the extraction node respectively.

Specifically, the matching mode with the input text may be determined according to the identifier of the text node, for example, if the identifier of the text node is an angle bracket "< >", similarity calculation is performed between the content in the text node and the input text, and if the similarity reaches a set threshold, matching is successful; and if the identifier of the text node is a middle bracket "[ ]", strictly matching the content in the text node with the input text, and if the content is identical, successfully matching. For example, fig. 1c is a screenshot of an effect of extracting a test text by using an information extraction template according to an embodiment of the present invention. When the information extraction template shown in fig. 1b is used to extract information from the test text shown in fig. 1c, because text content completely consistent with the text node [ authorized person ] and [ nominal to authorized person ] in the information extraction template exists in the test text, the matching between the text content and the text node is successful.

Similarly, when matching the extraction node with the input text, the matching mode can be determined according to the identifier of the text information to be extracted in the extraction node. If the first preset identifier exists in the extraction node, the representation form of the extraction node is a resource node, and at the moment, the corresponding entity can be extracted from the input text according to the extraction category in the extraction node. For example, when the test text shown in fig. 1c is subjected to information extraction by using the information extraction template shown in fig. 1b, two entities, namely the entity "zhangtianley" and the identity card number "420626 xxxxxxxxxx 12", can be extracted from the test text. If the second preset identifier exists in the extraction node, the representation form of the extraction node is any text node, and at the moment, text content meeting the length range can be extracted from the input text according to the extraction type in the extraction node. For example, when the information extraction template shown in fig. 1b is used to extract information from the test text shown in fig. 1c, characters satisfying the requirement that the character length is in the range of 0-100, that is, "contract with wuhan mechanical company for purchasing capital products" can be extracted from the test text.

According to the technical scheme provided by the embodiment, after the labeled segment and the non-labeled segment of the input text are identified, the extraction node is generated based on the labeled segment of the labeled text, and the text node is generated based on the non-labeled segment. The extraction nodes comprise extraction categories and text information to be extracted, and the text nodes are generated according to key fields in the non-labeled fragments. And combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain the information extraction template. Compared with a mode of manually writing and maintaining the information extraction template, the embodiment of the invention adopts a mode of automatically generating the template based on the marking information and the extraction resources, thereby saving a large amount of human resources, and the generated template has stronger generalization capability. When the template is used for information extraction, the effect of improving the information extraction precision can be achieved.

Example two

Fig. 2 is a flowchart of another information extraction method based on RPA and AI according to a second embodiment of the present invention, where in this embodiment, a generation manner of a text node is refined based on the foregoing embodiment, and as shown in fig. 2, the method provided in this embodiment includes:

s200, identifying the marked input text, and determining a marked segment containing marking information and a non-marked segment not containing marking information.

The labeling segment comprises a labeling category and labeling content.

S210, determining text information to be extracted according to the marked content, and combining the marked type and the text information to obtain an extraction node.

And the representation mode of the identifier corresponding to the text information in the extraction node is determined according to whether the entity exists in the marked content or not.

And S220, segmenting the non-labeled segments into a plurality of clauses according to punctuation marks.

And S230, determining the importance value of the semantic meaning of each clause in the non-labeled segment.

The importance value of the semantic meaning of each clause in the non-labeled segment can be determined by the following method:

determining score values of all clauses based on a text sorting algorithm TextRank in NLP, and sorting the score values to select the score values with the numerical value of K bits in the front from the score values to sum to obtain a sum value; for the score value of any clause, making a quotient of the score value and the sum value, and if the quotient is smaller than a preset threshold, determining that the importance value of the clause is smaller than the preset threshold; and if the quotient value is larger than the preset threshold value, determining that the importance value of the clause is larger than the preset threshold value.

S240, judging whether the importance value of any clause is smaller than a preset threshold value, if so, executing a step S250; otherwise, step S260 is performed.

And S250, determining the length range of the character to be extracted according to the length of the text content corresponding to the clause, generating a text node according to the length range, and continuing to execute the step S290.

In this embodiment, if the importance value of a clause is smaller than a preset threshold, it indicates that the clause is of low importance in the text, and at this time, the text node may be represented in the form of any text node <. x { n, m } >.

S260, identifying whether entity fragments exist in the clause, and if so, executing a step S270; otherwise, go to step S280;

and S270, replacing the entity segments with corresponding entity information to serve as text nodes, and continuing to execute the step S290.

In this embodiment, if the importance value of a clause is greater than a preset threshold, it is described that the clause is relatively important in the text, and at this time, the text node may be represented in the form of a resource node [ @ Entity _ XX ].

S280, perform resource processing on the clause, and continue to execute step S290 with the content after resource processing as a text node.

The resource processing comprises normalization processing and/or stem extraction processing in Natural Language Processing (NLP). Wherein, the principle of trunk extraction includes: (1) and extracting the central word through the dependency relationship of the dependency syntax tree. (2) And extracting the parallel words of the central words and the corresponding subjects or objects.

In this embodiment, the representation form of the text node after the resource processing is a plain text node, and there are two representation forms of the corresponding identifier, which are respectively a third preset identifier, that is, a middle bracket "[ ]", and a fourth preset identifier, that is, an angle bracket "< >", where, when information extraction is performed, the middle bracket represents that the content in the bracket and the matching object are strictly matched, that is, when the two are completely consistent, the matching is successful. The sharp brackets indicate that the content in the brackets is matched with the matching correspondence in a fuzzy mode, namely when the similarity of the content and the matching correspondence is larger than a set threshold value, the matching is successful.

Specifically, the middle parentheses are taken as identifiers in the following three cases:

(1) if there is only one word in the text and the text can be normalized, for example, < none business >, < no profession >, it can be normalized uniformly to < none business >, and since the normalized text has more accurate generalization ability than the similarity calculation, in this case, the identifier of the text node corresponding to the normalized text content is the third preset identifier, i.e. the middle bracket "[ ]".

(2) If the extraction node is any text node, and if the representation form of the text node behind the extraction node is a plain text node, a preset third preset identifier, namely a middle bracket "[ ]", is adopted as the identifier of the plain text node. For example: from the annotation text: my hometown name is a template generated (chenchenchen station) from: < my home town is > { home town [. about.2, 10] }. ].

(3) In the information extraction template generated by a plurality of similar texts, the content of a certain part of text is the same, and the identifier of the text node corresponding to the content of the certain part adopts a preset third preset identifier, namely a middle bracket "[ ]". For example: for two similar pieces of text: 1. arbiba, founder [ founder ═ maryun ], the largest e-commerce company in china. 2. Tengcong, pioneer [ founder ═ martini ], a game agency known in china. The templates generated are respectively: [ @ Entity _ company name ] <, originator > { originator @ Entity _ person name } <, well-known game agent company of china >. [ @ Entity _ company name ] <, originator > { originator @ Entity _ person name } <, china famous game agent company >. At this time, since there is a <, originator > part in the highly similar (similarity greater than a set similarity threshold) template generated by the similar text, the identifier of the part is converted into a third preset identifier, i.e., a middle bracket "[ ].

In addition to the three cases described above, the plain text nodes are represented in the form of a fourth preset identifier, namely "< >".

And S290, combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain an information extraction template, and extracting information of other input texts which are not labeled based on the information extraction template.

According to the technical scheme provided by the embodiment, the generated information extraction target has stronger generalization capability by carrying out resource processing such as normalization, backbone extraction and the like on the plain text nodes.

EXAMPLE III

Fig. 3 is a block diagram of an information extraction device based on RPA and AI according to a third embodiment of the present invention, as shown in fig. 3, the device includes: a label text recognition module 310, an extraction node generation module 320, a text node generation module 330 and an information extraction module 340; wherein the content of the first and second substances,

an annotation text recognition module 310 configured to: identifying the marked input text, and determining a marked segment containing marking information and a non-marked segment not containing marking information, wherein the marked segment comprises a marked category and marked content;

an extraction node generation module 320 configured to: determining text information to be extracted according to the marked content, and combining the marked type and the text information to obtain an extraction node; wherein, the representation mode of the identifier corresponding to the text information in the extraction node is determined according to whether an entity exists in the marked content;

a text node generation module 330 configured to: generating a text node according to the key field of the non-labeling segment, wherein the representation mode of the corresponding identifier of the text node is determined according to the importance value of the key field in the non-labeling segment;

an information extraction module 340 configured to: and combining the text nodes and the extraction nodes according to the positions of the non-labeled segments and the labeled segments in the input text to obtain an information extraction template, and extracting information of other input texts which are not labeled based on the information extraction template.

Optionally, the extraction node generating module 320 is specifically configured to:

taking the label type as an extraction type in an extraction node;

if the entity fragment corresponding to the extraction type exists in the marked content, replacing the entity fragment with entity information to be extracted corresponding to the extraction type; the entity information is represented by a first preset identifier to represent that entity extraction is carried out in the information extraction process;

and combining the extraction type and the entity information to be extracted according to a preset connector to obtain an extraction node.

Optionally, the extraction node generating module 320 is specifically configured to:

taking the label type as an extraction type in an extraction node;

if the entity fragment corresponding to the extraction type is not identified in the marked content, determining the length range of the character to be extracted according to the character length of the marked content; the length range of the character to be extracted is represented by a second preset identifier so as to represent that the character of the length range content is extracted in the information extraction process;

and combining the extraction type and the length range according to a preset connector to obtain an extraction node.

Optionally, the text node generating module 330 specifically includes:

a clause segmentation unit configured to: dividing the non-labeling segment into a plurality of clauses according to punctuation marks;

a clause importance determining unit configured to: determining the importance value of the semantic meaning of each clause in the non-labeled segment;

a text node generating unit configured to: for any clause, if the importance value is smaller than a preset threshold value, determining the length range of the character to be extracted according to the length of the text content corresponding to the clause, and generating a text node according to the length range; wherein the length range in the text node is represented by a second preset identifier.

Optionally, the clause importance determining unit is specifically configured to:

determining score values of each clause based on a text sorting algorithm TextRank in natural language processing NLP, and selecting score values with the numerical value of K bits in front from the score values to sum to obtain a sum value;

for the score value of any clause, making a quotient of the score value and the sum value, and if the quotient is smaller than a preset threshold, determining that the importance value of the clause is smaller than the preset threshold; and if the quotient value is larger than a preset threshold value, determining that the importance value of the clause is larger than the preset threshold value.

Optionally, the apparatus further comprises:

the entity judging module is configured to identify whether an entity fragment exists in the clause if the importance value is greater than or equal to the preset threshold;

an entity information replacement module configured to replace an entity fragment with corresponding entity information as a text node if the existence of the entity fragment is recognized; the entity information in the text node is represented by a first preset identifier;

the resource processing module is configured to perform resource processing on the clause and take the content after the resource processing as a text node if the entity segment is not identified; wherein the resource processing comprises normalization processing and/or skeleton extraction processing in Natural Language Processing (NLP).

Optionally, the input text is a text after being processed by optical character recognition OCR.

The information extraction device based on the RPA and the AI provided by the embodiment of the invention can execute the information extraction method based on the RPA and the AI provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in the above embodiments, reference may be made to the RPA and AI-based information extraction method provided in any embodiment of the present invention.

Example four

Referring to fig. 4, fig. 4 is a schematic structural diagram of a computing device according to a fourth embodiment of the present invention. As shown in fig. 4, the computing device may include:

a memory 701 in which executable program code is stored;

a processor 702 coupled to the memory 701;

the processor 702 calls the executable program code stored in the memory 701 to execute the RPA and AI-based information extraction method according to any embodiment of the present invention.

The embodiment of the invention discloses a computer-readable storage medium which stores a computer program, wherein the computer program enables a computer to execute an information extraction method based on RPA and AI provided by any embodiment of the invention.

In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the order of execution of the processes should be determined by their functions and inherent logic, and should not limit the implementation processes of the embodiments of the present invention.

In the embodiments provided herein, it should be understood that "B corresponding to A" means that B is associated with A, from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a hardware form, and can also be realized in a software functional unit form.

The integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a computer accessible memory. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory, and includes several requests to enable a computer device (which may be a personal computer, a server, a network device, or the like, and may specifically be a processor in the computer device) to execute some or all of the steps of the above methods according to the embodiments of the present invention.

It will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be implemented by hardware instructions associated with a program, and the program may be stored in a computer readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other Memory Disk storage, tape storage, or any other medium readable by a computer that can be used to carry or store data.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those of ordinary skill in the art will understand that: modules in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, or may be located in one or more devices different from the embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:摘要生成模型的训练方法、装置、设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!