Information extraction method, information extraction device, electronic equipment and medium

文档序号:7892 发布日期:2021-09-17 浏览:21次 中文

1. An information extraction method, comprising:

determining the target document attribute to be extracted by a target site;

extracting information of the target site according to target position information of the target document attribute in the target site; and the target position information is obtained according to the page title of the target site and the document title of the candidate document.

2. The method of claim 1, wherein before extracting information from the target site according to the target location information of the target literature attribute in the target site, the method further comprises:

obtaining candidate page titles included by the target site, matching document titles of candidate documents with the candidate page titles, and determining target document titles according to matching results;

and determining a target document corresponding to the target document title from the candidate documents, and determining target position information of the target document attribute in the target site according to a target attribute value of the target document attribute in the target document.

3. The method of claim 2, wherein said determining target location information of said target document attribute at said target site based on a target attribute value of said target document attribute in said target document comprises:

acquiring at least one target page title matched with the target document title, and determining page source codes respectively associated with the target page titles;

matching the target attribute value in each page source code, and determining at least one source code field matched with the target attribute value;

and determining at least one target page node of the target document attribute in the target site as the target position information according to the current page node corresponding to each source code field in the page to which the source code field belongs.

4. The method of claim 3, wherein the determining at least one destination page node of the destination document attribute in the destination site according to the current page node corresponding to each source code field in the page to which the source code field belongs comprises:

if the number of the current page nodes corresponding to any source code field in the page to which the source code field belongs is at least two, determining a combined page node corresponding to the source code field according to the current page node and at least one level of father nodes of the current page node;

and if the number of the combined page node in the page to which the source code field belongs is one, taking the combined page node as the target page node.

5. The method of claim 3, wherein the determining at least one destination page node of the destination document attribute in the destination site according to the current page node corresponding to each source code field in the page to which the source code field belongs comprises:

and if the number of the current page nodes corresponding to any source code field in the page to which the source code field belongs is one, taking the current page nodes corresponding to the source code field as the target page nodes.

6. The method of claim 3, wherein the extracting information from the target site according to the target location information of the target document attribute in the target site comprises:

determining the priority of each target page node according to the total number of each target page node in each page;

and extracting information of the target site according to each target page node and the priority of each target page node.

7. The method of claim 2, wherein matching document titles of candidate documents to the candidate page titles comprises:

under the condition that the document titles of the candidate documents are Chinese titles, matching the document titles with the candidate page titles on the basis of a fuzzy matching algorithm;

and matching the document title with the candidate page title based on a character matching algorithm under the condition that the document title of the candidate document is an English title.

8. The method of claim 1, wherein the target document attributes include at least one of an abstract, an author, an organization, a year, a journal, a publication, keywords, or a document identification.

9. An information extraction apparatus comprising:

the target document attribute determining module is used for determining the target document attribute to be extracted by the target site;

the information extraction module is used for extracting information of the target site according to the target position information of the target document attribute in the target site; and the target position information is obtained according to the page title of the target site and the document title of the candidate document.

10. The apparatus according to claim 9, the apparatus further comprising a target document determination module, in particular for:

obtaining candidate page titles included by the target site, matching document titles of candidate documents with the candidate page titles, and determining target document titles according to matching results;

and determining a target document corresponding to the target document title from the candidate documents, and determining target position information of the target document attribute in the target site according to a target attribute value of the target document attribute in the target document.

11. The apparatus of claim 10, wherein the target document determination module is further specifically configured to:

acquiring at least one target page title matched with the target document title, and determining page source codes respectively associated with the target page titles;

matching the target attribute value in each page source code, and determining at least one source code field matched with the target attribute value;

and determining at least one target page node of the target document attribute in the target site as the target position information according to the current page node corresponding to each source code field in the page to which the source code field belongs.

12. The apparatus of claim 11, wherein the target document determination module is further specifically configured to:

if the number of the current page nodes corresponding to any source code field in the page to which the source code field belongs is at least two, determining a combined page node corresponding to the source code field according to the current page node and at least one level of father nodes of the current page node;

and if the number of the combined page node in the page to which the source code field belongs is one, taking the combined page node as the target page node.

13. The apparatus of claim 11, wherein the target document determination module is further specifically configured to:

and if the number of the current page nodes corresponding to any source code field in the page to which the source code field belongs is one, taking the current page nodes corresponding to the source code field as the target page nodes.

14. The apparatus according to claim 11, wherein the information extraction module is specifically configured to:

determining the priority of each target page node according to the total number of each target page node in each page;

and extracting information of the target site according to each target page node and the priority of each target page node.

15. The apparatus of claim 10, wherein the target document determination module is further specifically configured to:

under the condition that the document titles of the candidate documents are Chinese titles, matching the document titles with the candidate page titles on the basis of a fuzzy matching algorithm;

and matching the document title with the candidate page title based on a character matching algorithm under the condition that the document title of the candidate document is an English title.

16. The apparatus of claim 9, wherein the target document attributes comprise at least one of an abstract, an author, an organization, a year, a journal, a publication, keywords, or a document identification.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.

Background

With the development of network technology, a great deal of webpage information exists in the internet, and how to obtain useful information from the great deal of webpage information becomes a key problem for acquiring the webpage information.

Currently, the extraction of the web page information is usually implemented based on a way of manually setting an information extraction template.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, and medium for extracting document information included in a target site.

According to an aspect of the present disclosure, there is provided an information extraction method including:

determining the target document attribute to be extracted by a target site;

extracting information of the target site according to target position information of the target document attribute in the target site; and the target position information is obtained according to the page title of the target site and the document title of the candidate document.

According to another aspect of the present disclosure, there is provided an information extracting apparatus including:

the target document attribute determining module is used for determining the target document attribute to be extracted by the target site;

the information extraction module is used for extracting information of the target site according to the target position information of the target document attribute in the target site; and the target position information is obtained according to the page title of the target site and the document title of the candidate document.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method according to any one of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of information extraction disclosed in accordance with an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of information extraction disclosed in accordance with an embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of information extraction disclosed in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of an information extraction device according to an embodiment of the disclosure;

fig. 5 is a block diagram of an electronic device for implementing the information extraction method disclosed in the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

An academic search engine is a search engine which provides related academic information for users according to keywords input by the users, and is generally constructed by extracting attribute values of various attributes of documents from massive academic sites to form structured data in a uniform format, and then establishing the academic search engine according to the structured data so as to support complex and high-level academic retrieval. The academic site refers to an unstructured document site with a data source which is not fixed, contains multiple language types, contains different data contents and contains different document types, and the document attributes in the unstructured document site are not fixed in appearance positions and are unpredictable in value.

With the continuous development of the academic world, the academic sites bring difficulties for analyzing unified structured data while the quantity of the academic sites is rapidly increased. The main body is as follows: 1. the site magnitude is large, and the whole site magnitude can reach more than fifty thousand. 2. The structures of different academic sites are not fixed, and even the structures of the web pages in the same site are updated irregularly. 3. There are differences in language types of different academic sites. 4. The unified structured data has more document attributes, and more than fifty document attributes can be achieved.

At present, for the extraction of literature information of academic sites, the common method is as follows: the method comprises the steps of artificially configuring rules for document attribute analysis aiming at different academic sites, establishing json configuration from document attributes to xpath paths, and timely changing the rules when the web pages of the sites change so as to ensure the recall rate and accuracy rate of the site analysis.

However, this method has a disadvantage in that when the number of sites increases sharply, the labor cost increases sharply, and the information extraction efficiency is also reduced. For the above disadvantages, the industry still uses a general parsing rule mode to extract information, that is, a set of general parsing rules is configured through manual labeling and data training in advance to adapt to mass academic sites, however, although this method saves some human costs, the overall information extraction accuracy is difficult to guarantee, and thus, the method is not widely popularized.

Therefore, how to efficiently extract the literature information in the academic site becomes a problem which needs to be solved urgently.

Fig. 1 is a flowchart of an information extraction method disclosed according to an embodiment of the present disclosure, which may be applied to a case of extracting document information included in a target site. The method of the present embodiment may be performed by the information extraction apparatus disclosed in the embodiments of the present disclosure, and the apparatus may be implemented by software and/or hardware, and may be integrated on any electronic device with computing capability.

As shown in fig. 1, the information extraction method disclosed in this embodiment may include:

s101, determining the target document attribute to be extracted by the target station.

The target site refers to a site containing literature information, and may be a professional academic site, such as a patent search website, a thesis database, an electronic library, an electronic journal website, and the like; the present embodiment is not limited to a specific form of the target site, and may be a comprehensive site including document information, such as a comprehensive information search website. The target document attribute refers to a document attribute to which document information to be extracted belongs in a target site, and for example, the target document attribute may be an abstract, an author, an organization, a periodical and the like.

In one embodiment, a target site matching a site identifier is determined from candidate sites according to the site identifier included in an information extraction instruction sent by a user, where the site identifier may be unique information of the site, such as a URL (Uniform Resource Locator) Address of the site, an IP (Internet Protocol Address) Address of the site, or TCP (Transmission Control Protocol) information of the site; the station identifier may also be identification information that is manually assigned, such as a name or a number, and the specific form of the station identifier is not limited in this embodiment, and all the identification information that can perform an identity distinguishing function is in the protection scope of this embodiment. After the target site is determined, according to the document attribute identification contained in the information extraction command, the target document attribute matched with the document attribute identification is determined from the candidate document attributes corresponding to the target site, wherein the document attribute name can be directly used as the document attribute identification, and an artificially set number can be used as the document attribute identification. The number of the target document attributes can be one or multiple, the upper limit is the number of the candidate document attributes of the target site, in this case, all the candidate document attributes are used as the target document attributes, and the specific number of the target document attributes is determined according to the actual needs of the user.

In another embodiment, according to a site identifier included in an information extraction instruction sent by a user, a target site matched with the site identifier is determined from candidate sites, and all candidate document attributes corresponding to the target site are directly used as target document attributes, so that the diversity of information extraction is improved, intermediate steps are reduced, and the information extraction efficiency is improved.

By determining the target document attribute to be extracted by the target site, a data base is laid for information extraction of the target site subsequently.

Optionally, the target document attribute includes at least one of an abstract, an author, an organization, a year, a journal, a publication number, a keyword, or a document identifier.

Wherein, the abstract refers to the content summary of the literature; the author is the name of the person who drafted the document; organization refers to the name of the unit to which the author belongs; the year refers to the year of publication of the literature; journal refers to the publication name of the publication literature; publications refer to publication numbers at the time of publication of the literature; the keywords refer to keywords extracted according to the content of the document; the document identifier, i.e. document doi, is a unique identifier of a digital object, and is used for distinguishing different documents.

At least one of the abstract, the author, the organization, the year, the journal, the publication number, the keywords or the document identification is set as the target document attribute, so that the diversity of the target document attribute is expanded, the extraction requirements of users on document information of different target document attributes are met, and the user experience is improved.

S102, extracting information of the target site according to target position information of the target literature attribute in the target site; and the target position information is obtained according to the page title of the target site and the document title of the candidate document.

In this embodiment, different document attributes in the target site correspond to different pieces of location information, that is, document information corresponding to any document attribute can be extracted from the target site according to the location information of the document attribute. For example, if the target location information of the target document attribute "abstract" in the target site is a, the abstract information corresponding to the "abstract" can be extracted from the location a; for another example, if the target location information of the target document attribute "author" in the target site is B, the author information corresponding to the "author" can be extracted from the location B.

In an embodiment, a preset number of candidate documents are obtained from a preset number of authoritative academic big stations, wherein the preset number respectively corresponding to the authoritative academic big stations and the candidate documents can be set according to actual business requirements, and it can be understood that the larger the two preset numbers are, the richer the types of the candidate documents are, and the higher the final information extraction accuracy is. After the candidate documents are obtained, the candidate documents are analyzed based on a preset analysis method, for example, a manual configuration analysis method, attribute values of document attributes included in the candidate documents are determined, and format processing is performed on the attribute values of the document attributes obtained through analysis to form uniform structured data for subsequent retrieval and use.

Crawling is carried out on the page title of the target site based on a crawler technology, and the page title included by the target site is obtained, wherein the page title can be a title field under a Meta tag or a Head tag in each page of the target site. And matching the acquired page title of the target site with the document title in the candidate document attribute obtained in advance, and determining the target page title and the target document title which are successfully matched according to the matching result. Further, a target attribute value corresponding to the target document attribute is determined from the corresponding candidate document based on the target document title, and if the target document title is AABBCC and the target document attribute is "year", for example, the "year" information of the candidate document corresponding to the document title "AABBCC" is set as the target attribute value.

And performing character matching on the acquired target attribute value in a page source code associated with the target page title, determining a source code field matched with the target attribute value, and taking position information corresponding to the source code field as target position information of the target document attribute in the target site. For example, if the target document attribute is "author" and the corresponding target attribute value is "zhang san", character matching is performed on zhang san "in the page source code associated with the target page title, and if the position information matched to zhang san" in the page source code is a, the position a is used as the target position information of the target document attribute "author" in the target site.

After the document data in the target site is updated, the document information corresponding to the target document attribute can be extracted directly according to the target position information of the target document attribute in the target site. For example, assuming that the target position information of the target document attribute "author" in the target site is a, the author information corresponding to the target document attribute "author" is directly extracted from the position a of the target site page source code; for another example, assuming that the target location information of the target document attribute "abstract" in the target site is B, the abstract information corresponding to the target document attribute "abstract" is directly extracted from the location B of the target site page source code.

The method comprises the steps of determining the attribute of a target document to be extracted by a target site, and extracting information of the target site according to target position information of the attribute of the target document in the target site; the target position information is obtained according to the page title of the target site and the document title of the candidate document, and the target position information of the target document attribute can be determined according to the page title of the target site and the document title of the candidate document, so that analysis rule configuration of the target site is not needed manually, the labor cost required by information extraction is reduced, and the information extraction efficiency is improved.

Fig. 2 is a flowchart of an information extraction method disclosed according to an embodiment of the present disclosure, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments.

As shown in fig. 2, the information extraction method disclosed in this embodiment may include:

s201, determining the target document attribute to be extracted by the target site.

S202, obtaining candidate page titles included by the target site, matching document titles of candidate documents with the candidate page titles, and determining target document titles according to matching results.

In one embodiment, candidate page titles included in the target site are crawled and document titles of the candidate documents are matched with the candidate page titles, thereby determining target page titles and target document titles that match each other.

Optionally, "matching the document title of the candidate document with the candidate page title" in S202 includes the following two cases a and B:

A. and matching the document title with the candidate page title based on a fuzzy matching algorithm under the condition that the document title of the candidate document is a Chinese title.

The fuzzy matching algorithm is a matching algorithm for realizing fuzzy matching between texts, such as a K-medoids clustering method, a Levenshtein edit distance method or a character similarity method.

In one embodiment, in the case that the document title of the candidate document is a Chinese title, the Chinese title and the candidate page title are both subjected to word segmentation processing, and the document title subjected to word segmentation is matched with the page title based on a fuzzy matching algorithm.

B. And matching the document title with the candidate page title based on a character matching algorithm under the condition that the document title of the candidate document is an English title.

The character matching algorithm is a matching algorithm for completely matching characters between texts, namely, only two texts with the same characters can be successfully matched.

In one embodiment, in the case where the document title of the candidate document is an english title, the english title is matched with the candidate page title based on a character matching algorithm.

Because different Chinese expression forms may exist in the Chinese titles of the same document in different sites, the problem of low matching success rate caused by the non-standard Chinese titles can be solved by matching the document titles with the candidate page titles based on the fuzzy matching algorithm under the condition that the document titles of the candidate document are the Chinese titles; since the lines of the english titles in different sites are more standard in a normal situation, the matching speed and efficiency can be improved by matching the document titles with the candidate page titles based on the character matching algorithm under the condition that the document titles of the candidate documents are the english titles.

S203, determining a target document corresponding to the target document title from the candidate documents, and determining target position information of the target document attribute in the target site according to a target attribute value of the target document attribute in the target document.

In one embodiment, a target page title and a target document title which are matched with each other are determined according to a matching result, a target document corresponding to the target document title and a target attribute value of a target document attribute in the target document are determined from candidate documents, character matching is carried out on the target attribute value in a page source code associated with the target page title, a source code field matched with the target attribute value is determined, and position information corresponding to the source code field is used as target position information of the target document attribute in a target site.

And S204, extracting information of the target site according to the target position information of the target literature attribute in the target site.

In one embodiment, after the information extraction is performed on the target site, the extracted information can be cleaned and output in a unified format conversion manner to generate structured data, so that the extracted information can be reused later, for example, an academic search engine is constructed. And moreover, the extracted information can be sampled periodically and evaluated to ensure the reliability and accuracy of information extraction, and when the accuracy and reliability of the extracted information are low, the information extraction can be corrected in time.

Because document titles in the site are in page titles under normal conditions, the candidate page titles included in the target site are obtained, the document titles of the candidate documents are matched with the candidate page titles, the target document titles are determined according to the matching result, the target documents corresponding to the target document titles are further determined from the candidate documents, and the target position information of the target document attributes in the target site is determined according to the target attribute values of the target document attributes in the target documents, so that the candidate page titles and the document titles of the candidate documents are used as anchor points, the target document titles matched with the target page titles are determined, the target position information is determined for subsequent determination and according to the target attribute values of the target document attributes, a bedding effect is achieved, and the smooth proceeding of the method is guaranteed.

Fig. 3 is a flowchart of an information extraction method disclosed according to an embodiment of the present disclosure, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments.

As shown in fig. 3, the information extraction method disclosed in this embodiment may include:

s301, determining the target document attribute to be extracted by the target site.

S302, obtaining candidate page titles included by the target site, matching document titles of candidate documents with the candidate page titles, and determining target document titles according to matching results.

S303, obtaining at least one target page title matched with the target literature title, and determining page source codes respectively associated with the target page titles.

Each page title in the website corresponds to page text content, that is, each page title is associated with a specific page source code, that is, page html.

In one embodiment, according to the matching result in S302, the target document title and at least one target page title matching the target document title can be determined, and further, according to the predetermined association relationship between the page title and the page source code, the page source code associated with each target page title is determined.

S304, determining a target document corresponding to the target document title from the candidate document, matching target attribute values of the target document attributes in the target document in the page source codes, and determining at least one source code field matched with the target attribute values.

For example, assuming that the target document is titled "AABBCC" and the target document attribute is "year," the document titled "AABBCC" is taken as the target document, assuming that the year information of "year" in the target document is "2008," the "2008" is taken as a matching character, character matching is performed on the page source code types respectively associated with the target page titles, and if any page source code has a source code field of "2008," the source code field of "2008" is taken as a source code field that is successfully matched.

S305, determining at least one target page node of the target document attribute in the target site as the target position information according to the current page node corresponding to each source code field in the page to which the source code field belongs.

Each source code field corresponds to a unique page node in a page to which the source code field belongs, namely the corresponding unique page node can be determined according to any source code field.

In one embodiment, at least one source code field matched with the target attribute value is determined according to the incidence relation between the source code field and the page node, and the current page node corresponds to each page in each belonging page. For example, if the page node corresponding to the source code field "2008" in the page to which the source code field belongs is the page node a, the page node a is taken as the current page node corresponding to the source code field "2008".

And further determining at least one target page node of the target document attribute in the target site from the determined current page nodes as target position information of the target document attribute in the target site.

Optionally, S305 includes:

if the number of the current page nodes corresponding to any source code field in the page to which the source code field belongs is at least two, determining a combined page node corresponding to the source code field according to the current page node and at least one level of father nodes of the current page node; and if the number of the combined page node in the page to which the source code field belongs is one, taking the combined page node as the target page node.

Each page in the site can repeatedly call a certain page node, so that a plurality of same nodes can be generated in a certain page.

In one embodiment, a current page node corresponding to any source code field is determined, the number of the current page nodes in a page to which the source code field belongs is determined, and if the number is at least two, that is, the current page node is not unique in the page, the current page node and a parent node thereof are used as combined page nodes, and the number of the combined page nodes in the page to which the source code field belongs is determined. And if the number of the combined page nodes is at least two, continuing to use the current page node, the father node of the current page node and the father node of the father node as the combined page nodes, and continuing to determine the number of the combined page nodes in the page to which the source code field belongs until the number of the combined page nodes in the page is one, and using the combined page nodes at the moment as target page nodes and as target position information.

Exemplarily, assuming that a current page node corresponding to any source code field is node 1, a parent node of node 1 is node 2, a parent node of node 2 is node 3, and a page to which the current page node belongs is page N, determining the number of node 1 in page N, if the number is at least two, determining the number of "node 1+ node 2" in page N, if the number is at least two, determining the number of "node 1+ node 2+ node 3", and if the number is one, taking a combined page node of "node 1+ node 2+ node 3" as a target node and as target location information.

If the number of the current page nodes corresponding to any one source code field in the source code field page is at least two, determining a combined page node corresponding to the source code field according to the current page node and at least one-level father node of the current page node, and if the number of the combined page node in the source code field page is one, taking the combined page node as the target page node, so that the uniqueness of each target page node in the source code field page is ensured, namely the uniqueness of the obtained target position information in the source code field page is ensured, and the problem of extracting various different information according to the target position information subsequently is avoided.

Optionally, S305 includes:

and if the number of the current page nodes corresponding to any source code field in the page to which the source code field belongs is one, taking the current page nodes corresponding to the source code field as the target page nodes.

In one embodiment, the current page node corresponding to any source code field is determined, the number of the current page nodes in the page to which the source code field belongs is determined, and if the number of the current page nodes is one, the current page nodes are directly used as target page nodes and used as target position information.

For example, assuming that a current page node corresponding to any source code field is node 1, and a page to which the current page node belongs is page N, determining the number of the node 1 in the page N, and if the number is one, taking the node 1 as a target node and as target location information.

If the number of the current page nodes corresponding to any one source code field in the page to which the source code field belongs is one, the current page nodes corresponding to the source code field are used as the target page nodes, so that the uniqueness of each target page node in the page to which the source code field belongs is ensured, namely the uniqueness of the obtained target position information in the page to which the source code field belongs is ensured, and the problem that various different information is extracted according to the target position information subsequently is avoided.

S306, extracting information of the target site according to the target position information of the target literature attribute in the target site.

Optionally, S306 includes:

determining the priority of each target page node according to the total number of each target page node in each page; and extracting information of the target site according to each target page node and the priority of each target page node.

For example, it is assumed that the target page node 1 includes one page e in each of the pages a, b, c, and d, that is, the total number is 5; assuming that the target page node 2 includes one page a, one page c, and one page d, that is, the total number is 3, it indicates that the probability of the target document attribute appearing in the target page node 1 is greater than that of the target page node 2, information extraction is performed in the target page node 1 of the target site first, and if the extraction fails, information extraction is performed in the target page node 2 of the target site again.

The priority of each target page node is determined according to the total number of each target page node in each page, and information extraction is performed on the target site according to each target page node and the priority of each target page node, so that information extraction is performed from the target page nodes with high attribute probability of the target documents, and the efficiency of information extraction is improved.

The method comprises the steps of obtaining at least one target page title matched with a target document title, determining page source codes respectively associated with the target page titles, matching target attribute values in the page source codes, determining at least one source code field matched with the target attribute values, determining at least one target page node of the target document attribute in a target site according to current page nodes respectively corresponding to the source code fields in a page to which the source code fields belong, and using the target page node as target position information.

Fig. 4 is a schematic structural diagram of an information extraction device disclosed according to an embodiment of the present disclosure, which can be applied to a case of extracting document information included in a target site. The device of the embodiment can be implemented by software and/or hardware, and can be integrated on any electronic equipment with computing capability.

As shown in fig. 4, the information extraction apparatus 40 disclosed in the present embodiment may include a target document attribute determination module 41 and an information extraction module 42, in which:

a target document attribute determining module 41, configured to determine a target document attribute to be extracted by a target site;

the information extraction module 42 is configured to extract information of the target site according to target location information of the target document attribute in the target site; and the target position information is obtained according to the page title of the target site and the document title of the candidate document.

Optionally, the apparatus further includes a target document determination module, specifically configured to:

obtaining candidate page titles included by the target site, matching document titles of candidate documents with the candidate page titles, and determining target document titles according to matching results;

and determining a target document corresponding to the target document title from the candidate documents, and determining target position information of the target document attribute in the target site according to a target attribute value of the target document attribute in the target document.

Optionally, the target document determining module is further specifically configured to:

acquiring at least one target page title matched with the target document title, and determining page source codes respectively associated with the target page titles;

matching the target attribute value in each page source code, and determining at least one source code field matched with the target attribute value;

and determining at least one target page node of the target document attribute in the target site as the target position information according to the current page node corresponding to each source code field in the page to which the source code field belongs.

Optionally, the target document determining module is further specifically configured to:

if the number of the current page nodes corresponding to any source code field in the page to which the source code field belongs is at least two, determining a combined page node corresponding to the source code field according to the current page node and at least one level of father nodes of the current page node;

and if the number of the combined page node in the page to which the source code field belongs is one, taking the combined page node as the target page node.

Optionally, the target document determining module is further specifically configured to:

and if the number of the current page nodes corresponding to any source code field in the page to which the source code field belongs is one, taking the current page nodes corresponding to the source code field as the target page nodes.

Optionally, the information extracting module 42 is specifically configured to:

determining the priority of each target page node according to the total number of each target page node in each page;

and extracting information of the target site according to each target page node and the priority of each target page node.

Optionally, the target document determining module is further specifically configured to:

under the condition that the document titles of the candidate documents are Chinese titles, matching the document titles with the candidate page titles on the basis of a fuzzy matching algorithm;

and matching the document title with the candidate page title based on a character matching algorithm under the condition that the document title of the candidate document is an English title.

Optionally, the target document attribute includes at least one of an abstract, an author, an organization, a year, a journal, a publication, a keyword, or a document identifier.

The information extraction device 40 disclosed in the embodiment of the present disclosure can execute the information extraction method disclosed in the embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method. Reference may be made to the description of any method embodiment of the disclosure for a matter not explicitly described in this embodiment.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as the information extraction method. For example, in some embodiments, the information extraction method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the information extraction method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the information extraction method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:地心运动人工智能预测方法和系统、电子设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!