Big data based policy acquisition, cleaning and automatic accurate pushing method

文档序号:7807 发布日期:2021-09-17 浏览:54次 中文

1. A big data based policy acquisition, cleaning and automatic accurate pushing method is characterized by comprising the following effective steps:

a. firstly, extracting keywords according to the content of the past government policy, and constructing a keyword set, wherein the keyword set comprises a topic keyword set and a constraint keyword set;

b. adding relevant government websites into an initial address URL seed set;

c. forming a topic crawler by using keywords in a topic keyword set, analyzing the topic relevance of a crawled webpage from an initial address URL seed set, and storing the webpage relevant to the topic into a database;

d. analyzing the text content of the webpage in the database, and extracting sentences containing keywords according to the keywords in the constraint keyword set;

e. analyzing the extracted sentences containing the keywords according to the constraint keyword set to obtain a key constraint attribute set;

f. constructing an enterprise recommendation attribute set containing enterprise information according to the constraint keyword set;

g. and comparing the obtained key constraint attribute set with the enterprise recommendation attribute set one by one, and pushing the webpage containing the key constraint attribute set to the enterprise when all the attributes in the enterprise recommendation attribute set are successfully matched with all the constraint attributes in the key constraint attribute set.

2. The big-data-based policy collection, cleaning and automatic accurate pushing method according to claim 1, wherein in the step c, when the web crawler crawls the web pages in the URL seed set, all links in one seed web page are searched first, then all links in the next layer are searched, and the next layer of search is returned to be executed until the lowest layer.

3. The big data based policy collection, cleaning and automatic precise pushing method according to claim 1, wherein the step d includes the following effective steps:

d1, first selecting proper separator to divide the sentence into sentence subset P ═ S1,...SN-wherein N is the total number of sentences;

d2, and then segmenting each sentence to obtain the segmentation set Si ═ w of the current sentencei1,....,wimM is the total number of words in the word set of the current sentence;

d3, according to the formula:

Ass(k,Si)=|<wk|wk∈Si&wk∈k>|

wherein Ass (KW, Si) is the association degree of the keyword and the current sentence; sentences containing keywords are extracted.

4. The big-data based policy collection, cleaning and automatic accurate pushing method according to claim 1, wherein the key constraint attribute set and the enterprise recommendation attribute set each comprise an attribute name and an attribute value.

5. The big-data-based policy collection, cleaning and automatic accurate pushing method according to claim 1, wherein in the step g, the domain attributes of the enterprises in the key constraint attribute set and the enterprise recommendation attribute set are determined according to the features of the Duwei decimal classification method.

6. The big-data-based policy collection, cleaning and automatic accurate pushing method according to claim 5, characterized in that the category numbers of key words in the constraint key word set and the enterprise recommendation attribute set are searched according to a Duty decimal classification method, then the length of the key word number of the Duty decimal classification method is used as an X axis, the key word classification number is used as a Y axis, corresponding points are drawn on two-dimensional coordinates by the Duty decimal classification numbers corresponding to the key words in the key constraint attribute set and the enterprise recommendation attribute set, if the points formed by the key words in the enterprise recommendation attribute set are near or overlapped with the key word points in the key constraint attribute set, the field attribute matching is judged to be successful, and if the points are far away, the field attribute matching is not successful.

Background

The development of enterprises must understand the corresponding policy requirements of countries, places and industry associations widely, so that the national laws and regulations can be followed, the operation rules of the industries can be understood, the policy and the dividends can be fully applied, the enterprise body can be strengthened by doing a big job, the market competitiveness of the enterprises can be improved, and the enterprises can be developed to protect the driving.

With the formal implementation of "optimizing operator environment regulations", it is clear that government departments need to continuously perfect policy measures in the aspect of policy service, and the policy benefits enterprises and talent bodies are implemented. Although the preferential policies are various in variety, the policies are relatively dispersed, reporting conditions are different, information is asymmetric and other problems, so that many enterprises and talents miss good policies, real support cannot be obtained, the policies are just set, and the enterprises and talents are quite disappointed. In order to complete the last kilometer of policy service, more enterprises and talents can obtain policy benefits, and how to really release the policy benefits from massive data, so that the enterprises can solve worries behind, obtain more benefits, and make talents innovate and specialized in entrepreneurship, and the method becomes an important research direction in the field of information technology processing.

Disclosure of Invention

Aiming at the calculation problem of the accurate pushing of the policy, the invention provides a big data-based policy acquisition, cleaning and automatic accurate pushing method which is reasonable in design, simple in method, convenient to operate and capable of realizing accurate pushing of the policy to corresponding enterprises.

In order to achieve the above purpose, the technical solution adopted by the present invention is that the present invention provides a big data based policy acquisition, cleaning and automatic accurate pushing method, which comprises the following effective steps:

a. firstly, extracting keywords according to the content of the past government policy, and constructing a keyword set, wherein the keyword set comprises a topic keyword set and a constraint keyword set;

b. adding relevant government websites into an initial address URL seed set;

c. forming a topic crawler by using keywords in a topic keyword set, analyzing the topic relevance of a crawled webpage from an initial address URL seed set, and storing the webpage relevant to the topic into a database;

d. analyzing the text content of the webpage in the database, and extracting sentences containing keywords according to the keywords in the constraint keyword set;

e. analyzing the extracted sentences containing the keywords according to the constraint keyword set to obtain a key constraint attribute set;

f. constructing an enterprise recommendation attribute set containing enterprise information according to the constraint keyword set;

g. and comparing the obtained key constraint attribute set with the enterprise recommendation attribute set one by one, and pushing the webpage containing the key constraint attribute set to the enterprise when all the attributes in the enterprise recommendation attribute set are successfully matched with all the constraint attributes in the key constraint attribute set.

Preferably, in the step c, when the web crawler crawls the web pages in the URL seed set, all links in one seed web page are searched first, then all links in the next layer are searched, and the next layer of search is executed until the lowest layer is reached.

Preferably, the step d comprises the following effective steps:

d1, first selecting proper separator to divide the sentence into sentence subset P ═ S1,...SN-wherein N is the total number of sentences;

d2, and then segmenting each sentence to obtain the segmentation set Si ═ w of the current sentencei1,....,wimM is the total number of words in the word set of the current sentence;

d3, according to the formula:

Ass(k,Si)=|<wk|wk∈Si&wk∈k>|

wherein Ass (KW, Si) is the association degree of the keyword and the current sentence; sentences containing keywords are extracted.

Preferably, the key constraint attribute set and the enterprise recommendation attribute set both include attribute names and attribute values.

Preferably, in the step g, the attributes of the fields to which the enterprises in the key constraint attribute set and the enterprise recommendation attribute set belong are determined according to the features of the Duwei decimal classification method.

Preferably, the classification numbers of key words in the constraint key word set and the key words in the enterprise recommendation attribute set are searched according to a Duty decimal classification method, then the length of the key word number of the Duty decimal classification method is used as an X axis, the key word classification number is used as a Y axis, corresponding points are drawn on two-dimensional coordinates by the Duty decimal classification numbers corresponding to the key words in the key constraint attribute set and the enterprise recommendation attribute set, if the points formed by the key words in the enterprise recommendation attribute set are close to or overlapped with the key word points in the key constraint attribute set, the field attribute matching is judged to be successful, and if the points are far away, the field attribute matching is not successful.

Compared with the prior art, the invention has the advantages and positive effects that,

1. the invention provides a big data-based policy acquisition, cleaning and automatic accurate pushing method, which is characterized in that corresponding keywords are arranged according to the characteristics of policy documents, then, relevant policy documents are crawled by a web crawler and analyzed, and information matching is formed by combining basic information of enterprises, so that the accurate pushing of the policies of the enterprises is realized, the problem of the last kilometer of the enterprises is solved, and more enterprises and talents can obtain policy dividends. Meanwhile, the method is simple, convenient to operate and suitable for large-scale popularization and use.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described with reference to the following examples. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments of the present disclosure.

Embodiment 1, this embodiment provides a big data-based policy acquisition, cleaning and automatic accurate pushing method

Firstly, extracting keywords according to the content of the prior government policy, constructing a keyword set, and capturing a website mainly by using a web crawler, wherein the prior web crawler has two types, one type is a universal web crawler, the other type is a subject type web crawler, and compared with the universal web crawler, the subject type web crawler has stronger pertinence, and needs a corresponding subject, therefore, when constructing the subject, the subject is determined according to the content of the policy which is issued in the past, according to analysis, aiming at the policy of declaration type, the text contains the requirement of declaration condition, therefore, the declaration condition can be used as one of the keywords, and in addition, the declaration type policies such as 'high and new technology enterprise', 'one enterprise technology', 'national level intellectual property demonstration center', 'provincial laboratory', 'national level laboratory', and the like are also included, such names can all be used as topic words crawled by topic type web crawlers.

In view of the fact that the existing text abstract technology mainly obtains corresponding keywords through the mass appearance of words in a text, and in the declaration conditions in declaration types, the appeared words generally appear at one time, therefore, the interpretation of policies cannot be realized by adopting the technology, therefore, keywords can be formed according to the requirement of the declaration conditions of the past policies, so that the required declaration conditions can be extracted by utilizing the keywords, and for this reason, the keyword set which can be generated according to the requirement of the past policy documents of the government comprises a subject keyword set used for the subject type web crawler search and a constraint keyword set used for realizing the cleaning interpretation and analysis of the documents.

Since government policies, after release, may also see similar documents on some other extranet web sites, this results in a subject web crawler crawling too many duplicate documents when crawling websites on a subject basis, taking into account government websites when issuing relevant policy requirements, the website is published only once, and then the relevant government websites are added into the initial address URL seed set crawled by the topic type web crawler, so that the occurrence of a large amount of repeated texts can be avoided, and furthermore, the past of repeated calculation is saved, the bandwidth is reserved, a guarantee is provided for rapid screening, and considering that the declared policy often has requirements of county level, city level, province level and country level, the four levels of corresponding government websites are required to be added into an initial address URL seed set, such as the crown county scientific and technological bureau, the chat city scientific and technological bureau, the Shandong province scientific and technical hall, the Ministry of public health and the Ministry of science and technology, and other declared policy websites.

Then, a topic crawler is formed by using keywords in a topic keyword set, starting from an initial address URL seed set, topic relevance of a crawled webpage is analyzed, and the webpage related to the topic is stored in a database, in this embodiment, considering that release of a policy is often displayed in the first place of a government website, for this reason, in order to avoid excessive crawling, in this embodiment, the following policy is adopted for crawling:

when the web crawler crawls the web pages in the URL seed set, all connections in one seed web page are searched first, then all links in the next layer are searched, and after all links in the next layer are searched, the next layer of search is returned to be executed until the lowest layer is reached. In a simple way, assuming that B, C, D three links are linked in the home page of the government website A, E, F two links are linked below the link B, G links are linked below the link F, H, I two links are linked below the link C, and D is a declared policy link which we require to find, the topic type web crawler searches according to the sequence of A, B, C, D, E, F, H, I, G, so that the search can ensure the timely processing of the shallow page, thereby ensuring the quick finding of the policy documents, and the traditional search mode is the sequence of A, B, E, F, G, C, H, I, D, so that the speed and the probability of searching the shallow website are easily reduced.

After the crawled website is stored, the text of the web page needs to be interpreted, as mentioned above, in a normal document, the vocabulary required to be declared appears only once, but not many times, and more words are keywords of the topic keyword set, so as to better interpret the "declaration condition" in the document, in this embodiment, after the original web page structure is first filtered, the html tag, version and other information in the original web page are filtered, the text information in the original web page is extracted, and then a proper separator is selected to segment the sentence of the text content of the web page into a sentence subset P ═ S1,...SNWhere N is the total number of sentences. In the relevant literature in view of policy, it does not contain "? ","! "punctuation mark, for this reason, in the present embodiment, with". "is a separator to perform clauses.

Then, each sentence is segmented, and the segmentation is performed by adopting the existing word segmentation device, at present, the common word segmentation devices in the market comprise a dictionary mechanism based on a hash table, a dictionary mechanism based on a TRIE index tree and jieba segmentation based on a TRIE number structure and realized on the basis of python, the three types of word segmentation devices can meet the requirement of the segmentation, and which word segmentation device can be adopted for the word segmentationi1,....,wimAnd m is the total number of words in the word segmentation set of the current sentence.

Finally, according to the formula: ass (k, Si) | < wk | wk ∈ Si & wk ∈ k > |, wherein Ass (KW, Si) is the degree of association between the keyword and the current sentence, k is the constraint keyword set, and the sentences containing the keyword are extracted, so that each sentence associated with the keyword in the constraint keyword set is extracted, and the content of the declaration condition in a policy text is provided, thereby facilitating the confirmation of the declaration condition.

The extracted declaration conditions need to analyze the contents, at present, sentences containing keywords can be extracted by combining a textrank keyword extraction technology with keywords of a constraint keyword set to analyze, a key constraint attribute set is obtained, the textrank intercepts word segmentation results of texts according to a word selection window of the set constraint keyword set, each word is taken as a node of a candidate keyword graph, and the intercepted words in each section of texts are taken as adjacent edges so as to construct the candidate keyword graph. And then, iteration is carried out on the candidate keyword graph by utilizing a pagerank thought loop, the weight of each node is initialized to be 1.0f, and after the set iteration frequency is stabilized, the node weights are sorted in a reverse order, so that the most important num words are obtained and serve as the candidate keywords.

The method can also adopt training based on LDA and a D2V model algorithm to obtain a key constraint attribute set centered in the policy text, in this embodiment, the key constraint attribute set refers to words containing attribute names and attribute values in the policy document, for example, in the declaration conditions of the high and new technology enterprises, the requirement on finance is to maintain sales revenue and the growth of total assets of the enterprises, then the attribute name of the enterprise is sales revenue, and the attribute value is the growth range of the last three years.

Similarly, the main purpose of this embodiment is to push a policy related to an enterprise meeting a declaration condition, and therefore, basic information of the enterprise needs to be known, and therefore, an enterprise recommendation attribute set containing enterprise information needs to be constructed according to a constraint keyword set, which is a conventional policy, and is also set according to an attribute name and an attribute value. The enterprise recommended attribute set comprises a plurality of information attributes, each information attribute forms a set, the set comprises at least two attributes including an attribute name and an attribute value, the attribute name and the attribute value are two most basic, and the attributes can be added to data types, matching thresholds and the like according to real-time requirements.

Since the filed of declaration is limited in the declaration of some projects, whether enterprises conform to the filed of declaration is a key one-vote problem for this reason, the field attributes of the enterprises in a constraint attribute set and an enterprise recommendation attribute set are determined according to the international general dewey decimal classification method, the dewey decimal classification method divides human knowledge into three parts of memory (history), imagination (literature) and rationality (philosophy, namely science) according to the idea about knowledge classification of 17 th century british philosopher, and the three parts are arranged in an inverted manner and expanded into 10 categories, which are also commonly used in the current filed of declaration policy documents, so that whether the filed of the enterprises meets the requirements of the filed of declaration can be judged, and the specific operation is as follows:

firstly, searching the classification numbers of key words in a constraint key word set and an enterprise recommendation attribute set according to a Duwei decimal classification method;

and then, using the length of the keyword number of the Duwei decimal classification method as an X axis and the keyword classification number as a Y axis, drawing corresponding points on a two-dimensional coordinate by using the Duwei decimal classification number corresponding to the key words in the key constraint attribute set and the enterprise recommendation attribute set, judging that the field attribute matching is successful if the points formed by the key words in the enterprise recommendation attribute set are near or overlapped with the key word points in the key constraint attribute set, and judging that the field attribute matching is unsuccessful if the points are far away. The distribution method can avoid noise of points irrelevant to the keywords, improve the accuracy of forecasting the correlation between the enterprise field and the declaration requirement field, and provide guarantee for accurate recommendation.

After interpreting the declaration condition, accurate pushing needs to be realized, and for this reason, in this embodiment, the following algorithm is adopted to perform accurate pushing:

firstly, comparing an acquired key constraint attribute set with an enterprise recommendation attribute set one by one, wherein the key constraint attribute set consists of a plurality of constraints { C1, C2,. Cn }, the attribute name of the same type is uniquely determined, and the constraint relation of the key constraint attribute set is a union set relation.

Through the setting, the accurate pushing of the declaration type policy is effectively realized, the last kilometer of the policy service is opened, and more enterprises and talents can obtain the policy dividend.

The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:网络结构搜索方法、装置、设备及计算机可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!