High-frequency product keyword and phrase display system for shipping and trading data
1. The utility model provides a high frequency product keyword and phrase display system of shipping and trade data which characterized in that includes:
the data updating unit is used for automatically acquiring unstructured text data provided by a client through a mailbox, transmitting the acquired data to the system and sorting the data by using the cleaning unit;
the cleaning unit is used for acquiring original shipping and trade data, cleaning the acquired data and constructing a basic database based on the cleaned data;
the high-frequency product keyword and phrase mining unit is used for performing word segmentation processing on historical data, performing word frequency statistics and calculating weight, extracting high-frequency product keywords and phrases to be used as a word stock, performing word segmentation processing on updated data, reserving nouns and phrases, performing word frequency statistics, comparing the word stock to obtain words and phrases with larger weight, performing word frequency sequencing, and screening results based on comparison of weight and a threshold value;
the storage unit is used for storing the high-frequency product keywords and the mining result of the phrase mining unit;
and the visualization unit is used for drawing different visual graphic reports by taking the high-frequency product keywords and the phrases as units of day, month and year.
2. The system for high frequency product keyword and phrase presentation of shipping and trading data of claim 1, wherein the raw shipping and trading data obtained is converted from unstructured data to structured data and the structured data is cleaned using SQL view method.
3. The shipping and trading data high frequency product keyword and phrase presentation system of claim 1, wherein the high frequency product keyword and phrase mining unit obtains a most probable part of speech using part of speech tagging based on hidden Markov models, specifically,
using part-of-speech tagging based on a hidden Markov model, and using an HMM model to obtain an HMM chain for an English tagged part-of-speech corpus:
and obtaining an HMM chain to judge the historical data word segmentation to obtain the maximum probability part of speech, screening out the words with the maximum probability part of speech and obtaining related phrases before and after the word.
4. A high-frequency product keyword and phrase display system for shipping and trade data is characterized by comprising the following steps:
acquiring original shipping and trade data;
converting the acquired non-structural data into structural data, cleaning the data by using an SQL view method, and constructing a basic database after cleaning;
performing word segmentation processing on the historical data, reserving nouns and phrases, and removing stop words and nonsense words; carrying out word frequency statistics and calculating weight, and extracting high-frequency product keywords and phrases as a word stock; performing word segmentation processing on the updated data, reserving nouns and phrases, performing word frequency statistics, comparing word libraries to obtain words and phrases with larger weights, performing word frequency sequencing, and screening results based on a preset threshold;
and performing visualization processing on the output data.
5. The system for displaying keywords and phrases of high-frequency products of shipping and trading data of claim 1, wherein before the output data is visualized, the system further comprises the following steps: and reconstructing a data source into structural data and calculated high-frequency product keywords and phrase data, storing the structural data and the calculated high-frequency product keywords and phrase data into an sql server database, using a view of sql to clean standardly, and storing the data of the reconstructed database into a hadoop distributed system as a background database of the visual website.
Background
The high-frequency product keywords and phrases are based on the occurrence times of the words or phrases of the product names and can reflect the attention and demand of the product in a time period. The high-frequency product keywords and phrases updated daily have important significance for the research of the international trade industry, and are beneficial to national organs and related enterprises to make scientific decisions and precise strategies. In addition, the method also has important help for the research of the product competitiveness of the trading company, and can also provide powerful basis for product promotion and market development for more workers investing in international trade.
Disclosure of Invention
In light of the above-identified problems, a system for displaying keywords and phrases of high-frequency products for shipping and trading data is provided. The technical means adopted by the invention are as follows:
a high frequency product keyword and phrase display system for shipping and trade data, comprising:
the data updating unit automatically transmits unstructured text data to the system through a mailbox after the client delivers the data, and arranges the data by using the cleaning unit;
the cleaning unit is used for acquiring original shipping and trade data, cleaning the acquired data and constructing a basic database based on the cleaned data;
the high-frequency product keyword and phrase mining unit is used for performing word segmentation processing on historical data, performing word frequency statistics and calculating weight, extracting high-frequency product keywords and phrases to be used as a word stock, performing word segmentation processing on updated data, reserving nouns and phrases, performing word frequency statistics, comparing the word stock to obtain words and phrases with larger weight, performing word frequency sequencing, and screening out a result based on comparison of weight and a threshold value;
the storage unit is used for storing the high-frequency product keywords and the keywords mined by the phrase mining unit;
and the visualization unit is used for drawing different visual graphic reports by taking the key words or phrases as units of day, month and year.
Further, the obtained original shipping and trade data are converted into structural data by using an SQL view method to clean the data.
Further, the high frequency product keyword and phrase mining unit obtains a maximum probability part of speech using part of speech tagging based on a hidden markov model, and specifically,
using part-of-speech tagging based on a hidden Markov model, and using an HMM model to obtain an HMM chain for an English tagged part-of-speech corpus:
and obtaining an HMM chain to judge the historical data word segmentation to obtain the maximum probability part of speech, screening out the words with the maximum probability part of speech and obtaining related phrases before and after the word.
A high-frequency product keyword and phrase display system for shipping and trade data comprises the following steps:
acquiring original shipping and trade data;
converting the acquired non-structural data into structural data, cleaning the data by using an SQL view method, and constructing a basic database after cleaning;
performing word segmentation processing on the historical data, reserving nouns and phrases, and removing stop words and nonsense words; carrying out word frequency statistics and calculating weight, and extracting high-frequency product keywords and phrases as a word stock; performing word segmentation processing on the updated data, reserving nouns and phrases, performing word frequency statistics, comparing word libraries to obtain words and phrases with larger weights, performing word frequency sequencing, and screening results based on a preset threshold;
and performing visualization processing on the output data.
Further, before performing visualization processing on the output data, the method further includes the following steps: and reconstructing a data source into structural data and calculated high-frequency product keywords and phrase data, storing the structural data and the calculated high-frequency product keywords and phrase data into an sql server database, using a view of sql to clean standardly, and storing the data of the reconstructed database into a hadoop distributed system as a background database of the visual website.
The invention has the following advantages:
1. the original English corpus is only words obtained by dividing text contents through blank spaces, wherein no phrase exists, the maximum probability part of speech is obtained through part of speech tagging of a hidden Markov model, the maximum probability word of the part of speech is screened out, and related phrases before and after the word are obtained, so that the problems of the part of speech and the part of speech are solved.
2. The whole process is quick and efficient, and the method can be used as an expandable system, can be embedded into a system for use, and can be used independently.
3. The invention firstly excavates valuable word information with heat in the shipping trade data and presents the word information in a visual mode, thereby facilitating the visual display. The visual interface is helpful for easily and clearly seeing the product heat trend of the day, the month and the year.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a block diagram of the process of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the invention discloses a high-frequency product keyword and phrase display system for shipping and trade data, comprising the following steps:
acquiring original shipping and trade data; the method comprises the steps of converting acquired non-structural data into structural data, cleaning the data by using an SQL view method, and constructing a basic database after cleaning. In this embodiment, the step of acquiring data may be: the processing rule of the shipping bill data is provided by customs, the text data is manually sent to a designated mailbox, and the data is automatically received to a local system by using an automatic program.
Performing word segmentation processing on the historical data, reserving nouns and phrases, and removing stop words and nonsense words; carrying out word frequency statistics and calculating weight, and extracting keywords and phrases as a word stock; and performing word segmentation processing on the updated data, reserving nouns and phrases, performing word frequency statistics, comparing word libraries to obtain words and phrases with larger weights, performing word frequency sequencing, and screening results based on a preset threshold value.
Specifically, the segmentation uses a blank space as a segmentation symbol to segment the product description of the shipping data into individual word words. And (3) using part-of-speech tagging based on a hidden Markov model, and using an HMM model for the part-of-speech corpus of the public English tagging to obtain an HMM chain. HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences, etc.), it can calculate the possible tag order of the probability distribution and select the best tag order. Hidden markov chains used therein:
and performing part-of-speech tagging on the list after the product description is divided as a whole.
And (3) the obtained HMM chain judges the existing word (historical data word segmentation) to obtain the maximum probability part of speech, and the word with the maximum probability of the part of speech is screened out to obtain the phrases which are related before and after the word. Due to the particularity of the shipping data, some terminology is not determined to belong to the corpus. And for words with higher word frequency and words with parts of speech which cannot be labeled, manually judging the words with higher word frequency and the selected phrases with higher phrase frequency, and setting the words as professional nominal phrases and nominal word libraries.
The product description of each record strip is participled using nltk's participle technique in python. The word frequency weight is calculated in such a way that the word occupies a proportion in the historical word group. The filtering method comprises the steps of screening out nouns from a historical word group and a current word group, removing stop words and nonsense words, then sequencing word frequency, and removing nouns with low word frequency. The method for determining the nonsense word is to determine whether the product name is included, in this embodiment, the word database related to the product name is a word group database with tags provided by a large and vast database of information. And calculating the weight of the updated word group according to the word frequency of the rest screened words and phrases, summarizing by using the daily weight to obtain the monthly weight, and summarizing by using the monthly weight to obtain the annual weight. And calculating the homonymy-circle ratio according to the ranking condition of the current day word group. And screening the result according to the condition that the weight of the screened word frequency ordering exceeds the set threshold value.
And reconstructing the data source into structural data, performing normative cleaning on the data by using an sql storage process, and storing the calculated high-frequency product keywords and phrase data into an sql server database to serve as a background database of the visual website.
And carrying out visualization processing on the data. The visualization is divided into five parts for display, wherein firstly, the high-frequency product keywords and phrases updated daily are presented in a columnar form; secondly, high-frequency product keywords and phrase fluctuation data within the past ninety days are presented in a line graph form, and thirty days, sixty days and ninety days can be freely selected for viewing; thirdly, presenting the monthly increase/decrease rate of the high-frequency product keywords and the phrases in a histogram form; fourthly, presenting the monthly cycle ratio increase/decrease rate of the high-frequency product keywords and the phrases in a form of a histogram; and fifthly, presenting the annual ring ratio increase/decrease rate of the hot pin in a form of a bar chart.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.