Method for marking intentions of mass live broadcast bullet screen data based on knowledge graph
1. A method for labeling intentions of a large amount of live-broadcast bullet screen data based on a knowledge graph is characterized by comprising the following steps: the method comprises the following steps:
s1, extracting keywords according to the bullet screen information and summarizing dimensions of the keywords;
s2, expanding the synonyms of the keywords and the homophones;
s3, combining a plurality of dimensions to serve as a template with a specific purpose;
s4, removing duplication and invalid data of the bullet screen data;
s5, extracting viewpoint intentions of the bullet screen data through a template;
and S6, manually checking and removing error data.
2. The method for annotation of intentions of a large amount of live-broadcast barrage data based on a knowledge graph as claimed in claim 1, wherein in S1, a database is established for barrage information, words in the barrage are augmented and annotated, wherein the emotion of the text is analyzed through word vectors, negative words and positive words are easily recognized from the text, then keywords are listed, the text is extracted when the keywords are hit, and spoken language and more omitted barrage data are annotated by combining with an artificial annotation method;
and the words in the database are adjusted according to the increase of the bullet screen, so that the normal expansion of the database is ensured, and then the data is filtered and labeled after the real-time data is expanded by updating the database.
3. The knowledge-graph-based method for annotating intentions of a large amount of live barrage data, as recited in claim 1, wherein in S2, keywords of the barrage are selected from homophones, participles, near-syllables and consenting words;
then the words are described and defined and recorded in the database, and then the matched words are connected with the keywords of the database.
4. The knowledge-graph-based method for annotating a large number of live barrage data intents, according to claim 1, wherein in the step S3, after keywords are connected, dimensions are formed, and are classified according to fields, ranges and time periods and are made into modules, so that a template with a specific intention is formed;
then manually defining the intention of the bullet screen in the template and carrying out derogatory and commendatory classification on the bullet screen.
5. The knowledge-graph-based method for annotating intentions of a large number of live-broadcast barrage data according to claim 1, wherein in the step S4, words of the barrage are de-duplicated, homophones, participles, near-sounds and agreed words of the barrage are selected through a database of the barrage, and invalid data of the barrage are removed.
6. The knowledge-graph-based method for annotating intentions of a large amount of live barrage data according to claim 1, wherein in S5, viewpoints in the barrage data are analyzed to remove erroneously defined barrages and words.
7. The knowledge-graph-based method for annotating a large number of live barrage data intents, according to claim 1, wherein in step S6, error data of the barrage are checked, the wrongly annotated barrage is removed, and definitions and intents in terms are marked, so that the barrage is manually analyzed again.
8. The method for annotating the intentions of a large amount of live barrage data based on the knowledge graph as claimed in claim 1, wherein in the step S6, the data can be summarized through data cleaning, then the data is rearranged, and finally the data is cleaned by using a script component;
the method comprises the steps of integrating the bullet screens of the data source, storing the bullet screens into a data warehouse, converting the bullet screens into data, and comparing the data with database data, so that the data are cleaned, and normal expression of the data is guaranteed.
Background
The barrage can give viewers an illusion of real-time interaction, although the sending time of different barrages is different, the barrage only appears at a specific time point in a video, so that the barrages sent at the same time basically have the same theme, and the illusion of commenting with other viewers at the same time can be realized when the barrage participates in commenting, and a traditional player comment system is independent of a player, so that the comment content mostly surrounds the whole video, the topicality is not strong, and the feeling of real-time interaction is not realized;
the method is characterized in that all customers comment on goods or services on a business platform of a television provider, the comment extraction is beneficial to goods public praise analysis, consumption decision assistance, public opinion analysis and the like, the purchase comment information is the first-hand feedback of the goods and is very useful for the information of merchants and brands, the current commonly used comment viewpoints are automatically extracted and analyzed, the viewpoint information is extracted through text grammar and sentence meaning analysis based on a natural language analysis (NLP) technology, the information clustering and the like are applied to the comment viewpoints extraction of users of various products including delicacies, hotels, automobiles and scenic spots at present;
however, these schemes are not applicable to spoken barrage text, NLP analysis requires complete grammatical sentences, and the main application scenarios are book-oriented text, the word vector method has a certain effect, but only positive and negative division can be performed, the keyword method needs to hit keywords, the loss rate of spoken scenes is high, and the obtained results are scattered.
Disclosure of Invention
The invention provides a knowledge graph-based method for labeling intentions of a large amount of live-broadcast barrage data, which can effectively solve the problems that spoken barrage texts proposed in the background technology are not applicable, NLP analysis needs complete grammatical sentences, main application scenes are book-face texts, a word vector method has certain effects, but only positive and negative division is needed, keywords need to be hit in a keyword method, the loss rate of spoken scenes is high, and obtained results are scattered.
In order to achieve the purpose, the invention provides the following technical scheme: a method for labeling intentions of a large amount of live-broadcast bullet screen data based on a knowledge graph comprises the following steps:
s1, extracting keywords according to the bullet screen information and summarizing dimensions of the keywords;
s2, expanding the synonyms of the keywords and the homophones;
s3, combining a plurality of dimensions to serve as a template with a specific purpose;
s4, removing duplication and invalid data of the bullet screen data;
s5, extracting viewpoint intentions of the bullet screen data through a template;
and S6, manually checking and removing error data.
According to the technical scheme, a database is established for the bullet screen information in the S1, words in the bullet screen are expanded and labeled, wherein the emotion of the text is analyzed through word vectors, negative words and positive words are easily identified from the text, then keywords are listed, the text is extracted when the keywords are hit, and spoken language and more omitted bullet screen data are labeled by combining an artificial labeling method;
and the words in the database are adjusted according to the increase of the bullet screen, so that the normal expansion of the database is ensured, and then the data is filtered and labeled after the real-time data is expanded by updating the database.
According to the technical scheme, the keywords of the bullet screen are selected by homophony, word segmentation, near pronunciation and agreed words in the S2;
then the words are described and defined and recorded in the database, and then the matched words are connected with the keywords of the database.
According to the technical scheme, after the keywords are connected in the step S3, dimensions are formed, the keywords are classified according to the field, the range and the time period, and the keywords are manufactured into modules, so that a template with a specific intention is formed;
then manually defining the intention of the bullet screen in the template and carrying out derogatory and commendatory classification on the bullet screen.
According to the technical scheme, in the S4, the words and phrases are removed, the homophones, the participles, the nearsighteds and the consented words and phrases of the bullet screen are selected through the database of the bullet screen, and invalid data of the bullet screen are removed.
According to the above technical solution, in S5, the viewpoint in the data of the bullet screen is analyzed to remove the bullet screen and the word defined by the mistake.
According to the technical scheme, in the step S6, the error data of the bullet screen are checked, the bullet screen with the error label is removed, the definition and the intention in the words are marked, and therefore the bullet screen is manually analyzed again.
According to the technical scheme, in the step S6, data cleaning can be performed through data cleaning summary, then data rearrangement is performed, and finally a script component is used for data cleaning;
the method comprises the steps of integrating the bullet screens of the data source, storing the bullet screens into a data warehouse, converting the bullet screens into data, and comparing the data with database data, so that the data are cleaned, and normal expression of the data is guaranteed.
Compared with the prior art, the invention has the beneficial effects that: the method is scientific and reasonable in structure and safe and convenient to use, aims to mark live barrage data intentions quickly and text intentions efficiently, overcomes the defects of huge data volume, low manual efficiency, poor effect of traditional NLP in barrage analysis and the like, analyzes the emotion of the text through word vectors, easily identifies negative words and positive words from the text, lists keywords, extracts the text when the text hits the keywords, marks spoken language and omitted barrage data by combining a manual marking method, facilitates later-stage barrage data conversion, searches consenting and close-meaning related words in the data, extracts words of some keywords, and defines semantics in the data, so that the barrage data can be conveniently deduplicated, the workload is reduced, and the method is suitable for better popularization and use.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
In the drawings:
FIG. 1 is a schematic flow chart of the method steps of the present invention;
fig. 2 is a schematic diagram of an effect structure of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example (b): as shown in fig. 1-2, the invention provides a method for annotating a large amount of live barrage data intents based on a knowledge graph, which comprises the following steps:
s1, extracting keywords according to the bullet screen information and summarizing dimensions of the keywords;
s2, expanding the synonyms of the keywords and the homophones;
s3, combining a plurality of dimensions to serve as a template with a specific purpose;
s4, removing duplication and invalid data of the bullet screen data;
s5, extracting viewpoint intentions of the bullet screen data through a template;
and S6, manually checking and removing error data.
According to the technical scheme, a database is established for bullet screen information in S1, words in a bullet screen are expanded and labeled, wherein the emotion of a text is analyzed through word vectors, negative words and positive words are easily identified from the text, then keywords are listed, the text is extracted when the keywords are hit, and spoken language and more omitted bullet screen data are labeled by combining an artificial labeling method;
and adjusting the words in the database according to the increase of the bullet screen so as to ensure the normal expansion of the database, and then filtering and labeling the data after expanding the real-time data by updating the database.
According to the technical scheme, the keywords of the bullet screen are selected by homophony, word segmentation, near pronunciation and agreed words in S2;
then the words are described and defined and recorded in the database, and then the matched words are connected with the keywords of the database.
According to the technical scheme, after the keywords are connected in the step S3, dimensions are formed, the keywords are classified according to the field, the range and the time period, and the keywords are manufactured into modules, so that a template with a specific intention is formed;
then manually defining the intention of the bullet screen in the template and carrying out derogatory and commendatory classification on the bullet screen.
According to the technical scheme, in the S4, the words are deduplicated, the homophones, the participles, the nearsighteds and the agreed words of the bullet screen are selected through the database of the bullet screen, and invalid data of the bullet screen are removed.
According to the above technical solution, in S5, the viewpoint in the data of the bullet screen is analyzed to remove the bullet screen and the word defined by the mistake.
According to the technical scheme, in the step S6, the error data of the bullet screen are checked, the bullet screen with the error label is removed, the definition and the intention in the words are marked, and therefore the bullet screen is manually analyzed again.
According to the technical scheme, in the step S6, data cleaning can be summarized, then data is rearranged, and finally a script component is used for data cleaning;
the method comprises the steps of integrating the bullet screens of the data source, storing the bullet screens into a data warehouse, converting the bullet screens into data, and comparing the data with database data, so that the data are cleaned, and normal expression of the data is guaranteed.
As shown in the following table:
keyword
Synonymy homophone
Dimension (d) of
Neutral skin
Neutral skin
Skin texture
How to buy
Za buy
Order consultation
Oily muscle
Oily skin
Skin texture
Repair cream
Repair cream
Articles and the like
Name of label
Homophone word
Dimension (d) of
Skin-moistening cream
Soothing cream
Articles and the like
Skin softening lotion
Soft care lotion
Articles and the like
All skin types
Any skin type
Skin texture
Which money
Which one isL which
Articles and the like
Is bought on
Buy good
Has purchased
Refined extract liquid
Refined liquid
Articles and the like
Cleansing foam
Cleansing foam
Articles and the like
Face cleaning mousse
Face cleaning mousse
Articles and the like
Skin cleaning water
Skin cleaning liquid
Articles and the like
Introduction to
Say once more
Commodity consultation
Combination skin
Mixed | mixed skin
Skin texture
The combination of multiple dimensions as a template for a specific purpose can be divided into { ([ # purchased # ]) [ # class # ] } | ([ # purchased # ])
Compared with the prior art, the invention has the beneficial effects that: the method is scientific and reasonable in structure and safe and convenient to use, aims to mark live barrage data intentions quickly and text intentions efficiently, overcomes the defects of huge data volume, low manual efficiency, poor effect of traditional NLP in barrage analysis and the like, analyzes the emotion of the text through word vectors, easily identifies negative words and positive words from the text, lists keywords, extracts the text when the text hits the keywords, marks spoken language and omitted barrage data by combining a manual marking method, facilitates later-stage barrage data conversion, searches consenting and close-meaning related words in the data, extracts words of some keywords, and defines semantics in the data, so that the barrage data can be conveniently deduplicated, the workload is reduced, and the method is suitable for better popularization and use.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种作业评阅系统及方法