Text classification method and device based on text abstract, electronic equipment and medium
1. A text classification method based on a text abstract is characterized by comprising the following steps:
obtaining a paragraph text to be classified, and dividing the paragraph text to be classified into single sentences to obtain a single sentence set to be classified;
extracting a first text abstract from the single sentence set to be classified by using a key word-based extraction method of the abstract;
extracting a second text abstract from the single sentence set to be classified by using a deep learning-based extraction method of the abstract;
respectively calculating the matching degree of the first text abstract and the second text abstract with the paragraph text to be classified, and determining the first text abstract or the second text abstract as a target text abstract according to the matching degree;
and performing text classification on the target text abstract by using a preset text classification model to obtain the text category of the paragraph text to be classified.
2. The method for text classification based on text excerpts according to claim 1, wherein the extracting a first text excerpt from the set of single sentences to be classified by using a keyword-based abstraction extraction method comprises:
vectorizing each single sentence to be classified in the single sentence set to be classified to obtain a text single sentence vector set;
calculating the similarity between each text single sentence vector in the text single sentence vector set, storing the similarity into a preset blank matrix, and constructing a transition probability matrix;
calculating a text ordering value of the single sentence to be classified based on the transition probability matrix;
and screening the single sentences to be classified in the single sentence set to be classified as the first text abstract through the text sorting value.
3. The text classification method based on the text abstract as claimed in claim 2, wherein the vectorizing each single sentence to be classified in the single sentence set to be classified to obtain a text single sentence vector set comprises:
splitting the single sentences to be classified in the single sentence set to be classified to obtain a plurality of text words to be classified;
vectorizing the text words to be classified by using a preset word vector model to obtain a plurality of vector text words, and combining the vector text words to obtain a text single sentence vector set.
4. The text classification method based on text abstract according to claim 2, wherein the calculating the text ranking value of the sentence to be classified based on the transition probability matrix comprises:
obtaining the similarity between each text single sentence vector in the transition probability matrix;
constructing a similarity graph structure by taking each text single sentence as a node and taking the similarity between each text single sentence vector as an edge of the node;
and calculating the text ranking value of the single sentence to be classified by using the similarity graph structure.
5. The method for text classification based on text excerpts according to claim 2, wherein the step of filtering the single sentence to be classified in the single sentence set to be classified into the first text excerpt according to the text ranking value comprises:
traversing the text sorting values of the single sentences to be classified in the paragraph texts to be classified, and selecting a preset number of text sorting values from large to small;
and combining the target to-be-classified single sentences corresponding to the preset number of text sorting values into the first text abstract.
6. The method for text summarization-based text classification according to any of the claims 1-5 wherein the extracting a second text summary from the set of single sentences to be classified using a deep learning-based abstraction extraction method comprises:
acquiring a training text set and a text abstract of the training text set, and training a preset two-classification model by using the training text set and the text abstract of the training text set to obtain a labeling model;
marking each single sentence to be classified in the single sentence set to be classified by using the marking model to obtain a marked single sentence;
and obtaining a second text abstract according to the sequence of the single sentences in the text of the paragraph to be classified and the labels of the labeled single sentences.
7. The text classification method based on the text abstract according to any one of claims 1 to 5, wherein the text classification of the target text abstract by using a preset text classification model to obtain the text category of the paragraph text to be classified comprises:
performing word segmentation operation on the target text abstract to obtain abstract word segmentation of the target text abstract;
establishing an information processor through a preset category dictionary in a preset text classification model;
inputting the abstract word segmentation into the information processor to obtain the text category of the abstract word segmentation;
and determining the text category of the abstract word segmentation as the text category of the paragraph text to be classified.
8. An apparatus for classifying a text based on a text abstract, the apparatus comprising:
the paragraph text dividing module is used for acquiring paragraph texts to be classified, dividing the paragraph texts to be classified into single sentences to obtain a single sentence set to be classified;
the first abstract acquisition module is used for extracting a first text abstract from the single sentence set to be classified by using a key word-based extraction method of the abstract;
the second abstract acquisition module is used for extracting a second text abstract from the single sentence set to be classified by using a deep learning-based extraction type abstract extraction method;
the target abstract confirming module is used for respectively calculating the matching degree of the first text abstract and the second text abstract with the paragraph text to be classified, and determining the first text abstract or the second text abstract as a target text abstract according to the matching degree;
and the text abstract classification module is used for performing text classification on the target text abstract by using a preset text classification model to obtain the text category of the paragraph text to be classified.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform a text summarization-based text classification method according to any of claims 1 to 7.
10. A computer-readable storage medium comprising a storage data area storing created data and a storage program area storing a computer program; wherein the computer program when executed by a processor implements a text summarization based text classification method according to any of claims 1 to 7.
Background
With the rapid development of internet technology, multimedia information on a network rapidly grows, and how to effectively organize, classify, manage and mine the rapidly growing information becomes a problem which needs to be solved urgently.
The existing text classification technology generally directly inputs the paragraph text to be classified into a text classification model, and due to the existence of a large amount of redundant irrelevant information in the paragraph text to be classified, the key information obtained by text classification cannot be highlighted, so that the classification result is far from expectation, namely the classification accuracy is not high.
Disclosure of Invention
The invention provides a text classification method and device based on a text abstract, electronic equipment and a computer readable storage medium, and mainly aims to improve the accuracy of text classification.
In order to achieve the above object, the text classification method based on the text abstract provided by the present invention includes:
obtaining a paragraph text to be classified, and dividing the paragraph text to be classified into single sentences to obtain a single sentence set to be classified;
extracting a first text abstract from the single sentence set to be classified by using a key word-based extraction method of the abstract;
extracting a second text abstract from the single sentence set to be classified by using a deep learning-based extraction method of the abstract;
respectively calculating the matching degree of the first text abstract and the second text abstract with the paragraph text to be classified, and determining the first text abstract or the second text abstract as a target text abstract according to the matching degree;
and performing text classification on the target text abstract by using a preset text classification model to obtain the text category of the paragraph text to be classified.
Optionally, the extracting a first text abstract from the set of single sentences to be classified by using a keyword-based extraction method of a abstract includes:
vectorizing each single sentence to be classified in the single sentence set to be classified to obtain a text single sentence vector set;
calculating the similarity between each text single sentence vector in the text single sentence vector set, storing the similarity into a preset blank matrix, and constructing a transition probability matrix;
calculating a text ordering value of the single sentence to be classified based on the transition probability matrix;
and screening the single sentences to be classified in the single sentence set to be classified as the first text abstract through the text sorting value.
Optionally, the vectorizing each to-be-classified single sentence in the to-be-classified single sentence set to obtain a text single sentence vector set includes:
splitting the single sentences to be classified in the single sentence set to be classified to obtain a plurality of text words to be classified;
vectorizing the text words to be classified by using a preset word vector model to obtain a plurality of vector text words, and combining the vector text words to obtain a text single sentence vector set.
Optionally, the calculating a text ranking value of the to-be-classified single sentence based on the transition probability matrix includes:
obtaining the similarity between each text single sentence vector in the transition probability matrix;
constructing a similarity graph structure by taking each text single sentence as a node and taking the similarity between each text single sentence vector as an edge of the node;
and calculating the text ranking value of the single sentence to be classified by using the similarity graph structure.
Optionally, the screening, according to the text ranking value, the to-be-classified single sentences in the to-be-classified single sentence set as the first text summary includes:
traversing the text sorting values of the single sentences to be classified in the paragraph texts to be classified, and selecting a preset number of text sorting values from large to small;
and combining the target to-be-classified single sentences corresponding to the preset number of text sorting values into the first text abstract.
Optionally, the extracting a second text abstract from the set of single sentences to be classified by using a deep learning-based abstraction extraction method includes:
acquiring a training text set and a text abstract of the training text set, and training a preset two-classification model by using the training text set and the text abstract of the training text set to obtain a labeling model;
marking each single sentence to be classified in the single sentence set to be classified by using the marking model to obtain a marked single sentence;
and obtaining a second text abstract according to the sequence of the single sentences in the text of the paragraph to be classified and the labels of the labeled single sentences.
Optionally, the performing text classification on the target text abstract by using a preset text classification model to obtain a text category of the paragraph text to be classified includes:
establishing an information processor through a preset category dictionary in a preset text classification model;
inputting the abstract word segmentation into the information processor to obtain the text category of the abstract word segmentation;
and determining the text category of the abstract word segmentation as the text category of the paragraph text to be classified.
In order to solve the above problem, the present invention further provides a text classification device based on a text abstract, including:
the paragraph text dividing module is used for acquiring paragraph texts to be classified, dividing the paragraph texts to be classified into single sentences to obtain a single sentence set to be classified;
the first abstract acquisition module is used for extracting a first text abstract from the single sentence set to be classified by using a key word-based extraction method of the abstract;
the second abstract acquisition module is used for extracting a second text abstract from the single sentence set to be classified by using a deep learning-based extraction type abstract extraction method;
the target abstract confirming module is used for respectively calculating the matching degree of the first text abstract and the second text abstract with the paragraph text to be classified, and determining the first text abstract or the second text abstract as a target text abstract according to the matching degree;
and the text abstract classification module is used for performing text classification on the target text abstract by using a preset text classification model to obtain the text category of the paragraph text to be classified.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform a text summarization-based text classification method as described above.
In order to solve the above problem, the present invention further provides a computer-readable storage medium including a storage data area and a storage program area, the storage data area storing created data, the storage program area storing a computer program; wherein the computer program when executed by a processor implements a text categorization method based on a text excerpt as described above.
In the embodiment of the invention, on one hand, a first text abstract is extracted from a paragraph text to be classified by using a keyword-based abstraction extraction method, on the other hand, a second text abstract is extracted from the paragraph text to be classified by using a deep learning-based abstraction extraction method, the matching degree of the first text abstract and the second text abstract and the paragraph text to be classified is calculated, the text abstract with high matching degree is screened out to be used as the text abstract of the paragraph text to be classified, then the text abstract is classified by a text classification model to obtain the text category of the paragraph text to be classified, the text classification is carried out by the way of obtaining the abstract first and then carrying out the text classification, the information redundancy is reduced, the classification result is more accurate, and meanwhile, the abstract extraction is carried out by adopting a plurality of ways, so that the problem that the text abstract is not accurate enough due to the extraction of a single means is avoided, the accuracy of the obtained abstract is improved, and the accuracy of text classification is further improved. Therefore, the embodiment of the invention can achieve the aim of improving the accuracy of text classification.
Drawings
Fig. 1 is a schematic flowchart of a text classification method based on a text abstract according to an embodiment of the present invention;
FIG. 2 is a detailed flowchart illustrating a step of the text classification method based on text excerpts provided in FIG. 1 according to a first embodiment of the present invention;
FIG. 3 is a block diagram of an apparatus for text classification based on text summarization according to an embodiment of the present invention;
fig. 4 is a schematic internal structural diagram of an electronic device implementing a text classification method based on a text abstract according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a text classification method based on a text abstract. The execution subject of the text classification method based on the text abstract includes, but is not limited to, at least one of electronic devices, such as a server and a terminal, which can be configured to execute the method provided by the embodiments of the present application. In other words, the text classification method based on the text abstract may be performed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Referring to fig. 1, a flowchart of a text classification method based on a text abstract according to an embodiment of the present invention is shown. In this embodiment, the text classification method based on the text abstract includes:
s1, obtaining the paragraph text to be classified, dividing the paragraph text to be classified into single sentences, and obtaining a single sentence set to be classified.
In the embodiment of the present invention, the paragraph text to be classified is a text that needs to be subjected to text classification, and the paragraph text to be classified may be in any format, for example, the paragraph text to be classified is a chinese text, or the paragraph text to be classified is an english text.
In the embodiment of the present invention, the to-be-classified paragraph text may be a text input by a user, or a text extracted from a preset to-be-classified paragraph text database.
In the embodiment of the present invention, the set of single sentences to be classified is a set formed by the single sentences in the text of the paragraphs to be classified, and specifically, the division of the text of the paragraphs to be classified into the single sentences can be realized by identifying punctuation marks in the text of the paragraphs to be classified.
For example, punctuation marks in the text of the paragraph to be classified are identified, and the sentence between the preset punctuation marks is divided into a single sentence when the preset punctuation marks (such as periods or semicolons) exist.
For example, when a first period from the beginning of the text of the paragraph to be classified to the text of the paragraph to be classified is recognized, the contents before the period are determined as a single sentence, and when the period is recognized again, the contents from the previous period to the period recognized again are determined as a single sentence.
And S2, extracting a first text abstract from the single sentence set to be classified by using a key word-based extraction method of the abstract.
In the embodiment of the invention, the method for extracting the abstract based on the keyword can be a Lead3 algorithm, a Lead3 algorithm is a method for extracting the abstract of the text, and 3 sentences can be selected from the text to be used as the abstract of the text.
Referring to fig. 2, fig. 2 is a detailed flowchart illustrating a step of the text classification method based on the text excerpt provided in fig. 1 according to the first embodiment.
Further, the extracting the first text abstract from the single sentence set to be classified by using the extraction method of the abstract based on the key words comprises the following steps:
s201, vectorizing each single sentence to be classified in the single sentence set to be classified to obtain a text single sentence vector set.
In the embodiment of the invention, the vectorization of the single sentence to be classified can be used for acquiring the vector information in the single sentence to be classified in a digitalized angle, so that the text information identified by manpower can be converted into the vector information identified by a machine.
Specifically, in the embodiment of the present invention, the vectorizing each single sentence to be classified in the single sentence set to be classified to obtain a text single sentence vector set includes:
splitting the single sentences to be classified in the single sentence set to be classified to obtain a plurality of text words to be classified;
vectorizing the text words to be classified by using a preset word vector model to obtain a plurality of vector text words, and combining the vector text words to obtain a text single sentence vector set.
In the embodiment of the invention, the step of splitting the multiple to-be-classified single sentences in the to-be-classified single sentence set is to perform word splitting processing on each to-be-classified single sentence in the to-be-classified single sentence set respectively to obtain words forming each single sentence, namely text words to be classified.
S202, calculating the similarity among the text single sentence vectors in the text single sentence vector set, storing the similarity into a preset blank matrix, and constructing a transition probability matrix.
In the embodiment of the present invention, each element in the transition probability matrix is represented by a probability, and is all non-negative, and the sum of each row of elements is equal to 1.
In detail, the calculating the similarity between the text single sentence vectors in the text single sentence vector set, and storing the similarity into a preset blank matrix to construct a transition probability matrix, includes:
calculating the similarity of each text single sentence vector in the text single sentence vector set by using a similarity calculation formula to obtain a vector similarity set;
and storing the vector similarity in the vector similarity set into a pre-constructed matrix to obtain a transition probability matrix.
In this embodiment, the vector similarity set is obtained by calculating the similarity of each text single sentence vector in the single sentence vector set by using a preset similarity calculation formula, where the preset similarity calculation formula may be a cosine similarity calculation formula, an euclidean distance calculation formula, or the like.
S203, calculating the text ranking value of the single sentence to be classified based on the transition probability matrix.
In the embodiment of the present invention, the text ordering value (textrank value) is a value representing a semantic relation between a single sentence to be classified and a paragraph text to be classified, where the stronger the semantic relevance between the single sentence to be classified and the paragraph text to be classified is, the higher the text ordering value is, and therefore, the higher the text ordering value is, the higher the possibility that the single sentence to be classified as the text abstract of the paragraph text to be classified is.
In detail, the calculating a text ranking value of the to-be-classified clause based on the transition probability matrix includes:
obtaining the similarity between each text single sentence vector in the transition probability matrix;
constructing a similarity graph structure by taking each text single sentence as a node and taking the similarity between each text single sentence vector as an edge of the node;
and calculating the text ranking value of the single sentence to be classified by using the similarity graph structure.
In the embodiment of the invention, the similarity graph structure is utilized to calculate the text ranking value S (V) of the single sentence to be classifiedi) Is realized by the following formula:
wherein j is andsingle sentence V with similarity relation between target text single sentences iiNode, V, being a text single sentence ijIs the node of the text single sentence j, E is the edge of the node, d is the damping coefficient, k is the co-occurrence of the target text single sentence i, E (V)i) Is and node ViSet of all nodes connected, E (v)j) Is and node VjSet of all nodes connected, WijRepresents ViAnd VjWeight of edges between, WjkRepresents VkAnd VjWeight of the edges in between, S (V)j) Is a VjThe text rank value of.
Further, the co-occurrence words are words in the same text paragraph that describe the same phenomenon or object with a certain frequency.
S204, screening the single sentences to be classified in the single sentence set to be classified into the first text abstract through the text sorting value.
In detail, the screening, by the text ranking value, the to-be-classified single sentences in the to-be-classified single sentence set as the first text abstract includes:
traversing the text sorting values of the single sentences to be classified in the paragraph texts to be classified, and selecting a preset number of text sorting values from large to small;
and combining the target to-be-classified single sentences corresponding to the preset number of text sorting values into the first text abstract.
Or after a preset number of target to-be-classified single sentences are selected, the sequence of each to-be-classified single sentence in the to-be-classified paragraph text is obtained according to the to-be-classified paragraph text, and all the selected target to-be-classified single sentences are sequenced according to the sequence.
In this embodiment, the preset number of summary single sentences is a number of single sentences of a first text summary preset by a user, and the number of single sentences to be classified in the first text summary of the obtained paragraph text to be classified meets the number of summary single sentences preset by the user.
And S3, extracting a second text abstract from the single sentence set to be classified by using an extraction abstract extraction method based on deep learning.
Further, the extracting a second text abstract from the single sentence set to be classified by using a deep learning-based abstraction extraction method includes:
acquiring a training text set and a text abstract of the training text set, and training a preset two-classification model by using the training text set and the text abstract of the training text set to obtain a labeling model;
marking each single sentence to be classified in the single sentence set to be classified by using the marking model to obtain a marked single sentence;
and obtaining a second text abstract according to the sequence of the single sentences in the text of the paragraph to be classified and the labels of the labeled single sentences.
In this embodiment, each single sentence to be classified in the single sentence set to be classified is labeled, and the types of the labels are two, one is that the sentence belongs to the subsequent abstract, and the other is that the sentence does not belong to the subsequent abstract.
In this embodiment, the sequence of the single sentences is that the sentence is the first sentence in the paragraphs to be classified, and then the sentences belonging to the subsequent abstract in the labeled single sentences are combined according to the sequence of the single sentences to obtain the second text abstract.
In the embodiment of the invention, the training texts and the text summaries of the training text set can be obtained by crawling texts disclosed in the network from the network by utilizing a crawler technology.
Further, the two-classification model is a model constructed based on a sigmoid two-classification function, and the text abstract of the training text set and the training text set are used for training to obtain a labeling model.
Further, before the labeling of each single sentence to be classified in the single sentence set to be classified by using the labeling model, the method further includes: and adding identifiers to the single sentence to be classified.
For example, a front identifier [ CLS ] is added in front of the single sentence to be classified, and a rear identifier [ SEP ] is added at the end of the single sentence to be classified.
In this embodiment, adding an identifier to the single annotation sentence can define the range of the single annotation sentence, provide signals for the start reading and the end reading of the model, and avoid the reading error of the single annotation sentence.
And S4, respectively calculating the matching degree of the first text abstract and the second text abstract with the paragraph text to be classified, and determining that the first text abstract or the second text abstract is a target text abstract according to the matching degree.
In the embodiment of the present invention, there are various methods for determining the target text abstract, for example, determining the target text abstract according to a quick sorting method.
In detail, the matching degree of the first text abstract and the second text abstract is calculated, and if the matching degree of the first text abstract and the to-be-classified paragraph text is higher than the matching degree of the second text abstract and the to-be-classified paragraph text, the first text abstract is used as a target text abstract of the to-be-classified paragraph text; and if the matching degree of the second text abstract and the paragraph text to be classified is higher than that of the first text abstract and the paragraph text to be classified, taking the second text abstract as a target text abstract of the paragraph text to be classified.
Further, the calculating of the matching degree between the first text abstract and the text of the paragraph to be classified can be realized by calculating the number of single sentences in the first text abstract.
For example, when the paragraph text to be classified includes twenty to-be-classified single sentences, if the first text abstract includes six to-be-classified single sentences, the matching degree between the first text abstract and the paragraph text to be classified is 6, and if the second text abstract includes eight to-be-classified single sentences, the matching degree between the second text abstract and the paragraph text to be classified is 8, so that the matching degree between the second text abstract and the paragraph text to be classified is higher than that of the first text abstract, and the second text abstract is used as the target text abstract of the paragraph text to be classified.
And S5, performing text classification on the target text abstract by using a preset text classification model to obtain the text category of the paragraph text to be classified.
In the embodiment of the invention, the preset text classification model is an Albert model, and the Albert model is a model obtained by simplification and optimization based on a Bert model. Compared with the Bert model, the Albert model has fewer required parameters and faster training, and can solve the problems of overlarge parameters and slow training of the Bert text classification model, so that the memory occupation can be reduced by classifying through the Albert model, and the text classification speed is increased.
Specifically, in this embodiment, the Albert text classification model may be obtained by performing factorization on an Embedding layer (Embedding) of the Bert text classification model and sharing parameters of an encoder, replacing a pre-training task of the Bert text classification model with a sentence sequential prediction task (SOP), and the speed and the classification accuracy of the model may be improved by performing factorization on the Embedding layer (Embedding) of the Bert text classification model and sharing parameters of the encoder, so as to reduce the number of parameters required by the Bert text classification model, and replacing the pre-training task of the Bert text classification model with the sentence sequential prediction task.
In detail, the performing text classification on the target text abstract by using a preset text classification model to obtain the text category of the to-be-classified paragraph text includes:
performing word segmentation operation on the target text abstract to obtain abstract word segmentation of the target text abstract;
establishing an information processor through a preset category dictionary in a preset text classification model;
inputting the abstract word segmentation into the information processor to obtain the text category of the abstract word segmentation;
and determining the text category of the abstract word segmentation as the text category of the paragraph text to be classified.
In this embodiment, the category dictionary is a dictionary preset in the Albert model, and the purpose of performing word segmentation on the abstract word segmentation can be achieved by an information processor constructed by the category dictionary.
In the embodiment of the present invention, the finally obtained text category of the paragraph text to be classified may be one or at least two, for example, the text category of the paragraph text to be classified obtained through the above operation may be multiple categories such as education, sports, and testing.
Furthermore, the method can be applied to intelligent information recommendation after the text category of the paragraph text to be classified is obtained.
Specifically, after obtaining the text category of the paragraph text to be classified, the method further includes:
selecting a paragraph text to be classified from a plurality of paragraph texts to be classified according to the interesting text category of the target user, and pushing the paragraph text to be classified to the target user.
For example, the text of the paragraph to be classified is a referee document text which comprises 20 document single sentences, the abstract of the referee document text is extracted by a keyword-based abstraction extraction method and a deep learning-based abstraction extraction method, if the first text abstract extracted by the keyword-based abstraction extraction method is 6 document single sentences, and if the first text abstract extracted by the deep learning-based abstraction extraction method is 8 document single sentences, the target abstract text of the referee document is determined to be 8 document single sentences. After the target text abstract of the referee document is obtained, text classification is carried out on the referee document text according to the target text abstract of the referee document text, and the referee document text is judged to belong to a civil referee document, a criminal referee document, an administrative referee document or other general litigation documents.
In the embodiment of the invention, on one hand, a first text abstract is extracted from a paragraph text to be classified by using a keyword-based abstraction extraction method, on the other hand, a second text abstract is extracted from the paragraph text to be classified by using a deep learning-based abstraction extraction method, the matching degree of the first text abstract and the second text abstract and the paragraph text to be classified is calculated, the text abstract with high matching degree is screened out to be used as the text abstract of the paragraph text to be classified, then the text abstract is classified by a text classification model to obtain the text category of the paragraph text to be classified, the text classification is carried out by the way of obtaining the abstract first and then carrying out the text classification, the information redundancy is reduced, the classification result is more accurate, and meanwhile, the abstract extraction is carried out by adopting a plurality of ways, so that the problem that the text abstract is not accurate enough due to the extraction of a single means is avoided, the accuracy of the obtained abstract is improved, and the accuracy of text classification is further improved. Therefore, the embodiment of the invention can achieve the aim of improving the accuracy of text classification.
Fig. 3 is a schematic block diagram of a text classification apparatus based on a text abstract according to the present invention.
The text classification device 100 based on the text abstract can be installed in an electronic device. According to the implemented functions, the text classification device based on the text abstract can comprise a paragraph text dividing module 101, a first abstract obtaining module 102, a second abstract obtaining module 103, a target abstract confirming module 104 and a text abstract classification module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the paragraph text dividing module 101 is configured to obtain a paragraph text to be classified, divide the paragraph text to be classified into single sentences, and obtain a single sentence set to be classified.
In the embodiment of the present invention, the paragraph text to be classified is a text that needs to be subjected to text classification, and the paragraph text to be classified may be in any format, for example, the paragraph text to be classified is a chinese text, or the paragraph text to be classified is an english text.
In the embodiment of the present invention, the to-be-classified paragraph text may be a text input by a user, or a text extracted from a preset to-be-classified paragraph text database.
In the embodiment of the present invention, the set of single sentences to be classified is a set formed by the single sentences in the text of the paragraphs to be classified, and specifically, the division of the text of the paragraphs to be classified into the single sentences can be realized by identifying punctuation marks in the text of the paragraphs to be classified.
For example, punctuation marks in the text of the paragraph to be classified are identified, and the sentence between the preset punctuation marks is divided into a single sentence when the preset punctuation marks (such as periods or semicolons) exist.
For example, when a first period from the beginning of the text of the paragraph to be classified to the text of the paragraph to be classified is recognized, the contents before the period are determined as a single sentence, and when the period is recognized again, the contents from the previous period to the period recognized again are determined as a single sentence.
The first abstract obtaining module 102 is configured to extract a first text abstract from the single sentence set to be classified by using a keyword-based extraction method of an abstract.
In the embodiment of the invention, the method for extracting the abstract based on the keyword can be a Lead3 algorithm, a Lead3 algorithm is a method for extracting the abstract of the text, and 3 sentences can be selected from the text to be used as the abstract of the text.
Further, the first abstract obtaining module 102 includes a vector processing unit, a matrix constructing unit, a calculating unit, and a first text abstract determining unit.
And the vector processing unit is used for vectorizing each single sentence to be classified in the single sentence set to be classified to obtain a text single sentence vector set.
Specifically, in this embodiment of the present invention, the vector processing unit is specifically configured to:
splitting the single sentences to be classified in the single sentence set to be classified to obtain a plurality of text words to be classified;
vectorizing the text words to be classified by using a preset word vector model to obtain a plurality of vector text words, and combining the vector text words to obtain a text single sentence vector set.
In the embodiment of the invention, the vector information in the to-be-classified sentence can be obtained in a digitalized angle by vectorizing the to-be-classified sentence, and the text information identified by manpower can be converted into the vector information identified by a machine.
And the matrix construction unit is used for calculating the similarity among the text single sentence vectors in the text single sentence vector set, storing the similarity into a preset blank matrix and constructing a transition probability matrix.
In the embodiment of the present invention, each element in the transition probability matrix is represented by a probability, and is all non-negative, and the sum of each row of elements is equal to 1.
In detail, the matrix building unit is specifically configured to:
calculating the similarity of each text single sentence vector in the text single sentence vector set by using a similarity calculation formula to obtain a vector similarity set;
and storing the vector similarity in the vector similarity set into a pre-constructed matrix to obtain a transition probability matrix.
In this embodiment, the vector similarity set is obtained by calculating the similarity of each text single sentence vector in the single sentence vector set by using a preset similarity calculation formula, where the preset similarity calculation formula may be a cosine similarity calculation formula, an euclidean distance calculation formula, or the like.
And the calculating unit is used for calculating the text ranking value of the single sentence to be classified based on the transition probability matrix.
In the embodiment of the present invention, the text ordering value (textrank value) is a value that indicates a semantic relation between a single sentence to be classified and a paragraph text to be classified, where the stronger the semantic relevance between the single sentence to be classified and the paragraph text to be classified is, the higher the text ordering value is, and therefore, the higher the text ordering value is, the higher the possibility that the single sentence to be classified is used as a text abstract of the paragraph text to be classified is.
In detail, the computing unit is specifically configured to:
obtaining the similarity between each text single sentence vector in the transition probability matrix;
constructing a similarity graph structure by taking each text single sentence as a node and taking the similarity between each text single sentence vector as an edge of the node;
and calculating the text ranking value of the single sentence to be classified by using the similarity graph structure.
In the embodiment of the invention, the similarity graph structure is utilized to calculate the text ranking value S (V) of the single sentence to be classifiedi) Is realized by the following formula:
wherein j is a single sentence with similarity relation with the target text single sentence i, ViNode, V, being a text single sentence ijIs the node of the text single sentence j, E is the edge of the node, d is the damping coefficient, k is the co-occurrence of the target text single sentence i, E (V)i) Is and node ViSet of all nodes connected, E (v)j) Is and node VjSet of all nodes connected, WijRepresents ViAnd VjWeight of edges between, WjkRepresents VkAnd VjWeight of the edges in between, S (V)j) Is a VjThe text rank value of.
Further, the co-occurrence words are words in the same text paragraph that describe the same phenomenon or object with a certain frequency.
And the first text abstract determining unit is used for screening the single sentences to be classified in the single sentence set to be classified into the first text abstract according to the text sorting value.
In detail, the first text summary determining unit is specifically configured to:
traversing the text sorting values of the single sentences to be classified in the paragraph texts to be classified, and selecting a preset number of text sorting values from large to small;
and combining the target to-be-classified single sentences corresponding to the preset number of text sorting values into the first text abstract.
Or after a preset number of target to-be-classified single sentences are selected, the sequence of each to-be-classified single sentence in the to-be-classified paragraph text is obtained according to the to-be-classified paragraph text, and all the selected target to-be-classified single sentences are sequenced according to the sequence.
In this embodiment, the preset number of summary single sentences is a number of single sentences of a first text summary preset by a user, and the number of single sentences to be classified in the first text summary of the obtained paragraph text to be classified meets the number of summary single sentences preset by the user.
The second abstract obtaining module 103 is configured to extract a second text abstract from the set of single sentences to be classified by using a deep learning-based abstraction extraction method.
Further, the second abstract obtaining module 103 is specifically configured to:
acquiring a training text set and a text abstract of the training text set, and training a preset two-classification model by using the training text set and the text abstract of the training text set to obtain a labeling model;
marking each single sentence to be classified in the single sentence set to be classified by using the marking model to obtain a marked single sentence;
and obtaining a second text abstract according to the sequence of the single sentences in the text of the paragraph to be classified and the labels of the labeled single sentences.
In this embodiment, each single sentence to be classified in the single sentence set to be classified is labeled, and the types of the labels are two, one is that the sentence belongs to the subsequent abstract, and the other is that the sentence does not belong to the subsequent abstract.
In this embodiment, the sequence of the single sentences is that the sentence is the first sentence in the paragraphs to be classified, and then the sentences belonging to the subsequent abstract in the labeled single sentences are combined according to the sequence of the single sentences to obtain the second text abstract.
In the embodiment of the invention, the training texts and the text summaries of the training text set can be obtained by crawling texts disclosed in the network from the network by utilizing a crawler technology.
Further, the two-classification model is a model constructed based on a sigmoid two-classification function, and the text abstract of the training text set and the training text set are used for training to obtain a labeling model.
Further, before the labeling of each single sentence to be classified in the single sentence set to be classified by using the labeling model, the method further includes: and adding identifiers to the single sentence to be classified.
For example, a front identifier [ CLS ] is added in front of the single sentence to be classified, and a rear identifier [ SEP ] is added at the end of the single sentence to be classified.
In this embodiment, adding an identifier to the single annotation sentence can define the range of the single annotation sentence, provide signals for the start reading and the end reading of the model, and avoid the reading error of the single annotation sentence.
The target abstract confirming module 104 is configured to calculate matching degrees between the first text abstract and the second text abstract and the paragraph text to be classified, and determine that the first text abstract or the second text abstract is a target text abstract according to the matching degrees.
In the embodiment of the present invention, there are various methods for determining the target text abstract, for example, determining the target text abstract according to a quick sorting method.
In detail, the matching degree of the first text abstract and the second text abstract is calculated, and if the matching degree of the first text abstract and the to-be-classified paragraph text is higher than the matching degree of the second text abstract and the to-be-classified paragraph text, the first text abstract is used as a target text abstract of the to-be-classified paragraph text; and if the matching degree of the second text abstract and the paragraph text to be classified is higher than that of the first text abstract and the paragraph text to be classified, taking the second text abstract as a target text abstract of the paragraph text to be classified.
Further, the calculating of the matching degree between the first text abstract and the paragraph text to be classified is realized by calculating the number of single sentences in the first text abstract.
For example, when the paragraph text to be classified includes twenty to-be-classified single sentences, if the first text abstract includes six to-be-classified single sentences, the matching degree between the first text abstract and the paragraph text to be classified is 6, and if the second text abstract includes eight to-be-classified single sentences, the matching degree between the second text abstract and the paragraph text to be classified is 8, so that the matching degree between the second text abstract and the paragraph text to be classified is higher than that of the first text abstract, and the second text abstract is used as the target text abstract of the paragraph text to be classified.
The text abstract classifying module 105 is configured to perform text classification on the target text abstract by using a preset text classification model to obtain a text category of the paragraph text to be classified.
In the embodiment of the invention, the preset text classification model is an Albert model, and the Albert model is a model obtained by simplification and optimization based on a Bert model. Compared with the Bert model, the Albert model has fewer required parameters and faster training, and can solve the problems of overlarge parameters and slow training of the Bert text classification model, so that the memory occupation can be reduced by classifying through the Albert model, and the text classification speed is increased.
Specifically, in this embodiment, the Albert text classification model may be obtained by performing factorization on an Embedding layer (Embedding) of the Bert text classification model and sharing parameters of an encoder, replacing a pre-training task of the Bert text classification model with a sentence sequential prediction task (SOP), and the speed and the classification accuracy of the model may be improved by performing factorization on the Embedding layer (Embedding) of the Bert text classification model and sharing parameters of the encoder, so as to reduce the number of parameters required by the Bert text classification model, and replacing the pre-training task of the Bert text classification model with the sentence sequential prediction task.
In detail, the text abstract classification module 105 is specifically configured to:
performing word segmentation operation on the target text abstract to obtain abstract word segmentation of the target text abstract;
establishing an information processor through a preset category dictionary in a preset text classification model;
inputting the abstract word segmentation into the information processor to obtain the text category of the abstract word segmentation;
and determining the text category of the abstract word segmentation as the text category of the paragraph text to be classified.
In this embodiment, the category dictionary is a dictionary preset in the Albert model, and the purpose of performing word segmentation on the abstract word segmentation can be achieved by an information processor constructed by the category dictionary.
In the embodiment of the present invention, the finally obtained text category of the paragraph text to be classified may be one or at least two, for example, the text category of the paragraph text to be classified obtained through the above operation may be multiple categories such as education, sports, and testing.
Furthermore, the method can be applied to intelligent information recommendation after the text category of the paragraph text to be classified is obtained.
Specifically, the device further comprises a recommending module, wherein the recommending module is used for:
after the text category of the paragraph text to be classified is obtained, the paragraph text to be classified is selected from the plurality of paragraph texts to be classified according to the interested text category of the target user and pushed to the target user.
For example, the text of the paragraph to be classified is a referee document text which comprises 20 document single sentences, the abstract of the referee document text is extracted by a keyword-based abstraction extraction method and a deep learning-based abstraction extraction method, if the first text abstract extracted by the keyword-based abstraction extraction method is 6 document single sentences, and if the first text abstract extracted by the deep learning-based abstraction extraction method is 8 document single sentences, the target abstract text of the referee document is determined to be 8 document single sentences. After the target text abstract of the referee document is obtained, text classification is carried out on the referee document text according to the target text abstract of the referee document text, and the referee document text is judged to belong to a civil referee document, a criminal referee document, an administrative referee document or other general litigation documents.
In the embodiment of the invention, on one hand, a first text abstract is extracted from a paragraph text to be classified by using a keyword-based abstraction extraction method, on the other hand, a second text abstract is extracted from the paragraph text to be classified by using a deep learning-based abstraction extraction method, the matching degree of the first text abstract and the second text abstract and the paragraph text to be classified is calculated, the text abstract with high matching degree is screened out to be used as the text abstract of the paragraph text to be classified, then the text abstract is classified by a text classification model to obtain the text category of the paragraph text to be classified, the text classification is carried out by the way of obtaining the abstract first and then carrying out the text classification, the information redundancy is reduced, the classification result is more accurate, and meanwhile, the abstract extraction is carried out by adopting a plurality of ways, so that the problem that the text abstract is not accurate enough due to the extraction of a single means is avoided, the accuracy of the obtained abstract is improved, and the accuracy of text classification is further improved. Therefore, the embodiment of the invention can achieve the aim of improving the accuracy of text classification.
Fig. 4 is a schematic structural diagram of an electronic device implementing the text classification method based on the text abstract according to the present invention.
The electronic device may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a text classification program based on a text excerpt, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules stored in the memory 11 (for example, executing a text classification program based on a text abstract, etc.) and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a text classification program based on a text abstract, but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Fig. 4 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 3 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The text classification program stored in the memory 11 of the electronic device based on the text abstract is a combination of a plurality of computer programs, and when running in the processor 10, can realize that:
obtaining a paragraph text to be classified, and dividing the paragraph text to be classified into single sentences to obtain a single sentence set to be classified;
extracting a first text abstract from the single sentence set to be classified by using a key word-based extraction method of the abstract;
extracting a second text abstract from the single sentence set to be classified by using a deep learning-based extraction method of the abstract;
respectively calculating the matching degree of the first text abstract and the second text abstract with the paragraph text to be classified, and determining the first text abstract or the second text abstract as a target text abstract according to the matching degree;
and performing text classification on the target text abstract by using a preset text classification model to obtain the text category of the paragraph text to be classified.
Specifically, the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.
Further, the electronic device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a non-volatile computer-readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
obtaining a paragraph text to be classified, and dividing the paragraph text to be classified into single sentences to obtain a single sentence set to be classified;
extracting a first text abstract from the single sentence set to be classified by using a key word-based extraction method of the abstract;
extracting a second text abstract from the single sentence set to be classified by using a deep learning-based extraction method of the abstract;
respectively calculating the matching degree of the first text abstract and the second text abstract with the paragraph text to be classified, and determining the first text abstract or the second text abstract as a target text abstract according to the matching degree;
and performing text classification on the target text abstract by using a preset text classification model to obtain the text category of the paragraph text to be classified.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.