Text clustering method and system
1. A text clustering method is characterized by comprising the following steps:
acquiring an input text and splitting the input text based on a preset model;
acquiring text characteristics of the split input text;
acquiring the generation probability of the text features for at least one preset cluster;
if the maximum value in the generation probability is larger than a preset threshold value, classifying the input text into a cluster corresponding to the maximum value in the generation probability;
and if the maximum value in the generation probability is smaller than or equal to the preset threshold value, creating a new cluster according to the input text.
2. The method of claim 1, wherein obtaining the input text and splitting the input text based on a preset model comprises:
and acquiring the input text, and splitting the input text based on an 2/3-gram model to obtain the split input text.
3. The method according to claim 2, wherein the text features include times and/or probabilities of phrases in the split input text appearing in the word frequency dictionary of the preset at least one cluster, and the obtaining the text features of the split input text includes:
and matching the phrases in the split input text with the character strings in the preset word frequency dictionary of at least one cluster, and acquiring the times and/or the probability of the phrases in the split input text appearing in the preset word frequency dictionary of at least one cluster.
4. The method according to claim 3, wherein the obtaining the generation probability of the text feature for a preset at least one cluster comprises:
and putting the frequency and/or probability of the phrases in the split input text appearing in the word frequency dictionary of the preset at least one cluster into a preset generation probability calculation formula, and acquiring the generation probability of the text features for the preset at least one cluster.
5. The method according to claim 4, wherein if the maximum value of the generated probabilities is greater than a preset threshold, after classifying the input text into the cluster corresponding to the maximum value of the generated probabilities, the method further comprises:
and recording the split input text in a word frequency dictionary of a cluster corresponding to the maximum value in the generated probability.
6. A text clustering system, the system comprising:
the first processing module is used for acquiring an input text and splitting the input text based on a preset model;
the second processing module is used for acquiring the text characteristics of the split input text;
the third processing module is used for acquiring the generation probability of the text features aiming at least one preset cluster;
the fourth processing module is used for classifying the input text into a cluster corresponding to the maximum value in the generation probabilities when the maximum value in the generation probabilities is larger than a preset threshold value;
and the fifth processing module is used for creating a new cluster according to the input text when the maximum value in the generation probability is less than or equal to the preset threshold value.
7. The system of claim 6,
the first processing module is specifically configured to obtain the input text and split the input text based on an 2/3-gram model to obtain the split input text.
8. The system according to claim 7, wherein the text features include the number and/or probability of occurrence of phrases in the split input text in the word frequency dictionary of the preset at least one cluster,
the second processing module is specifically configured to match the phrases in the split input text with the character strings in the word frequency dictionary of the preset at least one cluster, and obtain the frequency and/or probability of the phrases in the split input text appearing in the word frequency dictionary of the preset at least one cluster.
9. The system of claim 8,
the third processing module is specifically configured to input the frequency and/or probability of the occurrence of the phrases in the split input text in the word frequency dictionary of the preset at least one cluster into a preset generation probability calculation formula, and obtain the generation probability of the text feature for the preset at least one cluster.
10. The system according to claim 9, wherein if the maximum value of the generated probabilities is greater than a predetermined threshold, after classifying the input text into the cluster corresponding to the maximum value of the generated probabilities,
the fourth processing module is further configured to record the split input text in the word frequency dictionary of the cluster corresponding to the maximum value in the generation probabilities.
Background
In today's big data age, text is the primary source of data. Text retrieval technology helps us to locate and search required text content in massive data. Therefore, text retrieval has important application value. The text retrieval relies on efficient and accurate text classification and clustering, and a large amount of text data related to the input text is found through the characteristics of classification and clustering.
The prior art can solve the problem of text clustering by the following methods: 1. the supervised machine learning model can be trained by artificially setting text statistical variables and collecting data. 2. The similarity of the texts is set artificially. For example, the texts are compared with each other word by word/letter by letter, then the comparison result is put into Euclidean distance to calculate the text distance, and the texts with similar distances are classified into one class by using K-means clustering algorithm (K-means clustering algorithm) clustering on the text distance. 3. The regular patterns of the texts are set artificially, the texts with the same regular patterns are classified into one class, and then a plurality of regular patterns are combined into one class of clusters with business significance artificially.
However, if the first method is adopted, the model features need to be set manually, and the experience coverage of people is limited; supervised models rely on label data, which varies from application scenario to application scenario. In the second method, the model depends on the text similarity, and the universality of the text similarity is poor; the model is not suitable for processing data online, and if the model needs to be applied online, the model needs a large amount of data for starting. With the third approach, the regular expression has a very limited ability to cover features.
Disclosure of Invention
The application provides a text clustering method, which comprises the following steps: acquiring an input text and splitting the input text based on a preset model; acquiring text characteristics of the split input text; acquiring the generation probability of the text features for at least one preset cluster; if the maximum value in the generation probability is larger than a preset threshold value, classifying the input text into a cluster corresponding to the maximum value in the generation probability; and if the maximum value in the generation probability is smaller than or equal to the preset threshold value, creating a new cluster according to the input text. The method does not depend on label data and text similarity. When the input text is classified into a certain cluster, the characteristics of the cluster can be enriched, so that the coverage range of the cluster can be improved, and the universality of the scheme can be improved. When a new cluster is created for the input text, the new cluster can be created, improving the coverage of all clusters.
Optionally, with reference to the first aspect, in a first possible implementation manner of the first aspect, the obtaining an input text and splitting the input text based on a preset model includes: and acquiring the input text, and splitting the input text based on an 2/3-gram model to obtain the split input text.
Optionally, with reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the text feature includes a number of times and/or a probability that a phrase in the split input text appears in the word frequency dictionary of the preset at least one cluster, and the obtaining the text feature of the split input text includes: and matching the phrases in the split input text with the character strings in the preset word frequency dictionary of at least one cluster, and acquiring the times and/or the probability of the phrases in the split input text appearing in the preset word frequency dictionary of at least one cluster.
Optionally, with reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the obtaining the generation probability of the text feature for a preset at least one cluster includes: and putting the frequency and/or probability of the phrases in the split input text appearing in the word frequency dictionary of the preset at least one cluster into a preset generation probability calculation formula, and acquiring the generation probability of the text features for the preset at least one cluster.
Optionally, with reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, after classifying the input text into a cluster corresponding to a maximum value in the generation probabilities if the maximum value in the generation probabilities is greater than a preset threshold, the method further includes: and recording the split input text in a word frequency dictionary of a cluster corresponding to the maximum value in the generated probability.
A second aspect of the present application provides a text clustering system, wherein the system includes: the first processing module is used for acquiring an input text and splitting the input text based on a preset model; the second processing module is used for acquiring the text characteristics of the split input text; the third processing module is used for acquiring the generation probability of the text features aiming at least one preset cluster; the fourth processing module is used for classifying the input text into a cluster corresponding to the maximum value in the generation probabilities when the maximum value in the generation probabilities is larger than a preset threshold value; and the fifth processing module is used for creating a new cluster according to the input text when the maximum value in the generation probability is less than or equal to the preset threshold value.
Optionally, with reference to the second aspect, in a first possible implementation manner of the second aspect, the first processing module is specifically configured to obtain the input text and split the input text based on an 2/3-gram model to obtain the split input text.
Optionally, with reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the text feature includes a number of times and/or a probability that a phrase in the split input text appears in the word frequency dictionary of the preset at least one cluster, and the second processing module is specifically configured to match the phrase in the split input text with a character string in the word frequency dictionary of the preset at least one cluster, and acquire the number of times and/or the probability that the phrase in the split input text appears in the word frequency dictionary of the preset at least one cluster.
Optionally, with reference to the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the third processing module is specifically configured to put the frequency and/or probability of occurrence of a phrase in the split input text in the word frequency dictionary of the preset at least one cluster into a preset generation probability calculation formula, and obtain the generation probability of the text feature for the preset at least one cluster.
Optionally, with reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, after classifying the input text into the cluster corresponding to the maximum value in the generation probabilities if the maximum value in the generation probabilities is greater than a preset threshold, the fourth processing module is further configured to record the split input text in the word frequency dictionary of the cluster corresponding to the maximum value in the generation probabilities.
The application provides a text clustering method, which comprises the following steps: acquiring an input text and splitting the input text based on a preset model; acquiring text characteristics of the split input text; acquiring the generation probability of the text features for at least one preset cluster; if the maximum value in the generation probability is larger than a preset threshold value, classifying the input text into a cluster corresponding to the maximum value in the generation probability; and if the maximum value in the generation probability is smaller than or equal to the preset threshold value, creating a new cluster according to the input text. The method does not depend on label data and text similarity. When the input text is classified into a certain cluster, the characteristics of the cluster can be enriched, so that the coverage range of the cluster can be improved, and the universality of the scheme can be improved. When a new cluster is created for the input text, the new cluster can be created, improving the coverage of all clusters.
Drawings
Fig. 1 is a schematic diagram illustrating a method for solving a clustering problem by using a supervised model according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a method for solving a clustering problem by an unsupervised model according to an embodiment of the present application;
fig. 3 is an actual schematic diagram of a method for solving a clustering problem through a regular expression according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an embodiment of a text clustering self-learning streaming processing method provided by the present application;
fig. 5 is a schematic diagram of an embodiment of a text clustering self-learning streaming system provided by the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.
The text retrieval technology can help us locate text contents needing to be searched in massive data. Therefore, text retrieval has important application value. The text retrieval relies on efficient and accurate text classification and clustering, and a large amount of text data related to the input text is found through the characteristics of classification and clustering. Such as a clustered scenario of cattle shipping addresses. The e-commerce platform is attacked by the Party of scalper in the operation process, and the e-commerce platform purchases rare commodities in batches through accounts controlled by a large number of machine scripts or controlled by the Party of scalpers, so that normal users cannot purchase the rare commodities and need to purchase the rare commodities from the Party of scalpers in a pricing mode. In order to bypass simple protection measures, the cattle parties obtain different variant address texts by transforming the same receiving address text as the receiving address of each transaction, because the addresses are not the same address in the literal view, but the receiving addresses are close to each other, and the cattle parties behind can still take goods through the receiving addresses. The following lists several examples of textual clustering of cattle pickup addresses:
example one:
number 8 north-6 Guangzhou city/Zengcheng district/Dongjiang dao 2203;
10-span No. 8 Guangzhou city/Zengcheng district/Dongjiang dao north 1506;
guangzhou city/Zengcheng district/northeast river Dadao No. 8 12A 1808-pwl.
Example two:
jiaxing city/south lake area/Linggu pond 811 U.S. usud;
jiaxing city/south lake area/Ling Highway 811 No.;
jiaxing city/south lake area/Linggong pond 811 number kmxn.
Example three:
tin-free city/boon area/Changan Columbus 3-888;
wuxi city/Whistle area/Changan Columbus 4-222.
Example four:
qingdao city/yellow island/Shandong university of science and technology;
qingdao city/yellow island/Shandong university of Lang Bay;
dormitory building 502 of Shandong university of science and technology No. 3, Qingdao City/yellow island/anterior gulf harbor 579;
14 teaching building 25 No. 579 of Qingdao city, Huangdao district, Kuai Konglu and Qiangwan;
qingdao city/yellow island/forward gulf harbour road No. 6 building 3 unit 5 building.
As can be seen from the above example, although the addresses do not literally appear to be the same address, the shipping addresses are adjacent. The scalper party behind the back can take the goods through the addresses. In the electronic commerce platform with huge business volume and several selling species, the scalper party and normal users have great purchase demands on the scarce commodities. Orders from a large number of cattle parties for scarce goods are mixed in the data stream of orders from normal users, and cattle cannot be easily distinguished from normal users by the same IP/same equipment/same receiving address.
The cattle changes part of texts on the same receiving address, so that the receiving address is not seemingly the same receiving address on the surface, and a simple wind control rule is bypassed. Therefore, variant processing algorithms, such as text clustering, that can identify such variant texts are needed to classify the variant texts on the same shipping address text into the same category. Therefore, how to distinguish similar texts in the same cluster is a problem to be solved urgently.
The prior art can provide the following methods to solve the text clustering problem:
1. by supervised clustering techniques.
Text statistical variables are set artificially, and data are collected to train a supervised model. For example, a training model is used according to the text length, the number of continuous characters or letters, the number of special characters and other features, and label data, and the training model and text feature prediction are used for text classification. Specifically, please refer to fig. 1: the text can be obtained first, and then the features are extracted from the text to obtain a training set. For example, the features may include a length of text, a number of consecutive words or letters, a number of characters included, and the like. And then obtaining a supervised machine learning model according to the preset sample labels and the training set. And then carrying out model test or cross validation on the obtained model. The model can be deployed online after it is obtained. After the text is obtained again, corresponding features can be extracted from the text to serve as a training set, and then a result is predicted according to the on-line deployment model.
However, in this method, the model features need to be designed manually, the experience coverage of a human is limited, and some features may be missed. Second, supervised machine learning models rely on label data, which varies from application scenario to application scenario.
2. By unsupervised clustering techniques
The similarity of text is artificially defined. For example, after the texts are acquired, the similarity calculation of the customized texts is performed according to a customized text similarity rule, and then clustering is performed according to an unsupervised machine learning model (such as a Kmeans model), so that the texts with similar distances are classified into one class. Referring to fig. 2, for example, the text may be obtained first, then the text may be compared word by word/letter by letter, then the comparison result is put into the euclidean distance to calculate the text distance (i.e., to perform the custom text similarity calculation), and then the text distances may be clustered by using the Kmeans algorithm (i.e., the unsupervised machine learning model), so as to classify the texts with similar distances into one class. Clustering results and cluster center parameters can be obtained. When the method is used, the text sample needing to be predicted can be obtained first, and the similarity between the text sample needing to be predicted and each clustering center is calculated according to the obtained clustering center parameters. And then selecting the cluster corresponding to the minimum similarity value to obtain a clustering result. The cluster center parameter may be updated after the clustering result is obtained.
The scheme depends on text similarity, and the text similarity is poor in universality. The model is not suitable for processing data on line, and if the model needs to be applied on line, a large amount of data is needed for heating and starting.
3. By regular matching techniques
A regular pattern of text is set artificially. Texts with the same regular patterns are classified into one class, and then a plurality of regular patterns are artificially combined into one class of clusters with service significance. Referring to fig. 3, in this scheme, since the regular expression has a relatively limited capability of covering features, there may be many texts that the regular expression cannot cover.
Therefore, the present application provides a text clustering self-learning streaming processing method, please refer to fig. 4, which can extract the characteristics of streaming text data without manually setting the characteristics or using an extremely simple and general manner, can update the template used online in time, and can update the characteristics used by the template. The method comprises the following steps:
101. and acquiring an input text and splitting the input text based on a preset model.
And acquiring an input text, and splitting the input text based on a preset model. The predetermined model may be an 2/3-gram model.
It should be noted that the N-gram model is a statistical language model commonly used in large vocabulary continuous language recognition, and calculates the word frequency and calculates the sentence with the maximum probability (text error correction) or performs text spell check by using the collocation information between adjacent words in the context. The N-gram is based on Markov assumptions: the appearance of the Nth word is only related to the previous N-1 words, but not to any other words, and the probability of the whole sentence is the product of the appearance probabilities of all words. In the application of spell checking, the N-gram model needs to be added with a smoothing algorithm to show good effect due to the sparsity of data.
2/3-gram is one of N-gram models, and 2/3-gram model considers that any word in the text is only related to the first 2 or 3 words and not to any other words. Splitting the input text based on the 2/3-gram model is to split the input text two or three consecutive characters at a time. For example, if the input text is "Guangzhou city … garden cell D-D cell L", the text can be split into: "Guangzhou", "State City" … … "element L" and "Guangzhou City" … … "element L".
It is understood that before the input text is split, if some special characters exist in the input text, the special characters can be cleared or replaced.
102. And acquiring the text characteristics of the split input text.
And acquiring the text characteristics of the split input text. It should be noted that at least one cluster is preset, and a word frequency dictionary exists in each cluster. The word frequency dictionary in each cluster contains at least one character string (phrase). The text feature may be a number of times and/or a frequency of the phrases in the split input text appearing in the word frequency dictionary of the preset at least one cluster. The text characteristics of the split input text may be specifically: and matching the phrases in the split input text with the character strings in the word frequency dictionary of the at least one cluster, and acquiring the times and/or the probability of the phrases in the split input text appearing in the word frequency dictionary of the at least one cluster.
Illustratively, if N clusters are preset, N is an integer greater than or equal to 1. Each of the N clusters has a word frequency dictionary. The number of times and/or probability that the phrases in the split input text appear in the word frequency dictionary corresponding to each of the N clusters can be obtained. Specifically, referring to the example in step 101, if the input text is: "Guangzhou city … garden district D a D unit L", if split the input text according to 2/3-gram model, can obtain the input text after splitting: "Guangzhou", "State City" … … "element L" and "Guangzhou City" … … "element L". And comparing the phrases in the split input text with the character strings in the word frequency dictionary of the first cluster in the N clusters to obtain the times and/or probability of the split input text to the first word frequency dictionary. For example, one can obtain: "Guangzhou": the number of occurrences is 2, and the probability of occurrence is 0.0471; "State City": the number of occurrences is 2, and the probability of occurrence is 0.0471; … … "Guangzhou City": the number of occurrences is 2, and the probability of occurrence is 0.0471; "D a D": the number of occurrences was 1, and the probability of occurrence was 0.0236. And then comparing … … the phrases in the split input text with the character strings in the word frequency dictionary of the second cluster of the N clusters to obtain the times and/or probability of the phrases in the split input text appearing in the word frequency dictionary corresponding to each cluster of the N clusters.
103. And acquiring the generation probability of the text features aiming at least one preset cluster.
And acquiring the generation probability of the text features aiming at least one preset cluster. In the statistical field, "generating probability" refers to the probability that a secondary probability distribution randomly generates this sample given the probability distribution. The method of calculating such a probability may be optional or configurable. For example, the text similarity may be obtained according to a polynomial probability distribution or a first order markov, but is not limited thereto.
For example, text similarity obtained based on a polynomial probability distribution:
Pj 'Guangzhou city Zengcheng district'=(numj("Guangzhou") + numj("city of State") + … numj("urban")/(num)j("Guangzhou")! Numj("State City")! … numj("urban area")! ) Pj ("guangzhou"). The characteristics of the Chinese medicinal materials includePj (state city) …Pj ("urban area").
For example, text similarity based on first order markov acquisitions:
Pj 'Guangzhou city Zengcheng district'=Pj "GuangzhouPj "state cityPj "market increase" …Pj ' Guangzhou cityPj "State City Zengzeng" …
The text similarity may also be obtained according to other manners, which are not described herein again.
In this way, the generation probability of the input text for each of the N clusters can be obtained by substituting the frequency and/or probability of the phrases in the split input text obtained in step 102 appearing in the word frequency dictionary corresponding to each of the N clusters into the probability calculation formula.
It should be noted that the method for calculating the probability may be selected according to different application scenarios. For example, in a business scenario for cattle shipping address variant identification, a generating probability calculation method based on polynomial distribution may be selected. In other business scenarios, such as case text clustering, electric business username clustering, etc., a first-order markov-based generation probability calculation method may be used.
104. And if the maximum value in the generation probability is larger than a preset threshold value, attributing the input text to a cluster corresponding to the maximum value in the generation probability.
And if the maximum value in the generation probability is larger than a preset threshold value, attributing the input text to a cluster corresponding to the maximum value in the generation probability. For example, if 3 clusters are included, the calculation formula in step 103 calculates that the generation probability of the input text for the first cluster of the 3 clusters is 20%, the generation probability for the second cluster is 25%, and the generation probability for the third cluster is 23%. Assume that the preset threshold is 23%. The second cluster with the highest probability of generation is selected, with a probability of generation of 25%. Judging that 25% is larger than a preset threshold value 23%. The input text is attributed to the second cluster.
105. And recording the split input text in a word frequency dictionary of a cluster corresponding to the maximum value in the generated probability.
If it is determined in step 104 that the maximum value in the generation probabilities is greater than the preset threshold value, the split input texts are all recorded in the word frequency dictionary of the cluster corresponding to the maximum value in the generation probabilities. Specifically, the split input text may be: the "guangzhou", "state city" … … "element L" and "guangzhou city" … … "element L" are all recorded in the word frequency dictionary of the second cluster.
Thus, after the input text is attributed to the cluster, the number of character strings in the word frequency dictionary of the cluster can be increased, the subsequent text samples are matched with the character strings, a probability calculation method is generated for the text according to the cluster, and the probability that the subsequent text samples are attributed to the cluster is increased.
106. And if the maximum value in the generated probability is less than or equal to a preset threshold value, creating a new cluster according to the input text.
If in step 104, the predetermined threshold is 30%. Then it is determined that the second cluster with the highest probability of generation is selected, with a probability of generation of 25%. Judging that the 25% is less than the preset threshold value of 30%. A new cluster is created for the input text, a fourth cluster may be created, and the split input text: the "guangzhou", "state city" … … "element L" and "guangzhou city" … … "element L" are all recorded in the word frequency dictionary of the fourth cluster.
The application provides a text clustering method, if an input text can belong to a preset cluster in the method, the cluster to which the input text belongs can be obtained through the method, and the input text can also be used for updating a word frequency dictionary of the cluster. The updating of the word frequency dictionary means a change in the parameters used for the probability calculation, meaning that the model used to determine the home cluster is updated. Therefore, the method provided by the application not only can obtain the cluster attributed to the input text, but also can update the cluster model in real time. The number of clusters can be increased along with the inflow of data, the division of the clusters is more detailed along with the increase of the number of processed texts, new clusters can be divided by users, and the granularity of a clustering space can be improved.
The method for processing the text clusters has self-learning capability, and in the field of machine learning, the self-learning means that a new training set is not required to be prepared in advance to retrain the model to replace a decayed model used on line. The self-learning model can update itself on line through the processed samples and algorithms to adapt to the latest sample characteristics so as to avoid model degradation. The method is also a streaming processing method, and the streaming processing refers to that after data enters an online model, the data can be processed and can be used as a sample for online learning and optimization of the model.
The method provided by the application can be applied to clustering of address texts. For example, on the e-commerce platform, orders for purchasing scarce commodities are often mixed with orders for purchasing scarce commodities by cattle, because the quantity of the scarce commodities purchased by the cattle is large, and the orders of the cattle can be caught through order quantity analysis on the same receiving address if the same address is used. Thus, cattle can bypass these simple wind rules in a low cost but efficient manner. The cattle can replace the same receiving address by adding and deleting a small number of character strings, so that simple rules cannot identify the addresses as the same address, and the rules of receiving addresses and order analysis are bypassed. The method provided by the application can effectively identify the receiving address texts derived by replacing the same address and adding and deleting a small number of character strings, can accurately and efficiently belong the variants from the same address to the same cluster, counts the receiving amount on the cluster, can effectively attack the rushing behavior of the cattle, and improves the cost of doing bad work by the cattle.
The following table shows the performance comparison between the text clustering method provided by the present application and the existing three methods:
in the supervised clustering technology, the characteristics are set manually, and in the text clustering processing method provided by the application, the characteristics are based on general statistical parameters of texts. In the supervised clustering technology, in the process of deploying the model on line by a supervised machine learning model, the model can decline, the model can be interrupted, the model needs to be retrained and then deployed. In the method provided by the application, the model is automatically updated once every time one sample is processed, the problem of model decline is avoided, and the model can be updated in real time.
In the unsupervised clustering technology, the similarity is set manually, and in the text clustering processing method provided by the application, the similarity is based on the general statistical parameters of the texts. In the unsupervised clustering technology, the number of clusters is fixed and unchangeable, and cannot be expanded according to the change of samples. According to the text clustering processing method, the clustering number is convenient along with the accumulation of the samples, the clustering granularity of the samples is updated in time along with the samples, and the clustering accuracy can be improved. In the unsupervised clustering technology, each cluster has only one cluster center parameter description. In the text clustering processing method provided by the application, the clustering parameters come from general text statistical characteristics, and each processed text can realize accurate fine adjustment of the clustering parameters. In the unsupervised clustering technology, similarity is determined by various distances set by people, such as Euclidean distance from a cluster center, and the description is relatively single, so that real features of texts can be missed. In the text clustering processing method provided by the application, the similarity describes the probability of reconstructing the input text through the existing samples in the clusters based on the generation probability calculated by taking the text sample statistic used by each cluster as a parameter, the description of the similarity is more general and detailed, and the self-learning capability is realized along with the inflow of text data.
In the regular matching technology, regular expressions are set artificially, each regular expression is used as a feature, the coverage area is extremely small, and a large number of regular expressions are needed to improve the recognition capability of texts. In the text clustering method provided by the application, the features come from general text statistical parameters, the model can be updated in real time, and the method has high universality and flexibility and has a wide coverage of the features.
It should be noted that the text clustering method provided by the present application can be used for identifying the shipping addresses of normal users and cattle. A large amount of address text (i.e. samples of addresses suspected of being used by cattle) from the same address text variant can be identified. The method is not limited to the identification of the delivery address, and can also be used for the identification of all other texts. For example, it may also be applied to papers that cluster similar topics by paper title and keywords; clustering similar cases experienced by illness through case keywords; clustering the varieties of brands or brand names through similar keywords; clustering suspected spam registered user names through a large number of registered user names, and the like. The method for clustering the texts can reduce labor input, and colleagues processing the text clustering on line by the model can learn by self on line to update the optimization model.
The present application provides a system 20 for text clustering, please refer to fig. 5, where the system 20 is configured to execute corresponding steps in the above text clustering method, and the system 20 includes:
the first processing module 201 is configured to obtain an input text and split the input text based on a preset model.
The first processing module 201 is specifically configured to obtain the input text and split the input text based on an 2/3-gram model to obtain the split input text.
And the second processing module 202 is configured to obtain text features of the split input text.
The second processing module 202 is specifically configured to match the phrases in the split input text with the character strings in the word frequency dictionary of the preset at least one cluster, and obtain the times and/or probabilities of the phrases in the split input text appearing in the word frequency dictionary of the preset at least one cluster.
The third processing module 203 is configured to obtain a generation probability of the text feature for a preset at least one cluster.
The third processing module 203 is specifically configured to put the frequency and/or probability of the occurrence of the phrase in the split input text in the word frequency dictionary of the preset at least one cluster into a preset generation probability calculation formula, and obtain the generation probability of the text feature for the preset at least one cluster.
A fourth processing module 204, configured to classify the input text into a cluster corresponding to a maximum value in the generation probabilities when the maximum value in the generation probabilities is greater than a preset threshold.
The fourth processing module 204 is further configured to record the split input text in the word frequency dictionary of the cluster corresponding to the maximum value in the generation probabilities.
A fifth processing module 205, configured to create a new cluster according to the input text when the maximum value in the generation probabilities is smaller than or equal to the preset threshold.
It should be noted that the processing modules of the text clustering processing system 20 provided in the present application are differentiated only logically. In the implementation process, a plurality of processing modules may be integrated, and may be divided into more detailed processing units, or there may be variations of the above processing modules. Such as combining the first processing module 201 and the second processing module into one processing module, etc., which should not be construed as exceeding the limits of the present application.
The method and the system for processing the text clustering are provided, the model features used in the method depend on the statistical data features of the samples, can be better matched with the samples, and the used features are simple text substrings and have good universality. When the inflow text data is processed, the model can be updated on line by using each processed text sample, and the model is updated each time without being off-line and on-line, so that the time cost can be saved. The characteristics used by the model updating model used in the method are influenced by each latest processing data, and the latest data can be approached in time. The statistical characteristics of the latest data can be better grasped by a flexibly configured generation probability calculation module.
The text clustering method and the text clustering system provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.