Text processing method, device, equipment and storage medium
1. A method of text processing, the method comprising:
acquiring a text set to be evaluated and a plurality of text evaluation features for text evaluation, wherein the text set to be evaluated comprises a plurality of texts to be evaluated;
classifying the texts to be evaluated to obtain a plurality of text sets;
determining a target text set in the plurality of text sets based on the number of texts to be evaluated contained in each text set;
and for each text to be evaluated in the target text set, determining an evaluation result of the text to be evaluated based on the matching degree of the text to be evaluated and the text evaluation characteristics.
2. The method according to claim 1, wherein for each text to be evaluated in the target text set, the determining an evaluation result of the text to be evaluated based on the matching degree of the text to be evaluated and the text evaluation features comprises:
obtaining an evaluation weight corresponding to the matching degree of the text to be evaluated and each text evaluation feature;
and determining the evaluation result of the text to be evaluated based on the matching degree of the text to be evaluated and the plurality of text evaluation characteristics and the corresponding evaluation weight.
3. The method of claim 1 or 2, wherein the text evaluation feature comprises at least one of:
at least one category of information;
a text format;
the keyword library comprises a plurality of evaluation keywords.
4. The method of claim 3, wherein the text evaluation feature comprises a plurality of categories of information, and wherein for each text to be evaluated in the target set of texts, the method further comprises:
determining the information category hit by the text to be evaluated in the plurality of information categories;
and determining the matching degree of the text to be evaluated and the plurality of information categories based on the information categories hit by the text to be evaluated.
5. The method of claim 3, wherein the text evaluation feature comprises a keyword library comprising a plurality of evaluation keywords, and wherein for each text to be evaluated in the target text set, the method further comprises:
determining evaluation keywords hit by the text to be evaluated in the keyword library;
and determining the matching degree of the text to be evaluated and the plurality of evaluation keywords based on the evaluation keywords hit by the text to be evaluated.
6. The method of claim 3, wherein at least some of the evaluation keywords in the keyword library are determined by:
obtaining a sample text set, wherein the sample text set comprises a plurality of sample texts;
and determining an evaluation keyword from the candidate words based on the appearance of the candidate words contained in the sample texts.
7. The method of claim 1, wherein the classifying the texts to be evaluated to obtain a plurality of text sets comprises:
determining text characteristics of each text to be evaluated;
and clustering the text features of the texts to be evaluated based on the text features of the texts to be evaluated, and obtaining a plurality of text sets based on clustering results.
8. The method of claim 7, wherein the determining text characteristics of each of the texts to be evaluated comprises:
for each text to be evaluated, coding each word in the text to be evaluated to obtain the coding characteristics of each word in the text to be evaluated;
determining a word vector of each word in the text to be evaluated based on the coding features of each word in the text to be evaluated;
and determining the text characteristics of the text to be evaluated based on the word vector of each word in the text to be evaluated.
9. The method according to claim 7, wherein for each text to be evaluated, the determining of the word vectors of the words in the text to be evaluated based on the coding features of the words in the text to be evaluated is performed by a vector extraction model;
wherein the vector extraction model is obtained by training in the following way:
acquiring a training data set, wherein the training data set comprises a plurality of training texts;
coding each word in each training text to obtain the coding characteristics of each word in each training text;
inputting the coding features of each word in each training text into a neural network model, and determining the word vector of each adjacent word through the neural network model based on the coding features of the adjacent word;
determining word vector distribution corresponding to the word based on the word vectors of the adjacent words, and determining a predicted word corresponding to the word based on the word vector distribution;
determining a training loss value based on each word and the corresponding predicted word in each training text, performing iterative training on the neural network model according to the training loss value and the training data set until the training loss value meets a preset training end condition, and determining the model after training as the vector extraction model.
10. The method of claim 1, further comprising:
showing the evaluation result of each text to be evaluated in the target text set to a user through an evaluation result display interface;
and acquiring user operation information received through the evaluation result display interface, and correspondingly processing at least one text to be evaluated in the target text set based on the user operation information.
11. The method of claim 1, wherein the obtaining the set of texts to be evaluated comprises:
acquiring a text set to be processed, wherein the text set to be processed comprises a plurality of texts to be processed;
obtaining a plurality of text type prediction models, wherein each text type prediction model corresponds to one text type;
for each text to be processed, determining the text type of the text to be processed as the prediction probability of the text type corresponding to each text type prediction model based on each text type prediction model, and determining the text type corresponding to the text type prediction model with the highest prediction probability as the text type of the text to be processed;
determining the texts to be processed of the same text type as a text set to be evaluated, and acquiring any text set to be evaluated.
12. A text processing apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a text set to be evaluated and a plurality of text evaluation characteristics used for text evaluation, and the text set to be evaluated comprises a plurality of texts to be evaluated;
the classification module is used for classifying the texts to be evaluated to obtain a plurality of text sets;
the determining module is used for determining a target text set in the plurality of text sets based on the number of texts to be evaluated contained in each text set;
and the evaluation module is used for determining the evaluation result of each text to be evaluated in the target text set based on the matching degree of the text to be evaluated and the text evaluation characteristics.
13. The apparatus of claim 12, wherein for each text to be evaluated in the target set of texts, the evaluation module is configured to:
obtaining an evaluation weight corresponding to the matching degree of the text to be evaluated and each text evaluation feature;
and determining the evaluation result of the text to be evaluated based on the matching degree of the text to be evaluated and the plurality of text evaluation characteristics and the corresponding evaluation weight.
14. An electronic device comprising a processor and a memory, the processor and the memory being interconnected;
the memory is used for storing a computer program;
the processor is configured to perform the method of any of claims 1 to 11 when the computer program is invoked.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 11.
Background
With the rapid development of information technology, a large amount of text information needs to be evaluated in various fields to perform targeted processing on the text information. For example, for an enterprise, enterprise information includes a lot of user privacy information and a lot of business information, how to guarantee the security of the user privacy information and the business information of the enterprise becomes a central importance of enterprise information security work, and what the enterprise needs to do first is to evaluate the user privacy information and the business information to perform targeted information management.
Most of the existing text evaluation modes rely on experience knowledge, and keywords are designated through experience to be matched, so that text information needing to be evaluated is screened out, and corresponding evaluation results are determined based on different keywords included in the text information needing to be evaluated. Therefore, the existing text evaluation mode has poor flexibility and low evaluation accuracy.
Disclosure of Invention
The embodiment of the application provides a text processing method, a text processing device, text processing equipment and a storage medium, which can improve text processing efficiency and text evaluation accuracy and are high in applicability.
The embodiment of the application provides a text processing method, which comprises the following steps:
acquiring a text set to be evaluated and a plurality of text evaluation features for text evaluation, wherein the text set to be evaluated comprises a plurality of texts to be evaluated;
classifying the texts to be evaluated to obtain a plurality of text sets;
determining a target text set in the plurality of text sets based on the number of texts to be evaluated contained in each text set;
and for each text to be evaluated in the target text set, determining an evaluation result of the text to be evaluated based on the matching degree of the text to be evaluated and the text evaluation characteristics.
An embodiment of the present application provides a text processing apparatus, and the apparatus includes:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text set to be evaluated and a plurality of text evaluation characteristics used for text evaluation, and the text set to be evaluated comprises a plurality of texts to be evaluated;
the classification module is used for classifying the texts to be evaluated to obtain a plurality of text sets;
the determining module is used for determining a target text set in the plurality of text sets based on the number of texts to be evaluated contained in each text set;
and the evaluation module is used for determining the evaluation result of each text to be evaluated in the target text set based on the matching degree of the text to be evaluated and the text evaluation characteristics.
The embodiment of the application provides an electronic device, which comprises a processor and a memory, wherein the processor and the memory are connected with each other;
the memory is used for storing computer programs;
the processor is configured to execute the text processing method provided by the embodiment of the application when the computer program is called.
The embodiment of the application provides a computer readable storage medium, and the computer readable storage medium stores a computer program, and the computer program is executed by a processor to realize the text processing method provided by the embodiment of the application.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the text processing method provided by the embodiment of the application.
In the embodiment of the application, a plurality of text sets are obtained by classifying the texts to be evaluated, and the target text set is determined according to the number of the texts to be evaluated contained in each text set, so that the texts to be evaluated can be filtered, the processing amount of the texts to be evaluated is reduced, and the text processing efficiency is improved. On the other hand, each text to be evaluated in the target text set can be accurately evaluated through the plurality of text evaluation features, and the applicability is high.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a scene schematic diagram of a text processing method provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a text processing method provided in an embodiment of the present application;
FIG. 3a is a schematic diagram of a scenario for determining a text type according to an embodiment of the present application;
FIG. 3b is a schematic diagram of another scenario for determining a text type according to an embodiment of the present application;
FIG. 4 is a diagram illustrating a scenario for determining word vector distribution according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a scenario for determining a cluster category according to an embodiment of the present application;
fig. 6 is a schematic flowchart of evaluating a text to be evaluated according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a scenario showing evaluation results provided in an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The text processing method provided by the embodiment of the application relates to the fields of big data, Machine Learning (ML) in Artificial Intelligence (AI), Natural Language Processing (NLP) and the like. The machine learning is to specially study how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve the performance of the computer.
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, and the like. The text processing method provided by the embodiment of the application mainly relates to a text processing technology in natural language processing.
Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. According to the embodiment of the application, the machine can have the performance of processing the text through machine learning on the basis of the neural network.
The graph processing method provided by the embodiment of the application also relates to the fields of Cloud computing (Cloud computing) in Cloud technology, artificial intelligence Cloud service and the like. In the embodiment of the application, the computing tasks involved in the text processing method are distributed on a resource pool formed by a large number of computers through cloud computing so as to improve the efficiency of text processing. And the text processing method can be used as an artificial intelligence service, and the artificial intelligence cloud service for corresponding text processing is provided through an artificial intelligence platform.
The text-based processing method provided by the embodiment of the application can be executed by any terminal equipment or server. When the text processing method provided by the embodiment of the application is executed by a server, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform. When the text processing method provided by the embodiment of the application is executed by the terminal device, the terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, but is not limited thereto.
Referring to fig. 1, fig. 1 is a scene schematic diagram of a text processing method provided in an embodiment of the present application. As shown in fig. 1, the text set 100 to be evaluated includes a plurality of texts to be evaluated, and all texts to be evaluated in the text set 100 to be evaluated all belong to the same text type. For example, if the text type corresponding to the text set 100 to be evaluated is a log type, all the texts to be evaluated in the text set 100 to be evaluated are log texts. For another example, if the text type corresponding to the text set 100 to be evaluated is a trade type, all the texts to be evaluated in the text set 100 to be evaluated are texts related to trade. The specific classification of the text type corresponding to the text set to be evaluated may be determined based on the requirements of the actual application scenario, and is not limited herein. For example, for an enterprise or an organization, the text type corresponding to the text set to be evaluated may be divided into different text types according to client data, technical data, critical decision information, main meeting summary, financial budget information, various financial statements, and the like.
Further, although all the texts to be evaluated in the text set 100 to be evaluated belong to the same text type, there may still be a large difference in text content between the texts to be evaluated, for example, the texts to be evaluated related to trade may include texts related to trade funds and may also include texts related to product information. Therefore, all the texts to be evaluated in the text set 100 to be evaluated can be classified, the similar texts to be evaluated are classified into one class, and each class of the texts to be evaluated is used as a text set. As shown in fig. 1, after all texts to be evaluated in the text set 100 to be evaluated are classified, a text set 201, a text set 202, and a text set 203 can be obtained, and further, according to the number of texts to be evaluated included in each text set, a target text set 204 can be determined from the text set 201, the text set 202, and the text set 203, so as to achieve the purpose of screening the texts to be evaluated. The text to be evaluated included in the target text set 204 is a text that needs to be evaluated finally, for example, importance, threat, risk, and the like of all the texts to be evaluated in the target text set 204 are evaluated, and a specific evaluation dimension may be determined based on an actual text type and an actual application scenario requirement, which is not limited herein.
Specifically, when all the texts to be evaluated in the target text set 204 are evaluated, a plurality of text evaluation features 300 for text evaluation may be obtained, and the matching degree between each text to be evaluated in the target text set 204 and the plurality of text evaluation features is determined, so that the evaluation results 400 of all the texts to be evaluated are determined based on the matching degree corresponding to each text to be evaluated. That is to say, when all the texts to be evaluated in the target text set 204 are evaluated, the evaluation result of each text to be evaluated can be obtained based on the plurality of text evaluation features. As shown in fig. 1, the target text set 204 includes a text 1 to be evaluated, a text 2 to be evaluated, and a text 3 to be evaluated, after the matching degree between each text to be evaluated and the plurality of text evaluation features 300 is determined, the evaluation result of the text 1 to be evaluated may be determined based on the matching degree corresponding to the text 1 to be evaluated, the evaluation result of the text 2 to be evaluated may be determined based on the matching degree corresponding to the text 2 to be evaluated, and the evaluation result of the text 3 to be evaluated may be determined based on the matching degree corresponding to the text 3 to be evaluated.
Referring to fig. 2, fig. 2 is a schematic flowchart of a text processing method provided in an embodiment of the present application. As shown in fig. 2, the text processing method provided in the embodiment of the present application includes the following steps:
and step S21, acquiring a text set to be evaluated and a plurality of text evaluation features for text evaluation.
In some feasible embodiments, a plurality of texts to be evaluated in the acquired text set to be evaluated belong to the same text type, wherein the text type can be divided according to the field to which the text content relates, for example, a log text of the electronic device is classified into one text type, and enterprise information is classified into one text type. Optionally, the text type may be further divided according to different application scenarios, for example, classifying the enterprise information into one text type and classifying the academic documents into one text type. Optionally, the text type may be further divided according to different information attributes in a certain field and a certain application scenario, for example, for an enterprise or an organization, the client data, technical data, critical decision information, main meeting summary, financial budget information, and various financial statements may be divided into different text types. It should be particularly noted that the specific dividing manner and granularity of the text type corresponding to the text set to be evaluated may be determined based on the requirements of the actual application scenario, and are not limited herein.
Specifically, when a text set to be evaluated is obtained, a text set to be processed may be obtained first, where the text set to be processed includes a plurality of texts to be processed, and all texts to be processed in the text set to be processed are obtained texts to be processed belonging to each text type. For example, for an enterprise or an organization, the text types involved in each pending text of the pending text collection may include text types such as customer profiles, technical profiles, critical decision information, prime meeting agenda, financial budget information, and various financial statements.
Further, after the text set to be processed is obtained, the text types of all the texts to be processed in the text set to be processed can be determined, the text to be processed of the same text type is determined as a text set to be evaluated, and at this time, the text to be processed in the text set to be evaluated is the text to be evaluated. And then, the text set to be evaluated corresponding to each text type can be sequentially acquired, and based on the text processing method provided by the embodiment of the application, the texts to be evaluated in the text sets to be evaluated belonging to the same text type are respectively processed, so that the text processing efficiency is improved.
Optionally, if the text to be evaluated belonging to the specified text type needs to be evaluated in the actual requirement, the text set to be evaluated corresponding to the specified text type may also be obtained, and the text to be evaluated in the text set to be evaluated is processed based on the text processing method provided in the embodiment of the present application.
Optionally, the importance degree or the type priority corresponding to each text type may be determined, and the text set to be evaluated corresponding to each text type is sequentially obtained based on the order of the importance degrees or the order of the type priorities. It should be specially noted that the specific implementation manner for obtaining the text set to be evaluated is only an example, and may be determined based on the requirements of the actual application scenario, which is not limited herein.
In some possible embodiments, when determining the text types of all the texts to be processed in the text set to be processed, a plurality of text type prediction models may be obtained for determining the text type of each text to be processed. Each text type prediction model corresponds to one text type, that is, each text type prediction model can determine the probability that the text type of the text to be processed is one text type. The number of the text type prediction models and the text type prediction models for predicting which text types can be determined based on the actual application scene requirements, which is not limited herein.
Further, for each text to be processed, the prediction probability that the text type of the text to be processed is the text type corresponding to each text type prediction model can be determined based on each text type prediction model. Namely, each text to be processed is respectively input into each text type prediction model, each text type prediction model outputs a prediction probability, and the prediction probability represents the probability that the text type of the text to be processed is the text type corresponding to the text type prediction model. After determining that the text type of the text to be processed is the prediction probability of the text type corresponding to each text type prediction model, the text type corresponding to the text type prediction model with the highest prediction probability may be determined as the text type of the text to be processed. Based on the implementation manner, the text types of all the texts to be processed in the texts to be processed can be determined.
For example, referring to fig. 3a, fig. 3a is a schematic view of a scenario for determining a text type according to an embodiment of the present application. Three text type prediction models are shown in fig. 3a, each corresponding to one text type, i.e. a first text type prediction model corresponding to a first text type, a second text type prediction model corresponding to a second text type, and a third text type prediction model corresponding to a third text type. For a text to be processed, the text to be processed is respectively input into a first text type prediction model, a second text type prediction model and a third text type prediction model. And obtaining a first prediction probability based on the first text type prediction model, wherein the first prediction probability is 0.3, obtaining a second prediction probability based on the second text type prediction model, wherein the second prediction probability is 0.8, obtaining a third prediction probability based on the third text type prediction model, and wherein the third prediction probability is 0.4. The prediction probability of the second prediction probability obtained based on the second text type prediction model is the highest, so that the second text type corresponding to the second text type prediction model can be determined as the text type of the text to be processed.
Each text type prediction model may be a model obtained based on neural network training and having the capability of determining the text type of the text to be processed as the prediction probability of one text type, or may be a model obtained based on classification algorithm training and having the capability of determining the text type of the text to be processed as the prediction probability of one text type, and may be specifically determined based on the requirements of an actual application scenario, which is not limited herein.
Optionally, when determining text types of all texts to be processed in the text set to be processed, a text type prediction model may be further obtained, where the text type prediction model may be used to predict a prediction probability that a text type of each text to be processed is a text type. The text type that can be predicted by the text type prediction model and the number of types of text types that can be predicted can be determined based on the actual application scenario requirements, which is not limited herein.
Further, for each text to be processed, the prediction probability that the text type of the text to be processed is the text type can be determined based on the text type prediction model. Namely, each text to be processed is input into a text type prediction model, the text type prediction model can output a plurality of prediction probabilities, and each prediction probability represents the probability that the text type of the text to be processed is one text type. After determining that the text type of the text to be processed is the prediction probability of the text type, the text type corresponding to the highest prediction probability may be determined as the text type of the text to be processed. Based on the implementation manner, the text types of all texts to be processed in the text set to be processed can be determined.
For example, referring to fig. 3b, fig. 3b is a schematic diagram of another scenario for determining a text type provided in the embodiment of the present application. The text type prediction model shown in fig. 3b may determine the prediction probabilities that the text type of one text to be processed is three text types, that is, may determine the prediction probability that the text type of the text to be processed is the first text type, the prediction probability that the text type of the text to be processed is the second text type, and the prediction probability that the text type of the text to be processed is the third text type. For a text to be processed, inputting the text to be processed into the text type prediction model, and obtaining a first prediction probability of 0.3, a second prediction probability of 0.8 and a third prediction probability of 0.4. And determining the second text type corresponding to the second prediction probability as the text type of the text to be processed, wherein the second prediction probability is higher than the first prediction probability and the third prediction probability.
The text type prediction model may be a model obtained based on neural network training and having a capability of determining the text type of the text to be processed as the prediction probabilities of multiple text types, or may be a model obtained based on classification algorithm training and having a capability of determining the text type of the text to be processed as the prediction probabilities of multiple text types, and may be specifically determined based on actual application scene requirements, which is not limited herein.
In some feasible embodiments, while the text set to be evaluated is obtained, a plurality of text evaluation features for text evaluation need to be obtained, where the obtained plurality of text evaluation features are used for evaluating each text to be evaluated in the text set to be evaluated.
Wherein the text evaluation feature comprises at least one of:
at least one category of information;
a text format;
a keyword library.
The classification basis and the specific classification granularity of the information categories may be determined based on the actual application scene requirements, and are not limited herein. For example, for the text to be evaluated whose text type is the technical material type, the text content of the text to be evaluated may relate to information of multiple information categories, such as financial category, identity category, password category, financial category, and the like.
The text format is used for representing a return type or a text expression form of a text to be evaluated, including but not limited to a text type, a Json type, a picture (jpg, png) type, and the like, and may be determined based on a requirement of an actual application scene, which is not limited herein.
The keyword library comprises a plurality of evaluation keywords, and each evaluation keyword can correspond to one information category. The keyword library may be constructed in various ways, for example, based on big data and other ways, obtaining evaluation keywords belonging to each information category to construct the keyword library, extracting a plurality of evaluation keywords from the sample text based on a keyword extraction algorithm to construct the keyword library, and based on the evaluation keyword obtaining way, expanding the existing evaluation keyword set to construct the keyword library, and the like.
In some possible embodiments, at least part of the evaluation keywords in the keyword library may be determined by:
obtaining a sample text set, wherein the sample text set comprises a plurality of sample texts;
and determining an evaluation keyword from the candidate words based on the appearance of the candidate words contained in the sample texts.
For each sample text, the candidate words included in the sample text are words with higher importance in the sample text, or words that can represent different information categories in the sample text, and the determination method of the candidate words in each sample text and the acquisition method of each sample text are not limited herein.
The occurrence condition of each candidate word included in the multiple sample texts may be information related to the occurrence frequency of each candidate word in a certain dimension, including but not limited to the occurrence frequency of each candidate word in the corresponding sample text, the ratio of the occurrence frequency of each candidate word in the corresponding sample text to the total number of words in the sample text, the ratio of the occurrence frequency of each candidate word in the corresponding sample text to the occurrence frequency of the word with the maximum occurrence frequency in the sample text, and the like, which is not limited herein. For convenience of description, the occurrence condition corresponding to each candidate word is referred to as the word frequency of the candidate word in the following.
On the other hand, since the degree of importance may also be different between the candidate words, for example, if the candidate word "sell" appears as many times as the candidate word "top" in one sample text, it is obvious that it can be determined that the importance of the candidate word "top" is greater than that of the candidate word "sell". Therefore, if a candidate word is rare at ordinary times but appears in the sample text multiple times, the candidate word is likely to reflect a related information category of the corresponding sample text, and the candidate word "top secret" can be determined as an evaluation keyword.
Thus, for each candidate word, a first number of samples of all sample texts containing the candidate word may be determined, and a total number of texts of the plurality of sample texts may be determined. And further determining the inverse text frequency corresponding to the candidate word based on the first text number and the total text number:
and the frequency of the reverse text corresponding to the candidate word is inversely proportional to the degree of commonalities of the candidate word.
Further, also for each candidate word, a word weight of the candidate word may be determined based on the inverse text frequency and the word frequency corresponding to the candidate word (e.g., multiplying the inverse text frequency by the word frequency), and then, for all candidate words, a candidate word whose word weight is higher than a word weight threshold may be determined as the evaluation keyword. The word weight threshold may be determined based on actual application scenario requirements, and is not limited herein.
And step S22, classifying the texts to be evaluated to obtain a plurality of text sets.
In some feasible embodiments, for all texts to be evaluated in a text set to be evaluated belonging to the same text type, there may still be a large difference between text contents of the texts to be evaluated, so all texts to be evaluated in the text set to be evaluated may be classified, the texts to be evaluated that are more similar are classified into one class, and each class of text to be evaluated is taken as a text set.
Specifically, each text to be evaluated belonging to the same text type may be further classified, so as to screen out a text to be evaluated that is finally required to be evaluated from the text set to be evaluated. Specifically, the text features of each text to be evaluated in the text set to be evaluated may be determined first, based on the text features of each text to be evaluated, a clustering algorithm or a clustering model or the like is adopted to cluster each text feature, and a plurality of text sets are obtained based on a clustering result. And when the clustering is finished, the text to be evaluated corresponding to the text features belonging to the same clustering category is used as a text set, and the similarity between the text contents of the text to be evaluated in each text set is higher.
For example, if the risk of the text to be evaluated needs to be evaluated, usually, only a small part of the text to be evaluated contains risk information, and the similarity between the texts to be evaluated containing the risk information is closer. Therefore, the texts to be evaluated containing the risk information can be divided into the same class by clustering the texts to be evaluated, and the texts to be evaluated suspected to contain the risk information are screened.
The clustering algorithm includes, but is not limited to, a K-Means clustering algorithm, a mean shift clustering algorithm, a DBSCAN clustering algorithm, and the like, and may be specifically determined based on actual application scene requirements, which is not limited herein. The clustering model may be a clustering model obtained by training based on a clustering algorithm, may also be a clustering model obtained by training based on a neural network model, and may specifically be determined based on actual application scene requirements, which is not limited herein.
The text features of the texts to be evaluated are used as clustering objects, so that the information processing amount can be reduced, the clustering efficiency is improved, and because the text features carry the semantic information of the corresponding texts to be evaluated, the similarity between the texts to be evaluated in the text sets obtained based on the clustering results can be better, and the screening accuracy of the information to be evaluated is improved.
Optionally, the text set to be evaluated may also be divided into a plurality of text sets based on the forms of keyword matching, calculating text similarity between texts, and the like, and a specific implementation manner for classifying a plurality of texts to be evaluated may be determined based on the requirements of an actual application scenario, which is not limited herein.
In some feasible embodiments, when determining the text features of each text to be evaluated, for each text to be evaluated, the text to be evaluated may be directly encoded to obtain the text features of the text to be evaluated. Optionally, each word in each text to be evaluated may be encoded to obtain the encoding characteristic of each word, and further, a word vector corresponding to each word in each text to be evaluated is obtained based on the vector extraction model, so that the text characteristic of the corresponding text to be evaluated is determined based on the word vector corresponding to each word. For example, the word vectors corresponding to the words in each text to be evaluated may be processed based on the full-link layer, so as to obtain the text features of the text to be evaluated.
Optionally, text preprocessing may be performed on each text to be evaluated, text processing such as word segmentation, stop word removal, punctuation, dirty data and the like may be performed on each text to be evaluated to obtain each word of the preprocessed text to be evaluated, each word in the preprocessed text to be evaluated is further encoded, and a word vector of each word is obtained based on a vector extraction model to determine text features of the corresponding text to be evaluated.
The vector extraction model may be obtained based on neural network model training, for example, based on CBOW model training in the word2vec model, and the selection of the specific neural network model may be determined based on actual application scene requirements, which is not limited herein.
In some possible implementations, when training the vector extraction model, a training data set may be obtained, the training data set including a plurality of training texts. For each word in each training text, the words may be encoded to obtain the encoding characteristics of the word. And then inputting the coding features of each word in each training text into a neural network model, and for each time, respectively multiplying the coding features of the words adjacent to the word by a shared input weight matrix through the neural network model to obtain a word vector of each adjacent word. And adding the word vectors of each adjacent word to average to obtain a hidden layer vector.
Further, the hidden layer vector is multiplied by the output weight matrix and is processed by an activation function to obtain word vector distribution corresponding to each word, namely, a word is used as a predicted word, and the word vector distribution corresponding to the predicted word is finally obtained through the coding features of words adjacent to the word. The vector dimension of the word vector distribution is the same as the dimension of the word vector of each adjacent word, and the word vector distribution of each predicted word can represent the probability that the predicted word is each word, so that the word indicated by the maximum probability can be used as the predicted word.
Optionally, when encoding each word in the training text, the text processing such as word segmentation, stop word removal, punctuation, dirty data, and the like may also be performed on each training text to obtain each word of the pre-processed training text, and then each word in the pre-processed training text is encoded.
Further, a training loss value may be determined during the training process based on each word in each training text and the corresponding predicted word, the training loss value characterizing a difference between the word in each training text and the corresponding predicted word. And performing iterative training on the neural network model based on the training loss value and the training data set until the training loss value meets a preset training ending condition, and determining the model after the training is ended as a vector extraction model.
The training end condition may be that the training loss value is smaller than the training loss threshold, or that the training loss value tends to be stable, that is, a difference between training loss values corresponding to a certain number of consecutive training times is smaller than a preset difference threshold. When the training loss value meets the preset training end condition, the difference between each word and the corresponding predicted word is small, and further the word vector of each word obtained based on the vector extraction model tends to be stable. And finally, a shared input weight matrix of the vector extraction model is obtained through training, and the word vector corresponding to any word is determined according to the coding features of the word.
With reference to fig. 4, fig. 4 is a schematic view of a scene for determining word vector distribution according to an embodiment of the present application. FIG. 4 shows a text "go to Beijing today" to be evaluated. For the text to be evaluated, each word of the text to be evaluated can be subjected to thermal independent coding to obtain the coding features of each word in the text to be evaluated, and the coding features of each word are input into a vector extraction model. For the word "go" in the text to be evaluated, the coding features corresponding to the words "today", "day", "north" and "jing" adjacent to the word "go" can be input into the vector extraction model to obtain the word vectors corresponding to the words "today", "day", "north" and "jing". And further adding and averaging the word vectors to obtain a hidden vector, multiplying the hidden vector by an output weight matrix, and performing activation function processing to obtain a word vector distribution corresponding to the word 'go' (predicted word).
In some possible embodiments, when clustering text features of a text to be evaluated, a K-Means clustering algorithm may be used for clustering. Specifically, when clustering is started, k clustering categories may be determined first, that is, each text feature needs to be divided into k clusters in this clustering. The determination of the k value can be determined based on a priori experience, or the k value can be a preset value, or the k value can be determined by performing cross validation on each text to be evaluated.
When cross validation is performed, the text features corresponding to each text to be evaluated can be divided into a training subset and a testing subset, and the training subset is used for clustering based on different k values to determine the k value possibly used for final clustering. The test subset is used for testing each k value, and the final k value of the current clustering is determined based on the test result.
Optionally, the text features corresponding to each text to be evaluated may be divided into m training subsets, and m k values may be determined based on the m training subsets. In the process of determining each k value, one training subset can be selected as a test subset, and a k value is determined based on the implementation mode according to the test subset and the m-1 training subsets. And after each k value is determined, selecting another training subset as a test subset to determine another k value based on the implementation mode, and repeating the process to obtain m k values. Further, the mean of the m k values may be determined and rounded up or down to obtain the k values that are finally used for clustering all text features.
It should be particularly noted that the above specific determination manner of the k value for clustering is only an example, and may be specifically determined based on the requirements of the actual application scenario, and is not limited herein.
Further, after the k value for clustering is determined, D ═ x may be determined from all text features1,x2,…,xmSelecting k text features as features [ mu ] of k clustering centers1,μ2,…,μkThat is, the characteristics of the initial clustering centers at the beginning of clustering are respectively selected k text characteristics, each clustering center can be called as a mean vector or a centroid of a cluster corresponding to the corresponding clustering category, and for convenience of description, the present application implementsThe embodiments are collectively described with respect to cluster centers.
In each clustering process, x is used for one text featureiThe text feature x can be determinediCharacteristic mu from each cluster centerjSimilarity d of (j ═ 1,2, …, k)ij=||xi-μj||2The highest similarity (d)ijSmaller, higher similarity) of the cluster center and the cluster type λ corresponding to the feature of the cluster centeriDetermined as the text feature xiCluster class lambda to which the current clustering process belongsiFeature x of the textiInto a clustering class λiCorresponding cluster Cλi=Cλi∪{xi}. Based on the above implementation, all text features D ═ x can be determined1,x2,…,xmAnd } the corresponding cluster type.
Referring to fig. 5, fig. 5 is a schematic view of a scenario for determining a cluster category according to an embodiment of the present application. The number of clusters in fig. 5 is 3, i.e., there are cluster class 1, cluster class 2, and cluster class 3. In a clustering process, for the text features in fig. 5, the distance d between the text feature and the feature at each cluster center can be determined respectively1、d2And d3. Wherein d is3Much smaller than d1And d2It is indicated that the similarity between the text feature and the feature of the cluster center corresponding to the cluster category 3 is the highest, so that the cluster category of the text feature can be determined as the cluster category 3, and the text feature is classified into the corresponding cluster.
Further, after each clustering process is completed, the corresponding cluster C for each cluster categoryjThe feature of the cluster center corresponding to the cluster category needs to be updated, specifically, the feature can be updated throughx implementation, wherein | CjI is a cluster CjThe number of text features in the text. Namely, the mean vector of all the text features in the cluster corresponding to the cluster category is used as a new cluster center, and the clustering is carried out based on the new cluster center in the next clustering processAnd (4) class.
In the clustering process, a loss value can be determined based on the distance (similarity) between each text feature and the feature of the corresponding clustering center, iterative clustering is carried out on each text feature according to the clustering loss value until the clustering loss value meets the clustering ending condition, clustering is ended, and the text to be evaluated in the cluster corresponding to each clustering category is determined as a text set after clustering is ended.
The clustering ending condition may be that the clustering loss value is not less than the loss threshold, or that a difference between consecutive clustering loss values is less than a difference threshold, that is, the clustering loss value tends to be stable, which is not limited herein.
Wherein, the cluster loss value can be expressed as:that is, the clustering loss value is the sum of the distances between each text feature and the feature of the corresponding clustering center, the clustering loss value represents the degree of closeness of the text features in the clusters corresponding to each clustering category around the feature of the clustering center, and the smaller the clustering loss value is, the higher the similarity between the text features in the clusters is. Based on the clustering mode, the distance between the text features in the clusters corresponding to the clustering categories is smaller and smaller (the similarity is higher and higher), and the distance between the clusters is larger and larger (the similarity between the text features of different clusters is smaller and smaller), so that the classification of the texts to be evaluated is realized.
Optionally, in the clustering process, the clustering end condition may also be that the iterative clustering frequency reaches a preset frequency, and at this time, clustering may also be stopped, and the specific selection of the clustering end condition may be determined based on the actual application scenario requirement, which is not limited herein.
In the daily operation process of computer equipment, the number of alarm logs is often very large, wherein false alarms or log texts which have no influence on the operation of the computer are not short. According to the method and the device, the text characteristics of the log texts are determined, most of the log texts with false alarms or low influences can be filtered in a short time through clustering, and the processing amount of the log texts can be reduced.
Step S23, determining a target text set in the plurality of text sets based on the number of texts to be evaluated included in each text set.
In some possible embodiments, because the similarity between the text contents of the texts to be evaluated in each text set is high, the texts to be evaluated with similar importance and threat belong to the same text set. For example, if the risk of the text to be evaluated needs to be evaluated, usually, only a small part of the text to be evaluated contains risk information, and the similarity between the texts to be evaluated containing the risk information is closer. Therefore, through clustering the texts to be evaluated, the texts to be evaluated containing the risk information can be divided into the same class, so that the number of the texts to be evaluated included in each text set can be determined, and the target text set is determined from the plurality of text sets based on the number of the texts to be evaluated included in each text set. If the threat of the text to be evaluated needs to be evaluated, the text to be evaluated containing the threat information and the importance information usually only occupies a very small number of all the texts to be evaluated, so that one or more text sets containing the text to be evaluated with a small number can be used as target text sets, and the text sets are screened to obtain text sets with high threat, high importance and high risk.
Optionally, when the target text set is determined from the plurality of text sets, a text set including texts to be evaluated, the number of which does not exceed the number threshold value, may also be used as the target text set, and a text set including texts to be evaluated, the number of which exceeds the number threshold value, may also be used as the target text set, and specifically may be determined based on actual evaluation requirements, which is not limited herein.
Step S24, for each text to be evaluated in the target text set, determining an evaluation result of the text to be evaluated based on a matching degree between the text to be evaluated and the plurality of text evaluation features.
In some feasible embodiments, when the text to be evaluated in the target text set is evaluated, the matching degree between each text to be evaluated and the obtained multiple text evaluation features can be respectively determined, so that the evaluation result of the text to be evaluated is determined according to the matching degree corresponding to each text to be evaluated.
Specifically, the text evaluation feature includes a keyword library, the keyword library includes a plurality of evaluation keywords, and the matching degree between the text to be evaluated and the text evaluation feature is determined based on the ratio of the evaluation keywords hit by the text to be evaluated, the number of the evaluation keywords hit by the text to be evaluated, and the total number of the evaluation keywords hit by the text to be evaluated in the keyword library.
If the evaluation keyword corresponding to a certain information category hit by the text to be evaluated, the matching degree corresponding to the evaluation keyword hit by the text to be evaluated can be assigned, if the matching degree corresponding to the number of the evaluation keywords hit by the text to be evaluated is assigned, and the like, and the matching degree of the text to be evaluated and a plurality of text evaluation characteristics can be embodied in a mathematical mode.
Specifically, the text evaluation feature includes at least one information category, and one evaluation keyword in the keyword library may correspond to one information category. The matching degree between the text to be evaluated and the text evaluation features can be determined based on the information category hit by the text to be evaluated, the category number of the information category hit by the text to be evaluated, and the ratio of the total text length of the evaluation keywords corresponding to the information category hit by the text to be evaluated to the total text length of all the evaluation keywords in the keyword library.
Similarly, the matching degrees respectively corresponding to the information category hit by the text to be evaluated, the category number of the information category hit by the text to be evaluated, and the total text length of the evaluation keywords corresponding to the information category hit by the text to be evaluated in the total text length of all the evaluation keywords in the keyword library can also be embodied by assignment.
When the information category hit by the text to be evaluated is determined, the evaluation keyword hit by the text to be evaluated in the keyword library can be determined, and the information category corresponding to the hit evaluation keyword is used as the information category hit by the text to be evaluated.
Specifically, the text evaluation feature includes a text format, and the matching degree between the text to be evaluated and the text evaluation feature may be determined based on the specific text format of the text to be evaluated, and if the text format of the text to be evaluated is text, the corresponding matching degree may be determined to be 1.
Furthermore, an evaluation weight corresponding to the matching degree of the text to be evaluated and the plurality of text evaluation features can be obtained, that is, the evaluation result of the text to be evaluated is determined based on the matching degree of the text to be evaluated and the plurality of text evaluation features and the corresponding evaluation weight.
The evaluation weight is used for representing the application degree of the corresponding matching degree to the evaluation result, for example, if the influence degree of the number of the hit evaluation keywords of the text to be evaluated on the evaluation result is far greater than the influence degree of the text format of the text to be evaluated on the evaluation result, the evaluation weights corresponding to different matching degrees can be determined based on the requirements of the actual application scene, and no limitation is made here.
For example, the text to be evaluated hits evaluation keyword 1 and evaluation keyword 2 (the matching degree corresponding to hit evaluation keyword 1 is 1, and the matching degree corresponding to hit evaluation keyword 2 is 1), the total number of evaluation keywords 1 and evaluation keywords 2 hit by the text to be evaluated is 2 (the corresponding matching degree is 2), and if there are 12 evaluation keywords in the keyword library, the ratio of the total number of evaluation keywords hit by the text to be evaluated in the keyword library is 0.25 (the ratio is taken as the corresponding matching degree). In this case, the final evaluation result may be obtained by performing weighted summation based on each matching degree corresponding to the evaluation text and the corresponding evaluation weight.
Optionally, the difference between the different matching degrees of the text to be evaluated and the multiple text evaluation features is large, and if the matching degree corresponding to the number of the hit evaluation keywords is often larger than the ratio corresponding to the hit evaluation keywords, the objectivity of the finally obtained evaluation result may be reduced. Based on the method, normalization processing can be performed on the text to be evaluated and each matching degree of each text evaluation feature, so that each matching degree is in the same number of dimensions, and then the evaluation result of the text to be evaluated is determined based on the normalized matching degrees and the corresponding evaluation weights.
The evaluation result may be used to determine the risk, importance, threat, and the like of the text to be evaluated, and may be determined based on the actual evaluation requirement and the application scene requirement, which is not limited herein. And e.g. determining a high-threat target text set from the plurality of text sets based on the number of texts to be evaluated contained in each text set. For any text to be evaluated in the target text set, the threat level of the text to be evaluated can be ranked based on the final value obtained by matching the matching degree and the corresponding evaluation weight, that is, the threat level corresponding to the text to be evaluated is determined based on the corresponding relation between the final value and the threat level, and the threat level is used as the evaluation result of the text to be evaluated.
For example, when the threat level of the text to be evaluated needs to be evaluated, the corresponding relationship between the matching degree and the evaluation weight corresponding to the text to be evaluated may be as shown in table 1:
table 1: evaluation weight table
Degree of matching
Evaluating weights
Number of categories hit information category
2
Hit sensitive information ratio
3
Response return type
Text is 2; json 3; jpg/png: 2; and (3) the other: 1
Hit evaluation keyword number
2
Hit evaluation keyword proportion
3
Whether to hit the finance type evaluation keyword
3
Whether to hit an identity class evaluation keyword
3
Whether a location class evaluation keyword is hit
3
Whether to hit a password class evaluation keyword
3
Whether to hit a file class evaluation keyword
3
Whether to hit the evaluation keyword of user profile
2
Whether to hit the evaluation keyword of the log class
2
Whether to hit the device class evaluation keyword
2
Whether to hit a business class evaluation keyword
1
Whether to hit the evaluation keyword of communication class
1
The evaluation keywords in the keyword library can be financial evaluation keywords, identity evaluation keywords, location evaluation keywords, password evaluation keywords, file evaluation keywords, user data evaluation keywords, log evaluation keywords, equipment evaluation keywords, business evaluation keywords, and communication evaluation keywords. Correspondingly, for a text to be evaluated, the information category which can be hit can comprise a finance category, an identity category, a location category, a password category, a file category, a user profile category, a log category, an equipment category, a business category and a communication category.
Wherein, the hit sensitive information ratio in table 1 is the ratio of the total text length of the hit evaluation keywords to the total text length of all evaluation keywords in the keyword library; the response return type is the text format of the text to be evaluated. If the evaluation keywords hit by the text to be evaluated are financial evaluation keywords, identity evaluation keywords and password evaluation keywords, the number of words hitting the financial evaluation keywords is 2, the number of words hitting the identity evaluation keywords is 4, and the number of words hitting the password evaluation keywords is 6. Therefore, the category number of the text hit information category to be evaluated is 3, the corresponding evaluation weight is 2, the number of hit evaluation keywords is 12, and the corresponding evaluation weight is 2.
Further, assuming that the total number of keywords to be evaluated in the keyword library is 24, the total text length is 80, and the total text length of the evaluation keywords hit by the text information to be evaluated is 20, the percentage of hit sensitive information of the text information to be evaluated is 0.25, the corresponding evaluation weight is 3, the percentage of hit evaluation keywords is 0.5, and the corresponding evaluation weight is 3. If the Text format (response return type) hit by the Text to be evaluated is Text, the corresponding evaluation weight is 2. The matching degree corresponding to the text to be evaluated and each text feature (if a keyword of one information category is hit, the matching degree is assigned as 1) can be normalized, the threat score of the text to be evaluated is determined based on the normalized matching degree and the corresponding evaluation weight, and the threat level (such as high risk and low risk) of the text to be evaluated is determined based on the threat score.
It should be particularly noted that, when ranking each text to be evaluated, the division of the specific rank may be determined based on the requirements of the actual application scenario, and is not limited herein. For example, for an enterprise or an organization, the corresponding text to be evaluated may include high-value information such as customer information, technical information, important decision information, main meeting summary, financial budget information, and various financial statements. Therefore, different importance levels can be classified according to the value, content sensitivity, influence and distribution range of the information, for example, general enterprise data is classified into five levels: strictly, if the corresponding text to be evaluated is damaged or leaked, the organization may be exposed to serious financial or legal risks, such as financial information, system or personal authentication information, etc. Confidentiality, the level of the corresponding text to be evaluated, if corrupted or compromised, may expose the organization to financial or legal risks, such as credit card information, personal health information, or trade secrets. Secrecy, data that the corresponding text to be evaluated of the rating is damaged or leaked may have a negative impact on the operation, such as contract text with partners and suppliers, employee review information text, and the like. And internally disclosing, wherein the text to be evaluated corresponding to the grade is non-publicly disclosed information, such as a sales manual, an organizational chart, employee information and the like. Externally, the corresponding text to be evaluated of the rating can be freely disclosed, such as marketing materials, contact information, price lists and the like.
The text processing method provided by the embodiment of the present application is further described below with reference to fig. 6. As shown in fig. 6, fig. 6 is a schematic flowchart of evaluating a text to be evaluated according to an embodiment of the present application. For each text to be evaluated in the text set to be evaluated, text preprocessing such as word segmentation, stop word removal, punctuation, dirty data and the like can be performed on the text to be evaluated to obtain each word of the preprocessed text to be evaluated. And further coding each word in the preprocessed text to be evaluated to obtain the coding feature of each word, obtaining a word vector of each word based on the coding feature of each word through a vector extraction model, and further determining the text feature of the text to be evaluated based on the word vector of each word to obtain the text feature of the text to be evaluated in the text set to be evaluated.
Further, classifying the text features based on a clustering algorithm to obtain a plurality of text sets, and determining a target text set based on the number of texts to be evaluated contained in each text set. And determining the evaluation result of the text to be evaluated according to the matching degree of the text to be evaluated and the text evaluation characteristics for each text to be evaluated in the target text set so as to determine the evaluation results of all the texts to be evaluated in the target text set.
In some feasible embodiments, after the evaluation result of each text to be evaluated in the target text set is determined, the evaluation result of each text to be evaluated in the target text set may be presented to the user through an evaluation result display interface, for example, the risk level of each text to be evaluated is determined based on the matching degree of each text to be evaluated and each text evaluation feature and the corresponding evaluation weight, so that the risk level of each text to be evaluated may be presented to the user as the evaluation result.
Optionally, the evaluation result of a part of the text to be evaluated may be displayed to the user according to the actual application scene requirements, for example, based on the matching degree of each text to be evaluated and each text evaluation feature and the corresponding evaluation weight, the importance level of each text to be evaluated (for example, the importance level of a scientific literature) is determined, and the text to be evaluated with a high importance level is displayed to the user as the evaluation result.
Optionally, the number of texts to be evaluated with the same evaluation result may be counted, and each evaluation result and the number of corresponding texts to be evaluated are displayed to the user together, or the storage location of each text to be evaluated is displayed to the user while the evaluation result is displayed to the user, and the specific display manner is not limited herein.
Referring to fig. 7, fig. 7 is a schematic diagram of a display scenario of an evaluation result provided in an embodiment of the present application. Fig. 7 shows an evaluation result display interface, in the evaluation result display interface in fig. 7, the number of texts to be evaluated with different risk levels can be shown to the user, for example, the number of texts to be evaluated with high risk levels is 20, the number of texts to be evaluated with medium risk levels is 10, and the number of texts to be evaluated with low risk levels is 40. Meanwhile, the evaluation result is displayed, and meanwhile, detailed viewing indication information can be provided for the user, so that the detailed text information of the text to be evaluated corresponding to each risk level is displayed to the user in response to the viewing operation of the user.
Optionally, after the evaluation result is displayed to the user based on the evaluation result display interface, the user operation information received through the evaluation result display interface may be acquired, and the corresponding text to be evaluated is correspondingly processed based on the processing mode corresponding to the user operation information.
The processing mode corresponding to the user operation information may be determined based on an actual application scenario, and is not limited herein. For example, a text to be evaluated with high importance is displayed to the user based on the evaluation result display interface, and the text to be evaluated with high importance can be encrypted based on the user operation information. For another example, the text to be evaluated is a log text, and after the number of the high-risk log texts is displayed to the user based on the evaluation result display interface, the high-risk log texts may be cleared based on the user operation.
In the embodiment of the application, a plurality of text sets are obtained by classifying the texts to be evaluated, and the target text set is determined according to the number of the texts to be evaluated contained in each text set, so that the texts to be evaluated can be filtered, the processing amount of the texts to be evaluated is reduced, and the text processing efficiency is improved. On the other hand, each text to be evaluated in the target text set can be accurately evaluated through the plurality of text evaluation features, and then corresponding processing measures are taken for the texts to be evaluated corresponding to different evaluation results, so that text management is facilitated, information safety is improved, and applicability is high.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application. The text processing apparatus 1 provided in the embodiment of the present application includes:
the acquiring module 11 is configured to acquire a text set to be evaluated and a plurality of text evaluation features for text evaluation, where the text set to be evaluated includes a plurality of texts to be evaluated;
the classification module 12 is configured to classify the multiple texts to be evaluated to obtain multiple text sets;
a determining module 13, configured to determine a target text set in the plurality of text sets based on the number of texts to be evaluated included in each text set;
and the evaluation module 14 is configured to determine, for each text to be evaluated in the target text set, an evaluation result of the text to be evaluated based on matching degrees of the text to be evaluated and the text evaluation features.
In some possible embodiments, for each text to be evaluated in the target text set, the evaluation module 14 is configured to:
obtaining an evaluation weight corresponding to the matching degree of the text to be evaluated and each text evaluation feature;
and determining the evaluation result of the text to be evaluated based on the matching degree of the text to be evaluated and the plurality of text evaluation characteristics and the corresponding evaluation weight.
In some possible embodiments, the text evaluation feature includes at least one of:
at least one category of information;
a text format;
the keyword library comprises a plurality of evaluation keywords.
In some possible embodiments, the text evaluation feature includes a plurality of information categories, and for each text to be evaluated in the target text set, the evaluation module 14 is further configured to:
determining the information type hit by the text to be evaluated in the plurality of information types;
and determining the matching degree of the text to be evaluated and the plurality of information categories based on the information categories hit by the text to be evaluated.
In some possible embodiments, the text evaluation feature includes a keyword library, the keyword library includes a plurality of evaluation keywords, and for each text to be evaluated in the target text set, the evaluation module 14 is further configured to:
determining the evaluation keywords hit by the text to be evaluated in the keyword library;
and determining the matching degree of the text to be evaluated and the plurality of evaluation keywords based on the evaluation keywords hit by the text to be evaluated.
In some possible embodiments, the determining module 13 is configured to:
acquiring a sample text set, wherein the sample text set comprises a plurality of sample texts;
and determining an evaluation keyword from the candidate words based on the appearance of the candidate words contained in the sample texts.
In some possible embodiments, the classification module 12 is configured to:
determining the text characteristics of each text to be evaluated;
and clustering the text features of the texts to be evaluated based on the text features of the texts to be evaluated, and obtaining a plurality of text sets based on clustering results.
In some possible embodiments, the classification module 12 is configured to:
for each text to be evaluated, coding each word in the text to be evaluated to obtain the coding characteristics of each word in the text to be evaluated;
determining word vectors of the words in the text to be evaluated based on the coding features of the words in the text to be evaluated;
and determining the text characteristics of the text to be evaluated based on the word vectors of the words in the text to be evaluated.
In some possible embodiments, for each text to be evaluated, the determining of the word vector of each word in the text to be evaluated based on the coding feature of each word in the text to be evaluated is implemented by a vector extraction model;
the vector extraction model is obtained by training in the following way:
acquiring a training data set, wherein the training data set comprises a plurality of training texts;
coding each word in each training text to obtain the coding characteristics of each word in each training text;
inputting the coding characteristics of each word in each training text into a neural network model, and determining the word vector of each adjacent word through the neural network model according to the coding characteristics of the word adjacent to the word;
determining word vector distribution corresponding to the word based on the word vectors of the adjacent words, and determining a predicted word corresponding to the word based on the word vector distribution;
determining a training loss value based on each word and the corresponding predicted word in each training text, performing iterative training on the neural network model according to the training loss value and the training data set until the training loss value meets a preset training end condition, and determining the model after training as the vector extraction model.
In some possible embodiments, the above-mentioned evaluation module 14 is further configured to:
displaying the evaluation result of each text to be evaluated in the target text set to a user through an evaluation result display interface;
and acquiring user operation information received through the evaluation result display interface, and performing corresponding processing on at least one text to be evaluated in the target text set based on the user operation information.
In some possible embodiments, the obtaining module 11 is configured to:
acquiring a text set to be processed, wherein the text set to be processed comprises a plurality of texts to be processed;
obtaining a plurality of text type prediction models, wherein each text type prediction model corresponds to one text type;
for each text to be processed, determining the text type of the text to be processed as the prediction probability of the text type corresponding to each text type prediction model based on each text type prediction model, and determining the text type corresponding to the text type prediction model with the highest prediction probability as the text type of the text to be processed;
determining the texts to be processed of the same text type as a text set to be evaluated, and acquiring any text set to be evaluated.
In a specific implementation, the apparatus 1 may execute the implementation manners provided in the steps in fig. 2 through the built-in functional modules, which may specifically refer to the implementation manners provided in the steps, and are not described herein again.
In some possible embodiments, the text processing apparatus may be a computer program (including program code) running on a computer device, for example, the text processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application.
In some possible implementations, the text processing apparatus provided in this embodiment may be implemented by a combination of hardware and software, and by way of example, the text processing apparatus provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the text processing method provided in this embodiment, for example, the processor in the form of a hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
In some possible embodiments, the text processing apparatus provided in the embodiments of the present application may be implemented in software, and the text processing apparatus shown in fig. 8 may be software in the form of programs and plug-ins, and includes a series of modules, including an obtaining module 11, a classifying module 12, a determining module 13, and an evaluating module 14, for implementing the text processing method provided in the embodiments of the present application.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. As shown in fig. 9, the electronic device 1000 in the present embodiment may include: the processor 1001, the network interface 1004, and the memory 1005, and the electronic device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
In the electronic device 1000 shown in fig. 9, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement the text processing method provided by the embodiments of the present application.
It should be understood that in some possible embodiments, the processor 1001 may be a Central Processing Unit (CPU), and the processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In a specific implementation, the electronic device 1000 may execute the implementation manners provided in the steps in fig. 2 through the built-in functional modules, which may specifically refer to the implementation manners provided in the steps, and are not described herein again.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and is executed by a processor to implement the method provided in each step in fig. 2, which may specifically refer to the implementation manner provided in each step, and is not described herein again.
The computer readable storage medium may be the text processing apparatus provided in any of the foregoing embodiments or an internal storage unit of an electronic device, such as a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the electronic device. The computer readable storage medium may further include a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), and the like. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the steps of fig. 2.
The terms "first", "second", and the like in the claims and in the description and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or electronic device that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or electronic device. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not intended to limit the scope of the present application, which is defined by the appended claims.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种基于预训练模型的词对齐性能提升方法