Method and system for filtering and replacing text content sensitive words in online customer service scene
1. A method for filtering and replacing text content sensitive words in an online customer service scene is applied to an online customer service robot, and is characterized by comprising the following steps:
step S1: creating a data bucket;
step S2: acquiring a plurality of preset sensitive word banks, and configuring the data bucket based on the sensitive word banks;
step S3: acquiring a text needing to be subjected to sensitive word filtering replacement;
step S4: and performing sensitive word filtering replacement on the text based on the configured data bucket to obtain a target text, and outputting the target text.
2. The method as claimed in claim 1, wherein the step S2 of configuring the data bucket based on the sensitive thesaurus includes:
acquiring feature information of the sensitive word bank, wherein the feature information comprises: matching length and triggering probability;
inquiring a preset node comparison table, and determining nodes corresponding to the trigger probability in a data bucket;
and storing the sensitive word bank corresponding to the trigger probability on the node based on the red and black tree.
3. The method for filtering and replacing the text content sensitive word in the online customer service scenario according to claim 1, wherein the step S4: performing sensitive word filtering replacement on the text based on the configured data bucket, including:
performing word segmentation processing on the text to obtain a plurality of first target words;
indexing the first target word in the data bucket;
taking the corresponding node which is currently indexed as a target node, and taking the sensitive word bank of which the matching length is less than or equal to the text length of the first target word on the target node as a target sensitive word bank;
matching the first target word with a second target word in the target sensitive word bank;
and after all the first target words are indexed on all the nodes, replacing the first target words matched with a threshold with a number of times greater than or equal to a preset number of times in the text with a preset replacement text to obtain the target text.
4. The method as claimed in claim 3, wherein before outputting the target text in step S4, the method further comprises:
preprocessing the target text;
wherein, the target text is preprocessed, which comprises:
taking any first target word needing to be replaced by the replacement text in the text as a third target word;
extracting a first feature of the third target word;
establishing a trigger feature database, matching the first feature with a second feature in the trigger feature database, and if the first feature and the second feature are matched, acquiring the feature type of the second feature matched with the first feature;
inquiring a preset inquiry direction comparison table, and determining at least one inquiry direction corresponding to the characteristic type;
determining a first position of the third target word in the text;
acquiring a first number of fourth target words preset in the text in the query direction of the first position;
extracting a third feature of the fourth target word;
acquiring a preset approximate sensitive feature database, matching the third feature with a fourth feature in the approximate sensitive feature database, and if the matching is consistent, determining a second position of the fourth target word in the text;
acquiring a second number of fifth target words preset before and/or after the second position in the text;
extracting a fifth feature of the fifth target word;
acquiring a preset negative feature database, matching the fifth feature with a sixth feature in the negative feature database, and if the fifth feature is not matched with the sixth feature in the negative feature database, replacing the fourth target word corresponding to the third feature matched with the fourth feature into the replacement text;
and finishing preprocessing after all the fourth target words needing to be replaced by the replacement text in the target text are replaced.
5. The method of claim 4, wherein the creating of the trigger characteristics database comprises:
respectively acquiring a preset trigger word set and a preset approximate sensitive word database;
inquiring a preset associated trigger word comparison table, and determining at least one associated trigger word corresponding to each trigger word in the trigger word set;
creating a first event, the first event comprising: the sensitive sentence comprises the trigger word, and at least one approximate sensitive word in the approximate sensitive word database appears in a preset text length range before and/or after the trigger word in the sensitive sentence;
creating a second event, the second event comprising: the sensitive statement comprises the associated trigger word, and at least one approximate sensitive word in the approximate sensitive word database appears in a preset text length range before/after the associated trigger word in the sensitive statement;
respectively acquiring sensitive statement big data and a preset evaluation model;
respectively evaluating the occurrence conditions of the first event and the second event in the sensitive statement big data by using the evaluation model;
acquiring a plurality of first evaluation values output after the evaluation model evaluates the first event and a plurality of second evaluation values output after the evaluation model evaluates the second event;
calculating an evaluation index based on the first evaluation value and the second evaluation value, the calculation formula being as follows:
wherein σ is the evaluation index,for the ith one of the first evaluation values,for the ith second evaluation value, α is the total number of the first evaluation values, β is the total number of the second evaluation values, O1And O2Is a preset weight value, O2>1>O1> 0, gamma is an intermediate variable, mu1A first number, μ, of the first evaluation values smaller than or equal to a preset first evaluation value threshold2A second number, μ, of the second evaluation values that is less than or equal to a preset second evaluation value threshold0The number is a preset number threshold, and is equal to and else is equal to;
acquiring a preset blank database, if the evaluation index is greater than or equal to a preset evaluation index threshold value, extracting a seventh feature of the trigger word and eighth features of all the associated trigger words corresponding to the trigger word, and storing the seventh feature and the eighth features into the blank database;
and when the seventh characteristic of the trigger word needing to be stored in the blank database in the trigger word set and the eighth characteristics corresponding to all the associated trigger words are stored in the blank database, taking the blank database as a trigger characteristic database, and completing the establishment.
6. A system for filtering and replacing text content sensitive words in an online customer service scene is applied to an online customer service robot and is characterized by comprising:
a creation module for creating a data bucket;
the configuration module is used for acquiring a plurality of preset sensitive word banks and configuring the data bucket based on the sensitive word banks;
the acquisition module is used for acquiring a text which needs to be subjected to sensitive word filtering replacement;
and the filtering and replacing module is used for filtering and replacing the sensitive words of the text based on the configured data bucket to obtain a target text and outputting the target text.
7. The system of claim 6, wherein the configuration module performs the following operations:
acquiring feature information of the sensitive word bank, wherein the feature information comprises: matching length and triggering probability;
inquiring a preset node comparison table, and determining nodes corresponding to the trigger probability in a data bucket;
and storing the sensitive word bank corresponding to the trigger probability on the node based on the red and black tree.
8. The system for filtering and replacing the text content sensitive words in the online customer service scene as recited in claim 6, wherein the filtering and replacing module performs the following operations:
performing word segmentation processing on the text to obtain a plurality of first target words;
indexing the first target word in the data bucket;
taking the corresponding node which is currently indexed as a target node, and taking the sensitive word bank of which the matching length is less than or equal to the text length of the first target word on the target node as a target sensitive word bank;
matching the first target word with a second target word in the target sensitive word bank;
and after all the first target words are indexed on all the nodes, replacing the first target words matched with a threshold with a number of times greater than or equal to a preset number of times in the text with a preset replacement text to obtain the target text.
9. The system of claim 8, wherein the filtering replacement module further performs the following operations:
preprocessing the target text;
the filtering and replacing module is used for preprocessing the target text and specifically executing the following operations:
taking any first target word needing to be replaced by the replacement text in the text as a third target word;
extracting a first feature of the third target word;
establishing a trigger feature database, matching the first feature with a second feature in the trigger feature database, and if the first feature and the second feature are matched, acquiring the feature type of the second feature matched with the first feature;
inquiring a preset inquiry direction comparison table, and determining at least one inquiry direction corresponding to the characteristic type;
determining a first position of the third target word in the text;
acquiring a first number of fourth target words preset in the text in the query direction of the first position;
extracting a third feature of the fourth target word;
acquiring a preset approximate sensitive feature database, matching the third feature with a fourth feature in the approximate sensitive feature database, and if the matching is consistent, determining a second position of the fourth target word in the text;
acquiring a second number of fifth target words preset before and/or after the second position in the text;
extracting a fifth feature of the fifth target word;
acquiring a preset negative feature database, matching the fifth feature with a sixth feature in the negative feature database, and if the fifth feature is not matched with the sixth feature in the negative feature database, replacing the fourth target word corresponding to the third feature matched with the fourth feature into the replacement text;
and finishing preprocessing after all the fourth target words needing to be replaced by the replacement text in the target text are replaced.
10. The system for filtering and replacing the text content sensitive words in the online customer service scene according to claim 9, wherein the filtering and replacing module establishes the trigger characteristic data and specifically executes the following operations:
respectively acquiring a preset trigger word set and a preset approximate sensitive word database;
inquiring a preset associated trigger word comparison table, and determining at least one associated trigger word corresponding to each trigger word in the trigger word set;
creating a first event, the first event comprising: the sensitive sentence comprises the trigger word, and at least one approximate sensitive word in the approximate sensitive word database appears in a preset text length range before and/or after the trigger word in the sensitive sentence;
creating a second event, the second event comprising: the sensitive statement comprises the associated trigger word, and at least one approximate sensitive word in the approximate sensitive word database appears in a preset text length range before/after the associated trigger word in the sensitive statement;
respectively acquiring sensitive statement big data and a preset evaluation model;
respectively evaluating the occurrence conditions of the first event and the second event in the sensitive statement big data by using the evaluation model;
acquiring a plurality of first evaluation values output after the evaluation model evaluates the first event and a plurality of second evaluation values output after the evaluation model evaluates the second event;
calculating an evaluation index based on the first evaluation value and the second evaluation value, the calculation formula being as follows:
wherein σ is the evaluation index,for the ith one of the first evaluation values,for the ith second evaluation value, α is the total number of the first evaluation values, β is the total number of the second evaluation values, O1And O2Is a preset weight value, O2>1>O1> 0, gamma is an intermediate variable, mu1A first number, μ, of the first evaluation values smaller than or equal to a preset first evaluation value threshold2A second number, μ, of the second evaluation values that is less than or equal to a preset second evaluation value threshold0The number is a preset number threshold, and is equal to and else is equal to;
acquiring a preset blank database, if the evaluation index is greater than or equal to a preset evaluation index threshold value, extracting a seventh feature of the trigger word and eighth features of all the associated trigger words corresponding to the trigger word, and storing the seventh feature and the eighth features into the blank database;
and when the seventh characteristic of the trigger word needing to be stored in the blank database in the trigger word set and the eighth characteristics corresponding to all the associated trigger words are stored in the blank database, taking the blank database as a trigger characteristic database, and completing the establishment.
Background
At present, when an online customer service robot receives a visitor, in the process of pure text communication between the online customer service robot and the visitor, sensitive words need to be filtered and replaced for the text input by the visitor so as to maintain normal network order.
Disclosure of Invention
The invention aims to provide a method and a system for filtering and replacing sensitive words of text contents in an online customer service scene.
The method for filtering and replacing the text content sensitive words in the online customer service scene provided by the embodiment of the invention comprises the following steps:
step S1: creating a data bucket;
step S2: acquiring a plurality of preset sensitive word banks, and configuring a data bucket based on the sensitive word banks;
step S3: acquiring a text needing to be subjected to sensitive word filtering replacement;
step S4: and performing sensitive word filtering replacement on the text based on the configured data bucket to obtain a target text, and outputting the target text.
Preferably, in step S2, configuring the data bucket based on the sensitive thesaurus includes:
acquiring characteristic information of a sensitive word bank, wherein the characteristic information comprises: matching length and triggering probability;
inquiring a preset node comparison table, and determining nodes corresponding to the trigger probability in the data bucket;
and storing the sensitive word bank corresponding to the triggering probability on the node based on the red and black tree.
Preferably, step S4: performing sensitive word filtering replacement on the text based on the configured data bucket, wherein the method comprises the following steps:
performing word segmentation processing on the text to obtain a plurality of first target words;
indexing the first target word in a data bucket;
taking a corresponding node which is currently indexed as a target node, and taking a sensitive word bank with the matching length smaller than or equal to the text length of a first target word on the target node as a target sensitive word bank;
matching the first target word with a second target word in a target sensitive word bank;
and after all the first target words are indexed on all the nodes, replacing the first target words matched with the preset times threshold value or more in the text with a preset replacement text to obtain the target text.
Preferably, before outputting the target text, step S4 further includes:
preprocessing a target text;
the method for preprocessing the target text comprises the following steps:
taking any first target word needing to be replaced by a replacement text in the text as a third target word;
extracting a first feature of a third target word;
establishing a trigger characteristic database, matching the first characteristic with a second characteristic in the trigger characteristic database, and if the first characteristic is matched with the second characteristic in the trigger characteristic database, acquiring the characteristic type of the matched second characteristic;
inquiring a preset inquiry direction comparison table, and determining at least one inquiry direction corresponding to the characteristic type;
determining a first position of a third target word in the text;
acquiring a first number of fourth target words preset in the text in the query direction of the first position;
extracting a third feature of the fourth target word;
acquiring a preset approximate sensitive feature database, matching the third feature with a fourth feature in the approximate sensitive feature database, and determining a second position of a fourth target word in the text if the matching is in accordance;
acquiring fifth target words with a second quantity preset in the text before and/or after the second position;
extracting a fifth feature of the fifth target word;
acquiring a preset negative feature database, matching sixth features in the fifth feature or non-positive feature database, and if the sixth features do not match the sixth features, replacing a fourth target word corresponding to a third feature matching and conforming to the fourth features with a replacement text;
and finishing preprocessing after the fourth target words needing to be replaced by the replacement text in the target text are completely replaced.
Preferably, the establishing of the trigger characteristic database includes:
respectively acquiring a preset trigger word set and a preset approximate sensitive word database;
inquiring a preset associated trigger word comparison table, and determining at least one associated trigger word corresponding to each trigger word in the trigger word set;
creating a first event, the first event comprising: the sensitive sentence comprises a trigger word, and the approximate sensitive word in at least one approximate sensitive word database appears in a preset text length range before and/or after the trigger word in the sensitive sentence;
creating a second event, the second event comprising: the sensitive sentence comprises an associated trigger word, and at least one approximate sensitive word in an approximate sensitive word database appears in a preset text length range before/after the associated trigger word in the sensitive sentence;
respectively acquiring sensitive statement big data and a preset evaluation model;
respectively evaluating the occurrence conditions of the first event and the second event in the sensitive statement big data by using an evaluation model;
acquiring a plurality of first evaluation values output after the evaluation model evaluates the first event and a plurality of second evaluation values output after the evaluation model evaluates the second event;
calculating an evaluation index based on the first evaluation value and the second evaluation value, the calculation formula being as follows:
wherein σ is an evaluation index, θ1,iIs the ith first evaluation value, theta2,iIs the ith second evaluation value, α is the total number of the first evaluation values, β is the total number of the second evaluation values, O1And O2Is a preset weight value, O2>1>O1> 0, gamma is an intermediate variable, mu1A first number, μ, of the first evaluation values that is less than or equal to a preset first evaluation value threshold2A second number, μ, of second evaluation values less than or equal to a preset second evaluation value threshold0The number is a preset number threshold, and is equal to and else is equal to;
acquiring a preset blank database, if the evaluation index is greater than or equal to a preset evaluation index threshold, extracting a seventh feature of the trigger word and eighth features of all associated trigger words corresponding to the trigger word, and storing the seventh feature and the eighth features into the blank database;
and when the seventh characteristics of the trigger words needing to be stored in the blank database in the trigger word set and the eighth characteristics corresponding to all the associated trigger words are stored in the blank database, taking the blank database as a trigger characteristic database, and finishing the establishment.
The system for filtering and replacing the text content sensitive words in the online customer service scene provided by the embodiment of the invention comprises the following steps:
a creation module for creating a data bucket;
the configuration module is used for acquiring a plurality of preset sensitive word banks and configuring the data bucket based on the sensitive word banks;
the acquisition module is used for acquiring a text which needs to be subjected to sensitive word filtering replacement;
and the filtering and replacing module is used for filtering and replacing the sensitive words of the text based on the configured data bucket to obtain a target text and outputting the target text.
Preferably, the configuration module performs the following operations:
acquiring characteristic information of a sensitive word bank, wherein the characteristic information comprises: matching length and triggering probability;
inquiring a preset node comparison table, and determining nodes corresponding to the trigger probability in the data bucket;
and storing the sensitive word bank corresponding to the triggering probability on the node based on the red and black tree.
Preferably, the filtering replacement module performs the following operations:
performing word segmentation processing on the text to obtain a plurality of first target words;
indexing the first target word in a data bucket;
taking a corresponding node which is currently indexed as a target node, and taking a sensitive word bank with the matching length smaller than or equal to the text length of a first target word on the target node as a target sensitive word bank;
matching the first target word with a second target word in a target sensitive word bank;
and after all the first target words are indexed on all the nodes, replacing the first target words matched with the preset times threshold value or more in the text with a preset replacement text to obtain the target text.
Preferably, the filtering replacement module further performs the following operations:
preprocessing a target text;
the filtering and replacing module preprocesses the target text and specifically executes the following operations:
taking any first target word needing to be replaced by a replacement text in the text as a third target word;
extracting a first feature of a third target word;
establishing a trigger characteristic database, matching the first characteristic with a second characteristic in the trigger characteristic database, and if the first characteristic is matched with the second characteristic in the trigger characteristic database, acquiring the characteristic type of the matched second characteristic;
inquiring a preset inquiry direction comparison table, and determining at least one inquiry direction corresponding to the characteristic type;
determining a first position of a third target word in the text;
acquiring a first number of fourth target words preset in the text in the query direction of the first position;
extracting a third feature of the fourth target word;
acquiring a preset approximate sensitive feature database, matching the third feature with a fourth feature in the approximate sensitive feature database, and determining a second position of a fourth target word in the text if the matching is in accordance;
acquiring fifth target words with a second quantity preset in the text before and/or after the second position;
extracting a fifth feature of the fifth target word;
acquiring a preset negative feature database, matching sixth features in the fifth feature or non-positive feature database, and if the sixth features do not match the sixth features, replacing a fourth target word corresponding to a third feature matching and conforming to the fourth features with a replacement text;
and finishing preprocessing after the fourth target words needing to be replaced by the replacement text in the target text are completely replaced.
Preferably, the filtering and replacing module establishes the trigger characteristic data, and specifically executes the following operations:
respectively acquiring a preset trigger word set and a preset approximate sensitive word database;
inquiring a preset associated trigger word comparison table, and determining at least one associated trigger word corresponding to each trigger word in the trigger word set;
creating a first event, the first event comprising: the sensitive sentence comprises a trigger word, and the approximate sensitive word in at least one approximate sensitive word database appears in a preset text length range before and/or after the trigger word in the sensitive sentence;
creating a second event, the second event comprising: the sensitive sentence comprises an associated trigger word, and at least one approximate sensitive word in an approximate sensitive word database appears in a preset text length range before/after the associated trigger word in the sensitive sentence;
respectively acquiring sensitive statement big data and a preset evaluation model;
respectively evaluating the occurrence conditions of the first event and the second event in the sensitive statement big data by using an evaluation model;
acquiring a plurality of first evaluation values output after the evaluation model evaluates the first event and a plurality of second evaluation values output after the evaluation model evaluates the second event;
calculating an evaluation index based on the first evaluation value and the second evaluation value, the calculation formula being as follows:
wherein σ is an evaluation index, θ1,iIs the ith first evaluation value, theta2,iIs the ith second evaluation value, α is the total number of the first evaluation values, β is the total number of the second evaluation values, O1And O2Is a preset weight value, O2>1>O1> 0, gamma is an intermediate variable, mu1A first number, μ, of the first evaluation values that is less than or equal to a preset first evaluation value threshold2A second number, μ, of second evaluation values less than or equal to a preset second evaluation value threshold0The number is a preset number threshold, and is equal to and else is equal to;
acquiring a preset blank database, if the evaluation index is greater than or equal to a preset evaluation index threshold, extracting a seventh feature of the trigger word and eighth features of all associated trigger words corresponding to the trigger word, and storing the seventh feature and the eighth features into the blank database;
and when the seventh characteristics of the trigger words needing to be stored in the blank database in the trigger word set and the eighth characteristics corresponding to all the associated trigger words are stored in the blank database, taking the blank database as a trigger characteristic database, and finishing the establishment.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a method for filtering and replacing text content sensitive words in an online customer service scenario according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a system for filtering and replacing a text content sensitive word in an online customer service scenario according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
The embodiment of the invention provides a method for filtering and replacing text content sensitive words in an online customer service scene, which comprises the following steps of:
step S1: creating a data bucket;
step S2: acquiring a plurality of preset sensitive word banks, and configuring a data bucket based on the sensitive word banks;
step S3: acquiring a text needing to be subjected to sensitive word filtering replacement;
step S4: and performing sensitive word filtering replacement on the text based on the configured data bucket to obtain a target text, and outputting the target text.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset multiple sensitive word banks are specifically as follows: a plurality of databases containing a plurality of sensitive words; creating a data bucket, configuring the data bucket based on a sensitive word bank, acquiring a text (such as a text input by a user and an answer text acquired from the Internet and used for answering a certain question of the user) needing to be subjected to sensitive word filtering replacement, performing filtering replacement on the text based on the configured data bucket, acquiring a target text, and outputting (displaying) the target text;
the data bucket is configured based on the sensitive word bank, after the data bucket is configured, sensitive word filtering replacement is carried out on the text which needs to be subjected to sensitive word filtering replacement based on the data bucket, and finally the target text after filtering replacement is output.
The embodiment of the invention provides a method for filtering and replacing text content sensitive words in an online customer service scene, wherein in the step S2, a data bucket is configured based on a sensitive word bank, and the method comprises the following steps:
acquiring characteristic information of a sensitive word bank, wherein the characteristic information comprises: matching length and triggering probability;
inquiring a preset node comparison table, and determining nodes corresponding to the trigger probability in the data bucket;
and storing the sensitive word bank corresponding to the triggering probability on the node based on the red and black tree.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset node comparison table specifically comprises the following steps: the method comprises the steps that background personnel make in advance and comprise a plurality of comparison items, each comparison item comprises a trigger probability interval and a node of a data bucket, and when the trigger probability is in the trigger probability interval during comparison, a corresponding node is output; each sensitive word bank corresponds to a piece of characteristic information, which comprises a matching length (the length of a consistent text of a sensitive word in the corresponding sensitive word bank) and a triggering probability (obtained by determining the probability of the sensitive word in the corresponding sensitive word bank appearing in the obtained text historically based on historical sensitive word filtering and replacing data); sensitive word databases are stored in corresponding nodes based on a red-black tree (a self-balancing binary search tree data structure), and the red-black tree is used, so that the indexing efficiency is improved in the later indexing process.
The embodiment of the invention provides a method for filtering and replacing text content sensitive words in an online customer service scene, which comprises the following steps of S4: performing sensitive word filtering replacement on the text based on the configured data bucket, wherein the method comprises the following steps:
performing word segmentation processing on the text to obtain a plurality of first target words;
indexing the first target word in a data bucket;
taking a corresponding node which is currently indexed as a target node, and taking a sensitive word bank with the matching length smaller than or equal to the text length of a first target word on the target node as a target sensitive word bank;
matching the first target word with a second target word in a target sensitive word bank;
and after all the first target words are indexed on all the nodes, replacing the first target words matched with the preset times threshold value or more in the text with a preset replacement text to obtain the target text.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset time threshold specifically comprises: for example, 3; the preset replacement text specifically comprises: for example, { character >; dividing the text into a plurality of first target words, and indexing the first target words in a data bucket (generally, a node with higher triggering probability is selected for preferential indexing); taking a sensitive word bank with the matching length smaller than or equal to the text length of a first target word on a node which is currently indexed as a target sensitive word bank, matching the first target word with a second target word in the target sensitive word bank, counting once if the matching is in accordance, if the matching times of a certain first target word are greater than a preset time threshold, indicating that the certain first target word is a sensitive word, and replacing the certain first target word with a replacement text.
The embodiment of the present invention provides a method for filtering and replacing a text content sensitive word in an online customer service scene, where in step S4, before outputting a target text, the method further includes:
preprocessing a target text;
the method for preprocessing the target text comprises the following steps:
taking any first target word needing to be replaced by a replacement text in the text as a third target word;
extracting a first feature of a third target word;
establishing a trigger characteristic database, matching the first characteristic with a second characteristic in the trigger characteristic database, and if the first characteristic is matched with the second characteristic in the trigger characteristic database, acquiring the characteristic type of the matched second characteristic;
inquiring a preset inquiry direction comparison table, and determining at least one inquiry direction corresponding to the characteristic type;
determining a first position of a third target word in the text;
acquiring a first number of fourth target words preset in the text in the query direction of the first position;
extracting a third feature of the fourth target word;
acquiring a preset approximate sensitive feature database, matching the third feature with a fourth feature in the approximate sensitive feature database, and determining a second position of a fourth target word in the text if the matching is in accordance;
acquiring fifth target words with a second quantity preset in the text before and/or after the second position;
extracting a fifth feature of the fifth target word;
acquiring a preset negative feature database, matching sixth features in the fifth feature or non-positive feature database, and if the sixth features do not match the sixth features, replacing a fourth target word corresponding to a third feature matching and conforming to the fourth features with a replacement text;
and finishing preprocessing after the fourth target words needing to be replaced by the replacement text in the target text are completely replaced.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset query mode comparison table specifically comprises: the system is made by background personnel through statistics in advance and comprises a plurality of comparison items, wherein each comparison item comprises at least one query direction of a characteristic type; the preset first number specifically comprises: for example, 12; the preset approximate sensitive characteristic database specifically comprises: the database stores a number of proximity sensitive features, such as: country name, region name, and store name; the preset second number is specifically as follows: for example, 2; the preset negative characteristic database specifically comprises: the database stores a number of negative features, such as: "not" and the like;
for example: the target text is ' small A is one ', wherein a first target word needing to be replaced by the replacement text is ' east-west ' (replaced by ' small A '), the first target word is taken as a third target word, a first feature of the third target word is extracted, the first feature matches with a second feature, the feature type of the second feature matching with the first feature is determined to be a certain abusive term, after a preset query direction comparison table is queried, the abusive user is determined to be frequently placed as an adjective in a subject, the query direction is a forward query, a plurality of fourth target words ' small A ' before ' east-west ' are obtained, the fourth target words ' small A ' are ' one ', the third feature of the fourth target word is extracted, wherein the third feature of the fourth target word ' small A ' matches with a fourth feature (a person name) in an approximate sensitive feature database, the fifth target words ' small A ' before and/or after ' small A ' are ' and ' one ', extracting fifth features of a fifth target word, wherein none of the fifth features matches a sixth feature in the database of negative features, and wherein "small a" represents an abusive intent (abuse) of the user in the target text and should also be replaced with replacement text for output, i.e., the output "× is one", and wherein "small a" also appears in the last output target text when the intent of the user is positive;
at present, in a large number of sensitive word filtering and replacing technologies, names of people, names of countries and the like are all used as sensitive word filtering and replacing technologies, whether filtering and replacing are carried out or not can be determined by combining with actual conditions, actual use requirements of users are met better, user experience is improved, and the determination means is fine and very intelligent.
The embodiment of the invention provides a method for filtering and replacing text content sensitive words in an online customer service scene, which establishes a trigger characteristic database and comprises the following steps:
respectively acquiring a preset trigger word set and a preset approximate sensitive word database;
inquiring a preset associated trigger word comparison table, and determining at least one associated trigger word corresponding to each trigger word in the trigger word set;
creating a first event, the first event comprising: the sensitive sentence comprises a trigger word, and the approximate sensitive word in at least one approximate sensitive word database appears in a preset text length range before and/or after the trigger word in the sensitive sentence;
creating a second event, the second event comprising: the sensitive sentence comprises an associated trigger word, and at least one approximate sensitive word in an approximate sensitive word database appears in a preset text length range before/after the associated trigger word in the sensitive sentence;
respectively acquiring sensitive statement big data and a preset evaluation model;
respectively evaluating the occurrence conditions of the first event and the second event in the sensitive statement big data by using an evaluation model;
acquiring a plurality of first evaluation values output after the evaluation model evaluates the first event and a plurality of second evaluation values output after the evaluation model evaluates the second event;
calculating an evaluation index based on the first evaluation value and the second evaluation value, the calculation formula being as follows:
wherein σ is an evaluation index, θ1,iIs the ith first evaluation value, theta2,iIs the ith second evaluation value, α is the total number of the first evaluation values, β is the total number of the second evaluation values, O1And O2Is a preset weight value, O2>1>O1> 0, gamma is an intermediate variable, mu1A first number, μ, of the first evaluation values that is less than or equal to a preset first evaluation value threshold2A second number, μ, of second evaluation values less than or equal to a preset second evaluation value threshold0Is a predetermined number threshold, and is AND, else is;
Acquiring a preset blank database, if the evaluation index is greater than or equal to a preset evaluation index threshold, extracting a seventh feature of the trigger word and eighth features of all associated trigger words corresponding to the trigger word, and storing the seventh feature and the eighth features into the blank database;
and when the seventh characteristics of the trigger words needing to be stored in the blank database in the trigger word set and the eighth characteristics corresponding to all the associated trigger words are stored in the blank database, taking the blank database as a trigger characteristic database, and finishing the establishment.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset trigger word set specifically comprises: the set contains a plurality of trigger words, for example: abusive terms, etc.; the preset approximate sensitive word database specifically comprises: the database contains a large number of approximately sensitive words, such as: country name, region name, and store name; the preset associated trigger word comparison table specifically comprises: the method is prepared in advance by background personnel and comprises a plurality of comparison items, wherein each comparison item comprises a trigger word and at least one associated trigger word, and the method comprises the following steps: the expression of an abusive term may be different depending on the dialect, and the expressions of the abusive term may be determined as much as possible based on the look-up table; presetting a text length: for example, word number 15; the preset evaluation model specifically comprises the following steps: the model is generated by learning a large number of records for manually evaluating the occurrence conditions of the first event and the second event in the sensitive sentence big data by using a machine learning algorithm, and the higher the output evaluation value of the model is, the more frequently the corresponding event occurs historically and/or recently; the preset first evaluation value threshold is specifically: for example, 80; the preset second evaluation value threshold is specifically: for example, 75; the preset number threshold specifically comprises: for example, 7; the preset blank database specifically comprises the following steps: there is no content in the database; the preset evaluation index threshold specifically comprises: for example, 92; the sensitive statement big data is specifically as follows: a large number of sensitive sentences in the internet;
respectively establishing a first event and a second event, wherein if the first event and the second event occur frequently (for example, a sensitive statement with a subject being a name before an abusive term) and indicate that the corresponding trigger word or the associated trigger word has higher possibility that the user uses the similar sensitive word (for example, the name) and the corresponding trigger word simultaneously in an actual conversation, the characteristics of the corresponding trigger word or the associated trigger word are extracted and stored in a blank database; calculating an evaluation index through the formula, comprehensively evaluating the occurrence condition of an event in the sensitive statement big data, wherein the larger the evaluation index is, the more frequent the occurrence of the corresponding event is; during event counting, the occurrence of the approximate sensitive words before the trigger words and/or the occurrence of the approximate sensitive words after the trigger words can be recorded, so that a worker can conveniently make a query direction comparison table;
the embodiment of the invention reasonably determines the trigger words which can be used as the extraction features for establishing the trigger feature database and the corresponding associated trigger words from the trigger word set, effectively helps to find the third target word corresponding to the first feature matched and matched with a certain second feature based on the trigger feature database in the later period, improves the working efficiency of the system, and meanwhile, calculates the evaluation index based on the first evaluation value and the second evaluation value through the formula, comprehensively evaluates the two events, and greatly improves the working efficiency of the system.
The embodiment of the invention provides a system for filtering and replacing text content sensitive words in an online customer service scene, as shown in fig. 2, comprising:
the creating module 1 is used for creating a data bucket;
the configuration module 2 is used for acquiring a plurality of preset sensitive word banks and configuring the data bucket based on the sensitive word banks;
the acquisition module 3 is used for acquiring a text which needs to be subjected to sensitive word filtering replacement;
and the filtering and replacing module 4 is used for filtering and replacing the sensitive words of the text based on the configured data bucket to obtain a target text and outputting the target text.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset multiple sensitive word banks are specifically as follows: a plurality of databases containing a plurality of sensitive words; creating a data bucket, configuring the data bucket based on a sensitive word bank, acquiring a text (such as a text input by a user and an answer text acquired from the Internet and used for answering a certain question of the user) needing to be subjected to sensitive word filtering replacement, performing filtering replacement on the text based on the configured data bucket, acquiring a target text, and outputting (displaying) the target text;
the data bucket is configured based on the sensitive word bank, after the data bucket is configured, sensitive word filtering replacement is carried out on the text which needs to be subjected to sensitive word filtering replacement based on the data bucket, and finally the target text after filtering replacement is output.
The embodiment of the invention provides a system for filtering and replacing text content sensitive words in an online customer service scene, wherein a configuration module 2 executes the following operations:
acquiring characteristic information of a sensitive word bank, wherein the characteristic information comprises: matching length and triggering probability;
inquiring a preset node comparison table, and determining nodes corresponding to the trigger probability in the data bucket;
and storing the sensitive word bank corresponding to the triggering probability on the node based on the red and black tree.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset node comparison table specifically comprises the following steps: the method comprises the steps that background personnel make in advance and comprise a plurality of comparison items, each comparison item comprises a trigger probability interval and a node of a data bucket, and when the trigger probability is in the trigger probability interval during comparison, a corresponding node is output; each sensitive word bank corresponds to a piece of characteristic information, which comprises a matching length (the length of a consistent text of a sensitive word in the corresponding sensitive word bank) and a triggering probability (obtained by determining the probability of the sensitive word in the corresponding sensitive word bank appearing in the obtained text historically based on historical sensitive word filtering and replacing data); sensitive word databases are stored in corresponding nodes based on a red-black tree (a self-balancing binary search tree data structure), and the red-black tree is used, so that the indexing efficiency is improved in the later indexing process.
The embodiment of the invention provides a system for filtering and replacing text content sensitive words in an online customer service scene, wherein a filtering and replacing module 4 executes the following operations:
performing word segmentation processing on the text to obtain a plurality of first target words;
indexing the first target word in a data bucket;
taking a corresponding node which is currently indexed as a target node, and taking a sensitive word bank with the matching length smaller than or equal to the text length of a first target word on the target node as a target sensitive word bank;
matching the first target word with a second target word in a target sensitive word bank;
and after all the first target words are indexed on all the nodes, replacing the first target words matched with the preset times threshold value or more in the text with a preset replacement text to obtain the target text.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset time threshold specifically comprises: for example, 3; the preset replacement text specifically comprises: for example, { character >; dividing the text into a plurality of first target words, and indexing the first target words in a data bucket (generally, a node with higher triggering probability is selected for preferential indexing); taking a sensitive word bank with the matching length smaller than or equal to the text length of a first target word on a node which is currently indexed as a target sensitive word bank, matching the first target word with a second target word in the target sensitive word bank, counting once if the matching is in accordance, if the matching times of a certain first target word are greater than a preset time threshold, indicating that the certain first target word is a sensitive word, and replacing the certain first target word with a replacement text.
The embodiment of the invention provides a system for filtering and replacing text content sensitive words in an online customer service scene, wherein a filtering and replacing module 4 further executes the following operations:
preprocessing a target text;
the filtering and replacing module 4 preprocesses the target text, and specifically executes the following operations:
taking any first target word needing to be replaced by a replacement text in the text as a third target word;
extracting a first feature of a third target word;
establishing a trigger characteristic database, matching the first characteristic with a second characteristic in the trigger characteristic database, and if the first characteristic is matched with the second characteristic in the trigger characteristic database, acquiring the characteristic type of the matched second characteristic;
inquiring a preset inquiry direction comparison table, and determining at least one inquiry direction corresponding to the characteristic type;
determining a first position of a third target word in the text;
acquiring a first number of fourth target words preset in the text in the query direction of the first position;
extracting a third feature of the fourth target word;
acquiring a preset approximate sensitive feature database, matching the third feature with a fourth feature in the approximate sensitive feature database, and determining a second position of a fourth target word in the text if the matching is in accordance;
acquiring fifth target words with a second quantity preset in the text before and/or after the second position;
extracting a fifth feature of the fifth target word;
acquiring a preset negative feature database, matching sixth features in the fifth feature or non-positive feature database, and if the sixth features do not match the sixth features, replacing a fourth target word corresponding to a third feature matching and conforming to the fourth features with a replacement text;
and finishing preprocessing after the fourth target words needing to be replaced by the replacement text in the target text are completely replaced.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset query mode comparison table specifically comprises: the system is made by background personnel through statistics in advance and comprises a plurality of comparison items, wherein each comparison item comprises at least one query direction of a characteristic type; the preset first number specifically comprises: for example, 12; the preset approximate sensitive characteristic database specifically comprises: the database stores a number of proximity sensitive features, such as: country name, region name, and store name; the preset second number is specifically as follows: for example, 2; the preset negative characteristic database specifically comprises: the database stores a number of negative features, such as: "not" and the like;
for example: the target text is ' small A is one ', wherein a first target word needing to be replaced by the replacement text is ' east-west ' (replaced by ' small A '), the first target word is taken as a third target word, a first feature of the third target word is extracted, the first feature matches with a second feature, the feature type of the second feature matching with the first feature is determined to be a certain abusive term, after a preset query direction comparison table is queried, the abusive user is determined to be frequently placed as an adjective in a subject, the query direction is a forward query, a plurality of fourth target words ' small A ' before ' east-west ' are obtained, the fourth target words ' small A ' are ' one ', the third feature of the fourth target word is extracted, wherein the third feature of the fourth target word ' small A ' matches with a fourth feature (a person name) in an approximate sensitive feature database, the fifth target words ' small A ' before and/or after ' small A ' are ' and ' one ', extracting fifth features of a fifth target word, wherein none of the fifth features matches a sixth feature in the database of negative features, and wherein "small a" represents an abusive intent (abuse) of the user in the target text and should also be replaced with replacement text for output, i.e., the output "× is one", and wherein "small a" also appears in the last output target text when the intent of the user is positive;
at present, in a large number of sensitive word filtering and replacing technologies, names of people, names of countries and the like are all used as sensitive word filtering and replacing technologies, whether filtering and replacing are carried out or not can be determined by combining with actual conditions, actual use requirements of users are met better, user experience is improved, and the determination means is fine and very intelligent.
The embodiment of the invention provides a system for filtering and replacing text content sensitive words in an online customer service scene, wherein a filtering and replacing module 4 establishes trigger characteristic data and specifically executes the following operations:
respectively acquiring a preset trigger word set and a preset approximate sensitive word database;
inquiring a preset associated trigger word comparison table, and determining at least one associated trigger word corresponding to each trigger word in the trigger word set;
creating a first event, the first event comprising: the sensitive sentence comprises a trigger word, and the approximate sensitive word in at least one approximate sensitive word database appears in a preset text length range before and/or after the trigger word in the sensitive sentence;
creating a second event, the second event comprising: the sensitive sentence comprises an associated trigger word, and at least one approximate sensitive word in an approximate sensitive word database appears in a preset text length range before/after the associated trigger word in the sensitive sentence;
respectively acquiring sensitive statement big data and a preset evaluation model;
respectively evaluating the occurrence conditions of the first event and the second event in the sensitive statement big data by using an evaluation model;
acquiring a plurality of first evaluation values output after the evaluation model evaluates the first event and a plurality of second evaluation values output after the evaluation model evaluates the second event;
calculating an evaluation index based on the first evaluation value and the second evaluation value, the calculation formula being as follows:
wherein σ is an evaluation index, θ1,iIs the ith first evaluation value, theta2,iIs the ith second evaluation value, α is the total number of the first evaluation values, β is the total number of the second evaluation values, O1And O2Is a preset weight value, O2>1>O1> 0, gamma is an intermediate variable, mu1A first number, μ, of the first evaluation values that is less than or equal to a preset first evaluation value threshold2A second number, μ, of second evaluation values less than or equal to a preset second evaluation value threshold0The number is a preset number threshold, and is equal to and else is equal to;
acquiring a preset blank database, if the evaluation index is greater than or equal to a preset evaluation index threshold, extracting a seventh feature of the trigger word and eighth features of all associated trigger words corresponding to the trigger word, and storing the seventh feature and the eighth features into the blank database;
and when the seventh characteristics of the trigger words needing to be stored in the blank database in the trigger word set and the eighth characteristics corresponding to all the associated trigger words are stored in the blank database, taking the blank database as a trigger characteristic database, and finishing the establishment.
The working principle and the beneficial effects of the technical scheme are as follows:
the preset trigger word set specifically comprises: the set contains a plurality of trigger words, for example: abusive terms, etc.; the preset approximate sensitive word database specifically comprises: the database contains a large number of approximately sensitive words, such as: country name, region name, and store name; the preset associated trigger word comparison table specifically comprises: the method is prepared in advance by background personnel and comprises a plurality of comparison items, wherein each comparison item comprises a trigger word and at least one associated trigger word, and the method comprises the following steps: the expression of an abusive term may be different depending on the dialect, and the expressions of the abusive term may be determined as much as possible based on the look-up table; presetting a text length: for example, word number 15; the preset evaluation model specifically comprises the following steps: the model is generated by learning a large number of records for manually evaluating the occurrence conditions of the first event and the second event in the sensitive sentence big data by using a machine learning algorithm, and the higher the output evaluation value of the model is, the more frequently the corresponding event occurs historically and/or recently; the preset first evaluation value threshold is specifically: for example, 80; the preset second evaluation value threshold is specifically: for example, 75; the preset number threshold specifically comprises: for example, 7; the preset blank database specifically comprises the following steps: there is no content in the database; the preset evaluation index threshold specifically comprises: for example, 92; the sensitive statement big data is specifically as follows: a large number of sensitive sentences in the internet;
respectively establishing a first event and a second event, wherein if the first event and the second event occur frequently (for example, a sensitive statement with a subject being a name before an abusive term) and indicate that the corresponding trigger word or the associated trigger word has higher possibility that the user uses the similar sensitive word (for example, the name) and the corresponding trigger word simultaneously in an actual conversation, the characteristics of the corresponding trigger word or the associated trigger word are extracted and stored in a blank database; calculating an evaluation index through the formula, comprehensively evaluating the occurrence condition of an event in the sensitive statement big data, wherein the larger the evaluation index is, the more frequent the occurrence of the corresponding event is; during event counting, the occurrence of the approximate sensitive words before the trigger words and/or the occurrence of the approximate sensitive words after the trigger words can be recorded, so that a worker can conveniently make a query direction comparison table;
the embodiment of the invention reasonably determines the trigger words which can be used as the extraction features for establishing the trigger feature database and the corresponding associated trigger words from the trigger word set, effectively helps to find the third target word corresponding to the first feature matched and matched with a certain second feature based on the trigger feature database in the later period, improves the working efficiency of the system, and meanwhile, calculates the evaluation index based on the first evaluation value and the second evaluation value through the formula, comprehensively evaluates the two events, and greatly improves the working efficiency of the system.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.