Text search method and device, readable medium and electronic equipment
1. A text search method, the method comprising:
dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets;
inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors;
and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
2. The method according to claim 1, wherein the preset division mode comprises word division and character division, and the plurality of groups of target division text sets comprise a first target division text set and a second target division text set; the dividing the first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets comprises:
performing the word division on the first text to obtain a first target division text set containing one or more target words;
performing the character division on the first text to obtain a second target division text set containing one or more target characters;
inputting the plurality of groups of target division text sets into a pre-trained text vector model to obtain a first text vector, wherein the step of obtaining the first text vector comprises the following steps:
and dividing the first target into a text set and the second target into a text set, and inputting a pre-trained text vector model to obtain the first text vector.
3. The method of claim 2, wherein the text vector model comprises a word encoding network and a character encoding network; the text vector model is used for:
coding the first target division text set through the word coding network to obtain a word vector;
coding the second target division text set through the character coding network to obtain a character vector;
and calculating to obtain the first text vector according to the word vector and the character vector.
4. The method of claim 1, wherein the text vector model is pre-trained by:
acquiring a training sample set, wherein the training sample set comprises a plurality of training sample pairs and the similarity of each training sample pair;
determining a first loss function according to the training sample pairs and the similarity of the training sample pairs, wherein the first loss function is used for constraining the similarity of each training sample pair to meet the requirement of preset correlation similarity;
and training a preset model according to the training sample set and the first loss function to obtain the text vector model.
5. The method of claim 4, wherein before the training of the preset model according to the training sample set and the first loss function to obtain the text vector model, the method further comprises:
determining a second loss function according to a plurality of training sample pairs in the training sample set; the second loss function is used for constraining the similarity between the text in each training sample pair and the text in other training sample pairs in the training sample set to meet the requirement of preset non-relevant similarity;
the obtaining of the text vector model after training a preset model according to the training sample set and the first loss function includes:
and training a preset model according to the training sample set, the first loss function and the second loss function to obtain the text vector model.
6. The method of claim 4, wherein the obtaining a set of training samples comprises:
respectively determining the similarity between each historical search result and historical search information according to historical operation behavior information of a user on a plurality of historical search results, wherein the historical search results are obtained by searching according to the historical search information input by the user;
according to the similarity, taking the historical search information and the historical search result as the training sample pair;
taking a set of a plurality of the training sample pairs as the training sample set.
7. The method of claim 1, wherein the second text comprises a text statement; the text knowledge base is pre-established in the following way:
dividing the text sentences according to the multiple preset dividing modes to obtain multiple groups of target sentence division text sets;
dividing the text set into a plurality of groups of target sentences and inputting the text set into the text vector model to obtain second text vectors corresponding to the text sentences;
and establishing the text knowledge base according to the second text vector and the text sentence.
8. The method of claim 7, wherein the second text further comprises a document, and the text sentence is a sentence obtained by sentence splitting the document; establishing the text knowledge base according to the second text vector and the text statement comprises:
and establishing the text knowledge base according to the second text vector, the text sentence and the document.
9. The method of claim 1, wherein obtaining a target text vector from a second text vector of a pre-established knowledge base of text based on the first text vector comprises:
acquiring a candidate text vector closest to the first text vector from a second text vector of the text knowledge base;
and acquiring the target text vector from the candidate text vector according to the similarity between the candidate text vector and the first text vector.
10. A text search apparatus, characterized in that the apparatus comprises:
the first text dividing module is used for dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets;
the first text vector acquisition module is used for inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
the target text vector acquisition module is used for acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors;
and the target text searching module is used for taking the second text corresponding to the target text vector as a target searching result and displaying the target searching result.
11. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 9.
12. An electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 9.
Background
With the explosive growth of internet content, how to search required texts from massive network information becomes a hotspot concerned by information processing technology, such as searching of articles, lyrics, and web pages. The search engine can search to obtain a search result matched with the text to be searched in a text search mode according to the text to be searched input by the user. For text search, the related art generally performs search based on an inverted index, but the inverted index based mode is difficult to accurately match the expected search result of the user in some scenes, so that the search result is wrong or incomplete, and the user experience is reduced.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a text search method, including:
dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets;
inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors;
and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
In a second aspect, the present disclosure provides a text search apparatus, the apparatus comprising:
the first text dividing module is used for dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets;
the first text vector acquisition module is used for inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
the target text vector acquisition module is used for acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors;
and the target text searching module is used for taking the second text corresponding to the target text vector as a target searching result and displaying the target searching result.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to implement the steps of the method of the first aspect of the present disclosure.
By adopting the technical scheme, the first text to be searched is divided according to a plurality of preset division modes to obtain a plurality of groups of target division text sets; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result. Therefore, the situation of misspelling of the user can be effectively avoided by dividing the text set by the multiple groups of targets, and the real expectation of the user is reflected; and then, retrieval is carried out through a pre-trained text vector model, so that the accuracy of text search can be improved, the condition that a search result is wrong or incomplete due to misspelling of a user is avoided, and the experience of the user is improved.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow diagram illustrating a text search method in accordance with an exemplary embodiment;
FIG. 2 is a diagram illustrating the structure of a text vector model in accordance with an exemplary embodiment;
FIG. 3 illustrates a method of training a text vector model in accordance with an exemplary embodiment;
FIG. 4 is a block diagram illustrating a text search apparatus according to an example embodiment;
FIG. 5 is a block diagram illustrating a training apparatus for a text vector model in accordance with an exemplary embodiment;
FIG. 6 is a block diagram illustrating another training apparatus for text vector models in accordance with an exemplary embodiment;
FIG. 7 is a block diagram illustrating another text search apparatus in accordance with an illustrative embodiment;
FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
First, an application scenario of the present disclosure will be explained. The present disclosure may be applied to text search scenarios, such as search scenarios of articles, lyrics, web pages, and the like. In the related art, the inverted index mode adopted for text search includes that a document containing a text to be searched and input by a user is accurately searched in a database, and after the document is ranked according to the frequency of the text to be searched, a matched document is obtained and used as a search result. However, when the text to be searched input by the user is deviated from the words in the database, for example, the user may inadvertently make the input text to be searched have misspelling or non-standard conditions, and it is difficult to accurately match the search result expected by the user by using the above-mentioned inverted index method, which may cause the occurrence of wrong or incomplete search results, and reduce the user experience.
In order to solve the problems, the disclosure provides a text searching method, a text searching device, a readable medium and electronic equipment, wherein a first text to be searched is divided according to a plurality of preset division modes to obtain a plurality of groups of target division text sets, and the plurality of groups of target division text sets can effectively avoid the situation of misspelling of a user and embody the real expectation of the user; and then, retrieval is carried out through a pre-trained text vector model, so that the accuracy of text search can be improved, the condition that a search result is wrong or incomplete due to misspelling of a user is avoided, and the experience of the user is improved.
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings.
Fig. 1 illustrates a text search method according to an exemplary embodiment, as shown in fig. 1, the method including:
step 101, dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets.
For example, the first text to be searched may be a text input by a user; the preset division mode can be any one of sentence division, word division and character division, and a text sentence set, a text word set and a text character set corresponding to the first text can be obtained through the preset division mode, wherein any one of the text sentence set, the text word set and the text character set can be used as a target division text set.
It should be noted that the multiple groups of target partitioned text sets obtained in this step can effectively avoid the situation of user misspelling, and reflect the real expectation of the user. For example: the text expected to be searched by the user is 'newyork in september', but because the spelling is wrong in the user input, the first text to be searched is input as 'newyork in septimber', if the word segmentation is only carried out in the word segmentation mode, the result obtained after the word segmentation is 'newyork, in, septimber' or 'newyork, in, sept, m, ber', therefore, the corresponding text cannot be searched by accurately searching in the inverted index mode. However, by adopting the manner in this embodiment, sentence division, word division and character division may be performed on the first text "newyork in section", and the sentence division may divide the first text into an integral sentence "newyork in section" as a text sentence set corresponding to the first text; the word division divides the first text into 'newyork, in, section' or 'newyork, in, section, m, ber' as a text word set corresponding to the first text; character division can divide the first text into single letters "n, e, w, y, o, r, k, i, n, s, e, p, t, m, b, e, r" as text character sets corresponding to the first text. In this way, although the first text input by the user has spelling errors, through the combined action of the divided sentences, words and characters, especially under the action of the characters, the divided multiple groups of target divided text sets still have certain similarity with the expected result of the user, so that the real expectation of the user can be reflected in the search.
And 102, inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector.
In the same example, the text sentence set, the text word set, and the text character set corresponding to the first text may be input into a pre-trained text vector model, and the text vector model may encode the text sentence set, the text word set, and the text character set to obtain a first text vector.
And 103, acquiring a target text vector from a second text vector of a pre-established text knowledge base according to the first text vector.
The text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors.
In this step, the target text vector may be obtained according to the similarity between the first text vector and the second text vector, for example, the second text vector, the similarity of which with the first text vector is greater than or equal to a first preset similarity threshold, may be used as the target text vector; or sorting the similarity of the second text vectors and the first text vectors from large to small, and taking the first preset number of second text vectors with the similarity sorted at the top as target text vectors.
And step 104, taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
By adopting the method, the first text to be searched is divided according to a plurality of preset division modes to obtain a plurality of groups of target division text sets; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result. Therefore, the situation of misspelling of the user can be effectively avoided by dividing the text set by the multiple groups of targets, and the real expectation of the user is reflected; and then, retrieval is carried out through a pre-trained text vector model, so that the accuracy of text search can be improved, the condition that a search result is wrong or incomplete due to misspelling of a user is avoided, and the experience of the user is improved.
In another embodiment of the present disclosure, considering that the first text input by the user during the search is generally a small number of keywords or a small number of key words and rarely a continuous large segment of characters, and therefore, the sentence division effect on the first text is not obvious, the preset division manner may include word division and character division, and the multiple sets of target division text sets may also include a first target division text set and a second target division text set; the step 101 of dividing the first text to be searched according to a plurality of preset division modes to obtain a plurality of groups of target division text sets may include the following two steps:
step one, carrying out word division on a first text to obtain a first target division text set containing one or more target words.
The word division mode may be a word division (Token), a target word obtained by the word division mode may be called Token, and the target word may be any one or more of an english word, a chinese word, a number, and a punctuation.
In this step, the above first text may be segmented by various methods, for example, a dictionary-based segmentation method, a statistical-based machine learning segmentation method, an understanding-based segmentation method, and the like.
Further, the language of the first text may be determined first, and the word division manner corresponding to the language may be used for division according to the language of the first text. For example, if the language of the first text is chinese, a dictionary-based word segmentation method, or a statistical-based machine learning word segmentation method, or a combination of both methods may be used to improve the word segmentation accuracy. If the language of the first text is english, a simple word segmentation method based on space and punctuation division may be adopted, and of course, a word segmentation method based on a dictionary or a machine learning word segmentation method based on statistics may also be adopted for english. Through the word division, the first text can be divided into target words taking English words and/or Chinese words as units, so that the content of text information and the meaning of the text to be expressed can be analyzed more intensively.
For example, if the first text is "what do you mean", then by word division, the following four target words can be obtained: what, do, you, mean, these four target words can be combined into the above first target divided text set. If the first text is "i am a Chinese", the following three target words can be obtained through word division: i, y, chinese, and likewise, these three target words may also be combined to divide the text set for the first target.
And secondly, performing character division on the first text to obtain a second target division text set containing one or more target characters.
Wherein, the target character is any one of letters, Chinese characters, numbers and punctuation marks.
Exemplarily, if the first text is "what do you mean? ", then through character division, the following fourteen target characters can be obtained: w, h, a, t, d, o, y, o, u, m, e, a, n? The fourteen target characters may be combined to divide the text set for the second target as described above. If the first text is "i are 1 chinese", the following seven target characters can be obtained by character division: i, yes, 1, one, middle, country, person, and likewise, these seven target characters may also be combined to divide the text set for the second target.
It should be noted that, the first step and the second step may be executed in series or in parallel according to different orders, and the parallel execution may improve the efficiency of dividing the first text.
In this way, the first target is divided into the text set and the second target is divided into the text set, and a pre-trained text vector model is input to obtain the first text vector. The first text is divided according to the word and the character as a unit, so that a more accurate first text vector can be obtained.
Further, the text vector model may include a word encoding network and a character encoding network; the text vector model may be used to:
coding the first target division text set through the word coding network to obtain a word vector; coding the second target division text set through the character coding network to obtain a character vector; and calculating to obtain a first text vector according to the word vector and the character vector.
The word coding network and the character coding network may be machine learning models, and may be one or more of a pre-trained Convolutional Neural Network (CNN) model or a Long-Short Term Memory (LSTM) model, for example.
Fig. 2 is a schematic structural diagram illustrating a text vector model according to an exemplary embodiment, and as shown in fig. 2, the text vector model includes a word encoding network 201 and a character encoding network 202, for example, to obtain a first text vector according to a first text "I'm on the run it's a state of med", which is described as follows:
firstly, carrying out word division on a first text 'I'm on the run it's a state of mind' to obtain a first target division text set containing nine target words (I'm, on, the, run, it's, a, state, of, mind); the character division is performed on the first text "I'm on the run it's a state of mind" in parallel to obtain a second target-divided text set including 27 target characters (I, ', m, o, n, t, h, e, r, u, n, I, t,', s, a, s, t, a, t, e, o, f, m, I, n, d).
Then, inputting the first target division text set into a word coding network 201, and obtaining 64-dimensional word vectors after coding; and inputting the second target divided text set into the character coding network 202, and obtaining 64-dimensional character vectors after coding.
And finally, respectively normalizing the word vector and the character vector, and then adding to obtain a 128-dimensional first text vector.
It should be noted that the word vector and the character vector may also be 128-dimensional vectors or 256-dimensional vectors, and similarly, the first text vector may also be 256-dimensional vectors or 512-dimensional vectors, which is not limited in this disclosure.
Therefore, the first text is divided respectively according to the word and the character as a unit, and the text vector model comprising the word coding network and the character coding network is input, so that a more accurate first text vector can be obtained, and particularly, when the word input by a user is misspelled, the first text vector which is more in line with the expectation of the user can be obtained in a word and character fusion mode; and comparing the first text vector with the second text vector to obtain a target text vector and a second text corresponding to the target text vector, so that the accuracy of text search is improved, and the user experience is further improved.
Further, the word coding network may capture the front-back order of the words, that is, after the same words are arranged according to different orders, the word vectors obtained by inputting the words into the word coding network are also different. The target words in the first target division text set can be sequenced according to the sequence of the target words appearing in the first text, and a more accurate word vector can be obtained after a word coding network is input; similarly, the character encoding network may capture a front-back order of the characters, the target characters in the second target division text set may be sorted according to an order in which the target characters appear in the first text, and a more accurate character vector may be obtained after the character encoding network is input.
In this way, the text vector model can generate more accurate first text vectors according to the sequence of words or characters, thereby further improving the accuracy of text search.
Fig. 3 illustrates a method for training a text vector model according to an exemplary embodiment, where the text vector model is pre-trained in the following manner, as shown in fig. 3:
step 301, obtaining a training sample set.
The training sample set comprises a plurality of training sample pairs and the similarity of each training sample pair.
In this step, the obtaining manner of the training sample set may include the following steps:
first, the similarity between each historical search result and the historical search information can be respectively determined according to the historical operation behavior information of a plurality of historical search results implemented by the user.
And searching the historical search result according to the historical search information input by the user. The historical search information may include text to be searched that the user entered during a historical period (e.g., the past week or month); according to the text to be searched, a plurality of historical search results can be searched. For example, the text to be searched input by the user is a, and n pieces of historical search results including a1 to An can be searched.
The historical operation behavior information can be used for representing whether the user performs operation on the historical search result.
Illustratively, the historical operational behavior information may include any one of user click behavior information or user browsing behavior information. For example, if the user clicks on a historical search result or browses the historical search result, the user is considered to have performed an operation on the historical search result.
Further, the historical operation behavior information may also include user click behavior information and user browsing behavior. For example, if the user clicks on a historical search result and browses a page corresponding to the historical search result for a first preset time period, it is assumed that the user performed an operation on the historical search result. The first preset time period may be any preset time period between 5 seconds and 2 minutes, such as 20 seconds or 30 seconds.
In addition, when the historical search result meets the search intention of the user, the user performs an operation on the historical search result, so that the historical operation behavior information performed on the historical search result by the user can reflect the similarity between the historical search result and the historical search information to a certain extent. The historical search results which are operated more by the user are more consistent with the search intention of the user, namely, the correlation degree between the historical search information and the historical search information is relatively higher. In this embodiment, the similarity between each historical search result and the historical search information may be determined according to the historical operation behavior information implemented by the user on the historical search result, and the user is not required to manually label the similarity.
For example, the similarity between the historical search results and the historical search information may be determined according to the following formula:
S=C/H;
wherein S represents the similarity between the historical search result and the historical search information, C represents the times of the historical operation behavior carried out on the historical search result by the user, and H represents the times of the search engine searching according to the historical search information and displaying the historical search result to the user. The degree of similarity obtained by this formula may be any value between 0 and 1.
Then, according to the similarity, the historical search information and the historical search result are used as the training sample pair.
For example, the historical search information and the historical search results with the similarity greater than or equal to a second preset similarity threshold may be selected as the training sample pair. The second predetermined similarity threshold may be any value between 0.3 and 1, and may be, for example, 0.4 or 0.7.
And the historical search information and the historical search results with the similarity smaller than the second preset similarity threshold are not used as a training sample pair.
Further, since the search behavior of the user is frequent, the similarity between the historical search result and the historical search information obtained at different times is changed, so that the similarity range between the historical search result and the historical search information can be set according to the similarity between the historical search result and the historical search information obtained multiple times at different times, and the similarity range can represent the maximum value and the minimum value of the similarity between the historical search result and the historical search information. Thus, the similarity range can be used as the similarity of the training sample pair.
For example, the following 5 similarity ranges may be set: (0.84, 1) indicating that the minimum value of the similarity between the historical search result and the historical search information is 0.84 and the maximum value is 1; (0.68, 0.9) indicating that the minimum value of the similarity between the historical search result and the historical search information is 0.68 and the maximum value is 0.9; (0.54, 0.8) indicating that the minimum value of the similarity between the historical search result and the historical search information is 0.54 and the maximum value is 0.8; (0.48, 0.7) indicating that the minimum value of the similarity between the historical search result and the historical search information is 0.48 and the maximum value is 0.7; (0.4, 0.6) indicating that the minimum value of the similarity between the historical search result and the historical search information is 0.4 and the maximum value is 0.6.
Finally, a plurality of sets of the training sample pairs are used as the training sample set.
Therefore, enough training sample pairs can be obtained through the historical search information, the historical search result and the historical operation behavior information of the user, and the text vector model is obtained through training.
Step 302, determining a first loss function and/or a second loss function according to a training sample pair in a training sample set.
The first loss function is used to constrain the similarity of each training sample pair to meet a preset correlation similarity requirement, where the preset correlation similarity requirement may be equal to the corresponding similarity of the training sample pair, or within a similarity range corresponding to the training sample pair, that is, greater than or equal to a minimum value of the similarity, and less than or equal to a maximum value of the similarity. Under the condition that the preset correlation similarity requirement is the similarity range corresponding to the training sample pair, the first loss function can be determined according to the training sample pair and the similarity of the training sample pair. Illustratively, the expression of the first penalty function may include:
wherein L is1Representing the value of the first loss function, N representing the total number of training sample pairs, Text i' representing the ith training sample pair, LB identifying the minimum of similarity for that training sample pair, UB identifying the maximum of similarity for that training sample pair.
The second loss function is used to constrain the similarity between the text in each training sample pair and the text in other training sample pairs in the training sample set to meet a preset non-correlation similarity requirement, where the preset non-correlation similarity requirement may be infinitely close to 0, or may be less than or equal to a third preset similarity threshold. In this way, a second loss function may be determined based on a plurality of training sample pairs in the set of training samples. For example, the expression of the second penalty function may include:
wherein L is2Representing the value of the second loss function, N representing the total number of training sample pairs, Text i, Text i 'representing the ith training sample pair, Text j, Text j' representing the jth training sample pair, where i and j are notAre equal.
Step 303, training a preset model according to the training sample set, the first loss function and the second loss function to obtain the text vector model.
Likewise, the pre-set model may be one or more of a Convolutional Neural Networks (CNN) model or a Long-Short Term Memory (LSTM) model.
It should be noted that the first loss function and the second loss function are optional, and either one or both of the loss functions may be used in the training.
In this way, in the training process, the text of the training sample pair is used as a positive example through the first loss function, and the texts between different training sample pairs in the training sample set are used as negative examples through the second loss function, so that the similarity of vectors generated between similar texts is high, the similarity of vectors generated between dissimilar texts is low, and the accuracy of text search is further improved.
In another embodiment of the present disclosure, the second text includes a text sentence and/or a document; the text knowledge base is pre-established in the following way:
firstly, dividing the text sentences in the second text according to the multiple preset division modes to obtain multiple groups of target sentence division text sets.
Likewise, the preset division mode may include word division and character division, and the plurality of sets of target division texts may include a first set of target division texts and a second set of target division texts. This step may include: performing word division on the first text to obtain a first target division text set containing one or more target words; and performing character division on the first text to obtain a second target division text set containing one or more target characters. Wherein, the target character is any one of letters, Chinese characters, numbers and punctuation marks.
And then, dividing the multiple groups of target sentences into text sets and inputting the text sets into the text vector model to obtain second text vectors corresponding to the text sentences.
Likewise, the text vector model may also include a word encoding network and a character encoding network.
And finally, establishing the text knowledge base according to the second text vector and the text sentence.
For example, the corresponding relationship between the second text vector and the text sentence (i.e. the second text) may be established, so that, in the text knowledge base, the corresponding second text may be conveniently obtained according to the second text vector.
Therefore, through the method, the text knowledge base containing the second text and the second text vector can be constructed, so that the corresponding text can be searched in the text knowledge base according to the text to be searched input by the user and is displayed to the user.
Further, in order to facilitate retrieval, a vector index can be added to the second text vector through a hierarchical navigable small-world HNSW algorithm, so that the text searching efficiency is improved.
Further, in a case that the second text includes a text sentence and a document, the text sentence may be a sentence obtained by sentence splitting the document.
The document can be an article, lyrics, a webpage or the like, the sentence segmentation mode has various modes, for example, segmentation can be performed according to punctuations and a preset special text, the preset special text can comprise a preset starting text and a preset ending text, and the preset starting text is a text which is obtained according to statistics of sample sentences and has a first probability of being used as a starting word of the sentence larger than a first preset probability threshold; the preset ending text is a text which is obtained after sample sentences are counted and used as ending words of the sentences, wherein the second probability of the ending words is larger than a second preset probability threshold.
It should be noted that the ratio of the number of times that a certain word is used as a start word in a sample sentence to the number of times that the word appears in the sample sentence may be used as the first probability that the word is used as the start word of the sentence; similarly, the ratio of the number of times that a certain word is in the sample sentence as the final word to the number of times that the word appears in the sample sentence can be used as the second probability that the word is in the final word of the sentence.
In this way, the document is segmented to obtain one or more text sentences, and each text sentence is divided according to the multiple preset division modes to obtain multiple groups of target sentence division text sets; dividing the multiple groups of target sentences into text sets and inputting the text sets into the text vector model to obtain second text vectors corresponding to the text sentences; and finally, establishing a text knowledge base according to the second text vector, the text sentence and the document. Illustratively, a correspondence of the second text vector to the text statement and the document may be established. By adopting the method, the document or the text sentence can be conveniently obtained through the second text vector, and the efficiency of text search is improved.
In addition, because the first text input by the user during searching is generally a word or a sentence and rarely a large segment of characters in the document, the method for segmenting the document into a plurality of text sentences can be more suitable for the first text to be retrieved input by the user, and in the text searching process, only the similarity between the first text vector corresponding to the first text to be retrieved and the second text vector corresponding to the text sentence needs to be calculated, and the first text does not need to be compared with the whole document for retrieval, so that the efficiency of text searching can be further improved.
Further, the above-mentioned manner of obtaining the target text vector from the second text vector of the pre-established text knowledge base according to the first text vector may include the following steps:
firstly, a candidate text vector closest to the first text vector is obtained from a second text vector of the text knowledge base.
Illustratively, neighbor searching can be realized according to the hierarchical navigable small-world HNSW algorithm, and the candidate text vector closest to the first text vector can be quickly obtained
And secondly, acquiring the target text vector from the candidate text vector according to the similarity between the candidate text vector and the first text vector.
In this step, candidate text vectors whose similarity to the first text vector is greater than or equal to a fourth preset similarity threshold may be used as the target text vector; or the similarity between the candidate text vectors and the first text vector is ranked from large to small, and a second preset number of candidate text vectors with the similarity ranked in the front are used as target text vectors.
Therefore, in the method, under the condition that a large number of second text vectors exist in the text knowledge base, neighbor searching can be realized according to the hierarchical navigable small-world HNSW algorithm, candidate text vectors closest to the first text vector can be quickly obtained, and then the target text vector can be obtained according to the similarity. Thus, the efficiency of text search can be further improved.
Fig. 4 is a block diagram illustrating a text search apparatus according to an example embodiment. As shown in fig. 4, the text search apparatus includes:
the first text dividing module 401 is configured to divide a first text to be searched according to multiple preset dividing manners to obtain multiple groups of target divided text sets;
a first text vector obtaining module 402, configured to input the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
a target text vector obtaining module 403, configured to obtain a target text vector from second text vectors in a pre-established text knowledge base according to the first text vector, where the text knowledge base includes one or more second text vectors and second texts corresponding to the second text vectors;
and the target text searching module 404 is configured to take the second text corresponding to the target text vector as a target search result, and display the target search result.
Optionally, the preset division mode includes word division and character division, and the multiple groups of target division text sets include a first target division text set and a second target division text set; the first text partitioning module 401 is configured to:
performing the word division on the first text to obtain a first target division text set containing one or more target words;
performing the character division on the first text to obtain a second target division text set containing one or more target characters;
inputting the plurality of groups of target division text sets into a pre-trained text vector model to obtain a first text vector, wherein the step of obtaining the first text vector comprises the following steps:
and dividing the first target into a text set and the second target into a text set, and inputting a pre-trained text vector model to obtain the first text vector.
Optionally, the text vector model comprises a word encoding network and a character encoding network; the first text vector obtaining module 402 is configured to obtain a first text vector by:
coding the first target division text set through the word coding network to obtain a word vector;
coding the second target division text set through the character coding network to obtain a character vector;
and calculating to obtain the first text vector according to the word vector and the character vector.
FIG. 5 illustrates a training apparatus for a text vector model, according to an exemplary embodiment, as shown in FIG. 5, including;
a training sample obtaining module 501, configured to obtain a training sample set, where the training sample set includes a plurality of training sample pairs and a similarity of each training sample pair;
a first loss function determining module 502, configured to determine a first loss function according to the training sample pair and the similarity of the training sample pair, where the first loss function is used to constrain the similarity of each training sample pair to meet a preset correlation similarity requirement;
the model training module 503 is configured to train a preset model according to the training sample set and the first loss function to obtain the text vector model.
FIG. 6 illustrates another training apparatus for a text vector model, according to an exemplary embodiment, as shown in FIG. 6, the training apparatus further comprising;
a second loss function determining module 601, configured to determine a second loss function according to a plurality of training sample pairs in the training sample set; the second loss function is used for constraining the similarity between the text in each training sample pair and the text in other training sample pairs in the training sample set to meet the requirement of preset non-relevant similarity;
the model training module 503 is configured to train a preset model according to the training sample set, the first loss function, and the second loss function to obtain the text vector model.
Optionally, the training sample obtaining module 501 is configured to:
respectively determining the similarity between each historical search result and historical search information according to historical operation behavior information of a user on a plurality of historical search results, wherein the historical search results are obtained by searching according to the historical search information input by the user;
according to the similarity, taking the historical search information and the historical search result as the training sample pair;
and taking a plurality of sets of the training sample pairs as the training sample set.
Fig. 7 is a block diagram illustrating another text search apparatus according to an example embodiment. As shown in fig. 7, the text search apparatus further includes a text knowledge base establishing module 701, the second text includes a text sentence, and the text knowledge base establishing module 701 is configured to establish the text knowledge base in advance by:
dividing the text sentence according to the multiple preset division modes to obtain multiple groups of target sentence division text sets;
dividing the multiple groups of target sentences into text sets and inputting the text sets into the text vector model to obtain second text vectors corresponding to the text sentences;
and establishing the text knowledge base according to the second text vector and the text sentence.
Optionally, the second text further includes a document, and the text statement is a statement obtained by performing statement segmentation on the document; the text knowledge base establishing module 701 is further configured to establish the text knowledge base according to the second text vector, the text statement, and the document.
Optionally, the target text vector obtaining module 403 is configured to obtain a candidate text vector closest to the first text vector from a second text vector in the text knowledge base; and acquiring the target text vector from the candidate text vector according to the similarity between the candidate text vector and the first text vector.
In summary, the method includes the steps that first texts to be searched are divided according to multiple preset division modes to obtain multiple groups of target division text sets, the multiple groups of target division text sets can effectively avoid the situation of misspelling of a user, and the real expectation of the user is reflected; then inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; and acquiring a target text vector and a second text corresponding to the target text vector from a pre-established text knowledge base in a vector search mode according to the first text vector, thereby obtaining a target search result. Therefore, the accuracy of text search can be improved, the situation that search results are wrong or incomplete due to misspelling of the user is avoided, and the experience of the user is improved.
Referring now to FIG. 8, shown is a schematic diagram of an electronic device 800 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. For example, the first text division module may also be described as a "module for dividing the first text to be searched according to a plurality of preset division modes to obtain a plurality of groups of target division text sets".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Example 1 provides a text search method according to one or more embodiments of the present disclosure, including: dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
Example 2 provides the method of example 1, the preset division manner includes word division and character division, and the multiple sets of target division text sets include a first target division text set and a second target division text set; the dividing the first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets comprises: performing the word division on the first text to obtain a first target division text set containing one or more target words; performing the character division on the first text to obtain a second target division text set containing one or more target characters; inputting the plurality of groups of target division text sets into a pre-trained text vector model to obtain a first text vector, wherein the step of obtaining the first text vector comprises the following steps: and dividing the first target into a text set and the second target into a text set, and inputting a pre-trained text vector model to obtain the first text vector.
Example 3 provides the method of example 2, the text vector model comprising a word encoding network and a character encoding network, in accordance with one or more embodiments of the present disclosure; the text vector model is used for: coding the first target division text set through the word coding network to obtain a word vector; coding the second target division text set through the character coding network to obtain a character vector; and calculating to obtain the first text vector according to the word vector and the character vector.
Example 4 provides the method of example 1, the text vector model pre-trained by: acquiring a training sample set, wherein the training sample set comprises a plurality of training sample pairs and the similarity of each training sample pair; determining a first loss function according to the training sample pairs and the similarity of the training sample pairs, wherein the first loss function is used for constraining the similarity of each training sample pair to meet the requirement of preset correlation similarity; and training a preset model according to the training sample set and the first loss function to obtain the text vector model.
Example 5 provides the method of example 4, and before obtaining the text vector model after training the preset model according to the training sample set and the first loss function, the method further includes: determining a second loss function according to a plurality of training sample pairs in the training sample set; the second loss function is used for constraining the similarity between the text in each training sample pair and the text in other training sample pairs in the training sample set to meet the requirement of preset non-relevant similarity; the obtaining of the text vector model after training a preset model according to the training sample set and the first loss function includes: and training a preset model according to the training sample set, the first loss function and the second loss function to obtain the text vector model.
Example 6 provides the method of example 4, the obtaining a set of training samples comprising: respectively determining the similarity between each historical search result and historical search information according to historical operation behavior information of a user on a plurality of historical search results, wherein the historical search results are obtained by searching according to the historical search information input by the user; according to the similarity, taking the historical search information and the historical search result as the training sample pair; taking a set of a plurality of the training sample pairs as the training sample set.
Example 7 provides the method of example 1, the second text comprising a textual statement, in accordance with one or more embodiments of the present disclosure; the text knowledge base is pre-established in the following way: dividing the text sentences according to the multiple preset dividing modes to obtain multiple groups of target sentence division text sets; dividing the text set into a plurality of groups of target sentences and inputting the text set into the text vector model to obtain second text vectors corresponding to the text sentences; and establishing the text knowledge base according to the second text vector and the text sentence.
Example 8 provides the method of example 7, wherein the second text further includes a document, and the text sentence is a sentence obtained by sentence splitting the document; establishing the text knowledge base according to the second text vector and the text statement comprises: and establishing the text knowledge base according to the second text vector, the text sentence and the document.
Example 9 provides the method of example 8, wherein obtaining the target text vector from the second text vector of the pre-established knowledge base of text according to the first text vector comprises: acquiring a candidate text vector closest to the first text vector from a second text vector of the text knowledge base; and acquiring the target text vector from the candidate text vector according to the similarity between the candidate text vector and the first text vector.
Example 10 provides, in accordance with one or more embodiments of the present disclosure, a text search apparatus, comprising: the first text dividing module 401 is configured to divide a first text to be searched according to multiple preset dividing manners to obtain multiple groups of target divided text sets; a first text vector obtaining module 402, configured to input the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; a target text vector obtaining module 403, configured to obtain a target text vector from second text vectors in a pre-established text knowledge base according to the first text vector, where the text knowledge base includes one or more second text vectors and second texts corresponding to the second text vectors; and the target text searching module 404 is configured to take the second text corresponding to the target text vector as a target search result, and display the target search result.
Example 11 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the steps of the methods of examples 1-9, in accordance with one or more embodiments of the present disclosure.
Example 12 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the methods of examples 1 to 9.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:生成场景主题的方法和装置