Sample screening method and device

文档序号:7687 发布日期:2021-09-17 浏览:26次 中文

1. A method of screening a sample, comprising:

obtaining a sample question and a sample corpus containing answers corresponding to the sample question;

extracting text segments from the corpus text of the sample corpus as a text segment set of the corpus text, wherein the text segments are determined according to answers corresponding to the sample questions;

and screening the text segments in the text segment set, taking the text segment containing the complete answer as a positive sample text of the sample question, and taking the text segment not containing the answer as a negative sample text of the sample question.

2. The sample screening method according to claim 1, wherein the extracting text segments from the corpus text of the sample corpus as the text segment set of the corpus text comprises:

determining an initial sliding position of a corpus text of the sample corpus;

and sliding the feature extraction window in the corpus text from the initial sliding position according to a preset sliding step length, and taking the text segment extracted by the feature extraction window in the sliding process as a text segment set of the corpus text.

3. The sample screening method according to claim 1 or 2, wherein the extracting text segments from the corpus text of the sample corpus as the text segment set of the corpus text comprises:

determining an initial sliding position of a corpus text of the sample corpus;

and starting the characteristic extraction window with the changed size from the initial sliding position, sliding in the corpus text according to a preset sliding step length, and taking the text segment extracted by the characteristic extraction window in the sliding process as a text segment set of the corpus text.

4. The sample screening method according to claim 1 or 2, wherein the extracting text segments from the corpus text of the sample corpus as the text segment set of the corpus text comprises:

determining an initial sliding position of a corpus text of the sample corpus;

and starting a feature extraction window with a fixed size from the initial sliding position, sliding in the corpus text according to a preset sliding step length, and taking a text segment extracted by the feature extraction window in the sliding process as a text segment set of the corpus text.

5. The sample screening method according to claim 1, wherein the screening of the text segments in the text segment set comprises:

and screening the text segments in the text segment set by using a preset text segment screening algorithm.

6. The sample screening method according to claim 1 or 5, wherein the screening of the text segments in the text segment set comprises:

determining a starting position identifier and an ending position identifier of an answer corresponding to the sample question in the sample corpus;

and screening the text segments in the text segment set according to the starting position identification and the ending position identification.

7. The sample screening method according to claim 6, wherein the screening the text segments in the text segment set according to the start position identifier and the end position identifier comprises:

taking the text segment containing the starting position identification and the ending position identification in the text segment set as the positive sample text; and taking a text segment which does not contain the starting position identification and the ending position identification as the negative sample text.

8. The method for screening a specimen according to claim 1, further comprising:

constructing a question text pair based on the sample question, the positive sample text, and the negative sample text;

and training the question text to an answer extraction model to be trained to obtain the answer extraction model, wherein the answer extraction model enables the sample question to be associated with the positive sample text and/or the negative sample text.

9. A sample screening device, comprising:

the acquisition module is configured to acquire a sample question and a sample corpus containing answers corresponding to the sample question;

a sliding module configured to extract text segments from the corpus text of the sample corpus as a text segment set of the corpus text, wherein the size of the text segments is determined according to answers corresponding to the sample questions;

and the screening module is configured to screen the text segments in the text segment set, take the text segment containing the complete answer as the positive sample text of the sample question, and take the text segment not containing the answer as the negative sample text of the sample question.

10. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-8 when executing the instructions.

11. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 8.

Background

For machine learning, especially deep learning, most algorithms need to be run on the basis of a large amount of sample data. The richness and accuracy of sample data are very important to machine learning.

In the current model training process, because the sample labeling has subjective factors of artificial labeling, and the training samples are large in quantity, partial inferior samples cannot be screened out manually one by one, and the model training effect is influenced. Or positive and negative samples need to be constructed for model training in part of scenes, but the model accuracy after training is not high due to the fact that the positive and negative samples cannot be effectively screened.

Disclosure of Invention

In view of this, embodiments of the present application provide a sample screening method and apparatus, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.

According to a first aspect of embodiments of the present application, there is provided a sample screening method, including:

obtaining a sample question and a sample corpus containing answers corresponding to the sample question;

extracting text segments from the corpus text of the sample corpus as a text segment set of the corpus text, wherein the size of the text segments is determined according to answers corresponding to the sample questions;

and screening the text segments in the text segment set, taking the text segment containing the complete answer as a positive sample text of the sample question, and taking the text segment not containing the answer as a negative sample text of the sample question.

Optionally, the extracting text segments from the corpus text of the sample corpus as a text segment set of the corpus text includes:

determining an initial sliding position of a corpus text of the sample corpus;

and sliding the feature extraction window in the corpus text from the initial sliding position according to a preset sliding step length, and taking the text segment extracted by the feature extraction window in the sliding process as a text segment set of the corpus text.

Optionally, the extracting text segments from the corpus text of the sample corpus as a text segment set of the corpus text includes:

determining an initial sliding position of a corpus text of the sample corpus;

and starting the characteristic extraction window with the changed size from the initial sliding position, sliding in the corpus text according to a preset sliding step length, and taking the text segment extracted by the characteristic extraction window in the sliding process as a text segment set of the corpus text.

Optionally, the extracting text segments from the corpus text of the sample corpus as a text segment set of the corpus text includes:

determining an initial sliding position of a corpus text of the sample corpus;

and starting a feature extraction window with a fixed size from the initial sliding position, sliding in the corpus text according to a preset sliding step length, and taking a text segment extracted by the feature extraction window in the sliding process as a text segment set of the corpus text.

Optionally, the screening the text segments in the text segment set includes:

and screening the text segments in the text segment set by using a preset text segment screening algorithm.

Optionally, the screening the text segments in the text segment set includes:

determining a starting position identifier and an ending position identifier of an answer corresponding to the sample question in the sample corpus;

and screening the text segments in the text segment set according to the starting position identification and the ending position identification.

Optionally, the screening the text segments in the text segment set according to the starting position identifier and the ending position identifier includes:

taking the text segment containing the starting position identification and the ending position identification in the text segment set as the positive sample text; and taking a text segment which does not contain the starting position identification and the ending position identification as the negative sample text.

Optionally, the sample screening method further comprises:

constructing a question text pair based on the sample question, the positive sample text, and the negative sample text;

and training the question text to an answer extraction model to be trained to obtain the answer extraction model, wherein the answer extraction model enables the sample question to be associated with the positive sample text and/or the negative sample text.

According to a second aspect of embodiments of the present application, there is provided a sample screening apparatus comprising:

the acquisition module is configured to acquire a sample question and a sample corpus containing answers corresponding to the sample question;

a sliding module configured to extract text segments from the corpus text of the sample corpus as a text segment set of the corpus text, wherein the size of the text segments is determined according to answers corresponding to the sample questions;

and the screening module is configured to screen the text segments in the text segment set, take the text segment containing the complete answer as the positive sample text of the sample question, and take the text segment not containing the answer as the negative sample text of the sample question.

According to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the sample screening method when executing the instructions.

According to a fourth aspect of embodiments herein, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the sample screening method.

According to a fifth aspect of embodiments of the present application, there is provided a chip storing computer instructions that, when executed by the chip, implement the steps of the sample screening method.

In the embodiment of the application, by obtaining a sample question and a sample corpus containing an answer corresponding to the sample question, text segments are extracted from a corpus text of the sample corpus to serve as a text segment set of the corpus text, wherein the size of the text segment is determined according to the answer corresponding to the sample question, the text segments in the text segment set are screened, the text segment containing the complete answer serves as a positive sample text of the sample question, and the text segment not containing the answer serves as a negative sample text of the sample question.

The method includes the steps that text fragments of a long corpus text are divided to obtain a plurality of short corpus texts, and positive and negative samples for model training are determined in a mode of screening the short corpus texts, wherein the positive sample contains a complete answer of a sample question, and the negative sample does not contain any text information in the text fragment corresponding to the answer of the sample question; and determining positive and negative samples by screening the plurality of corpus texts, thereby ensuring the accuracy of the training result obtained by performing model training by using the positive and negative samples.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flow chart of a sample screening method provided in an embodiment of the present application;

fig. 3(a) is a schematic diagram of a method for truncating a long text by means of a sliding window according to an embodiment of the present application;

fig. 3(b) is a schematic diagram of another embodiment of the present application for truncating a long text;

FIG. 4 is a schematic diagram of a model training process provided by an embodiment of the present application;

FIG. 5 is a flow chart of a processing procedure of a sample screening method provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a sample screening apparatus according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if," as used herein, may be interpreted as "responsive to a determination," depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

EQA: and extracting questions and answers.

Positive sample: samples that are consistent with a true sample label, or samples that belong to a certain category.

Negative sample: samples that do not conform to the true sample label, or samples that do not belong to a certain category.

In the classification problem, for example, in the face recognition problem, if a user to be recognized in an image needs to be subjected to face recognition, a positive sample is a region (rectangular frame) where a face in the image is located, and a negative sample is a region (rectangular frame) where other objects except the region where the face is located in the image are located; or, in a detected question, for example, an answer detection question, if an answer to a certain question needs to be extracted from a section of text, the text containing the complete answer to the question is a positive sample, and the text not containing any content of the answer corresponding to the question is a negative sample.

Sliding the window: in processing corpus text, the sliding window may be one or more windows, the size (size) of which may be specified, the window sliding from the beginning position of the text all the way to the end position of the text.

Feature extraction: given a corpus text, feature extraction is a process of extracting a feature sequence, specifically, in a process of window sliding, content (i.e., character strings) in a window is extracted.

In the present application, a sample screening method and apparatus, a computing device and a computer-readable storage medium are provided, which are described in detail in the following embodiments one by one.

FIG. 1 shows a block diagram of a computing device 100 according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the sample screening method shown in fig. 2. Fig. 2 shows a flow chart of a sample screening method according to an embodiment of the present application, comprising steps 202 to 206.

Step 202, obtaining a sample question and a sample corpus containing answers corresponding to the sample question.

At present, machine reading and understanding is an important research topic in the field of natural language processing, and common tasks include complete filling, multiple choices, answer extraction, free answers and the like. The answer extraction (extraction type text question and answer EQA) is to organize the question query and the text segment para into a question text pair to be input into an answer extraction model, and the answer extraction model extracts answers to the questions from the text segment.

In the traditional answer extraction model training process, a sample sampling method of training data mainly uses all samples for training, namely the samples are not filtered; or, some negative samples are randomly selected according to a certain proportion to participate in training, and the answer extraction models obtained by the training of the two training modes are not accurate enough.

When text segment division is carried out on sample corpus, a batch of samples interfering with model training results are obtained, the samples are classified into negative samples, and the negative samples contain partial texts with correct answers, so that the negative samples interfere with the model training results to a certain extent when model training is carried out.

Based on this, the sample screening method provided in the embodiments of the present application, for a question query, gives a text, the text contains answers corresponding to the question query, if the text is a long text (the number of characters contained in the text is greater than a preset threshold), a sliding window mode is selected to truncate the long text into a plurality of short texts, wherein, in the plurality of short texts, part of the short texts contain the complete content of the answers corresponding to the question query, part of the short texts contain the partial content of the answers corresponding to the question query, and part of the short texts do not contain any content of the answers corresponding to the question query, the short text containing partial contents of answers corresponding to the question query is filtered, the sample quality for model training is improved, the interference of low-quality samples on the model training is reduced, and the improvement of a model training result is facilitated.

Specifically, the sample corpus is a written text containing a certain information content, which may be a text of various paragraphs such as a sentence, a text segment, an article or a plurality of articles, and may also be a text of various languages such as a chinese text, an english text, and a russian text, which is not limited in this application.

The sample question is a question that requires to be answered or interpreted, and may be a question associated with the information content in the sample corpus or a question that is not associated with the information content in the sample corpus, which is not limited in the present application.

Step 204, extracting text segments from the corpus text of the sample corpus as a text segment set of the corpus text.

And determining the size of the text segment according to the answer corresponding to the sample question.

In specific implementation, extracting a text fragment from the corpus text of the sample corpus as a text fragment set of the corpus text may specifically be implemented in the following manner:

determining an initial sliding position of a corpus text of the sample corpus;

and sliding the feature extraction window in the corpus text from the initial sliding position according to a preset sliding step length, and taking the text segment extracted by the feature extraction window in the sliding process as a text segment set of the corpus text.

Specifically, the start sliding position may be a position where a start character of the sample corpus is located, a position where an end character of the sample corpus is located, or a position where another character of the sample corpus is located.

Starting from the initial sliding position, sliding the feature extraction window according to the arrangement sequence of the characters in the sample corpus and a preset sliding step length; wherein the arrangement order is an arrangement order of characters between a start character and an end character of the sample corpus. Generally, characters are arranged in a certain order to form a text expressing a certain meaning.

In practical applications, for a sample question, a text (sample corpus) containing an answer corresponding to the sample question is given, and in the case of training an answer extraction model by using the sample question and the text, since the number of word units contained in the text is large and the input length of the answer extraction model is fixed to 512 word units, for a long text, it is generally selected to truncate the long text into a plurality of short texts by means of a sliding window.

The schematic diagram of truncating a long text by means of a sliding window provided in the embodiment of the present application is shown in fig. 3 (a).

A sample question is given, and a long text containing the answer corresponding to the sample question is given, wherein the position of the answer corresponding to the sample question in the text is shown as the labeling result in fig. 3 (a).

For a complete long text, the long text is divided into a plurality of text segments by a sliding window method, wherein the width of the window (i.e. the length of the text segment) is generally greater than or equal to the length of the character string of the answer. The segmentation result is shown in fig. 3(a), specifically, the long text is divided into 6 text segments, such as para1, para2, … …, and para 6.

In the embodiment of the application, under the condition that the width of the feature extraction window meets the condition that the length of the character string is greater than or equal to the length of the answer, the specific size of the feature extraction window is not further limited, and in practical application, when the feature extraction window is used for extracting the text segment, the size of the feature extraction window can be a fixed value or can be continuously changed in the sliding process of the feature extraction window.

If a text segment is extracted by using a feature extraction window with a size change, extracting the text segment from the corpus text of the sample corpus as a text segment set of the corpus text, and implementing the following steps:

determining an initial sliding position of a corpus text of the sample corpus;

and starting the characteristic extraction window with the changed size from the initial sliding position, sliding in the corpus text according to a preset sliding step length, and taking the text segment extracted by the characteristic extraction window in the sliding process as a text segment set of the corpus text.

Specifically, the specific text segment extraction process is similar to the text segment extraction process of the foregoing embodiment except that the size of the feature extraction window changes during the sliding process, and the specific implementation process refers to the content of the foregoing embodiment and is not described herein again.

Under the condition that the width of the feature extraction window meets the condition that the length of the character string is greater than or equal to the length of the answer, the width of the feature extraction window can be changed randomly in the sliding process, or the feature extraction window is set to be changed according to a certain rule (for example, the feature extraction window is gradually increased and then gradually decreased), and the feature extraction window can be specifically determined according to actual requirements without limitation; in addition, in order to ensure that complete text information can be extracted through the feature extraction window in the embodiment of the present application, therefore, the height of the feature extraction window may be determined according to the height of the text in the corpus text, and specifically, the height of the feature extraction window may be set to be greater than the height of the text in the corpus text.

A schematic diagram of truncating a long text is shown in fig. 3(b), taking as an example that the size of the feature extraction window is changed according to a rule that the size is gradually reduced and then gradually increased in the sliding process.

A sample question is given, and a long text containing the answer corresponding to the sample question is given, wherein the position of the answer corresponding to the sample question in the text is shown as the labeling result in fig. 3 (b).

For a complete long text, dividing the long text into a plurality of text segments by a sliding window method, where the width of a window (i.e. the length of a text segment) is generally greater than or equal to the length of a character string of an answer, and on the premise that this condition is satisfied, the width of the window may be changed according to a rule that the window gradually becomes smaller and then gradually becomes larger during the sliding process, and the long text is divided according to the size of each window during the changing process, and the obtained division result is as shown in fig. 3(b), specifically, the long text is divided into 7 text segments, such as para1, para2, … …, para6, and para7, where the text segment of para4 contains the complete content of the answer corresponding to the sample question, and relative to para3 in fig. 3(a), the para4 contains relatively less content unrelated to the sample question when containing the complete content of the answer corresponding to the sample question, in this case, if the text segment para4 is used as the positive sample text, the model training efficiency is improved and the accuracy of the model training result is ensured when the answer extraction model is trained by using the sample question, the positive sample text containing the text segment para4 and other negative sample texts.

Or, if a text segment is extracted by using a feature extraction window with a fixed size, extracting the text segment from the corpus text of the sample corpus as a text segment set of the corpus text, and may also be implemented in the following manner:

determining an initial sliding position of a corpus text of the sample corpus;

and starting a feature extraction window with a fixed size from the initial sliding position, sliding in the corpus text according to a preset sliding step length, and taking a text segment extracted by the feature extraction window in the sliding process as a text segment set of the corpus text.

Specifically, the specific text segment extraction process is similar to the text segment extraction process of the foregoing embodiment except that the size of the limited feature extraction window does not change during the sliding process, and the specific implementation process refers to the content of the foregoing embodiment and is not described herein again.

A schematic diagram of a segmentation result generated by segmenting a text segment by using a feature extraction window with a fixed size is shown in fig. 3(a), and the text segments 1(para1) to 6(para6) generated by segmentation in fig. 3(a) contain equal number of text units.

In practical application, if the size of the feature extraction window is fixed, the size of the feature extraction window needs to be larger than the length of an answer corresponding to a question to be queried, which is contained in the sample corpus, and on the basis of meeting the condition, the actual size of the feature extraction window can be determined according to actual requirements, which is not limited herein.

According to the method and the device, the corpus text is divided into the text segments through the feature extraction window with the fixed size, so that the extraction process of text features is simplified, and the improvement of the text feature extraction efficiency is facilitated; in addition, text segment division is carried out on the corpus text through the feature extraction window with the size change, so that the diversity of the text segment division results is favorably ensured, and the accuracy of obtaining the training results by utilizing the positive and negative samples to carry out model training is favorably improved.

And step 206, screening the text segments in the text segment set, taking the text segment containing the complete answer as a positive sample text of the sample question, and taking the text segment not containing the answer as a negative sample text of the sample question.

The positive sample text contains a complete corpus text of the answer, and the negative sample text does not contain any corpus text of the answer.

Specifically, after the corpus text is segmented by using the feature extraction window to obtain a plurality of text segments, the text segments can be filtered, the text segments containing incomplete answers (namely, partial contents containing answers) are filtered, and the answer extraction model to be trained is trained by using the remaining text segments and sample problems.

In specific implementation, the text segments in the text segment set are screened, and the method can be realized by the following steps:

determining a starting position identifier and an ending position identifier of an answer corresponding to the sample question in the sample corpus;

and screening the text segments in the text segment set according to the starting position identification and the ending position identification.

Further, screening the text segments in the text segment set according to the starting position identifier and the ending position identifier includes:

taking the text segment containing the starting position identification and the ending position identification in the text segment set as the positive sample text; and taking a text segment which does not contain the starting position identification and the ending position identification as the negative sample text.

Specifically, after segmenting the corpus text by using the feature extraction window to obtain a plurality of text segments, filtering the plurality of text segments, specifically, determining the start position identifier and the end position identifier of the answer corresponding to the question to be queried in the sample corpus according to the incidence relation between the word unit in the sample corpus and the identifier of the answer corresponding to the question to be queried. After the sample question and the sample corpus are obtained, answers corresponding to the sample question in the sample corpus can be determined according to the sample question, corresponding identifications are added to the answers, and particularly, a correlation relationship between the answers and the identifications can be established, so that which word unit in the sample corpus is an answer starting position or an answer ending position (namely, which word units in the sample corpus form the answer corresponding to the sample question) is determined according to the correlation relationship.

And then judging whether each text segment contains the starting position identification and the ending position identification, if so, taking the text segment as the positive sample text, taking the text segment not containing the starting position identification and the ending position identification as the negative sample text, and removing the text segment containing the starting position identification or the ending position identification.

As shown in fig. 3(a), after the long text is divided into text segments, the following situations may occur: 1) the text segment contains the complete answer; 2) the text segment contains incomplete answers, namely only part of the content of the answers; 3) the text fragment contains no answer at all.

As can be seen from fig. 3(a), the text segments para1 and para4 contain partial contents of answers, para2 and para3 contain complete answers, and para5 and para6 do not contain answers at all, so that para2 and para3 can be used as positive sample texts, para5 and para6 can be used as negative sample texts, and para1 and para4 can be removed.

In specific implementation, after a positive sample text and a negative sample text of a sample problem are obtained through screening, a sample pair can be constructed based on the sample problem, the positive sample text and the negative sample text, and model training can be performed by using the sample pair, which can be specifically realized through the following modes:

constructing a question text pair based on the sample question, the positive sample text, and the negative sample text;

and training the question text to an answer extraction model to be trained to obtain the answer extraction model, wherein the answer extraction model enables the sample question to be associated with the positive sample text and/or the negative sample text.

Specifically, after the corpus text of the sample corpus is divided into a plurality of text segments, and the plurality of text segments are screened to obtain a plurality of positive sample texts and negative sample texts, a question text pair can be constructed based on the sample question, the positive sample texts and the negative sample texts, and the question text pair is trained on an answer extraction model to be trained, so as to obtain an answer extraction model.

In order to describe the segmentation process of the text of the speech more intuitively, the embodiment of the present application takes a sample question (query) as "how do the rural epidemic prevention work? The answer corresponding to the sample question contained in the sample corpus is' one is that the rural epidemic situation prevention and control special class is required to be established; secondly, the function of the basic medical health institution is required to be fully exerted; thirdly, health management of the floating population is required to be enhanced. "is described as an example.

The text segment 1(para1) generated by segmenting the corpus text in the sample corpus is required to be a rural epidemic prevention and control professional shift, and since the text segment 1 only contains partial answers, we consider that para1 is not a correct answer.

The segmented text segment 2(para2) is' one special class for rural epidemic prevention and control is required to be established; secondly, the function of the basic medical health institution is required to be fully exerted; thirdly, health management of the floating population is required to be enhanced. ... ", para2 contains the complete answer, so para2 is the correct answer.

The implementation process of determining whether other text segments are correct answers is similar to the implementation process of the text segment 1 and the text segment 2, and is not described herein again.

In the current sample selection scheme, one scheme is to use all samples for model training, and not to screen the samples, that is, all the 6 text segments marked out in fig. 3(a) participate in the model training. Alternatively, a simple negative sample sampling mode is used, that is, negative samples are randomly selected according to a certain proportion, and taking 6 text fragments in fig. 3(a) as an example, para1, para4, para5 and para6 are used as negative samples (not including complete answers), and para2 and para3 are positive samples; the positive examples are all involved in training, while the negative examples may be all involved in training, or only a portion may be randomly selected to be involved in training.

However, since the para1 and the para4 in fig. 3(a) contain partial answer fragments, the training of the model using these fragments as negative samples interferes with the training result of the model. Therefore, in the sample screening scheme of the embodiment of the application, samples similar to para1 and para4 are filtered, only para2 and para3 are used as positive samples, and para5 and para6 are used as negative samples to perform model training.

Further, a schematic diagram of a model training process provided in the embodiment of the present application is shown in fig. 4.

The method comprises the steps of segmenting a long text to obtain a text segment set, filtering and screening text segments in the text segment set to obtain a new text segment set, splicing sample questions with a positive sample text and a negative sample text respectively to generate a question text pair, and inputting the question text into an answer extraction model to be trained to train.

Currently, in machine reading understanding applications, the following methods are mainly used for sampling EQA model training data:

(1) all samples were used for training, i.e. the samples were not filtered.

(2) And a simple negative sample sampling mode is used, namely, the negative samples are randomly selected according to a certain proportion.

Compared with the current sample sampling method, the embodiment of the application uses a more reasonable sample screening strategy to filter out samples which cause interference to model training.

In the embodiment of the present application, through obtaining a sample question and including a sample corpus of an answer corresponding to the sample question, a feature extraction window is slid in a corpus text of the sample corpus based on a preset sliding step length, and a text segment extracted by the feature extraction window is used as a text segment set of the corpus text in a sliding process, so that the text segment in the text segment set is screened, a first text segment obtained by screening is used as a positive sample text of the sample question, and a second text segment obtained by screening is used as a negative sample text of the sample question, wherein the first text segment includes a complete corpus text of the answer, and the second text segment does not include any corpus text of the answer.

The method includes the steps that text fragments of a long corpus text are divided to obtain a plurality of short corpus texts, and positive and negative samples for model training are determined in a mode of screening the short corpus texts, wherein the positive sample contains a complete answer of a sample question, and the negative sample does not contain any text information in the text fragment corresponding to the answer of the sample question; and determining positive and negative samples by screening the plurality of corpus texts, thereby ensuring the accuracy of the training result obtained by performing model training by using the positive and negative samples.

The sample screening method provided in the present application is further described below with reference to fig. 5, taking the sample obtained by screening as an example for training an answer extraction model. Fig. 5 shows a flowchart of a processing procedure of a sample screening method according to an embodiment of the present application, specifically, steps 502 to 518.

Step 502, obtaining a sample question and a sample corpus containing answers corresponding to the sample question.

Step 504, determining the initial sliding position of the corpus text of the sample corpus.

And step 506, sliding the feature extraction window in the corpus text according to a preset sliding step length from the initial sliding position.

And step 508, using the text segment extracted from the feature extraction window in the sliding process as the text segment set of the corpus text.

And determining the size of the feature extraction window according to the answer corresponding to the sample question.

Step 510, determining a start position identifier and an end position identifier of an answer corresponding to the sample question in the sample corpus.

And step 512, screening the text segments in the text segment set according to the starting position identification and the ending position identification.

Step 514, using the text segment set containing the start position identifier and the end position identifier as the positive sample text; and taking a text segment which does not contain the starting position identification and the ending position identification as the negative sample text.

Step 516, construct question text pairs based on the sample question, the positive sample text, and the negative sample text.

Step 518, training the question text to an answer extraction model input to be trained to obtain the answer extraction model, wherein the answer extraction model enables the sample question to be associated with the positive sample text and/or the negative sample text.

The method includes the steps that text fragments of a long corpus text are divided to obtain a plurality of short corpus texts, and positive and negative samples for model training are determined in a mode of screening the short corpus texts, wherein the positive sample contains a complete answer of a sample question, and the negative sample does not contain any text information in the text fragment corresponding to the answer of the sample question; and determining positive and negative samples by screening the plurality of corpus texts, thereby ensuring the accuracy of the training result obtained by performing model training by using the positive and negative samples.

Corresponding to the above method embodiments, the present application further provides sample screening apparatus embodiments, and fig. 6 shows a schematic structural diagram of a sample screening apparatus according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes:

an obtaining module 602, configured to obtain a sample question and a sample corpus including answers corresponding to the sample question;

a sliding module 604, configured to extract a text segment from the corpus text of the sample corpus as a text segment set of the corpus text, wherein a size of the text segment is determined according to an answer corresponding to the sample question;

a screening module 606 configured to screen text segments in the text segment set, and use a text segment containing a complete answer as a positive sample text of the sample question, and use a text segment not containing the answer as a negative sample text of the sample question.

Optionally, the sliding module 604 includes:

a first determining submodule configured to determine a start sliding position of a corpus text of the sample corpus;

and the first sliding submodule is configured to slide the feature extraction window in the corpus text from the initial sliding position according to a preset sliding step length, and take the text segment extracted by the feature extraction window in the sliding process as a text segment set of the corpus text.

Optionally, the sliding module 604 is further configured to:

a second determining submodule configured to determine a start sliding position of a corpus text of the sample corpus;

and the second sliding submodule is configured to start the feature extraction window with the changed size from the initial sliding position, slide in the corpus text according to a preset sliding step length, and use the text segment extracted by the feature extraction window in the sliding process as the text segment set of the corpus text.

Optionally, the sliding module 604 is further configured to:

a third determining submodule configured to determine a start sliding position of a corpus text of the sample corpus;

and the third sliding sub-module is configured to start the feature extraction window with a fixed size from the initial sliding position, slide in the corpus text according to a preset sliding step length, and use the text segment extracted by the feature extraction window in the sliding process as the text segment set of the corpus text.

Optionally, the screening module 606 is further configured to:

the first screening submodule is configured to screen the text segments in the text segment set by using a preset text segment screening algorithm.

Optionally, the screening module 606 is further configured to:

an identifier determining submodule configured to determine a start position identifier and an end position identifier of an answer corresponding to the sample question in the sample corpus;

and the second screening submodule is configured to screen the text segments in the text segment set according to the starting position identification and the ending position identification.

Optionally, the second screening submodule includes:

a sample determining unit configured to take a text segment containing the start position identifier and the end position identifier in the text segment set as the positive sample text; and taking a text segment which does not contain the starting position identification and the ending position identification as the negative sample text.

Optionally, the sample screening apparatus further comprises:

a construction module configured to construct a question text pair based on the sample question, the positive sample text, and the negative sample text;

a training module configured to train the question text on an answer extraction model input to be trained, to obtain the answer extraction model, wherein the answer extraction model associates the sample question with the positive sample text and/or the negative sample text.

The method includes the steps that text fragments of a long corpus text are divided to obtain a plurality of short corpus texts, and positive and negative samples for model training are determined in a mode of screening the short corpus texts, wherein the positive sample contains a complete answer of a sample question, and the negative sample does not contain any text information in the text fragment corresponding to the answer of the sample question; and determining positive and negative samples by screening the plurality of corpus texts, thereby ensuring the accuracy of the training result obtained by performing model training by using the positive and negative samples.

The above is a schematic scheme of a sample screening apparatus of this embodiment. It should be noted that the technical scheme of the sample screening apparatus is the same concept as the technical scheme of the sample screening method, and details that are not described in detail in the technical scheme of the sample screening apparatus can be referred to the description of the technical scheme of the sample screening method.

It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

There is also provided in an embodiment of the present application a computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the sample screening method when executing the instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the sample screening method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the sample screening method.

An embodiment of the present application further provides a computer readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the sample screening method as described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the sample screening method belong to the same concept, and details that are not described in detail in the technical scheme of the storage medium can be referred to the description of the technical scheme of the sample screening method.

The embodiment of the application discloses a chip, which stores computer instructions, and the instructions are executed by a processor to realize the steps of the sample screening method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:对话小说的交互展示方法、计算设备及计算机存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!