Transferable language model based on electronic medical record text

文档序号:8279 发布日期:2021-09-17 浏览:31次 中文

1. A migratable language model based on electronic medical record text, comprising the steps of:

s1: the template term separator is used for matching a corresponding term part from the medical text S by taking the medical knowledge base as a dictionary, and replacing the matched term to generate a text template and a professional term set;

s2: a template term encoder for encoding the text template and the professional term to obtain a vector representation E after the text and the term are fusedl+1

S3: and a pre-training downstream task layer performs pre-training by utilizing three tasks of word digging and filling, term reduction and negative consideration of the template, outputs the loss of the downstream task, and performs model training and optimization.

2. The electronic medical record text-based migratable language model of claim 1, wherein in step S1, the template term separator comprises the steps of:

s11: input as case history text SDocField name SkeyInput/output as field value SvalueDetermining different processing modes according to the field types;

s12: matching corresponding term parts from the medical text S by using a dictionary tree matching algorithm and taking the medical knowledge base KG as a dictionary, and replacing the term parts to generate a text template SpatternAnd term set SKG

3. The electronic medical record text-based migratable language model of claim 1, wherein: in step S2, the template term encoder includes the following steps:

s21: the template term encoder uses Pattern Attentention and KG Cross Attention to capture the context semantic information of the template in turn, and establishes the associated information between the text template and the knowledge base, and the concrete formula is as follows;

SelfAttention(X)=ln(mult_headh=12(X,X,X,MASK)+X)

KGCrossAttention(X,K)=ln(mult_headh=12(X,K,K,MASK)+X)

s22: carrying out one-time nonlinear transformation on the information captured in the S21 by using the FNN layer to obtain a fused vector characterization El +1The concrete formula is as follows.

El+1=FFN(KGCrossAttention(SelfAttention(El),K,MASK))

E1=layer_normal(add([xi]s-max,[pi]s_max))

4. The electronic medical record text-based migratable language model of claim 1, wherein: in step S3, the pre-training of the downstream task layer includes the following steps:

s31: the template word digging and blank filling task is used for learning the context-related representation of each character in the template, and specifically comprises the following steps of randomly selecting 15% of normal characters from the template, replacing the normal characters with [ MASK ] according to 80% of probability, replacing the normal characters with other characters in a word list by 10%, keeping the characters unchanged by 10%, and restoring the characters through a downstream task layer, wherein the specific calculation formula is as follows:

Hmlm=ln(relu(WmlmDl+1+bmlm))

s32: the term reduction task is used for learning the relation between the slot in the template and the filled-in term, and is realized by randomly selecting 10% of terms from the term set, replacing the terms with [ MASK ] according to 80% probability, replacing the terms with other terms in a knowledge base by 10%, keeping the terms still by 10%, and reducing the terms through a downstream task layer, wherein the calculation formula is as follows:

Htmlm=ln(relu(WtmlmDl+1+btmlm))

s33: the semantic tendency judgment task specifically comprises the following steps: matching paragraphs containing negative meanings from the corpus by using a predefined rule as negative examples, randomly selecting other paragraphs as positive examples, limiting the maximum sampling number of the positive examples to be equal to the number of the negative examples, and predicting the tendency of the paragraphs by a downstream task layer, wherein the specific calculation formula is as follows:

P(para is positive|Hp)=sigmod(WnegHp′’+bneg)

s34: the model was pre-trained using the three tasks described above in combination.

Background

The text of the electronic medical record contains the symptoms of the patient, the examination result and the description of diagnosis and treatment process made by the doctor according to the basic data such as symptoms, physical and chemical indexes and the like, and the important information is stored in unstructured information and cannot be understood and processed by a computer.

Due to the confidentiality of medical data and the specialty of medical terms, researchers need to reduce the labeling of linguistic data as much as possible, however, the existing models can only be used in one field at the same time, and data needs to be labeled again when the fields are switched, which is time-consuming and labor-consuming. Meanwhile, the medical record text has a certain particularity, and a composition mode of 'template + term' is usually adopted, such as a gastric cancer surgical process text: "probing: liver, gallbladder, pancreas, spleen, large intestine and small intestine are not abnormal, the focus is located on the anterior wall of the lesser curvature of stomach, and can be decomposed into templates: "exploring [ body organ ] [ abnormal condition ], lesion located [ body organ ]" and terminology: the combination of liver, gallbladder, pancreas, spleen, large and small intestines, anterior wall of small stomach bend and the combination of the above-mentioned diseases; and in the similar specialties, the templates are almost consistent, and only terms in the templates are replaced, such as intestinal cancer surgical process text: "exploring see: the liver, gallbladder, pancreas and spleen are not obviously abnormal, and the small intestine in the abdominal cavity is widely adhered, if an information extraction model for separating the template from the professional term and modeling can be established, the migration difficulty of the model in the similar disciplines can be greatly reduced.

Disclosure of Invention

Aiming at the defects in the existing language model, the invention provides the migratable language model based on the electronic medical record text, which separates the electronic medical record text into a template and a term, so that the model can separate the medical record text by using a medical knowledge base for modeling, thereby completing the cross-disciplinary information extraction. Meanwhile, the pre-training is carried out in an unsupervised mode, the requirement for manual data marking is reduced, and the migration difficulty of the model can be reduced when the model faces electronic medical record texts of different specialties.

The invention adopts the following technical scheme:

1. the main idea of the migratable language model based on the electronic medical record text comprises the following steps:

and S1, the template term separator takes the medical knowledge base as a dictionary, matches corresponding term parts from the medical text S, and replaces the matched terms to generate a text template and professional term set.

S2: and the template term encoder inputs the text template and the professional terms and outputs the fused vector representation.

S3: the invention discloses a pre-training downstream task layer, which performs pre-training by using three methods of template word digging and filling, term restoring and negative consideration. The output of the pre-training stage is the loss of the downstream task, and the output of the fine-tuning stage is the fused vector representation El+1

2. Specifically, in step S1, the method of template term separator includes the following steps:

s11: input as case history text SDocField name SkeyInput/output as field value SvalueAnd determining different processing modes according to the field types. Specifically, tasks can be classified into three categories:

s12: matching corresponding term parts from the medical text S by using a dictionary tree matching algorithm and taking the medical knowledge base KG as a dictionary, and replacing the term parts to generate a text template SpatternAnd term set SKG

3. In step S2, the method of the template term encoder includes the following steps:

s21: the template term encoder uses Pattern Attentention and KG Cross Attention to capture the context semantic information of the template, and the associated information between the template and the knowledge base

S22: carrying out one-time nonlinear transformation on the vector by using the FNN layer to obtain a fused vector characterization El+1. The concrete formula is as follows, the layer number l belongs to { x |1 is less than or equal to x is less than or equal to 12 }.

SelfAttention(X)=ln(mult_headh=12(X,X,X,MASK)+X)

KGCrossAttention(X,K)=ln(mult-headh=12(X,K,K,MASK)+X)

El+1=FFN(KGCrossAttention(SelfAttention(El),K,MASK))

E1=layer_normal(add([xi]s_max,[pi]s_max))

Wherein E is1As an initial vector, X from the text X mapped by the word vectori]s_maxAnd the corresponding position code [ p ]i]s-max. K is a set of terms SKGEach word inIs represented by a vector of (a). MASK is a MASK matrix that can control the Attention range of each word, and is used here in KG Cross Attention to let the template focus only on the vector representation of the corresponding alternative position.

4. In step S3, the method of the template term decoder includes the following steps:

s31: the template word-digging and blank-filling task is used for learning the context-related representation of each word in the template. The method comprises the following steps of randomly selecting 15% of normal characters from a template, replacing the characters with [ MASK ] according to 80% of probability, replacing the characters with other characters in a word list by 10%, keeping the characters unchanged by 10%, and restoring the characters through a downstream task layer.

And S32, a term reduction task for learning the relation between the slot in the template and the filled term. Specifically, 10% of terms are randomly selected from the term set, replaced by [ MASK ] according to 80% probability, 10% are replaced by other terms in the knowledge base, 10% are kept still, and the terms are restored through a downstream task layer. In addition, the medical record text usually contains partial structured information which can be inferred through the medical record description text, and the structured terms in the information are replaced by the same operation, so that the model is forced to learn the reasoning relationship between the description text and the structured text.

S33, semantic tendency judgment task, which comprises the following steps: matching paragraphs containing negative meanings from the corpus by using a predefined rule as negative examples, randomly selecting some other paragraphs as positive examples, limiting the maximum sampling number of the positive examples to be equal to the number of the negative examples, and predicting the tendency of the paragraphs through a downstream task layer.

S34, performing joint pre-training by using three tasks, wherein the output of the pre-training stage is the loss of a downstream task, and the output of the fine-tuning stage is the fused vector characterization El+1

Drawings

The various aspects of the present invention will become more apparent to the reader after reading the detailed description of the invention with reference to the attached drawings. Wherein the content of the first and second substances,

FIG. 1 is a schematic diagram illustrating an electronic medical record text-based migratable language model, according to an embodiment of the present invention.

Detailed Description

In order to make the present disclosure more complete and complete, reference is made to the accompanying drawings, in which like references indicate similar or analogous elements, and to the various embodiments of the invention described below. However, it will be understood by those of ordinary skill in the art that the examples provided below are not intended to limit the scope of the present invention. In addition, the drawings are only for illustrative purposes and are not drawn to scale.

Specific embodiments of various aspects of the present invention are described in further detail below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram illustrating an electronic medical record text-based migratable language model, according to an embodiment of the present invention.

1. The main idea of the migratable language model based on the electronic medical record text comprises the following steps:

s1: and the template term separator is used for matching a corresponding term part from the medical text S by taking the medical knowledge base as a dictionary, and replacing the matched term to generate a text template and a professional term set.

S2: and the template term encoder inputs the text template and the professional terms and outputs the fused vector representation.

S3: the invention discloses a pre-training downstream task layer, which performs pre-training by using three methods of template word digging and filling, term restoring and negative consideration.

2. Specifically, in step S1, the step of constructing a migratable language model based on an electronic medical record text includes the following steps:

s11: the method comprises the steps of constructing a migratable language model based on an electronic medical record text, wherein the model structure is mainly divided into a template term separator, a template term encoder and a pre-training downstream task layer.

S12: the template term separator matches corresponding term parts from the medical text S by using a dictionary tree matching algorithm and a medical knowledge base KG as a dictionary, and then replaces the corresponding term parts to generate the text template SpatternAnd term set SKG

For example, gastric cancer procedure text: "probing: liver, gallbladder, pancreas, spleen, large intestine and small intestine are not abnormal, the focus is located on the anterior wall of the lesser curvature of stomach, and can be decomposed into templates: "exploring [ body organ ] [ abnormal condition ], lesion located [ body organ ]" and terminology: the combination of liver, gallbladder, pancreas, spleen, large and small intestines, anterior wall of small stomach bend and the combination of the above-mentioned symptoms.

3. Specifically, in step S2, constructing the template term encoder includes the following steps:

s21: the template term encoder uses Pattern Attentention and KG Cross Attention to capture the context semantic information of the template, and the association information between the template and the knowledge base.

S22: carrying out one-time nonlinear transformation on the vector by using the FNN layer to obtain a fused vector characterization El+1. The concrete formula is as follows, the layer number l belongs to { x |1 is less than or equal to x is less than or equal to 12 }.

SelfAttention(X)=ln(mult_headh=12(X,X,X,MASK)+X)

KGCrossAttention(X,K)=ln(mult-headh=12(X,K,K,MASK)+X)

El+1=FFN(KGCrossAttention(SelfAttentton(El),K,MASK))

E1=layer_normal(add([xi]s_max,[pi]s_max))

Wherein E is1As an initial vector, X from the text X mapped by the word vectori]s_maxAnd the corresponding position code [ p ]i]s_max. K is a set of terms SKGEach word inIs represented by a vector of (a). MASK is a MASK matrix that can control the Attention range of each word, and is used here in KG Cross Attention to let the template focus only on the vector representation of the corresponding alternative position.

4. Specifically, in step S3, the pre-training of the model includes the following steps:

s31, template word-mining and gap-filling task for learning context-dependent representation of each word in the template. The method comprises the following steps of randomly selecting 15% of normal characters from a template, replacing the characters with [ MASK ] according to 80% of probability, replacing the characters with other characters in a word list by 10%, keeping the characters unchanged by 10%, and restoring the characters through a downstream task layer. The specific calculation formula is as follows:

Hmlm=ln(relu(WmlmDl+1+bmlm))

this task enables the model to generate corresponding representations for each word of the input text on a context-by-context basis without supervision.

And S32, a term reduction task for learning the relation between the slot in the template and the filled term. Specifically, 10% of terms are randomly selected from the term set, replaced by [ MASK ] according to 80% probability, 10% are replaced by other terms in the knowledge base, 10% are kept still, and the terms are restored through a downstream task layer. In addition, the medical record text usually contains partial structured information which can be inferred through the medical record description text, and the structured terms in the information are replaced by the same operation, so that the model is forced to learn the reasoning relationship between the description text and the structured text.

The overall calculation formula of the term reduction task is as follows:

Htmlm=ln(relu(WtmlmDl+1+btmlm))

the task aims at the characteristic that the electronic medical record text contains a large number of terms, so that the model can learn according to the characteristic of the terms, and the expression of the model facing the electronic medical record text is enhanced.

S33, semantic tendency judgment task, which comprises the following steps: matching paragraphs containing negative meanings from the corpus by using a predefined rule as negative examples, randomly selecting some other paragraphs as positive examples, limiting the maximum sampling number of the positive examples to be equal to the number of the negative examples, and predicting the tendency of the paragraphs through a downstream task layer. The specific calculation formula is as follows:

lossneg=-∑p∈positive paralog(P(para is positive|Hp))

-∑p∈negative paralog(1-P(para is positive|Hp))

since attention mechanism is biased to capture the connection between words, the overall context information is easy to ignore, and the requirement of negative context in medicine is much higher than that in the general field, so the understanding of the model on negative semantics is improved by the task.

S34: and (3) performing pre-training by using three tasks in a combined manner, wherein the final total pre-training loss is the weighted sum of the three tasks, and the formula is as follows:

loss=lossmlm+losstmlm+lossneg

according to the migratable language model based on the electronic medical record text, the electronic medical record text is separated into two parts, namely a template and terms, through a medical knowledge base, the natural language template is independently modeled by using Pattern Attenttion, and then corresponding medical terms are fused by using KG Cross Attention, so that the model can be modeled by separating the medical record text through the medical knowledge base, and the information extraction across disciplines is completed. In order to enable the model to be more suitable for the electronic medical record text, three pre-training tasks are designed, and after the model is pre-trained by the method, the migration difficulty of the model in the similar disciplines can be greatly reduced.

The above are merely examples of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement and the like, which are not made by the inventive work, are included in the scope of protection of the present invention within the spirit and principle of the present invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:意图识别方法、装置、设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!