Model training method, information extraction method, related device and storage medium
1. A method of model training, the method comprising:
acquiring N sample data; each sample data comprises sub data of M categories; the M types of sub data included in the N sample data correspond to M × N sub data pairs, each sub data pair includes M sub data, the types of the sub data are different, an association relationship corresponds to M sub data included in each sub data pair, the M types of sub data included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
inputting the M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each type of sub-data; the preset model is used for calculating the similarity between the M pieces of sub data included in each sub data pair, and determining the vector representation space corresponding to each data category according to the similarity between the M pieces of sub data included in each MXN sub data pair.
2. The method of claim 1, wherein the inputting the mxn sub-data pairs into a preset model for training to generate a pre-training model corresponding to each category of sub-data comprises:
inputting the M multiplied by N sub-data pairs into a preset model;
and training the preset model according to the incidence relation among the M sub data included in the M multiplied by N sub data pairs to generate a pre-training model corresponding to each type of sub data.
3. An information extraction method, characterized in that the method comprises:
acquiring data to be processed, wherein the data to be processed comprises at least one type of sub data to be processed;
inputting the to-be-processed subdata of at least one category into a pre-training model corresponding to each category of the to-be-processed subdata to obtain vector information corresponding to the to-be-processed data; wherein the pre-training model is obtained by the model training method of claim 1;
and extracting target information carried by the data to be processed according to the vector information.
4. The method of claim 3, wherein the inputting the at least one type of sub-data to be processed into a pre-training model corresponding to each type of the sub-data to be processed to obtain vector information corresponding to the data to be processed comprises:
inputting the to-be-processed subdata of at least one category into a pre-training model corresponding to each category of the to-be-processed subdata to obtain vector information corresponding to each category of the to-be-processed subdata;
performing feature fusion operation on the vector information corresponding to the to-be-processed subdata to obtain the vector information corresponding to the to-be-processed data; wherein the feature fusion operation comprises at least one of: splicing operation and pooling operation.
5. The method of claim 3, wherein the extracting target information carried by the data to be processed according to the vector information comprises:
carrying out regularization processing on the vector information;
and extracting target information carried by the data to be processed according to the processed vector information.
6. The method of claim 4, wherein the data to be processed comprises at least two categories of sub-data to be processed;
after the sub-data to be processed of the at least one category is input into the pre-training model corresponding to each category of the sub-data to be processed, the method further includes:
determining the similarity between the sub-data to be processed of the at least two categories;
the obtaining of the vector information corresponding to the sub-data to be processed includes:
generating vector information corresponding to the sub-data to be processed according to the similarity and the vector representation space; wherein the vector representation space is obtained by the model training method of claim 1.
7. A model training apparatus, the apparatus comprising:
the first acquisition module is used for acquiring N sample data; each sample data comprises sub data of M categories; the M types of sub data included in the N sample data correspond to M × N sub data pairs, each sub data pair includes M sub data, the types of the sub data are different, an association relationship corresponds to M sub data included in each sub data pair, the M types of sub data included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
the training module is used for inputting the MXN sub-data pairs into a preset model for training to generate a pre-training model corresponding to each category of sub-data; the preset model is used for calculating the similarity between the M pieces of sub data included in each sub data pair, and determining the vector representation space corresponding to each data category according to the similarity between the M pieces of sub data included in each MXN sub data pair.
8. An information extraction apparatus, characterized in that the apparatus comprises:
the second obtaining module is used for obtaining data to be processed, wherein the data to be processed comprises sub data to be processed of at least one type;
the output module is used for inputting the to-be-processed subdata of at least one type into a pre-training model corresponding to each type of the to-be-processed subdata to obtain vector information corresponding to the to-be-processed data; wherein the pre-training model is obtained by the model training method of claim 1;
and the extraction module is used for extracting the target information carried by the data to be processed according to the vector information.
9. A model training apparatus comprising a processor, a memory, and a communication interface:
the processor is connected with the memory and the communication interface;
the memory for storing executable program code;
the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the model training method according to any one of claims 1 to 2.
10. An information extraction device, comprising a processor, a memory, and a communication interface:
the processor is connected with the memory and the communication interface;
the memory for storing executable program code;
the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the information extraction method according to any one of claims 3 to 6.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a model training method according to any one of claims 1-2 or an information extraction method according to any one of claims 3-6.
Background
At present, the popularity of the internet is higher and higher, the number of net citizens is also increased continuously, and more people record and share life through multi-mode data such as videos. In the process of creating short videos, not only the video content, the audio content and the text need to be prepared, but also how to generate high-quality documentations or titles to attract more users to watch. The existing document generation method is mainly generated by manual writing and other modes, and generally causes the problems of low quality and low generation efficiency of the generated document.
Disclosure of Invention
The embodiment of the application provides a model training method, an information extraction method, a related device and a storage medium, which can quickly and accurately extract target information.
In order to solve the technical problems, the application comprises the following technical scheme:
in a first aspect, an embodiment of the present application provides a model training method, where the method includes:
acquiring N sample data; each sample data comprises sub data of M categories; the M types of sub data included in the N sample data correspond to M × N sub data pairs, each sub data pair includes M sub data, the types of the sub data are different, an association relationship corresponds to M sub data included in each sub data pair, the M types of sub data included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
inputting the M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each type of sub-data; the preset model is used for calculating the similarity between the M pieces of sub data included in each sub data pair, and determining the vector representation space corresponding to each data category according to the similarity between the M pieces of sub data included in each MXN sub data pair.
In a second aspect, an embodiment of the present application provides an information extraction method, where the method includes:
acquiring data to be processed, wherein the data to be processed comprises at least one type of sub data to be processed;
inputting the to-be-processed subdata of at least one category into a pre-training model corresponding to each category of the to-be-processed subdata to obtain vector information corresponding to the to-be-processed data; wherein the pre-training model is obtained by the model training method of claim 1;
and extracting target information carried by the data to be processed according to the vector information.
In a third aspect, an embodiment of the present application provides a model training apparatus, where the apparatus includes:
the first acquisition module is used for acquiring N sample data; each sample data comprises sub data of M categories; the M types of sub data included in the N sample data correspond to M × N sub data pairs, each sub data pair includes M sub data, the types of the sub data are different, an association relationship corresponds to M sub data included in each sub data pair, the M types of sub data included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
the training module is used for inputting the MXN sub-data pairs into a preset model for training to generate a pre-training model corresponding to each category of sub-data; the preset model is used for calculating the similarity between the M pieces of sub data included in each sub data pair, and determining the vector representation space corresponding to each data category according to the similarity between the M pieces of sub data included in each MXN sub data pair.
In a fourth aspect, an embodiment of the present application provides an information extraction apparatus, where the apparatus includes:
the second obtaining module is used for obtaining data to be processed, wherein the data to be processed comprises sub data to be processed of at least one type;
the output module is used for inputting the to-be-processed subdata of at least one type into a pre-training model corresponding to each type of the to-be-processed subdata to obtain vector information corresponding to the to-be-processed data; wherein the pre-training model is obtained by the model training method of claim 1;
and the extraction module is used for extracting the target information carried by the data to be processed according to the vector information.
In a fifth aspect, the present application provides another model training apparatus, the apparatus comprising a processor, a memory, and a communication interface:
the processor is connected with the memory and the communication interface;
the memory for storing executable program code;
the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing the model training method according to the first aspect.
In a sixth aspect, the present application provides another information extraction apparatus, comprising a processor, a memory, and a communication interface:
the processor is connected with the memory and the communication interface;
the memory for storing executable program code;
the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the information extraction method as described in the second aspect above.
In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the model training method according to the first aspect or the information extraction method according to the second aspect.
According to the method and the device, the preset model is trained through a large amount of sample data containing the sub-data of various types, the pre-training model corresponding to the sub-data of each type is generated, in the using process of a user, the original data are input into the pre-training model, the vector information corresponding to the original data is obtained, and finally the target information corresponding to the original data is extracted through the preset information extraction model. By adopting the model training method provided by the application, the pre-training model is constructed by utilizing the idea of comparison learning, the vector information corresponding to various different types of original data is generated, a more accurate and short representation method for the original data is obtained, and the target information is extracted according to the obtained vector information, so that the quality of the extracted target information is higher, the target information is quickly and accurately extracted from the original data, and the problems of low extracted information quality, low generation efficiency and the like caused by the existing information extracted by manual writing and the like are solved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a method for constructing a seed data pair according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a vector representation space of a double tower model according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an encoder according to an embodiment of the present application;
fig. 6 is a schematic flowchart of an information extraction method provided in an embodiment of the present application;
fig. 7 is a schematic structural diagram of a decoder according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an overall framework of a method for extracting information based on a pre-training model according to an embodiment of the present application;
fig. 9 is a schematic view of an interface display of an electronic device in the process of extracting information according to an embodiment of the present application;
FIG. 10 is a schematic flow chart diagram illustrating another information extraction method provided in an embodiment of the present application;
fig. 11 is a schematic flowchart of another information extraction method provided in the embodiment of the present application;
FIG. 12 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;
fig. 13 is a schematic structural diagram of an information extraction apparatus according to an embodiment of the present application;
FIG. 14 is a schematic structural diagram of another model training apparatus provided in the embodiments of the present application;
fig. 15 is a schematic structural diagram of another information extraction apparatus provided in an embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the present application are described in detail below.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Referring to fig. 1, fig. 1 is a schematic view of an application scenario provided in the present application. As shown in fig. 1, a video a may be input into the electronic device 10, and the electronic device 10 performs data analysis on the video a and outputs a title or a file of the video a. In the embodiment of the present application, a title or a file of a video a may be referred to as target information of the video a, data input to an electronic device for information extraction may be referred to as to-be-processed data, and the to-be-processed data is processed to obtain sub-data of at least one category, where the category of the sub-data may include, but is not limited to, video data, audio data, picture data, text data, and the like.
The electronic device 10 may include, but is not limited to, a smart phone, a personal computer, a laptop computer, a smart tablet, a portable wearable device, and the like. The electronic device 10 has a flexible Access mode and a high bandwidth communication performance, and has a plurality of communication modes, which may include, but are not limited to, communication through various wireless operation networks such as GSM, Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (W-CDMA), and communication through wireless lan, bluetooth, and infrared.
In an embodiment of the present application, the electronic device 10 may include an encoder and a decoder. The encoder is used for training the sample data to obtain a pre-training model, and in the using process, the decoder is used for obtaining vector information corresponding to the data to be processed according to the pre-training model, and the decoder is used for generating target information according to the vector information. As shown in fig. 1, after a user uploads a video a based on an electronic device, the electronic device may output target information of the video a after information extraction according to the information extraction method provided by the present application.
Next, a model training method and an information extraction method provided in the embodiment of the present application will be described with reference to an application scenario diagram shown in fig. 1.
Referring to fig. 2, fig. 2 is a schematic flow chart of a model training method in an embodiment of the present application, where the method includes:
s201, acquiring N sample data.
Specifically, the electronic device trains sample data according to an internal encoder, and first obtains N sample data. Each sample data comprises M types of sub data, the M types of sub data included in the N sample data correspond to M multiplied by N sub data pairs, each sub data pair comprises M sub data, the types of the sub data are different, M sub data included in each sub data pair correspond to an association relation, the M types of sub data included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2. In the embodiment of the present application, the sample data may include, but is not limited to, video data, audio data, image data, text data, and other various types of data.
Further, after acquiring N sample data, the electronic device first processes all the sample data. The method for processing the sample data may include: and carrying out batch processing on the sample data. The electronic equipment firstly constructs a batch data set for sample data, and the construction method specifically comprises the following steps: taking the sample data comprising 3 pieces of data, taking the video data and the audio data as examples of the subdata category, processing the 3 pieces of sample data to obtain subdata corresponding to each piece of data, and constructing subdata pairs by using the video data and the audio data contained in each current sample data, wherein each subdata pair comprises one piece of video data and one piece of audio data.
Fig. 3 shows a schematic diagram of a method for constructing a seed data pair. Assume that the sample data includes subdata of two types, i.e., audio and video, as shown in fig. 3, there are 3 pieces of data in the sample data, which are sample data a, sample data B, and sample data C, respectively. Firstly, the 3 pieces of data are decomposed to obtain sub-data: and respectively constructing a similar subdata pair and a dissimilar subdata pair for the 6 subdata, namely video A, audio A, video B, audio B, video C and audio C. The similar subdata pairs indicate that the subdata of different types in the subdata pairs have the same source, and the subdata sources are from the same sample data. The similar sub-data pairs as shown in FIG. 3 include: (video a, audio a), (video B, audio B), (video C, audio C). The sources of the different types of sub-data in the dissimilar sub-data pairs are different, and the different sources of the sub-data are from different sample data. The dissimilar child data pairs shown in FIG. 3 include: (video A, Audio B), (video A, Audio C), (video B, Audio A), (video B, Audio C), (video C, Audio A), (video C, Audio B). Illustratively, the sub data pair (video a, audio B) is a similar sub data pair, where both video a and audio a are derived from sample data a. The subdata pairs (video B, audio C) are dissimilar subdata pairs, where video B is derived from sample data B and audio C is derived from sample data C. It is to be understood that the similar sub-data pairs and the dissimilar sub-data pairs constructed in the above example may also be referred to as positive and negative examples of constructing the sample data set, and specifically, the similar sub-data pairs may be used as the positive examples, and the dissimilar sub-data pairs may be used as the negative examples. In this embodiment of the present application, the similar sub-data may also be symmetric to form a similar instance pair, and the dissimilar sub-data may also be symmetric to form a dissimilar instance pair. The method for processing the sample data and the method for constructing the sub-data pairs are not limited.
S202, inputting the M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each type of sub-data.
Specifically, after the electronic device processes the N sample data to obtain M × N sub-data, the M × N sub-data pairs are input into a preset model for training, and a pre-training model corresponding to each category of sub-data is obtained after training. The preset model is used for calculating the similarity between the M pieces of sub data included in each sub data pair, and determining the vector representation space corresponding to each data category according to the similarity between the M pieces of sub data included in the M multiplied by N sub data pairs. The preset model can comprise a comparison learning model obtained based on double-tower model training, the comparison learning model can be any comparison learning model suitable for the model training method, and the type of the comparison learning model is not limited by the application. The method for model training based on the double-tower model can comprise the following steps: and performing comparative learning training based on any two categories of data.
Fig. 4 is a schematic diagram of a vector representation space of a two-tower model, in this embodiment, the vector representation space may be represented in a form of a coordinate system, which corresponds to a two-dimensional coordinate system in the case of the two-tower model, and may be a three-dimensional coordinate system in the case of the three-tower model. In the vector representation space, subdata from the same source is concentrated in one area, and subdata from different sources are located in different areas. As shown in fig. 4, it can be seen from the coordinate system that the sub-data included in the sample data a, the sample data B, and the sample data C are relatively concentrated, that is, the sub-data included in the sample number data a is concentrated in the area circled by the sample data a in the graph, and similarly, the sub-data included in the sample data B and the sample data C is concentrated in the area circled by the sample data B and the sample data C in the graph. It should be noted that the present application does not limit the presentation form of the vector representation space. For the subdata of different types from the same source, the similarity is higher than that of the subdata of different types from different sources. The higher the similarity is, the closer the sub-data is in the vector representation space is, and the smaller the loss value of the loss function in the pre-training model is; the lower the similarity, the farther the subdata is in the vector representation space, the greater the loss value of the loss function in the pre-trained model.
It can be understood that, in practical application, according to different business scenarios, data of one category is used as a reference tower for model training, and then comparison learning training is performed in combination with data of other categories. For example, if a pre-training model corresponding to each of video data, audio data, picture data, and text data is to be trained, the features corresponding to the video data are generally used as a reference tower in a two-tower model, the two-tower model of the video tower and the audio tower, the two-tower model of the video tower and the picture tower, and the two-tower model of the video tower and the text tower are respectively constructed by combining the audio data, the picture data, and the text data, and the two-tower model is trained according to sample data to obtain a pre-training model corresponding to each type of data. It should be noted that, in the embodiment of the present application, the data type of the reference tower in the double-tower model training process is not limited. Such as but not limited to audio data as a reference tower in a two-tower model. It can be understood that, in practical applications, the double-tower model may be further extended to a model of a multi-tower structure, when the sample data includes three types of data, the double-tower model may be extended to a three-tower model, and then the sample data is trained based on the three-tower model, and so on. In the embodiment of the present application, the model expanded into the multi-tower structure can be flexibly selected according to the difference of services and the difference of sample data in practical application, which is not limited in the present application.
Further, inputting the M × N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each category of sub-data, including: inputting M multiplied by N sub-data pairs into a preset model; and training the preset model according to the incidence relation between M sub data included in the M multiplied by N sub data pairs, and generating the pre-training model corresponding to each type of sub data.
Specifically, when the preset model is trained, the InfoNCE function may be used as a loss function for model training. The specific calculation formula of the InfoNCE function is as follows:
wherein Z isiAnd zi' denotes a similar child data pair, ziAnd zj' denotes a dissimilar child data pair. Examples of similar sub-data pairs and dissimilar sub-data pairs are detailed in S201.
In the process of model training, the InfoNCE loss function can make the distance between the video data and the audio data (i.e. the similar sub-data pairs mentioned in the foregoing embodiment) derived from the same sample data in the current vector representation space closer, and make the distance between the video data and the audio data (i.e. the dissimilar sub-data pairs mentioned in the foregoing embodiment) derived from different sample data farther. The higher the similarity of the similar subdata pairs is, the closer the distance of each subdata in the vector representation space is, and the smaller the loss value of the loss function is; the lower the similarity of the similar subdata pairs, the longer the distance of each subdata in the vector representation space, the larger the loss value of the loss function. The lower the similarity of the dissimilar subdata pairs is, the longer the distance of each subdata in the vector representation space is, and the larger the loss value of the loss function is; the higher the similarity of the dissimilar subdata pairs, the closer the distance of each subdata in the vector representation space, and the smaller the loss value of the loss function. Training the comparative learning model based on the InfonCE loss function can obtain a vector representation space corresponding to the data of each category. In the vector representation space, the similar subdata pairs have high similarity, the distance in the vector representation space is close, the dissimilar subdata pairs have low similarity, and the distance in the vector representation space is far. And in the model training process, continuously adjusting the relevant parameter values of the loss function according to the respective similarity of the pre-constructed similar subdata pairs and the non-similar subdata pairs so as to enable the pre-trained model obtained by training to reach the optimal state. And obtaining a pre-training model corresponding to each type of subdata based on the training process, inputting data of information to be extracted into the pre-training model in the subsequent use process after obtaining the pre-training model, and obtaining vector information corresponding to the data to be extracted for a project downstream task.
The model training method provided by the embodiment of the application is executed by an encoder inside the electronic equipment. The structure of an encoder for executing the model training method in the embodiment of the present application will be described below with reference to the above model training method.
Fig. 5 shows a schematic structure of an encoder. The encoder is mainly used for generating a pre-training model and generating a vector corresponding to data of information to be extracted by using the pre-training model. As shown in fig. 5, the electronic device first obtains sample data, where the sample data includes video data, audio data, picture data, and text data. Assuming that the current model training is based on a double-tower model to perform comparison learning training to obtain a pre-training model, the video tower can be constructed by taking video data as a reference tower, and a double-tower model of the video tower and the audio tower, a double-tower model of the video tower and the picture tower, and a double-tower model of the video tower and the text tower are constructed by combining other three types of data respectively. Taking a double tower model of a video tower and an audio tower as an example, an encoder in the electronic device performs batch processing on sample data, and classifies video data, audio data, picture data and text data respectively to obtain required video data and audio data. If the sample data comprises 3 data: the three sample data are respectively sample data A, sample data B and sample data C, the three sample data A can be obtained by processing, the sample data A comprises a video A and an audio A, the sample data B comprises a video B and an audio B, and the sample data C comprises a video C and an audio C. For sample data a, the sub data includes video a and audio a, and the same is true, the sub data of sample data B includes video B and audio B, and the sub data of sample data C includes video C and audio C. Therefore, when constructing the sub-data pairs, similar sub-data pairs and dissimilar sub-data pairs can be constructed according to the source of the sub-data, where the similar sub-data pairs are (e.g., (video a, audio a), (video B, audio B), (video C, audio C), and the dissimilar sub-data pairs are (e.g., (video a, audio B), (video B, audio C), etc.). If there are N sample data and there are M types of data, then M × N sub-data pairs can be obtained. And after the sub-data pairs are constructed, training the double-tower model by taking the InfonCE function as a loss function to obtain a pre-training model. As shown in fig. 5, the pre-training models obtained by training are different for different types of data, the contrast learning training is usually performed based on 3 dressnet 50 for video data, the training is performed based on BERT model for audio data and text data, and the training is performed based on ResNet for picture data. After the pre-training model is obtained, vector representation spaces corresponding to the data of each category can be obtained, as shown in fig. 5, the vector representation spaces of the video data, the audio data, the picture data, and the text data are Linear projector, CLS, Linear projector, and CLS, respectively, and a vector corresponding to each sub-data can be obtained based on the vector representation spaces. As shown in fig. 5, taking one of the data as an example, the vectors (z, a), (z, b), (z, c), (z, d) converted from the sample data can be obtained from the video data, the audio data, the picture data and the text data, respectively. After the vector information is obtained, to avoid overfitting, L2 regularization processing is usually required to be performed on the data, and after the processing is completed, feature fusion operation is performed on a single vector to obtain final vectors corresponding to the sample data, which are respectively (z _ a, z _ b), (z _ a, z _ c), and (z _ a, z _ d). The vector (z _ a, z _ b) represents a vector output by the double-tower model processing based on the video tower and the audio tower, and the original data corresponding to the vector comprises video data and audio data; the vectors (z _ a, z _ c) represent vectors output by the double-tower model processing based on the video tower and the picture tower, and the original data corresponding to the vectors comprise video data and picture data; the vector (z _ a, z _ d) represents the vector output by the double tower model processing based on the video tower and the text tower, and the vector corresponds to the original data comprising the video data and the text data. To this end, the task of the encoder is completed, and the generated vector may be sent to a pre-training model trained in the decoder based on the T5 model, so that the decoder generates the target information.
According to the model training method, after sample data is obtained, the sample data is processed firstly, a subdata pair is constructed, the processed subdata pair is subjected to comparative learning training based on a double-tower model to obtain a pre-training model, and meanwhile, a vector representation space corresponding to each category of data is obtained according to the double-tower model training and is used in a downstream information extraction task, so that vector information corresponding to each data is generated in a more targeted manner, a foundation is laid for subsequent information extraction, and the quality of extracted target information is improved.
The information extraction method provided by the embodiment of the present application will be described below with reference to the model training method provided by the embodiment of the present application. The information extraction method provided by the embodiment of the application adopts the model training method to extract information.
Referring to fig. 6, fig. 6 is a schematic flow chart of an information extraction method in an embodiment of the present application, where the method includes:
s601, acquiring data to be processed.
Specifically, when the user wants to perform an information extraction operation, the user can click a designated position on the interface of the electronic device and upload the data to be processed at the designated position. The electronic device obtains data to be processed. The data to be processed comprises at least one type of sub-data to be processed. For example, the data to be processed is any piece of video in certain video software, and the piece of video includes video data and audio data after being processed. The processed video data and audio data are the to-be-processed subdata.
S602, inputting the to-be-processed subdata of at least one category into a pre-training model corresponding to each category of the to-be-processed subdata to obtain vector information corresponding to the to-be-processed data.
Specifically, after the electronic device obtains the data to be processed, when the data to be processed only includes one piece of data, and the piece of data only includes one type of subdata, it is not necessary to perform batch processing on the data, and the data is directly input into the pre-training model corresponding to the type of subdata, and the pre-training model outputs vector information corresponding to the data to be processed. When the data to be processed comprises a piece of data and the piece of data comprises subdata of at least two types, the subdata of the at least two types is respectively input into the corresponding pre-training models of the respective types, and vector information corresponding to the subdata of the at least two types is output. And when the data to be processed comprises at least two pieces of data and each piece of data comprises subdata of at least one type, performing batch processing on the data to be processed. For the specific batch processing method, reference is made to the above embodiments, which are not repeated in this embodiment. And obtaining at least one piece of sub-data to be processed, which comprises the data to be processed, after the batch processing. The electronic equipment inputs the sub data to be processed into pre-training models corresponding to the categories of the sub data to be processed respectively to obtain vector information corresponding to the data to be processed. For example, if the sub-data to be processed obtained after the data to be processed is batch-processed includes video data and audio data, the video data is input into the video tower pre-training model, and the audio data is input into the audio tower pre-training model, so as to obtain vector information corresponding to the video data and vector information corresponding to the audio data respectively. The pre-training model is obtained by using the model training method shown in fig. 2, and for the specific training method, reference is made to the foregoing embodiment, which is not repeated in this embodiment.
S603, extracting target information carried by the data to be processed according to the vector information.
Specifically, after the vector information is obtained, to avoid overfitting, regularization processing is performed on the vector information first, an encoder in the electronic device transmits the processed vector information to a decoder, and the decoder extracts target information according to the vector information corresponding to the data to be processed based on a T5(Transfer Text-to-Text Transformer, T5) model.
Fig. 7 shows a schematic structure of a decoder. The decoder mainly extracts target information from vector information generated by the encoder based on a pre-training model for extracting information obtained by training a T5 model. After the decoder generates vector information corresponding to the data to be processed, the vector information is transmitted to the decoder, the decoder receives the vector information and inputs the vector information into a pre-training model for extracting information, and the model processes the vector information to generate target information corresponding to the data to be processed. The T5 model is a model capable of converting all natural language processing tasks into Text-to-Text (Text-to-Text) tasks, adopts a Transformer structure, and has extremely strong feature extraction capability. Natural processing tasks may include, but are not limited to: the method comprises a text translation task, a text classification task, a text generation task and an automatic summarization task. The text translation task comprises the step of translating a language corresponding to the input text into a text of a specified language. The text classification task includes automatically classifying the entered text based on some criteria. For example, the input words are classified according to semantic features. The text generation task and the automatic summarization task are similar to the purpose of the information extraction method provided in the embodiment of the application, and the main purpose is to extract target information of data to be processed. The method for training the pre-training model for extracting the information based on the T5 model comprises the following steps: the method comprises the steps of creating a self-supervision task (such as language modeling or missing word filling), pre-training a model by using a large amount of sample data to obtain a pre-training model for extracting information, then, finely adjusting the pre-training model for extracting information by using a small amount of data comprising a plurality of different types of original data and target information generated according to the original data, specifically adjusting all parameters contained in the pre-training model, and achieving the purpose of optimizing the model by continuously adjusting the model, thereby improving the model effect. The fine-tuned pre-training model for extracting information can be used for extracting target information. It should be noted that, since the T5 model is pre-trained based on english sample data, the T5 model may be trained using the multi-national language version MT5(Multilingual T5, MT5) in the embodiment of the present application.
Fig. 8 shows an overall framework schematic diagram of a method for extracting information based on a pre-training model, and fig. 9 shows an interface display schematic diagram of an electronic device in the process of extracting information. As shown in fig. 8, the overall framework for extracting information based on the pre-training model includes an encoder and a decoder, the encoder mainly functions to obtain vector information corresponding to the data to be processed by using the pre-training model obtained by the model training method, and the decoder mainly functions to extract target information corresponding to the data to be processed according to the vector information obtained by the encoder. For example, when a user inputs any piece of data in the electronic device, the electronic device processes the data, determines that the piece of data includes two types of data, namely video data and audio data, obtains video vector information corresponding to the video data and audio vector information corresponding to the audio data according to pre-training models corresponding to the video data and the audio data, performs feature fusion processing on the video vector information and the audio vector information, and obtains the vector information of the piece of data after processing. And after the encoder acquires the vector information, the vector information is sent to the decoder, the decoder decodes the vector information of the data to be processed according to a pre-trained model based on the T5 model, and extracts the target information of the data to be processed according to the vector information. As shown in fig. 9, according to the information extraction method provided in the present application, when a user uploads data a of information to be extracted at a specified location on an electronic device, an encoder and a decoder inside the system may extract target information of the data a through processing according to the information extraction method, obtain content that target information shown in B in fig. 9 is "a precious family book", and display the content on a screen of the electronic device to provide to the user.
According to the information extraction method, when a user inputs data of information to be extracted, the electronic equipment extracts vector information of the data of the information to be extracted according to a pre-trained model which is trained in advance, the decoder extracts target information corresponding to the data of the information to be extracted according to the vector information, the data are converted into vectors, and then the information is extracted according to the vectors, so that the problems that the quality of the extracted information is low and the efficiency of information extraction is low due to the fact that the accuracy of modes such as manual writing is not high in the current information extraction method are solved.
Referring to fig. 10, fig. 10 is a flow chart illustrating another information extraction method. The method comprises the following steps:
and S1001, acquiring data to be processed.
Specifically, the user uploads the data to be processed to the electronic device. The electronic equipment obtains the sub-data to be processed. The data to be processed comprises at least one type of sub-data to be processed. For the types of the data to be processed, please refer to the above embodiments, which are not described in detail in this embodiment.
S1002, inputting the to-be-processed subdata of at least one type into a pre-training model corresponding to each type of the to-be-processed subdata to obtain vector information corresponding to each type of the to-be-processed subdata.
Specifically, if a certain data to be processed includes two sub data to be processed, which are video data and audio data, respectively. After the electronic equipment acquires the subdata to be processed, the video data are respectively input into the pre-training models corresponding to the video data, the audio data are input into the pre-training models corresponding to the audio data, and the vector information corresponding to the video data and the vector information corresponding to the audio data are respectively obtained. The pre-training model is obtained by training according to the model training method in the embodiment. Please refer to the above embodiments for a specific method for obtaining vector information according to a pre-training model, which is not described in detail herein.
S1003, performing feature fusion operation on the vector information corresponding to the sub-data to be processed to obtain the vector information corresponding to the data to be processed.
Specifically, the electronic device performs a feature fusion operation on the vector information corresponding to the obtained to-be-processed sub-data. Wherein the feature fusion operation comprises at least one of: splicing operation and pooling operation. For the splicing operation, the specific splicing method includes: splicing the vector information of at least two sub-data to be processed into a vector, wherein the obtained vector after splicing is the vector information corresponding to the data to be processed. The format of the spliced vector may include, but is not limited to, a list format. For pooling operations, two methods are specifically included, the first being the sum method, which specifically includes: and carrying out one-to-one corresponding addition on the vectors of all the sub-data to be processed to obtain added vector information, namely the vector information corresponding to the data to be processed. The second method is an average method, which specifically comprises the following steps: and carrying out an averaging operation on the vectors of all the sub-data to be processed to obtain a vector which is a vector corresponding to the data to be processed. For example, if a certain data to be processed includes two sub-data to be processed, and the vectors corresponding to the two sub-data to be processed are (z, a) and (z, b), respectively, when feature fusion is performed on the vectors by using a splicing operation, the fused vectors are (z _ a, z _ b). When the pooling operation is adopted for feature fusion, if a sum method is adopted, the fused vector is (z + z, a + b); when the average method is used, the fused vector is ((z + z)/2, (a + b)/2).
And S1004, extracting target information carried by the data to be processed according to the vector information.
Specifically, after the vector information is obtained, the encoder in the electronic device transmits the processed vector information to the decoder, and the decoder extracts the target information according to the vector information corresponding to the data to be processed based on the T5 model. For the process of extracting information based on the T5 model, please refer to the above embodiments, which are not described in detail in this embodiment.
According to the method for obtaining the vector information of the data to be processed according to the sub-data to be processed, the sub-data to be processed is input into the pre-training model to obtain the vector information corresponding to each sub-data to be processed, feature fusion processing is performed on the vector information to obtain the vector information corresponding to the data to be processed, the vector information corresponding to each category of sub-data to be processed of the data to be processed is extracted, and then fusion processing is performed on the vectors of a plurality of sub-data to be processed included in the data to be processed, so that the finally obtained vectors of the data to be processed are more accurate, and the accuracy of target information generation during subsequent information extraction is improved laterally.
Referring to fig. 11, fig. 11 is a flowchart illustrating another method for determining a vector corresponding to data to be processed according to sub-data to be processed. The method comprises the following steps:
s1101, acquiring data to be processed.
Specifically, the user uploads the data to be processed to the electronic device. The electronic equipment obtains the sub-data to be processed. The data to be processed comprises at least one type of sub-data to be processed. For the types of the data to be processed, please refer to the above embodiments, which are not described in detail in this embodiment.
S1102, inputting the to-be-processed subdata of at least two types into the pre-training model corresponding to each type of the to-be-processed subdata.
Specifically, if a certain data to be processed includes two sub data to be processed, which are video data and audio data, respectively. After the electronic equipment acquires the subdata to be processed, the video data are respectively input into the pre-training models corresponding to the video data, and the audio data are input into the pre-training models corresponding to the audio data. The pre-training model is obtained by training according to the model training method in the embodiment.
S1103, determining the similarity between the sub-data to be processed of at least two categories.
Specifically, the electronic device obtains the similarity of the video data and the audio data included in the data to be processed according to the pre-training model. The similarity is used for representing the similarity between the subdata of different types in the subdata pair, and particularly can represent the possibility that the subdata of different types comes from the same original data. The closer the distance of the similar subdata pairs is, the higher the similarity is, and the closer the distance of each subdata in the vector representation space is, the smaller the loss value of the loss function is; the farther the distance of the similar subdata pair is, the lower the similarity is, and the farther the distance of each subdata in the vector representation space is, the larger the loss value of the loss function is. The higher the similarity of the dissimilar subdata pairs is, the longer the distance of each subdata in the vector representation space is, and the smaller the loss value of the loss function is; the lower the similarity of the dissimilar subdata pairs, the closer the distance of each subdata in the vector representation space, the greater the loss value of the loss function. For example, there are two sub-data pairs, i.e., (video a, audio a), (video a, audio B). After the electronic device is analyzed, it can be determined that the subdata video and the audio in the first subdata pair are both from the data a, the subdata video in the later subdata pair is from the data a, and the audio is from the data B. Therefore, the similarity between the video data and the audio data in the first sub data pair is greater than the similarity between the video data and the audio data in the second sub data pair.
And S1104, generating vector information corresponding to the sub-data to be processed according to the similarity and the vector representation space.
Specifically, after determining the similarity between the sub-data of different categories in the sub-data pair, the electronic device generates the vector information corresponding to the sub-data to be processed by combining the vector representation space corresponding to each category of the sub-data to be processed.
S1105, carrying out feature fusion operation on the vector information corresponding to the sub-data to be processed respectively to obtain the vector information corresponding to the data to be processed.
Specifically, the electronic device performs a feature fusion operation on the vector information corresponding to the obtained to-be-processed sub-data. Wherein the feature fusion operation comprises at least one of: splicing operation and pooling operation. Please refer to the above embodiments for specific methods corresponding to the splicing operation and the pooling operation, which are not described in detail in this embodiment.
S1106, extracting target information carried by the data to be processed according to the vector information corresponding to the data to be processed.
Specifically, after the vector information is obtained, the encoder in the electronic device transmits the processed vector information to the decoder, and the decoder extracts the target information according to the vector information corresponding to the data to be processed based on the T5 model. For the process of extracting information based on the T5 model, please refer to the above embodiments, which are not described in detail in this embodiment.
According to the flow schematic diagram of the method for determining the vector corresponding to the data to be processed according to the sub-data to be processed, the similarity between the sub-data of different categories in the sub-data pair is determined, and the vector information corresponding to the sub-data to be processed is determined according to the similarity and the vector representation space corresponding to the category of each sub-data, so that the generated vector information corresponding to the sub-data to be processed is more accurate, the accuracy of extracting target information according to the vector information in the follow-up process is improved, and the extracted information is faster and more accurate.
Referring to fig. 12, based on a model training method, fig. 12 is a schematic structural diagram of a model training apparatus provided in the present application, wherein the function and effect of the model extraction apparatus in the embodiment of the present application are equivalent to those of the encoder mentioned in the above embodiments, both of which belong to the same concept and are used for executing the model training method in the present application. The model training apparatus 1200 includes:
a first obtaining module 1201, configured to obtain N sample data; each sample data comprises sub data of M categories; the M types of sub data included in the N sample data correspond to M × N sub data pairs, each sub data pair includes M sub data, the types of the sub data are different, an association relationship corresponds to M sub data included in each sub data pair, the M types of sub data included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
a training module 1202, configured to input the mxn sub-data pairs into a preset model for training, and generate a pre-training model corresponding to each category of sub-data; the preset model is used for calculating the similarity between the M pieces of sub data included in each sub data pair, and determining the vector representation space corresponding to each data category according to the similarity between the M pieces of sub data included in each MXN sub data pair.
In some embodiments, the training module 1202 includes:
an input unit configured to input the M × N sub-data pairs into a preset model;
and the generating unit is used for training the preset model according to the incidence relation between the M sub data included in the M multiplied by N sub data pairs, and generating a pre-training model corresponding to each type of sub data.
Referring to fig. 13, based on the information extraction method, fig. 13 is a schematic structural diagram of an information extraction apparatus provided in the present application, where the information extraction apparatus in the embodiment of the present application has an effect equivalent to that of the decoder in the above embodiment, and both of the information extraction apparatus and the decoder belong to the same concept and are used for executing the information extraction method in the present application. The information extraction apparatus 1300 includes:
a second obtaining module 1301, configured to obtain data to be processed, where the data to be processed includes at least one category of sub data to be processed;
an output module 1302, configured to input the to-be-processed sub data of the at least one category into a pre-training model corresponding to each category of the to-be-processed sub data, so as to obtain vector information corresponding to the to-be-processed data; wherein the pre-training model is obtained by the model training method of claim 1;
and the extracting module 1303 is configured to extract target information carried by the data to be processed according to the vector information.
In some embodiments, the output module 1302 includes:
the input unit is used for inputting the sub-data to be processed of the at least one type into a pre-training model corresponding to each type of the sub-data to be processed to obtain vector information corresponding to each type of the sub-data to be processed;
the fusion unit is used for performing characteristic fusion operation on the vector information corresponding to the sub-data to be processed respectively to obtain the vector information corresponding to the data to be processed; wherein the feature fusion operation comprises at least one of: splicing operation and pooling operation.
In some embodiments, the extraction module 1303 includes:
the processing single cloud is used for carrying out regularization processing on the vector information;
and the extraction unit is used for extracting the target information carried by the data to be processed according to the processed vector information.
In some embodiments, the data to be processed includes at least two categories of sub-data to be processed;
the device further comprises:
a determining module, configured to determine a similarity between the at least two categories of the to-be-processed sub data after the output module 1302 inputs the at least one category of the to-be-processed sub data into the pre-training model corresponding to each category of the to-be-processed sub data;
the output module 1302 is specifically configured to:
generating vector information corresponding to the sub-data to be processed according to the similarity and the vector representation space; wherein the vector representation space is obtained by the model training method of claim 1.
Referring to fig. 14, fig. 14 is a schematic structural diagram of another model training apparatus 1400 provided in the embodiments of the present application. Wherein the model training apparatus may be integrated in the electronic device 10. The model training apparatus 1400 may include at least: at least one processor 1401, e.g. a CPU, at least one network interface 1404, a user interface 1403, a memory 1405, at least one communication bus 1402. The communication bus 1402 is used to realize connection communication between these components. User interface 1403 may include, but is not limited to, a camera, a display, a touch screen, a keyboard, a mouse, a joystick, and the like. The network interface 1404 may optionally include a standard wired interface, a wireless interface (e.g., WIFI interface), and a communication connection may be established with the server via the network interface 1404. Memory 1402 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. As shown in fig. 14, a memory 1005, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and program instructions.
It should be noted that the network interface 1404 may be connected to an acquirer, a transmitter, or other communication module, and the other communication module may include, but is not limited to, a WiFi module, an operator network communication module, and the like.
Processor 1401 may be used to invoke program instructions stored in memory 1405 and may perform the following steps:
acquiring N sample data; each sample data comprises sub data of M categories; the M types of sub data included in the N sample data correspond to M × N sub data pairs, each sub data pair includes M sub data, the types of the sub data are different, an association relationship corresponds to M sub data included in each sub data pair, the M types of sub data included in each sample data are associated with each other, and M and N are positive integers greater than or equal to 2;
inputting the M multiplied by N sub-data pairs into a preset model for training, and generating a pre-training model corresponding to each type of sub-data; the preset model is used for calculating the similarity between the M pieces of sub data included in each sub data pair, and determining the vector representation space corresponding to each data category according to the similarity between the M pieces of sub data included in each MXN sub data pair.
Possibly, the processor 1401 inputs the M × N sub-data pairs into a preset model for training, generates a pre-training model corresponding to each category of sub-data, and specifically performs:
inputting the M multiplied by N sub-data pairs into a preset model;
and training the preset model according to the incidence relation among the M sub data included in the M multiplied by N sub data pairs to generate a pre-training model corresponding to each type of sub data.
Referring to fig. 15, fig. 15 is a schematic structural diagram of another information extraction apparatus 1500 provided in the embodiment of the present application. Wherein the information extraction means may be integrated in the electronic device 10. The information extraction apparatus 1500 may include at least: at least one processor 1501, e.g., a CPU, at least one network interface 1504, a user interface 1503, memory 1505, at least one communication bus 1502. The communication bus 1502 is used to realize connection communication among these components. The user interface 1503 may include, but is not limited to, a camera, a display, a touch screen, a keyboard, a mouse, a joystick, and the like. The network interface 1504 may optionally include a standard wired interface, a wireless interface (e.g., a WIFI interface), and a communication connection may be established with the server through the network interface 1504. The memory 1502 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). As shown in fig. 15, the memory 1505, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and program instructions.
It should be noted that the network interface 1504 may be connected to an acquirer, a transmitter, or another communication module, and the other communication module may include, but is not limited to, a WiFi module, an operator network communication module, and the like.
Processor 1501 may be configured to call program instructions stored in memory 1505, and may perform the following steps:
acquiring data to be processed, wherein the data to be processed comprises at least one type of sub data to be processed;
inputting the to-be-processed subdata of at least one category into a pre-training model corresponding to each category of the to-be-processed subdata to obtain vector information corresponding to the to-be-processed data; wherein the pre-training model is obtained by the model training method of claim 1;
and extracting target information carried by the data to be processed according to the vector information.
Possibly, the processor 1501 inputs the sub-data to be processed of the at least one type into the pre-training model corresponding to each type of the sub-data to be processed, to obtain vector information corresponding to the sub-data to be processed, and specifically executes:
inputting the to-be-processed subdata of at least one category into a pre-training model corresponding to each category of the to-be-processed subdata to obtain vector information corresponding to each category of the to-be-processed subdata;
performing feature fusion operation on the vector information corresponding to the to-be-processed subdata to obtain the vector information corresponding to the to-be-processed data; wherein the feature fusion operation comprises at least one of: splicing operation and pooling operation.
Possibly, the processor 1501 extracts target information carried by the data to be processed according to the vector information, and specifically executes:
carrying out regularization processing on the vector information;
and extracting target information carried by the data to be processed according to the processed vector information.
Possibly, the data to be processed comprises at least two categories of sub data to be processed;
the processor 1501 is further configured to, after inputting the to-be-processed sub data of the at least one type into the pre-training model corresponding to each type of the to-be-processed sub data, execute:
determining the similarity between the sub-data to be processed of the at least two categories;
the processor 1501 obtains vector information corresponding to the to-be-processed sub-data, and specifically executes:
generating vector information corresponding to the sub-data to be processed according to the similarity and the vector representation space; wherein the vector representation space is obtained by the model training method of claim 1.
Embodiments of the present application also provide a computer-readable storage medium having stored therein instructions, which when executed on a computer or processor, cause the computer or processor to perform one or more steps of any one of the methods described above. If the modules of the model training device and the information extraction device are implemented in the form of software functional units and sold or used as independent products, the modules can be stored in the computer readable storage medium.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Versatile Disk (DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), etc.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. And the aforementioned storage medium includes: various media capable of storing program codes, such as Read-Only Memory (ROM), Random Access Memory (RAM), magnetic disk, and optical disk. The technical features in the present examples and embodiments may be arbitrarily combined without conflict.
The above-described embodiments are merely preferred embodiments of the present application, and are not intended to limit the scope of the present application, and various modifications and improvements made to the technical solutions of the present application by those skilled in the art without departing from the design spirit of the present application should fall within the protection scope defined by the claims of the present application.