Financial subject identification method, electronic device, and storage medium
1. A method for identifying a financial subject, comprising:
acquiring a financial document to be analyzed;
inputting the financial document to be analyzed into more than two different first subject identification models to obtain a first prediction result set, wherein the first prediction result set is composed of first prediction results corresponding to the first subject identification models, and each first prediction result comprises a plurality of financial subjects predicted by the corresponding first subject identification model;
and determining whether the financial subjects are output as identification results according to the times of the occurrence of the financial subjects in the first prediction result set.
2. The method of identifying financial subjects as recited in claim 1, further comprising:
acquiring a financial document to be trained, and acquiring a first character sequence and a second character sequence according to the financial document to be trained;
dividing the first character sequence into a training set and a verification set, and performing more than one round of training on more than two different second main body recognition models according to the training set and the second character sequence to obtain a third main body recognition model set, wherein the third main body recognition model set consists of a plurality of third main body recognition models corresponding to the second main body recognition models, and one round of training is performed on each second main body recognition model to obtain one third main body recognition model;
verifying each third subject identification model by using the verification set to obtain a recall rate and a second prediction result set of each third subject identification model, wherein the second prediction result set consists of each second prediction result corresponding to each third subject identification model, each second prediction result comprises a plurality of financial subjects obtained by predicting the corresponding third subject identification model, the third subject identification model meeting the recall rate requirement in the third subject identification model set is determined as a fourth subject identification model, and each second prediction result corresponding to the fourth subject identification model forms a third prediction result set;
determining whether the financial subjects are output as prediction results according to the times of occurrence of the financial subjects in the third prediction result set;
and calculating the matching degree of the prediction result and the financial subject calibrated in the verification set, and determining the fourth subject recognition model with the matching degree meeting the requirement obtained by calculation as the first subject recognition model.
3. The method of identifying financial subjects as claimed in claim 2, wherein the second subject identification model is constructed by at least one of:
BERT-BLSTM-CRF model and BERT-IDCNN-CRF model.
4. The method for identifying financial subjects as claimed in claim 2, wherein obtaining financial documents to be trained, and obtaining a first character sequence and a second character sequence according to the financial documents to be trained, specifically comprises:
acquiring a financial document to be trained, and preprocessing the financial document to be trained to obtain first text information;
and labeling the first text information to obtain a first character sequence and a second character sequence.
5. The method for identifying financial subjects as claimed in claim 4, wherein preprocessing the financial document to be trained to obtain first text information comprises:
removing redundant information in the financial document to be trained through regular matching to obtain a processed financial document, wherein the processed financial document comprises a title and a text;
and acquiring the editing distance between the title and the text, and splicing the title and the text if the editing distance is greater than a first threshold value to obtain first text information.
6. The method of claim 4, wherein labeling the first text message to obtain a first character sequence and a second character sequence comprises:
marking the financial main body in the first text information to obtain a third character sequence, wherein the third character sequence comprises a title and a text;
marking whether the financial main body appears in the text, the number of times of appearance in the text and whether the financial main body appears in the title in the third character sequence to obtain a second character sequence with marking information;
and marking the position information of the financial main body in the third character sequence to obtain the first character sequence with marking information.
7. The method of identifying financial subjects as claimed in claim 1,
the first subject recognition model comprises a trained first sub-model and a trained second sub-model;
inputting the financial document to be analyzed into more than two different first subject recognition models to obtain a first prediction result set, wherein the obtaining of the first prediction result set comprises the following steps:
inputting the financial document to be analyzed into the trained first sub-model to obtain characteristic information corresponding to the financial document to be analyzed, wherein the trained first sub-model is obtained by training the financial document to be trained;
and inputting the characteristic information corresponding to the financial document to be analyzed into the trained second sub-model to obtain a first prediction result set.
8. The method of claim 7, wherein the first sub-model is a BERT model;
inputting the financial document to be analyzed into the trained first sub-model to obtain feature information corresponding to the financial document to be analyzed, and the method specifically includes:
training a layer of front-sequence coding predictor in a BERT model by using a financial document to be trained to obtain a first weight value corresponding to the trained front-sequence coding predictor, wherein the BERT model is provided with a plurality of layers of front-sequence coding predictors;
obtaining second weight values corresponding to a plurality of untrained preamble code predictors in the BERT model;
obtaining a weight value of the BERT model according to the first weight value and each second weight value;
mapping the weighted value of the BERT model to 512 dimensions through a full connection layer to obtain a trained BERT model;
and inputting the financial document to be analyzed into the trained BERT model to obtain the characteristic information corresponding to the financial document to be analyzed.
9. The method of claim 8, wherein the trained pre-order code predictor is a bottom-most pre-order code predictor in the BERT model.
10. The method of claim 1, wherein determining whether the financial subject is output as a recognition result according to the number of times each financial subject appears in the first prediction result set comprises:
and determining a constant multiple of the number of the first prediction results as a second threshold, and outputting the financial subject as an identification result if the number of times of the financial subject appearing in the first prediction result set is greater than or equal to the second threshold.
11. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the method of identifying a financial subject according to any one of claims 1 to 10.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of identification of a financial subject according to any one of claims 1 to 10.
Background
With the rapid progress of the internet and the rapid development of global finance, various finances represented by the internet have been integrated into various fields of economic and social development, and financial information has been explosively increased. The P2P network loan platform, the small loan company, the equity investment organization and other financial industry status are continuously emerging, the financing scale and the transaction scale are continuously enlarged, the related transaction subjects are more and more complex, and the implementation of the economic crime by means of the Internet has stronger incidences, wider spread and greater danger.
Financial supervision faces more difficulties than traditional industry supervision, and no effective solution is provided at present how to identify the main body of financial fraud information from massive financial information. In the prior art, named entity recognition is carried out based on deep learning, but the method has the risk of generating body misjudgment, and obviously, the body of financial fraud information cannot be accurately recognized if the method is directly used for recognizing the body of financial fraud information.
Disclosure of Invention
In the embodiment, a method for identifying a financial subject, an electronic device and a storage medium are provided to solve the problem that the subject of financial fraud information cannot be identified in the related art.
In a first aspect, there is provided in this embodiment a method of identifying a financial subject, the method comprising:
acquiring a financial document to be analyzed;
inputting the financial document to be analyzed into more than two different first subject identification models to obtain a first prediction result set, wherein the first prediction result set is composed of first prediction results corresponding to the first subject identification models, and each first prediction result comprises a plurality of financial subjects predicted by the corresponding first subject identification model;
and determining whether the financial subjects are output as identification results according to the times of the occurrence of the financial subjects in the first prediction result set.
In some of these embodiments, the method further comprises:
acquiring a financial document to be trained, and acquiring a first character sequence and a second character sequence according to the financial document to be trained;
dividing the first character sequence into a training set and a verification set, and performing more than one round of training on more than two different second main body recognition models according to the training set and the second character sequence to obtain a third main body recognition model set, wherein the third main body recognition model set consists of a plurality of third main body recognition models corresponding to the second main body recognition models, and one round of training is performed on each second main body recognition model to obtain one third main body recognition model;
verifying each third subject identification model by using the verification set to obtain a recall rate and a second prediction result set of each third subject identification model, wherein the second prediction result set consists of each second prediction result corresponding to each third subject identification model, each second prediction result comprises a plurality of financial subjects obtained by predicting the corresponding third subject identification model, the third subject identification model meeting the recall rate requirement in the third subject identification model set is determined as a fourth subject identification model, and each second prediction result corresponding to the fourth subject identification model forms a third prediction result set;
determining whether the financial subjects are output as prediction results according to the times of occurrence of the financial subjects in the third prediction result set;
and calculating the matching degree of the prediction result and the financial fraud information body calibrated in the verification set, and determining the fourth body recognition model with the calculated matching degree meeting the requirement as the first body recognition model.
In some of these embodiments, the second subject recognition model is constructed by at least one of:
BERT-BLSTM-CRF model and BERT-IDCNN-CRF model.
In some embodiments, obtaining a financial document to be trained, and obtaining a first character sequence and a second character sequence according to the financial document to be trained specifically includes:
acquiring a financial document to be trained, and preprocessing the financial document to be trained to obtain first text information;
and labeling the first text information to obtain a first character sequence and a second character sequence.
In some embodiments, the preprocessing the financial document to be trained to obtain first text information specifically includes:
removing redundant information in the financial document to be trained through regular matching to obtain a processed financial document, wherein the processed financial document comprises a title and a text;
and acquiring the editing distance between the title and the text, and splicing the title and the text if the editing distance is greater than a first threshold value to obtain first text information.
In some embodiments, labeling the first text information to obtain a first character sequence and a second character sequence includes:
marking the financial main body in the first text information to obtain a third character sequence, wherein the third character sequence comprises a title and a text;
marking whether the financial main body appears in the text, the number of times of appearance in the text and whether the financial main body appears in the title in the third character sequence to obtain a second character sequence with marking information;
and marking the position information of the financial main body in the third character sequence to obtain the first character sequence with marking information.
In some of these embodiments, the first subject recognition model comprises a trained first sub-model and a trained second sub-model;
inputting the financial document to be analyzed into more than two different first subject recognition models to obtain a first prediction result set, wherein the obtaining of the first prediction result set comprises the following steps:
inputting the financial document to be analyzed into the trained first sub-model to obtain characteristic information corresponding to the financial document to be analyzed, wherein the trained first sub-model is obtained by training the financial document to be trained;
and inputting the characteristic information corresponding to the financial document to be analyzed into the trained second sub-model to obtain a first prediction result set.
In some of these embodiments, the first sub-model is a BERT model;
inputting the financial document to be analyzed into the trained first sub-model to obtain feature information corresponding to the financial document to be analyzed, and the method specifically includes:
training a layer of front-sequence coding predictor in a BERT model by using a financial document to be trained to obtain a first weight value corresponding to the trained front-sequence coding predictor, wherein the BERT model is provided with a plurality of layers of front-sequence coding predictors;
obtaining second weight values corresponding to a plurality of untrained preamble code predictors in the BERT model;
obtaining a weight value of the BERT model according to the first weight value and each second weight value;
mapping the weighted value of the BERT model to 512 dimensions through a full connection layer to obtain a trained BERT model;
and inputting the financial document to be analyzed into the trained BERT model to obtain the characteristic information corresponding to the financial document to be analyzed.
In some of these embodiments, the trained pre-order coded predictor is the bottom-most pre-order coded predictor in the BERT model.
In some of these embodiments, determining whether each of the financial subjects is output as an identified result based on the number of times each of the financial subjects occurred in the first set of prediction results includes:
and determining a constant multiple of the number of the first prediction results as a second threshold, and outputting the financial subject as an identification result if the number of times of the financial subject appearing in the first prediction result set is greater than or equal to the second threshold.
In a second aspect, the present embodiment provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method for identifying a financial subject according to the first aspect when executing the computer program.
In a third aspect, there is provided in this embodiment a storage medium having stored thereon a computer program which, when executed by a processor, implements the method of identifying a financial subject as described in the first aspect above.
Compared with the related art, the financial subject identification method, the electronic device and the storage medium provided in this embodiment obtain the financial document to be analyzed, input the financial document to be analyzed to two or more different first subject identification models, and obtain the first prediction result set, where the first prediction result set is composed of first prediction results corresponding to the first subject identification models, each first prediction result includes a plurality of financial subjects predicted by the corresponding first subject identification model, and determine whether the financial subject is output as the identification result according to the number of times that the financial subject appears in the first prediction result set, thereby solving the problem that the subject of financial fraud information is easily misjudged, and realizing more accurate identification of the subject of fraud information.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a block diagram of a hardware configuration of an application terminal of an identification method of a financial subject according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of identifying a financial principal according to an embodiment of the present application;
FIG. 3 is a flow chart of a first subject recognition model acquisition method according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for obtaining characteristic information corresponding to a financial document to be analyzed according to an embodiment of the present application;
FIG. 5 is a flow chart of yet another method of identifying a financial principal according to an embodiment of the present application;
FIG. 6 is a flow chart of yet another method of identifying a financial principal according to an embodiment of the present application;
fig. 7 is a schematic diagram of BERT model dynamic weight fusion according to an embodiment of the present application.
Detailed Description
For a clearer understanding of the objects, aspects and advantages of the present application, reference is made to the following description and accompanying drawings.
Unless defined otherwise, technical or scientific terms used herein shall have the same general meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of this application do not denote a limitation of quantity, either in the singular or the plural. The terms "comprises," "comprising," "has," "having," and any variations thereof, as referred to in this application, are intended to cover non-exclusive inclusions; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or modules, but may include other steps or modules (elements) not listed or inherent to such process, method, article, or apparatus. Reference throughout this application to "connected," "coupled," and the like is not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference to "a plurality" in this application means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. In general, the character "/" indicates a relationship in which the objects associated before and after are an "or". The terms "first," "second," "third," and the like in this application are used for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or a similar computing device. For example, the present invention is implemented in a terminal, and fig. 1 is a block diagram of a hardware configuration of an application terminal of the method for identifying a financial entity according to an embodiment of the present invention. As shown in fig. 1, the terminal may include one or more processors 102 (only one shown in fig. 1) and a memory 104 for storing data, wherein the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those of ordinary skill in the art that the structure shown in fig. 1 is merely an illustration and is not intended to limit the structure of the terminal described above. For example, the terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the identification method of the financial subject in the embodiment, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. The network described above includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
In the present embodiment, a method for identifying a financial subject is provided, and fig. 2 is a flowchart of a method for identifying a financial subject according to an embodiment of the present application, where as shown in fig. 2, the flowchart includes the following steps:
in step S201, a financial document to be analyzed is acquired.
Step S202, inputting the financial document to be analyzed into more than two different first subject identification models to obtain a first prediction result set, wherein the first prediction result set is composed of first prediction results corresponding to the first subject identification models, and each first prediction result comprises a plurality of financial subjects predicted by the corresponding first subject identification model.
Step S203, determining whether the financial subjects are output as the identification result according to the times of the occurrence of the financial subjects in the first prediction result set.
In this embodiment, the financial subject is a subject of financial fraud information.
Through the steps, the problem that the main body of the financial fraud information is easy to misjudge is solved, the final financial main body is determined from a plurality of financial main bodies predicted by more than two different first main body recognition models according to the times of the financial main bodies appearing in the first prediction result set, and the main body of the financial fraud information is recognized more accurately.
In some embodiments, fig. 3 is a flowchart of a first subject recognition model obtaining method according to an embodiment of the present application, and as shown in fig. 3, the step of obtaining the first subject recognition model includes:
step S301, acquiring a financial document to be trained, and obtaining a first character sequence and a second character sequence according to the financial document to be trained.
Step S302, dividing the first character sequence into a training set and a verification set, and performing more than one round of training on more than two different second main body recognition models according to the training set and the second character sequence to obtain a third main body recognition model set, wherein the third main body recognition model set is composed of a plurality of third main body recognition models corresponding to the second main body recognition models, and one round of training is performed on each second main body recognition model to obtain one third main body recognition model.
Step S303, verifying each third subject identification model by using a verification set to obtain a recall rate and a second prediction result set of each third subject identification model, wherein the second prediction result set is composed of each second prediction result corresponding to each third subject identification model, each second prediction result comprises a plurality of financial subjects obtained by predicting by the corresponding third subject identification model, and the third subject identification model meeting the recall rate requirement in the third subject identification model set is determined as a fourth subject identification model, wherein each second prediction result corresponding to the fourth subject identification model forms a third prediction result set.
And step S304, determining whether the financial subjects are output as the prediction results according to the times of the occurrence of the financial subjects in the third prediction result set.
Step S305, calculating the matching degree of the prediction result and the financial fraud information body calibrated in the verification set, and determining a fourth body recognition model with the calculated matching degree meeting the requirement as the first body recognition model.
Through the steps, more than two different second main body recognition models are trained for more than one round according to the training set and the second character sequence, each second main body recognition model is trained for each round to obtain a third main body recognition model, the third main body recognition models corresponding to the second main body recognition models form a third main body recognition model set, the verification sets are used for verifying the third main body recognition models to obtain the recall rate of the third main body recognition models and the second prediction results corresponding to the third main body recognition models, the third main body recognition models meeting the recall rate requirement are used as fourth main body recognition models, and the first screening of the third main body recognition models is realized;
meanwhile, second prediction results corresponding to fourth subject identification models form a third prediction result set, the predicted financial subjects are determined according to the times of the financial subjects predicted by the fourth subject identification models appearing in the third prediction result set, the matching degree of the predicted financial subjects and the financial fraud information subjects marked in the verification set is calculated, the fourth subject identification model meeting the matching degree requirement is determined as the first subject identification model, second screening of the third subject identification model is achieved, and the first subject identification model is determined through twice screening of the third subject identification model.
In some of these embodiments, the second subject recognition model is constructed by at least one of:
BERT-BLSTM-CRF model and BERT-IDCNN-CRF model.
It should be noted that the Bidirectional Encoder Representation (BERT) of the transformer is a pre-training model proposed by Google AI research institute in 2018 and 10 months.
Bi-directional Long Short-Term Memory (BiLSTM) is formed by combining forward LSTM and backward LSTM.
Long Short-Term Memory (LSTM) is one of RNN (Current Neural network).
Iteratively expanded CNN (I-directed CNN) is a combination of 4 related CNN blocks with the same structure, and each block is a three-layer related convolution layer with a relation width of 1, 1 and 2, and is called an Iterated related CNN.
A conditional random field algorithm (CRF) is a mathematical algorithm proposed in 2001 based on probability map models following markov.
In some embodiments, obtaining a financial document to be trained, and obtaining a first character sequence and a second character sequence according to the financial document to be trained specifically includes:
acquiring a financial document to be trained, and preprocessing the financial document to be trained to obtain first text information;
and labeling the first text information to obtain a first character sequence and a second character sequence.
By the method, the first text information is labeled, the corresponding first character sequence and the second character sequence are obtained, and preparation is made for obtaining the first main body recognition model according to the first character sequence and the second character sequence.
In some embodiments, the preprocessing the financial document to be trained to obtain the first text information specifically includes:
removing redundant information in the financial document to be trained through regular matching to obtain a processed financial document, wherein the processed financial document comprises a title and a text;
and acquiring the editing distance between the title and the text, and splicing the title and the text if the editing distance is greater than a first threshold value to obtain first text information.
It should be noted that the Edit Distance (MED) is an index used to measure the similarity between two sequences in the fields of information theory, linguistics, and computer science.
In this embodiment, if the edit distance between the title and the body is less than or equal to the first threshold, only the text is retained, and the text is used as the first text information.
By the method, redundant information in the financial document to be trained is removed, whether the title and the text need to be spliced or not is judged according to the editing distance between the title and the text, the title and the text need to be spliced only when the editing distance is larger than a first threshold, and the title is similar to the text when the editing distance is smaller than or equal to the first threshold, so that the title and the text do not need to be spliced, the text is used as the first text information, and the redundant information in the first text information is avoided.
In some embodiments, labeling the first text information to obtain the first character sequence and the second character sequence includes:
marking the financial main body in the first text information to obtain a third character sequence, wherein the third character sequence comprises a title and a text;
marking whether the financial main body appears in the text, the number of times of appearance in the text and whether the financial main body appears in the title in the third character sequence to obtain a second character sequence with marking information;
and marking the position information of the financial main body in the third character sequence to obtain the first character sequence with the marking information.
Through the method, the first character sequence comprises the marked financial main body and the position information of the financial main body, the second character sequence comprises the marked financial main body and the times of the marked financial main body appearing in the text and the title, and the second main body recognition model is trained according to the first character sequence and the second character sequence to obtain the first main body recognition model.
In some of these embodiments, the first subject recognition model comprises a trained first sub-model and a trained second sub-model;
inputting the financial document to be analyzed into more than two different first subject recognition models, and obtaining a first prediction result set comprises:
inputting the financial document to be analyzed into the trained first sub-model to obtain characteristic information corresponding to the financial document to be analyzed, wherein the trained first sub-model is obtained by training the financial document to be trained;
and inputting the characteristic information corresponding to the financial document to be analyzed into the trained second sub-model to obtain a first prediction result set.
By the method, the financial document to be trained is used for training the first sub-model to obtain the trained first sub-model, so that the trained first sub-model can more accurately acquire the characteristic information corresponding to the financial document to be analyzed, the more accurate characteristic information is input into the trained second sub-model, and the main body of financial fraud information in the financial document to be analyzed can be more accurately predicted.
In some of these embodiments, the first sub-model is a BERT model;
fig. 4 is a flowchart of a method for acquiring feature information corresponding to a financial document to be analyzed according to an embodiment of the present application, where as shown in fig. 4, the method includes the following steps:
step S401, a financial document to be trained is used for training a layer of pre-sequence code predictor in a BERT model to obtain a first weight value corresponding to the trained pre-sequence code predictor, wherein the BERT model is provided with a plurality of layers of pre-sequence code predictors.
Step S402, obtaining each second weight value corresponding to a plurality of untrained precedence code predictors in the BERT model.
Step S403, obtaining a weight value of the BERT model according to the first weight value and each second weight value.
And S404, mapping the weight value of the BERT model to 512 dimensions through the full connection layer to obtain the trained BERT model.
Step S405, inputting the financial document to be analyzed into the trained BERT model to obtain the characteristic information corresponding to the financial document to be analyzed.
Through the steps, a layer of early-order code predictor in the BERT model is trained to obtain a corresponding first weight value, and the trained BERT model is obtained according to the first weight value, so that the trained BERT model can more accurately extract the characteristic information corresponding to the financial document to be analyzed.
In some of these embodiments, the trained pre-order coded predictor is the bottom-most pre-order coded predictor in the BERT model.
In this embodiment, the pre-sequence predictors of each layer are not independent from each other, and the pre-sequence predictor of the next layer can obtain a merging characteristic by combining the input of the pre-sequence predictor of the previous layer besides the input characteristic of the pre-sequence predictor of the next layer, and output the merging characteristic.
Through the method, the top-level pre-coding predictor is trained to obtain the trained BERT model, and the trained BERT model can more accurately extract the characteristic information corresponding to the financial document to be analyzed.
In some of these embodiments, determining whether the financial subject is output as the recognition result based on the number of times each financial subject appears in the first prediction result set comprises:
and determining a constant multiple of the number of the first prediction results as a second threshold, and outputting the financial subject as the identification result if the number of times of the financial subject appearing in the first prediction result set is greater than or equal to the second threshold.
By the above manner, the financial subjects with the occurrence frequency greater than or equal to the second threshold value are output as the identification result in the first prediction result set, and the financial subjects with the occurrence frequency less than the second threshold value are removed, so that the subject of the financial fraud information can be accurately determined from the first prediction result set according to the second threshold value, the subject of the financial fraud information can be more accurately determined, and the inaccurate financial subjects are prevented from being output as the identification result.
Fig. 5 is a flowchart of a further method for identifying a financial subject according to an embodiment of the present application, where the flowchart includes the following steps, as shown in fig. 5:
step S501, more than two different first main body recognition models are determined according to financial documents to be trained, and each first main body recognition model comprises a trained first sub-model and a trained second sub-model.
In the embodiment, a financial document to be trained is obtained, and a first character sequence and a second character sequence are obtained according to the financial document to be trained;
dividing the first character sequence into a training set and a verification set, and performing more than one round of training on more than two different second main body recognition models according to the training set and the second character sequence to obtain a third main body recognition model set, wherein the third main body recognition model set consists of a plurality of third main body recognition models corresponding to the second main body recognition models, and one round of training is performed on each second main body recognition model to obtain one third main body recognition model;
verifying each third subject identification model by using a verification set to obtain the recall rate and a second prediction result set of each third subject identification model, wherein the second prediction result set consists of each second prediction result corresponding to each third subject identification model, each second prediction result comprises a plurality of financial subjects predicted by the corresponding third subject identification model, the third subject identification model meeting the recall rate requirement in the third subject identification model set is determined as a fourth subject identification model, and each second prediction result corresponding to the fourth subject identification model forms a third prediction result set;
determining whether the financial subjects are output as prediction results according to the times of the financial subjects appearing in the third prediction result set;
and calculating the matching degree of the prediction result and the financial fraud information body calibrated in the verification set, and determining a fourth body recognition model with the calculated matching degree meeting the requirement as the first body recognition model.
In one embodiment, the second subject recognition model is constructed by at least one of:
BERT-BLSTM-CRF model and BERT-IDCNN-CRF model.
In one embodiment, the obtaining a financial document to be trained, and obtaining a first character sequence and a second character sequence according to the financial document to be trained specifically includes:
acquiring a financial document to be trained, and preprocessing the financial document to be trained to obtain first text information;
and labeling the first text information to obtain a first character sequence and a second character sequence.
In one embodiment, the preprocessing the financial document to be trained to obtain the first text information specifically includes:
removing redundant information in the financial document to be trained through regular matching to obtain a processed financial document, wherein the processed financial document comprises a title and a text;
and acquiring the editing distance between the title and the text, and splicing the title and the text if the editing distance is greater than a first threshold value to obtain first text information.
In one embodiment, labeling the first text information to obtain the first character sequence and the second character sequence includes:
marking the financial main body in the first text information to obtain a third character sequence, wherein the third character sequence comprises a title and a text;
marking whether the financial main body appears in the text, the number of times of appearance in the text and whether the financial main body appears in the title in the third character sequence to obtain a second character sequence with marking information;
and marking the position information of the financial main body in the third character sequence to obtain the first character sequence with the marking information.
Step S502, inputting the financial document to be analyzed into the trained first sub-model to obtain the characteristic information corresponding to the financial document to be analyzed, wherein the trained first sub-model is obtained by training the financial document to be trained.
Step S503, inputting the characteristic information corresponding to the financial document to be analyzed into the trained second sub-model to obtain a first prediction result set.
In this embodiment, the first prediction result set is composed of first prediction results corresponding to the first subject recognition models, and each first prediction result includes a plurality of financial subjects predicted by the corresponding first subject recognition model.
In one embodiment, the first sub-model is a BERT model;
inputting the financial document to be analyzed into the trained first sub-model to obtain the characteristic information corresponding to the financial document to be analyzed, and the method specifically comprises the following steps:
training a layer of front-sequence coding predictor in a BERT model by using a financial document to be trained to obtain a first weight value corresponding to the trained front-sequence coding predictor, wherein the BERT model is provided with a plurality of layers of front-sequence coding predictors;
obtaining second weight values corresponding to a plurality of untrained preamble code predictors in the BERT model;
obtaining a weight value of the BERT model according to the first weight value and each second weight value;
mapping the weighted value of the BERT model to 512 dimensions through a full connection layer to obtain a trained BERT model;
and inputting the financial document to be analyzed into the trained BERT model to obtain the characteristic information corresponding to the financial document to be analyzed.
In one embodiment, the trained pre-order coded predictor is the bottom-most pre-order coded predictor in the BERT model.
Step S504, determining a constant multiple of the number of the first prediction results as a second threshold, and outputting the financial subject as the identification result if the number of times of the financial subject appearing in the first prediction result set is greater than or equal to the second threshold.
Through the steps, the financial document to be analyzed is input into more than two different first main body identification models to obtain a first prediction result set, the first prediction result set is composed of first prediction results corresponding to the first main body identification models, each first prediction result comprises a plurality of financial main bodies obtained through prediction of the corresponding first main body identification model, whether the financial main bodies are output as the recognition results or not is determined according to the times of the financial main bodies appearing in the first prediction result set, the problem that the main bodies of the financial fraud information are prone to being misjudged is solved, the final financial main body is determined from the plurality of financial main bodies predicted by the more than two different first main body identification models, and the main bodies of the financial fraud information are recognized more accurately.
Fig. 6 is a flowchart of a further method for identifying a financial subject according to an embodiment of the present application, where the flowchart includes the following steps, as shown in fig. 6:
step S601, preprocessing the financial document to be trained to obtain first text information.
Step S602, labeling the first text information to obtain a first character sequence and a second character sequence.
Step S603, dividing the first character sequence into a training set and a verification set, and performing at least one round of training on two different second subject recognition models according to the training set and the second character sequence to obtain a third subject recognition model set, where the third subject recognition model set is composed of a plurality of third subject recognition models corresponding to the second subject recognition models, and each round of training is performed on the second subject recognition models to obtain one third subject recognition model.
Step S604, verifying each third subject identification model by using a verification set to obtain the recall rate and a second prediction result set of each third subject identification model, wherein the second prediction result set is composed of each second prediction result corresponding to each third subject identification model, each second prediction result comprises a plurality of financial subjects obtained by predicting by the corresponding third subject identification model, the third subject identification model meeting the recall rate requirement in the third subject identification model set is determined as a fourth subject identification model, and each second prediction result corresponding to the fourth subject identification model forms a third prediction result set.
And step S605, determining whether the financial subjects are output as the prediction results according to the times of the occurrence of the financial subjects in the third prediction result set.
Step S606, calculating the matching degree of the prediction result and the financial fraud information main body calibrated in the verification set, and determining a fourth main body recognition model with the calculated matching degree meeting the requirement as a first main body recognition model, wherein the first main body recognition model comprises a trained BERT model and a trained second sub-model.
In this embodiment, a financial document to be trained is used to train a one-layer pre-order coding predictor in a BERT model, and a first weight value corresponding to the trained pre-order coding predictor is obtained, where the BERT model has multiple layers of pre-order coding predictors; obtaining second weight values corresponding to a plurality of untrained preamble code predictors in the BERT model; obtaining a weight value of the BERT model according to the first weight value and each second weight value; and mapping the weighted value of the BERT model to 512 dimensions through a full connection layer to obtain the trained BERT model.
Step S607, inputting the financial document to be analyzed into the trained BERT model to obtain the characteristic information corresponding to the financial document to be analyzed.
Step S608, inputting the feature information corresponding to the financial document to be analyzed into the trained second sub-model to obtain a first prediction result set.
And step S609, determining whether the financial subjects are output as the identification result according to the times of the occurrence of the financial subjects in the first prediction result set.
Through the steps, the financial document to be analyzed is input into more than two different first main body identification models to obtain a first prediction result set, the first prediction result set is composed of first prediction results corresponding to the first main body identification models, each first prediction result comprises a plurality of financial main bodies predicted by the corresponding first main body identification model, whether the financial subjects are output as the identification result is determined according to the times of the financial subjects appearing in the first prediction result set, the problem that financial fraud information subjects are easy to misjudge is solved, the implementation of financial supervision of the Internet industry is facilitated, the fraudulent financial information subjects can be identified from massive financial information, therefore, the system can control and prevent the spreading of economic crimes in time, and has great practical significance for preventing and fighting against Internet economic crimes and reducing property loss of the masses.
Because all financial documents to be analyzed and financial documents to be trained are from financial information texts obtained by crawling specific financial webpages, the financial information texts specifically comprise two parts, namely a text title and a text information text, some webpages have title texts without text texts, some webpages have text texts without title texts, and the texts are different in length, the texts of the two parts of title and text need to be preprocessed.
In one embodiment, the financial document to be trained includes a text title and a text information text, and the preprocessing is performed on the financial document to be trained to obtain the first text information, which specifically includes:
filtering noise in a financial document to be trained through regular matching, wherein the noise comprises picture information, website information, webpage labels, dates, special characters, non-Chinese characters, non-English characters and non-numeric symbols, judging whether a title and a text have an inclusion relationship or not by calculating an editing distance between the title of the text and the text of the text information, and accordingly removing text data of which the title and the text are arbitrarily null, specifically calculating a text editing distance between the title and the text, only keeping the text information when the editing distance between the title and the text is less than 200, and splicing the title of the text and the text information when the editing distance between the title and the text is more than 200 to obtain processed text information;
and cutting the processed text information according to the priority of punctuation marks, recombining according to the original sequence, generating a new data sample when the length of a recombined sentence exceeds 510 characters, and repeatedly executing the process on the rest sentences until all the processed text information is assembled to obtain the first text information. By the mode, noise in the financial document to be trained is filtered through regular matching, text data with any meaning of title and text being empty is removed through the editing distance between the title and the text, the problem that redundant information in the financial document to be trained is too much is solved, the processed text information is cut according to the priority of punctuation marks, the problem that a single text is too long is solved, and data information is completely utilized.
In one embodiment, labeling the first text information to obtain a first character sequence and a second character sequence specifically includes:
manually labeling the first text information, marking the financial subject entity contained in each piece of financial document information to form a financial subject entity list, marking whether the information expression is fraudulent content, forming a label column negative, and forming an original data set 4 column: title, text, entity and negative, and is marked as a third character sequence;
mapping the third character sequence into character labels of 'O', 'B-ORG' and 'I-ORG', wherein the first character corresponding to the entity word in the third character sequence is B-ORG, the rest characters corresponding to the entity word are I-ORG, and the other words in the third character sequence are O, so as to form a one-to-one mapping relation between the characters and the labels, and obtain the first character sequence;
and in the third character sequence, marking the frequency of occurrence of the entity word in the first 507 characters of the text information text, marking whether the entity word occurs in the text and marking whether the entity word occurs in the title, so as to obtain a second character sequence.
By the method, the first text information is labeled to obtain a first character sequence and a second character sequence, and preparation is made for subsequently training the second main body recognition model according to the first character sequence and the second character sequence to obtain the first main body recognition model.
In one embodiment, the obtaining of the financial document to be analyzed specifically includes:
and constructing a financial fraud information detection model based on the BERT model, specifically, connecting the output of the last full connection layer of the BERT model with an activation function Sigmoid (0/1) to obtain the financial fraud information detection model, and inputting the first character sequence into the financial fraud information detection model to obtain a financial document to be analyzed.
In addition, in this embodiment, the financial document to be analyzed may be obtained by using a conventional machine learning model, where the conventional machine learning model includes an SVM model and a Logistic Regression model.
In the above manner, the financial document with the financial fraud information is used as the financial document to be analyzed, and preparation is made for subsequently identifying the main body of the financial fraud information according to the financial document to be analyzed.
In one embodiment, constructing two or more different second subject recognition models comprises:
the second subject recognition model is constructed in four ways in this embodiment.
The first method is as follows: and constructing a second main body recognition model based on the BERT-BLSTM-CRF model, specifically, inputting token vectors learned by the BERT training model into the BILSTM model for further learning, enabling the model to understand the context relationship of the text sequence, and finally obtaining the classification result of each token through the CRF model. The final full tie layer's of the original BERT model that this application was used at first output characteristic is as the input of BLSTM model, then carries on the output of the full tie layer of BLSTM model CRF model to accomplish the discernment of finance subject information, and three layer construction is respectively: firstly, encoding an input text by BERT (binary encoding) by using a Transformer mechanism, and acquiring semantic representation of characters by using a pre-training model; secondly, the BilSTM further extracts the high-level characteristics of the data on the basis of the BERT output result; and thirdly, the CRF carries out state transfer constraint on the output result of the BilSTM layer.
The second method comprises the following steps: the method includes the steps of partially improving a native BERT model, wherein each layer of the native BERT model has different comprehension of texts, the final weight of the BERT model is obtained through a dynamic weight fusion mode, fig. 7 is a schematic diagram of the BERT model dynamic weight fusion according to the embodiment of the application, as shown in fig. 7, the application endows a weight to a representation generated by a 12 th layer of transform of the BERT model, then the first weight value is determined through training, second weight values corresponding to 1 st to 11 th layers are obtained, the weights corresponding to 1 st to 12 th layers of transform are averaged to obtain the final weight value, the final weight value is reduced to 512 dimensions through one full-connection layer, and a BLSTM-CRF model carried by the dynamically fused BERT model is used as a second mode to construct a second main body recognition model.
The third method comprises the following steps: the method is characterized in that a second main body recognition model is built based on a BERT-IDCNN-CRF model, the IDCNN can fully capture long-distance information of a long-sequence text under the condition that local information of the text is lost, the method is suitable for text data recognition of the long text, and different from the BILSTM model, the complexity of O (n) is only needed for processing sentences with the length of n under the parallel condition, the precision of the method is equivalent to that of the BERT-BLSTM-CRF model, and the prediction speed is improved by half.
The method is as follows: the method is based on the mode two improved BERT model and matched with IDCNN-CRF to construct a second main body recognition model.
In this embodiment, the second body recognition model is not limited to the four ways described above, for example, a bilru model may be used to replace a bilst model or an IDCNN model in the four second body recognition models, so that the first character sequence and the second character sequence can be further feature extracted to implement a semantic encoding process.
Through the method, more than two different second main body identification models are built, and preparation is made for identifying the main body of the financial fraud information according to the different second main body identification models subsequently.
There is also provided in this embodiment an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
acquiring a financial document to be analyzed;
inputting financial documents to be analyzed into more than two different first main body identification models to obtain a first prediction result set, wherein the first prediction result set is composed of first prediction results corresponding to the first main body identification models, and each first prediction result comprises a plurality of financial main bodies predicted by the corresponding first main body identification models;
and determining whether the financial subjects are output as the identification result according to the times of the occurrence of the financial subjects in the first prediction result set.
It should be noted that, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and optional implementations, and details are not described again in this embodiment.
In addition, in combination with the method for identifying a financial subject provided in the above embodiment, a storage medium may also be provided to implement in this embodiment. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the above-described embodiments of the method of identifying a financial subject.
It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be derived by a person skilled in the art from the examples provided herein without any inventive step, shall fall within the scope of protection of the present application.
It is obvious that the drawings are only examples or embodiments of the present application, and it is obvious to those skilled in the art that the present application can be applied to other similar cases according to the drawings without creative efforts. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
The term "embodiment" is used herein to mean that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly or implicitly understood by one of ordinary skill in the art that the embodiments described in this application may be combined with other embodiments without conflict.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the patent protection. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.