OCR error correction method based on Chinese character level characteristics and language model
1. A Chinese OCR error correction method based on Chinese character level characteristics and a language model is characterized by comprising the following steps:
s1, extracting characters, character structures and component information corresponding to the characters from the Chinese character data set, and constructing character structure component data;
s2, recognizing an image containing Chinese characters by using an OCR model obtained by training an image data set, then increasing image noise by using a data enhancement technology, then re-recognizing the image with different noise by using the OCR model, and constructing an error correction data set with an after-OCR error style based on the recognition results of the original image and the image with noise, wherein each sample in the error correction data set contains an error character, context and correct character recognized by the OCR;
s3, constructing a word vector training data set for word vector training based on the character structure component data and the Chinese corpus, inputting context words containing target words, characters corresponding to the context words, and character structures and component information of the target words and the characters corresponding to the context words, and obtaining word vectors with character structure or component distinguishing capability after training;
and S4, training the language model by using the word vector obtained by training in S3 as a word embedding layer of the language model, finely adjusting the language model by using the error correction data set to adapt to the OCR recognition error style, and finally obtaining an error correction model which can generate a character error correction candidate set aiming at the error character and is used for correcting the error character generated by the OCR.
2. The method for Chinese OCR error correction based on Chinese character level features and language models of claim 1 wherein the character structure component data is further decomposed into character structures and stroke information corresponding to the components in a recursive manner for each separable component until each of the separated components cannot be decomposed further; when the word vector training is carried out, context words containing target words, characters corresponding to the context words, and character structures, components and stroke information after the target words and the characters corresponding to the context words are further decomposed are input.
3. The Chinese OCR error correction method based on Chinese character level features and language model of claim 1, wherein the OCR model comprises a target detection model and a character recognition model, the target detection model finds out the center point, width and height of the characters in the image, and generates the text box in the image according to the parameters; and then, sending the generated text box into a character recognition model, recognizing and aligning each character in the box, and outputting the most possible output result.
4. The Chinese OCR error correction method based on Chinese character level features and language models as claimed in claim 1, wherein before increasing image noise by data enhancement technique, different noise-adding combination modes are tested to find the noise-adding mode most simulating the real image to be recognized with quality problem for data enhancement of the image.
5. The method for Chinese OCR error correction based on Chinese character-level features and language model according to claim 1, wherein in S2, the image containing Chinese characters is noisy using the imgauge frame.
6. The method for Chinese OCR correction based on Chinese character-level features and language model according to claim 1, wherein only the text strings with the number of wrong characters less than one fifth of the total number of characters are selected as samples to be included in the correction data set in S2.
7. The method for Chinese OCR error correction based on Chinese character-level features and language model of claim 1, wherein in S3, word vector training is performed by CBOW method.
8. The method for Chinese OCR correction based on Chinese character-level features and language model of claim 1, wherein the language model is trained with Chinese encyclopedia data in advance before the fine tuning.
9. The method of claim 1, wherein the language model comprises a word embedding layer, a bidirectional LSTM and a plurality of fully connected layers, and after the word vectors in the word embedding layer are fed into the bidirectional LSTM, the probability distribution of candidate characters for error correction is output through the fully connected layers, thereby obtaining a candidate set for character error correction.
10. The method of claim 1, wherein for the erroneous character to be corrected appearing after OCR recognition, the correct character is selected from the correction data set to replace by manual designation or automatic selection.
Background
Optical Character Recognition (OCR) technology, which is an important part of text processing systems, is to acquire text information on a paper document or a historical document by optical input means such as sampling, photographing and the like, and then convert the text information into computer-operable text by using various pattern recognition algorithms. The main application scenes of the method comprise identification of identity cards and certificates, license plate number identification and the like.
Deep Neural Network (DNN) based OCR technology has achieved a significant accuracy, however, current work is still available on more normative datasets. Therefore, when applied to real scenes, many problems occur that cause the OCR system based on DNN only to fail to work correctly, such as loss of some important data of picture information, deviation of overall information due to picture tilt, noise problem due to poor picture quality, and the like. Many OCR recognition post-processing error correction techniques are also in force to deal with problems that may result from image quality.
However, in the current error correction field, more work is based on languages with fewer English or similar basic characters, and the error correction work is easier because of fewer types of characters and relatively limited similarity between the characters. However, it is difficult to correct errors in languages with more basic characters, such as Chinese and Japanese. In particular, in the chinese language, 21003 basic characters in GBK coding make the candidate set of similar characters too many when correcting errors, and even if only 3755 first-level chinese characters of the commonly used chinese character GB2312 are considered, 52 basic characters are huge numbers relative to the english character.
In addition, most of the current OCR error correction works are only performed on data having characters as basic components, such as language models, which only consider the association information between characters but do not utilize the information inside the characters. Such work is particularly common in the english world, since their characters are not complex. However, in the work of correcting the error of the chinese character, since the chinese character itself has complicated information, there is still room for improvement in the work of correcting the error only by using the language model.
Word vectors trained on information of the Chinese characters also exist, for example, based on a character enhanced Chinese Word Embedding model (CWE), the Word Embedding quality is improved by learning in a manner of combining the Chinese Word and information of the Chinese character corresponding to the Word. In addition, a Joint Learning Word Embedding (JWE) model is also proposed on the basis of CWE, and information of radicals of Chinese characters is further added to learn Word vectors.
The application of the method in the task of correcting the errors is improved to a certain extent compared with the method without using the character information, but the task of correcting the errors of the characters after the OCR still has a progress space, because the errors of the characters after the OCR usually include information of more than one character radical and related information of structures among each character radical. Specifically, the same left-falling stroke can be composed of four different words, i.e., "human", "enter", "eight", "", but in character correction, the first three characters are usually easily recognized with each other, and the fourth "" word should not have too high a similarity to the remaining three characters. When Chinese stroke structure is used for classification, people, Chinese characters and Chinese characters can be classified into left and right structures, namely "From the top to the bottom, the left and right sides, "will be divided into nested structures, i.e.)"Vertical and horizontal strokes ", better simulates the similarity between characters, and can distinguish characters with the same stroke information but different structure information. But the structure information of the character is not used in the related work at present.
Disclosure of Invention
In order to solve the problem related to poor error correction capability of similar stroke order characters caused by the fact that character structure information is not used in the background art, the invention provides a method for generating error correction candidates based on a language model of a Chinese character level stroke structure. The method can overcome the problem of character recognition errors of similar stroke sequences and different character structures to a certain extent, and enable the model to learn character-level features with finer granularity, thereby improving the error correction capability of the model on errors after OCR. The method can be applied to character error correction scenes after OCR.
In order to achieve the purpose, the method comprises the following specific steps:
a Chinese OCR error correction method based on Chinese character level characteristics and a language model comprises the following steps:
s1, extracting characters, character structures and component information corresponding to the characters from the Chinese character data set, and constructing character structure component data;
s2, recognizing an image containing Chinese characters by using an OCR model obtained by training an image data set, then increasing image noise by using a data enhancement technology, then re-recognizing the image with different noise by using the OCR model, and constructing an error correction data set with an after-OCR error style based on the recognition results of the original image and the image with noise, wherein each sample in the error correction data set contains an error character, context and correct character recognized by the OCR;
s3, constructing a word vector training data set for word vector training based on the character structure component data and the Chinese corpus, inputting context words containing target words, characters corresponding to the context words, and character structures and component information of the target words and the characters corresponding to the context words, and obtaining word vectors with character structure or component distinguishing capability after training;
and S4, training the language model by using the word vector obtained by training in S3 as a word embedding layer of the language model, finely adjusting the language model by using the error correction data set to adapt to the OCR recognition error style, and finally obtaining an error correction model which can generate a character error correction candidate set aiming at the error character and is used for correcting the error character generated by the OCR.
Preferably, in the character structure component data, each detachable component is further decomposed into a character structure and stroke information corresponding to the component in a recursive manner until each decomposed component cannot be further decomposed; when the word vector training is carried out, context words containing target words, characters corresponding to the context words, and character structures, components and stroke information after the target words and the characters corresponding to the context words are further decomposed are input.
Preferably, the OCR model comprises a target detection model and a character recognition model, the target detection model finds out the central point, the width and the height of the characters in the image, and generates a text box in the image according to the parameters; and then, sending the generated text box into a character recognition model, recognizing and aligning each character in the box, and outputting the most possible output result.
Preferably, before the noise of the image is increased by the data enhancement technology, different noise adding combination modes are tested, and the noise adding mode which can simulate the real image to be identified with the quality problem most is found and is used for enhancing the data of the image.
Preferably, in S2, the image containing the chinese character is noisy using the imgauge frame.
Preferably, in S2, only the text string with the number of erroneous characters less than one fifth of the total number of characters is selected as a sample and included in the error correction data set.
Preferably, in S3, word vector training is performed by a CBOW method.
Preferably, the language model is trained with Chinese encyclopedia data in advance before fine tuning.
Preferably, the language model comprises a word embedding layer, a bidirectional LSTM and a plurality of full connection layers, and after the word vectors in the word embedding layer are sent into the bidirectional LSTM, the candidate character probability distribution for error correction is output through the full connection layers, so that a character error correction candidate set is obtained.
Preferably, for the error characters to be corrected which appear after OCR recognition, the correct characters are selected from the error correction data set for replacement through manual designation or automatic selection.
Compared with the prior art, the method has the following beneficial effects:
1) according to the method, more remarkable characteristics of errors after OCR can be found through a data enhancement technology, and the model effect is improved;
2) the invention can solve the problem of recognition errors of the same strokes but different character structures;
3) the invention can improve the quality of generating the error correction candidate set under the condition of insufficient context information.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a sample Chinese character part structure;
FIG. 3 is a sample Chinese character part and stroke structure after further decomposition;
FIG. 4 is a schematic diagram of a word vector training model;
FIG. 5 is a schematic diagram of a character error correction model.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions will be described below with reference to the drawings in the embodiments of the present application. It should be noted that the described embodiments are only some of the embodiments in the present application, and not all of the embodiments.
As a better implementation mode of the invention, the invention provides a Chinese OCR error correction method based on Chinese character level characteristics and a language model, the frame of the method is shown as figure 1, and the concrete steps are as follows:
and S1, extracting characters and character structure and component information corresponding to each character from the Chinese character data set to construct character structure component data.
The components of Chinese characters are character-forming units formed by strokes and having the function of assembling Chinese characters, and the character structure is the orientation relation between the components in the characters, such as left and right structuresVertical structureLeft, middle and right structureDamascene structureAnd so on.
S2, recognizing the image containing Chinese characters by using an OCR model obtained by training an image data set, wherein the OCR model can be divided into a target detection model and a character recognition model according to the function division, wherein the target detection model is used for finding out the central point, the width and the height of the characters in the image and generating a text box in the image according to the parameters; and then, the generated text box is sent into a character recognition model, each character in the box is recognized and aligned, and the most possible output result is output, so that the recognition of the Chinese characters in the image is realized. The concrete implementation form of the OCR model is not limited, and any model capable of realizing Chinese character recognition in the image in the prior art can be used as the OCR model of the invention.
Because the problem of small sample number exists in the error correction data after the OCR in the prior art, an error correction model cannot be well trained, and the realization effect of an error correction task can be improved in the OCR field. Therefore, after completing the OCR model training, the present invention can use the OCR model to construct a corresponding error correction data set, which specifically includes increasing image noise by a data enhancement technique, then re-using the OCR model to re-identify an image with different noise added, and constructing an error correction data set with an error after OCR style based on the original image and the recognition result of the image with noise added. Each sample in the error correction data set contains the error characters, the context and the correct characters identified by the OCR, and the samples can be subsequently utilized to finely adjust the general language model, so that the general language model can adapt to a specific OCR error correction field, and the performance of the language model on an OCR error correction task is better improved.
In addition, the specific form of the data enhancement technology is not limited, and the invention utilizes the imgauge framework to add noise to the image containing Chinese characters. However, because the characteristics of the patterns formed by different noise adding forms are different, before the image noise is increased by the data enhancement technology, the combination mode of different noise adding is preferably tested to find the optimal noise adding mode, so that the optimal noise adding mode can simulate the real image to be recognized by OCR with quality problems in the real world, and the optimal noise adding mode is used for enhancing the data of the image.
In addition, when the error correction data set is constructed, samples are preferably required to be screened in advance, and if the number of the error characters is too large, accurate error correction cannot be realized subsequently, so that only text strings with the number of the error characters less than one fifth of the total number of the characters are selected as the samples and are included in the error correction data set. For example, a total number of characters of 10, then OCR recognized text strings with a number of erroneous characters less than 2 should be selected for inclusion in the data set.
S3, constructing a word vector training data set based on the character structure component data and the Chinese corpus, and performing word vector training by using the word vector training data set, wherein the word vector training can be performed by a CBOW method, and the input of the word vector training data set comprises the context words of the target words, the characters corresponding to the context words, and also comprises the character structures and component information of the characters corresponding to the target words and the context words. Thus, after the training, the word vector with the character structure or component distinguishing capability can be obtained.
Here, the chinese corpus may adopt chinese encyclopedia data, in which characters are supplemented with structural component information of different granularity levels based on the aforementioned character structural component data, so that each character contains corresponding character structure and component information.
And S4, training a language model by using the word vector obtained by training in S3 as a word embedding layer of the language model, finely adjusting the language model by using the error correction data set obtained in S2 to adapt to the OCR recognition error style, and finally obtaining an error correction model which can generate a character error correction candidate set aiming at the error character and is used for correcting the error character generated by the OCR.
It should be noted that the language model needs to be trained in advance before fine-tuning, considering the sample size of the error correction data set itself. The invention uses Chinese encyclopedia data with large enough sample size to train in advance, and under the help of a large amount of data, the word vector space of the input layer can better illustrate the relation between words and can be used for initializing the word vector input layer of other tasks. Subsequent fine-tuning of the language model using the error correction data set obtained in S2 may further accommodate particular OCR recognition error styles.
The language model form adopted by the invention for constructing the error correction model is not limited, and the conventional CBOW model and the like can be applicable. In the invention, the adopted language model can comprise a word embedding layer, a bidirectional LSTM and a plurality of full connection layers, after the word vectors in the word embedding layer are sent into the bidirectional LSTM, the probability distribution of candidate characters for error correction is output through the full connection layers, and a candidate character set, namely a character error correction candidate set, can be selected according to a preset probability threshold.
When the method is actually used, the error characters can be found out in advance through an error character positioning algorithm, and the error character positioning algorithm does not belong to the key point of the method and is not repeated. And selecting correct characters from the error correction data set for replacement in a manual designation or automatic selection mode aiming at the error characters to be corrected after the OCR recognition.
In addition, since the character structure component data obtained in S1 of the present invention contains only characters and character structure and component information corresponding to each character, the minimum unit of decomposition is a component. In fact, the invention can further decompose each separable component in the character, one component can be decomposed into the character structure and the stroke information corresponding to the component, if the decomposed stroke can be further decomposed, the further decomposition is needed in a recursive mode until each decomposed component cannot be further decomposed. The detachable components can be set according to the condition of the data set, and the components which are not partially detached can be not set as the detachable components and are directly reserved in further decomposition. Thus, the character structure component data obtained by such further decomposition includes characters, and the character structure and stroke information (strokes decomposed by the component) and the component information (the component refers to a component which is not decomposed) corresponding to each character. When the word vector training is performed on the word vector training data set constructed by aiming at the character structure component data, context words containing target words, characters corresponding to the context words, and character structures, components and stroke information after the target words and the characters corresponding to the context words are further decomposed are input. By the method, the granularity of the character information can be further refined, more character structure information can be obtained, and the effect of an error correction model formed by the vectors can be further improved.
The following describes specific implementation forms and technical effects of the two methods for correcting the chinese OCR error corresponding to the character structure component data by embodiments.
Examples
In this embodiment, a flow framework of a chinese OCR error correction method based on chinese character level features and a language model is shown in fig. 1, and the specific steps are as follows:
1. processing of Chinese character strokes and structural information
The chinese character data set obtained from the open source website kanji-database contains data such as chinese, japanese, korean, more used chinese character structure and pronunciation and corresponding Unicode codes, and unnecessary parts are removed, as shown in step 1 of fig. 1, and the detailed processing steps are as follows:
s1, extracting characters and character structure and component information corresponding to each character from the Chinese character data set to construct character structure component data;
1) the information of Unicode coding, pronunciation modes of various countries, polyphone pronunciation and the like of the characters in the Chinese character information in the obtained Chinese character data set is removed, only the required characters and the character structure and component information corresponding to each character are left, and the information is constructed into character structure component data. The part sample example constructed as character structural part data is shown in FIG. 2 and denoted as JSQE.
2) Sequentially traversing the character structure and the component information of each character, splitting the component information which can be further decomposed into character structures and stroke information corresponding to the components; and for the strokes which can be further split, continuing to carry out the splitting, and continuously repeating the process in a recursive mode until each split part cannot be further split, wherein all characters consist of the part information, the stroke information and the character structure which cannot be further split. The character structure part data thus constructed is denoted as JSWE _ meta, in which a partial sample example is shown in fig. 3.
Note that, when performing the further decomposition of 2) above, not all the components are divided into strokes, and only the preset divided components are further decomposed, and some components need to be reserved, for example, "openings", "doors", and the like.
Two kinds of character structure component data JSW and JSW _ meta containing different Chinese character strokes and structure information obtained finally in the process can be used in the subsequent word vector training task.
OCR model training and data enhancement
In order to specifically solve the problem of OCR (optical character recognition) post-correction irrelevant to the text field, the method needs to perform image enhancement technologies such as image noise addition and the like on the field picture using the method, and identifies the noisy picture again by using an OCR (optical character recognition) model, so that identification errors close to real world identification errors are generated, the problem of small data volume of OCR post-correction is solved by using the errors, and the effect of an error correction task can be gradually improved by the model in the field due to the improvement of the data volume. The OCR model training and data enhancement process is shown as step 2 in fig. 1, and the detailed steps are as follows:
1) and sending the image data and the corresponding text segments into an OCR model consisting of an OCR text detection model and a text recognition model, and training an available OCR model for subsequent use.
In the embodiment, OCR character recognition is performed based on CTPN and CTC, firstly, VGG-16net is used as a pre-training model in an OCR text detection model to extract image features, an LSTM model is used for learning context self-sequence features in the image features, and finally, the coordinates and width and height of a text box are calculated by an RPN text box generation algorithm, wherein the model is used for extracting a region box where characters are located from an image. And then, the text recognition model is carried out by using a CRNN-based CTC conversion model, after the text box obtained in the previous stage is input into the model, image features are extracted by a CNN network layer, sequence features among characters are learned by an LSTM model, and finally the characters are aligned by the CTC conversion model and output.
In this embodiment, The data sets used for training The OCR text detection model include five data sets of ICDAR2011, ICDAR2013, ICDAR2015, The multilinngual and SWT, where a series of data sets of ICDAR are more standard picture-to-text data sets, it is worth mentioning that The multilinngual data includes multi-language text, and The SWT data set includes many text box data with extremely small range, which is a data set with great fighting power. The data set used to train the CRNN + CTC text recognition model was organized for the ginhub user YCG 09. The data set had approximately 364 images, with a training set to validation set ratio of 99: 1; the data set uses a Chinese language database, and randomly generates character images through various changes; contains 5990 characters such as Chinese characters, English, punctuations, etc.; the length of each sample is fixed as 10 characters, and the samples are intercepted from random sentences in a corpus; this embodiment subsequently names the data set, i.e., YCG09, with the username. The evaluation index used in the identification model is accuracy, namely the proportion of correctly identified samples in the total samples:
the final trained model effect is shown in table 1:
TABLE 1 OCR recognition model Final training results
Training set
Test set
Accuracy
98.92%
98.05%
Loss
0.2235
2.595
And after the OCR model is trained, performing OCR recognition on the YCG09 data set to obtain a first group of recognition results.
2) The method comprises the steps of utilizing an imgauge framework to add noise to an image by using a data enhancement technology, applying a plurality of noise adding modes such as coordinate offset, Gaussian noise, pixel value correction and the like to image data, randomly combining the noise adding modes and parameter values, and trying the plurality of noise adding combination modes to find a noise adding mode which can most simulate the situations of poor image quality and the like in the real world. And then carrying out data enhancement operation on the graphic data in the data set by using the optimal noise adding mode.
3) And inputting the image subjected to data enhancement as new data into the trained OCR model for recognition again, wherein the same image can acquire the text string subjected to OCR recognition before and after noise is added. According to the truth value of the characters in the image, the error characters in the text string can be determined, so that the output text string is screened, and the text string with the number of the error characters accounting for one fifth of the total number of the original characters (the total number of the original characters is 10 in the embodiment, so that the threshold condition is set to be less than 3) or below is used as error correction data for a subsequent fine-tune stage. Thus, in this embodiment, an error correction data set YCG09_ authored having a post-OCR error style is constructed based on the recognition results of the original image and the post-noise added image, each sample in the error correction data set containing an OCR-recognized error character, context and correct character.
In this embodiment, a large amount of data is obtained for an experiment of performing data enhancement on a test set, an error correction data set YCG09_ authenticated in which an error character has a certain association with both context and correct solution is formed, and is used for a subsequent character error correction task, where specific information is shown in table 2:
table 2 data enhancement of resulting data set-related information
YCG09_augmented
Average sample length
10
Number of samples of wrong sentence
1853
Number of wrong character samples
1913
Type of data source
Chinese/news
3. Error correction model based on Chinese character structure word vector
The step uses the stroke structure information of the Chinese characters which are processed before, trains word vectors by using the information, works by taking the word vectors as an embedded layer of an error correction language model, then carries out a series of data preprocessing works, divides the massive texts in the public field of Chinese into a data set of the target characters predicted by the context, and trains the error correction model by using the data set. As shown in step 3 in fig. 1, the training of the error correction model simultaneously uses the error correction data, the stroke structure information and the chinese encyclopedia data of the specific field, and finally outputs the error correction result by using the language model, which includes the following specific implementation steps:
1) the Chinese wiki encyclopedia data is used as training corpus data, stop words are removed from each text segment, punctuation marks are removed, word segmentation and the like are removed, n words in front of and behind each word are used as contexts (the length of the context in the stage is generally set to be 2), and the words are used as targets to form a preliminary data set. In addition, the character structure component data JSW E and JSW E _ meta at the two granularity levels are combined, corresponding character structure, component and stroke information are added to the characters in the corpus, and therefore a final word vector training data set is formed. In addition, another level of granularity of character structure part data JWE forms a word vector training data set that has only stroke information and no structure information, i.e., similar to JWE training data (Yu J, Jian X, Xin H, et al. Joint embedding of words, characters, and fine-grained sub-characters components [ C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language processing.2017:286- & 291.). And (3) training a language model by utilizing the three word vector training data sets respectively to obtain word vectors capable of understanding Chinese character level information. The training process of the word vector can be described as that the following text is an input sequence, a neural network language model which takes a target word as output is continuously subjected to back propagation, and under the help of a large amount of data, the word vector space of an input layer can better illustrate the relation between words and can be used for initialization of a word vector input layer of other tasks. The specific language model structure in this embodiment is shown in fig. 4, and the model is trained by adding component or stroke data based on the CBOW training algorithm, and inputs a contextual word (w) including a target wordi-1, wi+1) Character (c) corresponding to the context wordi-1,ci+1) And character structure and component information(s) of characters corresponding to the target word and the context wordi-1,si,si+1)。
The method adds Chinese character structure information in word vector training, not only adds context words of target words, but also adds information of components, strokes, structures and the like of characters corresponding to each word in a network input process, enhances the capability of distinguishing characters with different structures or strokes in word vectors, and enables finally obtained word vectors to hold the structure and stroke sequence properties of Chinese characters.
2) Reconstructing the three word vectors obtained in the step 1) into a language model respectively for error correction, namely, using the word vectors as a word embedding layer training language model of another language model, and finely tuning the language model by using the error correction data set to enable the language model to adapt to the OCR recognition error style, and finally obtaining an error correction model capable of generating a character error correction candidate set aiming at the error characters and used for correcting the error characters generated by the OCR.
In this embodiment, the specific structure of the language model used in this step is shown in fig. 5, and includes a word embedding layer, a bidirectional LSTM and a plurality of fully connected layers, i.e., sense layers, and after the word vectors in the word embedding layer are sent into the bidirectional LSTM, candidate character probability distribution for error correction is output through the fully connected layers, so as to obtain a character error correction candidate set. The model learns by a bidirectional LSTM and a plurality of Dense layers, the model is endowed with the capability of distinguishing character structure strokes by a word vector training data set, and the model is responsible for learning context information.
In this embodiment, the language model is trained with the chinese encyclopedia data in advance before performing the fine tuning.
The common language model data set is used for replacing an error correction data set with rare data and strong task specificity, so that the model obtains more reliable context information, and the error correction data obtained in the data enhancement process is used as a fine-tune for the error correction model, so that the model can have good effect in a specific text field. The error correction model initially refers to more Chinese encyclopedia data rather than error data in a specific field, the effect of the model in the field is better and better along with the increase of data brought by data enhancement, and in addition, word vectors based on a stroke structure can help the model to better distinguish partial characters with similar strokes, so that the error correction effect is improved.
To further demonstrate the effect of the above three granularity levels of character structure component data on word vector training and final error correction models, specific experimental results are provided below.
For the word vector training task, the results are as follows:
in this embodiment, the data sets used for the word vector and the error correction candidate generation model are both chinese Wiki encyclopedia data, about 1.7GB in size, and are subjected to complex and simple conversion and word segmentation preprocessing during the word vector training process, and the chinese character stroke structure is from the kanji-database. The word vector training effect verification method is characterized in that word sim240, word sim-297 and analog data sets are used for verifying word vector training effects, and a series of data for Chinese word similarity comparison and Chinese word semantic class comparison are manually constructed respectively for the two data sets to help evaluating word vectors. And the experimental results of CWE and JWE were compared as baseline.
When training word vectors, the Chinese character structure data of the three different granularity levels are used for training, the first type is a training mode only having stroke information but not structure information, namely the training mode is similar to JWE training data; the second type adds the structure information of each character, namely JSHE, on the previous basis; thirdly, all character stroke information which can be further refined is recursively processed on the basis of the second kind, namely the stroke structure information of all characters is only composed of structure information and the most basic stroke information such as 'one', 'left-falling' and the like, namely JSW _ meta.
The effect of the trained word vector on the semantic similarity task is shown in table 3:
TABLE 3 word vector semantic similarity task Effect
Model
Wordsim-240
Wordsim-297
CWE
0.5133
0.5805
JWE
0.5367
0.6508
JSWE
0.5513
0.6453
JSWE_meta
0.5322
0.6474
The effect of the trained word vectors on the analogy task is shown in table 4:
TABLE 4 word vector analogy task Effect
Model
Total
Capital
State
Family
CWE
0.7553
0.8420
0.8743
0.4632
JWE
0.7651
0.8375
0.8057
0.5588
JSWE
0.7731
0.8537
0.8000
0.5551
JSWE_meta
0.7624
0.8463
0.7657
0.5514
It can be seen from the word vector semantic similarity task and the analogy task that the JWE added with the structure information has a better effect compared with the JWE only with the stroke information, while the JWE _ meta word vector effect added with the stroke structure information with finer granularity is reduced, so that the appropriate character structure information can have a positive effect in the evaluation task of the word vector level, and the word vector effect is poor due to the excessively fine granularity.
For the error correction candidate generation task of the error correction model, the result is as follows:
the data sets applied to the error correction candidate generation task are the sum ALLSIGHAN of three-year data of SIGHAN-2013, SIGHAN-2014 and SIGHAN-2015, the ocr _4575 data set and the YCG09_ augmented data set obtained by data enhancement in the present embodiment, respectively, and detailed information of the data sets is shown in table 5:
TABLE 5 error correction candidate Generation of task related data sets
ALLSIGHAN
ocr_4575
YCG09_augmented
Average sample length
46.14
10.15
10
Number of samples of wrong sentence
8064
4575
1853
Number of wrong character samples
11340
5862
1913
Type of data source
Manchurian script
Article fragment
MandarinText/news
ALLSIGHAN, the error characters of the data set are changed manually, so the type of errors mainly contained is spelling errors, that is, most of the error characters and the original characters are homophones or near-phonetic characters; ocr _4575 is identified through the ocr model, but data comes from more scenic recent article fragments, there is no rare data such as a Chinese, identification difficulty is not high, context information is usually more accurate during error correction, and the help to the error correction task is larger; and the YCG09 data part is derived from the dialect, so that the language model trained based on the Chinese Wiki encyclopedia data can hardly correct the dialect recognition error and needs other means for assistance.
In the embodiment, loss with accurve as an index is used for training the error correction character candidate generation model to perform neural network training. And in the evaluation task, considering that the magnitude of Chinese basic characters is large, and the accuracy cannot be used for effectively distinguishing the advantages and the disadvantages of various methods, the quality of candidate generation is evaluated by using a top-N mode, and if correct characters appear in the first N sequences arranged according to the confidence level, the candidate generation task is considered to be successful.
In this embodiment, candidate generation effects of N in three cases, 10, 20 and 50, are verified on three data sets, and JWE is used as baseline for comparison, and the experimental results are shown in tables 6, 7 and 8:
TABLE 6 error correction candidate generative model top-10 Effect
Top-10
ALLSIGHAN
ocr_4575
YCG09_augmented
JWE
0.3654
0.3317
0.1716
JSWE
0.3509
0.3213
0.1705
JSWE_meta
0.3812
0.3384
0.1775
TABLE 7 error correction candidate generative model top-20 Effect
Top-20
ALLSIGHAN
ocr_4575
YCG09_augmented
JWE
0.4344
0.3822
0.2057
JSWE
0.4195
0.3665
0.2030
JSWE_meta
0.4428
0.3951
0.2116
TABLE 8 error correction candidate generative model top-50 Effect
Top-50
ALLSIGHAN
ocr_4575
YCG09_augmented
JWE
0.5306
0.4553
0.2555
JSWE
0.5118
0.4360
0.2517
JSWE_meta
0.5399
0.4682
0.2625
As can be seen from the results in tables 6, 7, and 8, the candidate generative model formed by the word vectors of JSWE _ meta, which has more structure information, has the best effect by further refining the granularity of the character information. However, the effect is reduced instead of the fine-grained JSWE only adding part of the structural information, which indicates from the side that the structural data of the characters is more important than the stroke data in the word vector training process for the error correction candidate generation model, because all the characters which can still be split are recursively deconstructed in the fine-grained JSWE _ meta data set, and all the structural data with more characters exist.
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.