Bill amount identification method and device, computer equipment and storage medium
1. A bill amount identification method is characterized by comprising the following steps:
identifying the bill through a character identification algorithm to obtain an identification text corresponding to one or more slices;
extracting upper-case number characters and unit characters from the recognition text;
calculating the capital amount value corresponding to the capital number characters and the unit characters;
extracting lower case amount characters from the identification text and determining a lower case amount numerical value;
verifying the upper case amount value and the lower case amount value;
and extracting the sum of the bill according to the verification result.
2. A bill amount recognition method according to claim 1, wherein the step of extracting upper-case numeric characters and unit characters from the recognition text comprises:
truncating the recognition text according to the characteristics of the head and the tail characters;
constructing a regular matching item to extract characters of the cut recognition text, wherein the regular matching item comprises a plurality of preset candidate upper-case-number characters and a plurality of candidate unit characters;
and correcting the extracted upper-case number characters or unit characters.
3. A bill amount recognition method according to claim 2, wherein the step of correcting the extracted upper-case numeric characters or unit characters comprises any one of:
correcting the shape-near characters in the upper-case numeric characters or the unit characters based on the shape-near character dictionary;
correcting the upper case number characters or unit characters based on a rule base, wherein the rule base comprises a structural sequence relation between upper case money and units;
correcting the error of the unit characters according to the sequence of the units arranged from big to small;
scoring and correcting the capital numeric characters or unit characters according to the four-corner coding and the FASPell coding; wherein the scoring formula is as follows:
S=Scode+0.5*Sstructure+0.25*Swrite
in the above formula, S is the total score of the candidate set characters; scodeMatching digit numbers for the error recognition characters and the candidate characters by four codes; sstructureIs a structural comparison coefficient; swriteIs the stroke similarity coefficient;
correcting the repeated or missing unit characters according to the structural relationship between the numbers and the units;
and expanding the incomplete recognition text according to the head and tail characters.
4. A bill amount recognition method according to any one of claims 1 to 3, wherein the step of calculating the upper-case amount value corresponding to the unit character comprises:
assigning corresponding attributes and numerical values to each of the unit characters, wherein the attributes comprise numerical attributes and unit attributes;
generating a number sequence containing a plurality of elements according to the corresponding attributes and numerical values of the unit characters;
calculating a capitalization amount value according to the sequence; the calculation formula is as follows:
where C represents an element in the sequence and len (C) represents the length of the sequence.
5. A bill amount recognition method according to claim 1, wherein the step of extracting lower case amount characters from the recognition text and determining a lower case amount value comprises:
extracting the lower case amount character based on the prefix character or the position information;
converting the lower case amount characters into lower case amount numerical values;
and checking the lower case amount value according to the unit character.
6. The bill amount recognition method according to claim 5, wherein the step of verifying the lower case amount character from the unit character comprises:
under the condition that the bill is the non-tax value-added tax invoice, if the bill contains two identical lower-case amount values, the two identical lower-case amount values are determined to be correct;
and under the condition that the bill is the non-tax-exempt value-added tax invoice, if the bill contains three lower-case amount values and the sum of the amounts of two lower-case amount values is equal to the third lower-case amount value, determining that the three lower-case amount values are correct.
7. The bill amount identification method according to claim 1, wherein the step of verifying the upper case amount value and the lower case amount value includes:
under the condition that the upper-case amount value and the lower-case amount value are consistent, taking the upper-case amount value or the lower-case amount value as a verification result;
under the condition that only the upper-case amount value or the lower-case amount value is analyzed, taking the analyzed upper-case amount value or the analyzed lower-case amount value as a verification result;
taking the lower case amount value as the verification result in the case that the upper case amount value is a component of the lower case amount value;
taking the upper case amount value as the verification result when the lower case amount value is smaller than the upper case amount value;
and taking the upper-case sum value as the verification result under the other conditions.
8. An apparatus for recognizing an amount of a bill, comprising:
the bill acquisition module is suitable for identifying the bill through a character identification algorithm so as to obtain an identification text corresponding to one or more slices;
the capitalization extracting module is suitable for extracting capitalization characters and unit characters from the recognition text;
the upper case numerical value calculation module is suitable for calculating the upper case amount numerical value corresponding to the upper case numerical character and the unit character;
the lower case extraction module is suitable for extracting lower case amount characters from the identification text and determining a lower case amount value;
the numerical value checking module is used for checking the upper-case amount numerical value and the lower-case amount numerical value;
and the amount determining module is suitable for extracting the amount of the bill according to the verification result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented by the processor when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
Background
The bill identification is an important application field of artificial intelligence, and can effectively reduce the manual accounting cost and realize the automatic expense reimbursement by automatically identifying each key field, especially the amount of money of the bill. The prior art generally has the following solutions for structured recognition in a bill scene: (1) performing OCR (Optical Character Recognition) on the whole image, correcting the OCR result based on the domain dictionary, and extracting required key field information based on a fixed field slice or a fixed area; (2) making a self-defined identification template, and identifying characters in a target area based on reference characters with fixed positions and fixed contents to realize the structured identification of pictures with the same format; (3) customizing a positioning or dividing model, searching for a required field area, and customizing a recognition model according to the field type (such as target number, English, symbol and Chinese character).
However, the custom template matching is only suitable for hard bills such as identity cards and the like which are not easy to fold or wrinkle, and for paper bills, the recognition success rate is low due to easy deformation of paper space; the development cost of the special positioning, segmenting or identifying model is high, and the portability is poor; when the ticket has printing offset, the money extraction is hindered; there is a risk of misidentifying other characters as capitalized numeric characters, and when any character in a capitalized amount is misidentified, the amount cannot be translated into a numeric value; and when the recognized size and amount values are inconsistent, the corresponding accept or reject is lacked.
Therefore, the conventional technique has a limited ability to extract and correct the amount of bills, and the success rate of analyzing the total amount highly depends on the OCR recognition result. The efficiency of bill identification application scenes such as reimbursement and the like is reduced, manual intervention is needed, the period is prolonged, and meanwhile the enterprise management cost is also increased.
Disclosure of Invention
The invention aims to provide a technical scheme capable of quickly and accurately identifying the sum of bills so as to solve the problems in the prior art.
In order to achieve the aim, the invention provides a bill amount identification method, which comprises the following steps:
identifying the bill through a character identification algorithm to obtain an identification text corresponding to one or more slices;
extracting upper-case number characters and unit characters from the recognition text;
calculating the capital amount value corresponding to the capital number characters and the unit characters;
extracting lower case amount characters from the identification text and determining a lower case amount numerical value;
verifying the upper case amount value and the lower case amount value;
and extracting the sum of the bill according to the verification result.
According to the bill amount recognition method provided by the present invention, the step of extracting upper-case numeric characters and unit characters from the recognition text comprises:
truncating the recognition text according to the characteristics of the head and the tail characters;
constructing a regular matching item to extract characters of the cut recognition text, wherein the regular matching item comprises a plurality of preset candidate upper-case-number characters and a plurality of candidate unit characters;
and correcting the extracted upper-case number characters or unit characters.
According to the bill amount recognition method provided by the present invention, the step of correcting the extracted upper case number characters or unit characters includes any one of the following steps:
correcting the shape-near characters in the upper-case numeric characters or the unit characters based on the shape-near character dictionary;
correcting the upper case number characters or unit characters based on a rule base, wherein the rule base comprises a structural sequence relation between upper case money and units;
correcting the error of the unit characters according to the sequence of the units arranged from big to small;
scoring and correcting the capital numeric characters or unit characters according to the four-corner coding and the FASPell coding; wherein the scoring formula is as follows:
S=Scode+0.5*Sstructure+0.25*Swrite
in the above formula, S is the total score of the candidate set characters; scodeMatching digit numbers for the error recognition characters and the candidate characters by four codes; sstructureFor structural comparison coefficients, SwriteIs the stroke similarity coefficient;
correcting the repeated or missing unit characters according to the structural relationship between the numbers and the units;
and expanding the incomplete recognition text according to the head and tail characters.
According to the bill amount recognition method provided by the invention, the step of calculating the capital-written amount value corresponding to the unit character comprises the following steps:
assigning corresponding attributes and numerical values to each of the unit characters, wherein the attributes comprise numerical attributes and unit attributes;
generating a number sequence containing a plurality of elements according to the corresponding attributes and numerical values of the unit characters;
calculating a capitalization amount value according to the sequence; the calculation formula is as follows:
where C represents an element in the sequence and len (C) represents the length of the sequence.
According to the bill amount recognition method provided by the invention, the steps of extracting the lower case amount characters from the recognition text and determining the lower case amount value comprise the following steps:
extracting the lower case amount character based on the prefix character or the position information;
converting the lower case amount characters into lower case amount numerical values;
and checking the lower case amount value according to the unit character.
According to the bill amount recognition method provided by the invention, the step of verifying the lower case amount character according to the unit character comprises the following steps:
under the condition that the bill is the non-tax value-added tax invoice, if the bill contains two identical lower-case amount values, the two identical lower-case amount values are determined to be correct;
and under the condition that the bill is the non-tax-exempt value-added tax invoice, if the bill contains three lower-case amount values and the sum of the amounts of two lower-case amount values is equal to the third lower-case amount value, determining that the three lower-case amount values are correct.
According to the method for extracting the bill amount provided by the invention, the step of verifying the upper case amount value and the lower case amount value comprises the following steps:
under the condition that the upper-case amount value and the lower-case amount value are consistent, taking the upper-case amount value or the lower-case amount value as a verification result;
under the condition that only the upper-case amount value or the lower-case amount value is analyzed, taking the analyzed upper-case amount value or the analyzed lower-case amount value as a verification result;
taking the lower case amount value as the verification result in the case that the upper case amount value is a component of the lower case amount value;
taking the upper case amount value as the verification result when the lower case amount value is smaller than the upper case amount value;
and taking the upper-case sum value as the verification result under the other conditions.
In order to achieve the above object, the present invention further provides a bill amount recognition apparatus, including:
the bill acquisition module is suitable for identifying the bill through a character identification algorithm so as to obtain an identification text corresponding to one or more slices;
the capitalization extracting module is suitable for extracting capitalization characters and unit characters from the recognition text;
the upper case numerical value calculation module is suitable for calculating the upper case amount numerical value corresponding to the upper case numerical character and the unit character;
the lower case extraction module is suitable for extracting lower case amount characters from the identification text and determining a lower case amount value;
the numerical value checking module is used for calculating the lower case amount numerical value corresponding to the lower case amount character and checking the upper case amount numerical value and the lower case amount numerical value;
and the amount determining module is suitable for extracting the amount of the bill according to the verification result.
To achieve the above object, the present invention further provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
To achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the above method.
The bill amount recognition method, the bill amount recognition device, the computer equipment and the storage medium respectively extract the capital-written numerical characters and the lowercase amount characters from the bill, respectively calculate to obtain the capital-written amount numerical values and the lowercase amount numerical values, and finally determine the bill amount numerical values through checking the capital-written amount numerical values and the lowercase amount numerical values. The invention fully considers the capital and lowercase amounts, and finally determines the bill amount value after checking the capital and lowercase amounts, thereby avoiding the condition that the output amount is wrong due to inaccurate amount identification of one of the amounts and effectively improving the accuracy of amount identification. The invention can conveniently identify the bill amount by utilizing the image characteristic searching mode without customizing a special identification model, can effectively reduce the development cost and improve the identification efficiency of the bill amount.
Drawings
FIG. 1 is a flow chart of a first embodiment of a bill amount identification method of the present invention;
FIG. 2 is a schematic flow chart of the calculation of a capitalization value according to one embodiment of the present invention;
FIG. 3 is a schematic flow chart of extracting upper case numeric characters and unit characters according to one embodiment of the present invention;
FIG. 4 is a schematic diagram of program modules of a first embodiment of the device for extracting amount of money from a bill according to the present invention;
fig. 5 is a schematic diagram of a hardware configuration of a first embodiment of the bill amount extraction device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1, the present embodiment provides a method for identifying a bill amount, including the following steps:
and S100, identifying the bill through a character identification algorithm to obtain identification texts corresponding to one or more slices.
The text recognition algorithm in this embodiment may be any algorithm that can recognize text from paper or pictures in the prior art, such as an OCR algorithm. OCR (Optical Character Recognition) is a technology that uses electronic equipment to convert a paper document into an electronic document, obtain a corresponding image file, and convert characters in the image into a text format by a Character Recognition method for further editing and processing by word processing software. And for the bill to be recognized, the OCR algorithm acquires the slices of the character area according to the image characteristics, and respectively recognizes the text contained in each slice.
And S200, extracting upper-case number characters and unit characters from the recognition text.
It will be appreciated that the upper case bill amount contains at least one upper case numeric character and one unit character, for example, one in one herboria defines one as the upper case numeric character and one in "herboria" and "round" as the unit character. Further, since the "circle" itself has no mathematical meaning, it can be deleted as a redundant character in a subsequent process. In order to improve the recognition efficiency, the embodiment may first filter all the sliced recognition texts based on a regular matching algorithm, and filter out the recognition texts containing at least one preset upper-case number character and at least one preset unit character. The regular matching algorithm in this embodiment is implemented by constructing a regular match comprising a plurality of preset upper-case numeric characters, which may comprise zero one two three four seven eight and the like, and a plurality of preset unit characters, which may comprise round corner, and hundreds of million wan and the like. By filtering the identification text, the data volume can be effectively reduced, and the money extraction efficiency is improved.
And S300, calculating the capital value corresponding to the capital value character string.
The capital-written character string in the present application may be composed of the above-mentioned partial capital-written numeric characters and partial unit characters, for example, "Yi Bai" is a capital-written character string. It will be appreciated that the upper case strings are simply text in character form, and this step is used to convert the text to a specific numeric value in mathematical form. Fig. 2 shows a schematic flowchart of the calculation of the upper-case amount value according to the present embodiment, and as shown in fig. 2, step S300 includes:
and S310, assigning corresponding attributes and numerical values to each character, wherein the attributes comprise numerical attributes and unit attributes.
The numeric attribute or the unit attribute is determined according to whether the target character belongs to a preset upper case numeric character or a preset unit character. For example, if the target character "three" belongs to a preset capital-case numeric character, determining that the attribute is a numeric attribute; for another example, if the target character "ten thousand" belongs to the preset unit character, the attribute is determined as the unit attribute. In this embodiment, different numerical values are assigned to different numerical attributes or unit attributes, where the numerical value of the numerical attribute is related to the specific numerical size thereof, for example, the character "three" is the numerical attribute, and the corresponding numerical value is 3. The value of the unit attribute is related to its corresponding magnitude, for example, the character "ten thousand" is the unit attribute, and the corresponding value is 10000.
And S320, generating a number sequence containing a plurality of elements by the characters according to the corresponding attributes and numerical values.
For example, the capital-rate character string is "four thousand five hundreds", wherein "four" is a numerical attribute and corresponds to a numerical value of 4; "Qian" is unit attribute, corresponding to a value of 1000; "Wu" is a numerical attribute, corresponding to a numerical value of 5; "Bai" is a unit attribute, corresponding to a value of 100. The corresponding numerical values of each character are arranged according to the original arrangement sequence, so that a number sequence [4, 1000, 5, 100] can be obtained, wherein the length of the number sequence is 4, and the number sequence contains 4 elements.
S330, calculating a capital sum value according to the sequence; the calculation formula is as follows:
in the above formula, C represents an element in the number series, and len (C) represents the length of the number series. Still in the sequence [4, 1000, 5, 100]]For example, where the length of the series len (C) is 4, C0=4,C1=1000,C2=5,C3=100。[4,1000,5,100]The corresponding upper case sum value is: 4 × 1000+5 × 100 ═ 4500.
And S400, extracting the lower case amount characters from the recognition text and determining the lower case amount numerical value.
The step of extracting the lower case amount characters in this embodiment may be performed in two ways, one is to extract the lower case amount characters based on the prefix characters, and the other is to extract the lower case amount characters based on the position information. The prefix character may include a rmb character ", a chinese character" lower case: "and Chinese characters" lower case total: "etc., it will be appreciated that the ticket will typically have an amount printed in alphanumeric form after the prefix characters described above, so that lower case amount characters can be quickly extracted via the prefix characters. In addition, for tickets that do not print prefixes, lower case characters can be extracted based on a fixed location, where the fixed location can be a fixed field location, such as identifying the number of fields in the text, or a fixed coordinate location, such as within a preset coordinate range.
On the basis of extracting the lower case amount characters, the lower case amount characters can be directly converted into corresponding lower case amount numerical values in the step. Because the characters of the lowercase amount are characters in an Arabic numeral form, the characters in the Arabic numeral form can be directly converted into corresponding Arabic numerals, namely the lowercase amount values, according to the preset mapping relation. For example, for the lower case amount character 1234, the character 1 may be directly converted to the number 1, the character 2 may be directly converted to the number 2, the character 3 may be converted to the number 3, the character 4 may be converted to the number 4, and the numbers obtained after conversion may be arranged in order to obtain the lower case amount value 1234.
The present embodiment may further verify the lower case amount value according to a unit character. It will be appreciated that the ticket will typically contain both upper case numeric characters and lower case amount characters, upon which the lower case amount value can be verified in terms of upper case amount. In order to improve the checking efficiency, the embodiment does not need to compare the upper-case numeric characters and the lower-case amount values one by one, and only needs to use the extracted unit characters to judge whether the order of the lower-case amount values is correct or not. Specifically, the unit character used for verification in the present embodiment refers to the first unit character in left-to-right order, for example, "wuweibai", where the first unit character is "wuweibai". According to step S300, it can be determined that "Qian" corresponds to a value of 1000. Checking the extracted lower-case value to 1000 at this time may filter out some orders of magnitude of non-matching lower-case values, such as 500, 450, etc. It should be noted that, in the case of poor bill quality, "sanwan wubai" has a possibility of being recognized as "XXX wubai" due to overlapping or blurring of printing, and only "X wubai" can be extracted, and therefore, the check in this step is only performed by a minimum order of magnitude and does not check a specific value. For example, in the case where the unit character is "kd", verification is allowed if the lower case amount is 4500, 5500, 34500, etc., but verification is not allowed if the lower case amount is 999.
Further, in the case that the ticket includes a plurality of lower case amount values, the embodiment may also perform mutual verification according to the plurality of lower case amount values. Specifically, if the corresponding bill is a non-tax value-added tax invoice and two identical lower case amount values are extracted, the two identical lower case amount values can be determined to be correct; and if the corresponding bill is the non-tax-exempt value-added tax invoice and three lower case amount values are extracted, wherein the sum of the two lower case amount values is equal to the third lower case amount value, and the three lower case amount values are determined to be correct. The checking rules are set according to the characteristics of the value-added tax invoice, and the value-added tax invoice can be quickly checked.
And S500, checking the upper case sum value and the lower case sum value.
The verification of this step is more rigorous than the verification in step S400, i.e. a detailed verification of specific values. It will be understood that the upper case amount value and the lower case amount value extracted from the ticket may be the same or different, and the following verification rules are set according to the relationship between the upper case amount value and the lower case amount value in the embodiment:
(1) and under the condition that the upper-case amount value and the lower-case amount value are consistent, taking the upper-case amount value or the lower-case amount value as a verification result.
(2) And under the condition that only the upper-case amount value or the lower-case amount value is analyzed, taking the analyzed upper-case amount value or the analyzed lower-case amount value as a verification result.
(3) And taking the lower case amount value as the verification result under the condition that the upper case amount value is a component of the lower case amount value. For example, if the upper case value is 23.45 and the lower case value is 1223.45, where the upper case value is included in the lower case value, the lower case value 1223.45 is used as the verification result.
(4) And taking the upper-case amount value as the verification result under the condition that the lower-case amount value is smaller than the upper-case amount value.
(5) And taking the upper-case sum value as the verification result under the other conditions.
By setting the verification rule, the embodiment can ensure that the correct amount value is obtained under the condition that the upper-case amount and the lower-case amount are inconsistent.
And S600, returning the extraction amount of the bill according to the verification result.
Specifically, the capital or small amount of money determined in the verification result is used as the final extraction amount of the target bill and returned to the user.
In summary, the present embodiment fully considers the characteristics of the total amount field, has strong adaptability and error correction capability to the incorrect OCR recognition result, and can give the total amount with higher confidence by integrating the upper and lower case amounts of the ticket. The bill amount identification method provided by the embodiment can be widely applied to the scenes of using bills, such as financial reimbursement, automatic claim settlement entry and bookkeeping, so that the automatic extraction and entry of total amount are realized, the management efficiency is improved, and the labor cost of enterprises is effectively reduced.
FIG. 3 is a schematic flow chart of extracting upper case numeric characters and unit characters according to an embodiment of the present invention. As shown in fig. 3, step S200 includes:
and S210, constructing a regular matching item to extract characters of all the recognized texts, wherein the regular matching item comprises a plurality of preset candidate upper-case-number characters and a plurality of candidate unit characters.
Wherein the candidate capital-written numeric characters can include zero one two three four land seven eight, etc., and the candidate unit characters can include round corner, whole hundred thousand million Wan Yi, etc. The step directly extracts the upper case number characters and the unit characters in the recognition text by constructing a regular expression comprising a plurality of candidate upper case number characters and a plurality of candidate unit characters.
And S220, truncating or expanding all extracted characters according to the characteristics of the head character and the tail character.
It will be appreciated that the upper case composition is generally composed of upper case numeric characters and unit characters, with the upper case numeric character following the preceding unit character. Therefore, the step cuts off or expands the identification text according to the characteristic that the capital money is a number at the head and a unit at the tail. Where truncation refers to the removal of redundant characters based on the currently recognized text, where the redundant characters may include zeros preceding the numeric character and "elements", "circles", "integers", etc. in the unit character. For example, a certain recognition text is "eight" which is a capital character and "eight" which is a unit character, and therefore, only "eight" is reserved as a capital character after the current recognition text is cut. On the other hand, if the first character in the current recognition text is a unit character or the last character is a capital character, the relevant character indicating that a part of the capital amount is intercepted in the previous recognition text or the next recognition text, and for this, the current recognition text needs to be expanded so as to incorporate the character representing the capital amount in the previous recognition text or the next recognition text into the current recognition text.
And S230, correcting the extracted character string of the upper-case money.
This step corrects the upper case amount from several aspects, with the aim of improving the accuracy of the amount extraction. The specific error correction mode includes any one or more of the following modes:
(1) and correcting the shape-near characters in the upper-case numeric characters or the unit characters based on the shape-near character dictionary. The near-word dictionary may include mapping relationships between a plurality of error-prone words and correct characters, and when an error-prone word is retrieved, the error-prone word may be directly replaced with a corresponding correct character for error correction, for example, the 'other' is corrected to 'eight'.
(2) And correcting the upper-case number characters or unit characters based on a rule base, wherein the rule base comprises a structural sequence relation between upper-case money and units. A rule base may be established based on the capitalized 'number-unit-number-unit' feature, such as: easily and wrongly identifying the number as Bai and the number as Bai, and correcting errors according to the structure information of 'digit-Bai-digit' or 'unit-Wu-unit'; capitalized amounts starting with zero, the first unit being the element; for the subsequent flow, the zero before the number is removed, the element after the unit is removed, and the final positive integer is removed.
(3) And correcting the unit characters according to the sequence of the unit characters from large to small. The sequence of unit characters from large to small includes measurement unit sequences of hundred million, thousand, hundred thousand, ten thousand, hundred, ten, yuan, corner, minute, etc. In the embodiment, the unit character is corrected, for example, the error unit behind the herborist is corrected to be a pickoff, and the error character ahead the pickoff is a herborist. In addition, when there is zero in the upper-case characters, there is a case that the unit characters are discontinuous, and at this time, at least one unit character should be skipped at the corresponding position, for example, the unit 'pick' is skipped in 'wubai zero three integers'.
(4) And scoring and correcting the upper-case number characters or unit characters according to the four-corner coding and the FASPell coding.
It will be appreciated that OCR-induced text errors are primarily near word errors. Based on this, in the present embodiment, the score of each character in the candidate set is calculated by four-corner coding (a single-stroke or multi-stroke form in which four corners of a chinese character at the upper left corner, the upper right corner, the lower left corner and the lower right corner are represented by 4 to 5 digits) and fastill coding (a string of indefinite-length character strings are used to represent the structure and stroke information of the character), and the character higher than the threshold is used as the error correction result. In the step, when the four-corner coding scores of the characters in the two candidate sets are consistent, information of structural similarity (left and right structure, upper and lower structure, full-surrounding structure and the like) and stroke similarity (for example, the number of kilo strokes is 5) is also referred to. The specific scoring formula is as follows:
S=Scode+0.5*Sstructure+0.25*Swrite
in the above formula, S is the total score of the candidate set characters; scodeMatching digit numbers for the error recognition characters and the candidate characters by four codes; sstructureIs a structure consistency coefficient, 1 when the structures are consistent, and 0, S otherwisewriteThe stroke similarity coefficient is 1 when the absolute value of the stroke difference is less than 3, and is 0 otherwise. The above-mentioned setting of weights (1, 0.5, 0.25) ensures a decreasing significance between the 3 similarities. It is assumed that for a misrecognized character, a score needs to be calculated for each candidate character from the candidate character set, with the highest score as the error correction result. Specifically, four-code matching digit S between the misrecognized character and the candidate character can be obtained by respectively carrying out four-corner coding on the misrecognized character and the candidate charactercode(ii) a By performing FASPell encoding on the misrecognized character and the candidate character, a structural consistency coefficient S between the two characters can be obtainedstructureAnd stroke similarity coefficient Swrite. By applying the three coefficients Scode、SstructureAnd SwriteAnd carrying out weighted summation to obtain the scores between the misrecognized characters and the candidate characters. And finally, taking the candidate character with the highest score as an error correction result.
(5) And correcting the repeated or missing unit characters according to the structural relationship between the numbers and the units. The OCR result has the condition of more characters and less characters, except zero + number, other two numbers must have a unit, and the complement is carried out according to the information of the front unit and the rear unit. Aiming at the problem that a single character is identified into two characters, repeated units are removed, such as correcting the five-element three-corner into the five-element three-corner.
(6) Determining a first candidate upper-case number character or a first candidate unit character with the same partial structure as the upper-case number character or the unit character based on a four-corner code or a FASPell code under the condition that the upper-case amount numerical value cannot be calculated; and replacing the original recognition character with the first candidate upper case digit character or the first candidate unit character. Specifically, for characters with the same structure, only 2 bits in four codes need to be matched, for example, the left and right structural characters are matched with the left half or the right half. For example, the OCR error recognition result is 'exo', the bit attribute is inferred to be a number according to the context information, and the candidate set is 'zero one two three four seven eight', since 'five' and 'exo' are both left-right structures, and four codes match 2 bits on the left half for error correction. FASPell code describes structure and stroke information, such as FASPell code matches specified partial structure, and error correction can also be performed. For example, 'gao' as an OCR error recognition result, which includes 'member', is corrected to 'circle' since 'member' structure is also included in 'circle'.
The embodiment corrects the OCR recognition result in multiple modes, so that the recognition errors can be reduced to the greatest extent, and the accuracy of bill sum extraction is improved.
With continued reference to fig. 4, a bill amount extraction device is shown, in this embodiment, the bill amount extraction device 40 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to implement the present invention and realize the bill amount identification method. The program modules referred to herein are a series of computer program instruction segments capable of performing specific functions, better suited than the program itself for describing the execution of the value extracting means 40 of the ticket in a storage medium. The following description will specifically describe the functions of the program modules of the present embodiment:
a bill acquiring module 41 adapted to acquire a bill image, and recognize the bill image by a character recognition algorithm to obtain one or more recognition texts;
an uppercase extraction module 42 adapted to extract uppercase numeric characters and unit characters from the recognition text;
a capital numerical value calculation module 43 adapted to calculate a capital amount numerical value corresponding to the capital numerical characters and the unit characters;
a lower case extraction module 44 adapted to extract lower case amount characters from the recognition text;
the numerical value checking module 45 is used for calculating the lower case amount numerical value corresponding to the lower case amount character and checking the upper case amount numerical value and the lower case amount numerical value;
and the amount determining module 46 is suitable for extracting the amount of the bill according to the verification result.
The embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device 50 of the present embodiment includes at least, but is not limited to: a memory 51, a processor 52, which may be communicatively coupled to each other via a system bus, as shown in FIG. 5. It is noted that fig. 5 only shows a computer device 50 with components 51-52, but it is to be understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.
In this embodiment, the memory 51 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 51 may be an internal storage unit of the computer device 50, such as a hard disk or a memory of the computer device 50. In other embodiments, the memory 51 may be an external storage device of the computer device 50, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 50. Of course, the memory 51 may also include both internal and external storage devices for the computer device 50. In this embodiment, the memory 51 is generally used for storing an operating system installed in the computer device 50 and various types of application software, such as the program code of the bill amount recognition apparatus 40 in the first embodiment. Further, the memory 51 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 52 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 52 generally serves to control the overall operation of the computer device 50. In this embodiment, the processor 52 is configured to operate the program code stored in the memory 51 or process data, for example, operate the bill amount identification device 10, so as to implement the bill amount identification method according to the first embodiment.
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of the present embodiment is used for storing a bill amount recognition device 40, and when executed by a processor, implements the bill amount recognition method of the first embodiment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example" or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种遥感影像域适应语义分割方法