Data processing method and device, electronic equipment and storage medium

文档序号:7644 发布日期:2021-09-17 浏览:28次 中文

1. A data processing method, comprising:

classifying text units in a table to be processed to obtain text units corresponding to keys of the table to be processed, wherein the obtained text units corresponding to the keys are first text units;

obtaining a target first text unit pair with master-slave relation based on the first text unit;

acquiring a master-slave relationship between the target first text units included in the target first text unit pair;

and obtaining key structured data corresponding to the table to be processed based on the master-slave relationship between the target first text units included in the target first text unit pair, wherein the key structured data is data representing the master-slave relationship between the first text units and the first text units.

2. The method of claim 1, wherein obtaining a target first text unit pair having a master-slave relationship based on the first text unit comprises:

obtaining at least one first text unit pair based on any two first text units, wherein every two first text units correspond to one first text unit pair;

acquiring a pair type corresponding to each first text unit pair;

and determining the first text unit pair with the pair type of the first pair type as a target first text unit pair with master-slave relation.

3. The method according to claim 2, wherein said obtaining respective corresponding pair types of each of the first text unit pairs comprises:

acquiring the feature vector corresponding to each first text unit pair;

and obtaining the pair type of the corresponding first text unit pair based on each feature vector.

4. The method according to claim 3, wherein said obtaining the feature vector corresponding to each of the first text unit pairs comprises:

extracting the features of each first text unit pair through a feature extraction layer of a classification model to obtain a feature vector corresponding to each first text unit pair;

the obtaining a pair type of a corresponding first text unit pair based on each feature vector comprises:

and processing the feature vector corresponding to each first text unit through a feature processing layer of the classification model to obtain the pair type corresponding to each first text unit.

5. The method of claim 4, wherein the classification model is obtained by:

acquiring a sample first text unit pair carrying a pair type label;

and training an initial model based on the sample first text unit pair and the carried pair type label to obtain the classification model.

6. The method according to claim 1, wherein obtaining the key structured data corresponding to the table to be processed based on the master-slave relationship between the target first text units included in each target first text unit pair comprises:

and carrying out hierarchical processing on each target first text unit based on the master-slave relationship between the target first text units included in each target first text unit pair to obtain key structured data corresponding to the to-be-processed table.

7. The method of claim 1, further comprising:

acquiring key value structured data corresponding to the to-be-processed table, wherein the key value structured data represent association relation between a first text unit and a second text unit, and the second text unit is a text unit corresponding to each value of the to-be-processed table;

and merging the key structured data and the key value structured data to obtain structured data corresponding to the to-be-processed table.

8. The method of claim 7, wherein the table to be processed includes a plurality of text units, each text unit having a respective corresponding coordinate attribute, the plurality of text units includes the first text unit and the second text unit, and the obtaining key-value structured data corresponding to the table to be processed includes:

acquiring each second text unit set, wherein each second text unit included in each second text unit set has at least one same coordinate attribute;

acquiring first text units corresponding to the second text unit sets;

and merging each second text unit set and the first text unit corresponding to each second text unit set to obtain key value structured data corresponding to the table to be processed.

9. The method according to claim 8, wherein the tables to be processed comprise a first type of table to be processed and a second type of table to be processed, the coordinate attributes of the first type of table to be processed comprise an abscissa left boundary, an abscissa right boundary and an abscissa middle value, and the coordinate attributes of the second type of table to be processed comprise an ordinate up-down boundary value.

10. The method according to claim 7, wherein after obtaining the structured data corresponding to the table to be processed, the method further comprises:

obtaining table labels corresponding to each first text unit included in the structured data based on the corresponding relation between the structured data and the table labels;

and generating a table with a complete frame corresponding to the table to be processed based on the table label.

11. The method according to any one of claims 1-10, wherein before classifying the text units in the table to be processed, the method further comprises:

analyzing a table to be processed to obtain text elements corresponding to the table to be processed and coordinates corresponding to the text elements;

and obtaining each text unit included in the table to be processed based on the text element corresponding to the table to be processed and the respective corresponding coordinate of each text element.

12. The method according to claim 11, wherein obtaining each text unit included in the table to be processed based on the text element corresponding to the table to be processed and the corresponding coordinate of each text element comprises:

obtaining the transverse distance between each text element based on the corresponding coordinate of each text element;

and distinguishing a plurality of text elements with the same horizontal coordinate based on the distance threshold and the horizontal distance between the text elements to obtain each text unit included in the to-be-processed table.

13. A data processing apparatus, characterized in that the apparatus comprises:

the first text unit obtaining module is used for classifying text units in a table to be processed to obtain text units corresponding to keys of the table to be processed, wherein the obtained text units corresponding to the keys are first text units;

the target first text unit pair obtaining module is used for obtaining a target first text unit pair with master-slave relation based on the first text unit;

a master-slave relationship obtaining module, configured to obtain a master-slave relationship between the target first text units included in the target first text unit pair;

and the key structured data obtaining module is used for obtaining key structured data corresponding to the table to be processed based on the master-slave relationship between the target first text unit and the target first text unit included in the target first text unit pair, wherein the key structured data is data representing the master-slave relationship between the first text unit and the first text unit.

14. An electronic device comprising a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the method of any of claims 1-12.

15. A computer-readable storage medium, having program code stored therein, wherein the program code when executed by a processor performs the method of any of claims 1-12.

Background

Structured data plays an important role in post-processing of data. However, in the related art, when the structured data is obtained from a partial table, the structured data is not obtained accurately.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present application provide a data processing method, an apparatus, an electronic device, and a storage medium to improve the foregoing problems.

In a first aspect, an embodiment of the present application provides a data processing method, where the method includes: classifying the text units in the table to be processed to obtain the text units corresponding to the keys of the table to be processed, wherein the obtained text units corresponding to the keys are first text units; obtaining a target first text unit pair with master-slave relation based on the first text unit; acquiring a master-slave relationship between target first text units included in each target first text unit pair; and obtaining key structured data corresponding to the table to be processed based on the master-slave relationship between the target first text units included in each target first text unit pair, wherein the key structured data is data representing the master-slave relationship between the first text units and the first text units.

In a second aspect, an embodiment of the present application provides a data processing apparatus, including: the first text unit obtaining module is used for classifying the text units in the table to be processed to obtain the text units corresponding to the keys of the table to be processed, wherein the obtained text units corresponding to the keys are the first text units; the target first text unit pair obtaining module is used for obtaining a target first text unit pair with master-slave relation based on the first text unit; the master-slave relationship obtaining module is used for obtaining the master-slave relationship between the target first text units included in each target first text unit pair; and the key structured data obtaining module is used for obtaining key structured data corresponding to the table to be processed based on the master-slave relationship between the target first text units included in each target first text unit pair, wherein the key structured data is data representing the master-slave relationship between the first text units and the first text units.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the methods described above.

In a fourth aspect, the present application provides a computer-readable storage medium having program code stored therein, where the program code executes the method described above when executed by a processor.

In a fifth aspect, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the above-described method.

According to the data processing method, the data processing device, the electronic device and the storage medium, after the text units in the table to be processed are classified to obtain the first text units of the table to be processed, the target first text unit pairs with the master-slave relationship are obtained based on the first text units, then the master-slave relationship between the target first text units included in each target first text unit pair is obtained, and finally the key structured data corresponding to the table to be processed is obtained based on the master-slave relationship between the target first text units included in each target first text unit pair. The key structured data obtained by the target first text unit pair with the master-slave relationship is utilized, so that the master-slave relationship between the first text unit and the first text unit can be accurately expressed by the obtained key structured data, the accuracy of the key structured data is improved, and the accuracy of obtaining the structured data from the table is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram illustrating structured data stored in Json form according to an embodiment of the present application;

FIG. 2a is a diagram illustrating a simple table with default borders in an embodiment of the present application;

FIG. 2b is a diagram illustrating a table showing normal frames after dividing the dividing lines in the embodiment of the present application;

FIG. 3a is a diagram illustrating a complex table with default borders in an embodiment of the present application;

FIG. 3b is a diagram illustrating a complex table after a segmentation line is segmented incorrectly in an embodiment of the present application;

FIG. 3c is a diagram illustrating a complete table of frames after frame supplementation based on structured data in an embodiment of the present application;

FIG. 4 is a diagram illustrating another example of a complex table with default borders in an embodiment of the present application;

FIG. 5 is a flow chart illustrating a data processing method in an embodiment of the present application;

FIG. 6 is a schematic diagram of another form to be processed in an embodiment of the present application;

fig. 7 shows a flow chart of a data processing method according to another embodiment of the present application;

FIG. 8 is a schematic diagram of a portion of a first text unit pair obtained from the to-be-processed table shown in FIG. 3 a;

FIG. 9 is a flow chart illustrating one embodiment of S230 in a data processing method as set forth in the embodiment of FIG. 7;

FIG. 10 is a diagram illustrating feature vectors corresponding to a first text unit pair proposed in this embodiment;

FIG. 11 is a flow chart illustrating another implementation of S230 in a data processing method as set forth in the embodiment shown in FIG. 7;

fig. 12 is a schematic diagram illustrating a network structure of a convolutional neural network according to an embodiment of the present application;

FIG. 13 is a diagram illustrating a table with complete frames according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a sample first pair of text elements taken from the table of FIG. 13;

FIG. 15 is a schematic diagram illustrating a training process of a convolutional neural network according to an embodiment of the present application;

fig. 16 is a flowchart illustrating a data processing method according to another embodiment of the present application;

fig. 17 is a flowchart illustrating a data processing method according to an embodiment of the present application;

fig. 18 is a block diagram of a data processing apparatus according to an embodiment of the present application;

fig. 19 shows a block diagram of a data processing apparatus according to a further embodiment of the present application;

fig. 20 is a block diagram showing another electronic device for executing a data processing method according to an embodiment of the present application;

fig. 21 illustrates a storage unit for storing or carrying program codes for implementing a data processing method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Among them, with the development of artificial intelligence technology, many scenes involving Natural Language Processing (NLP) and machine learning have appeared.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The scheme provided by the embodiment of the present application relates to technologies such as Natural Language Processing (NLP) and machine learning of artificial intelligence, and is specifically described by the following embodiments:

the structured data has certain formatting or structured data, and has the characteristics of organized and ordered formats, so that the structured data has wide application scenes in the data post-processing process, for example, the structured data is required to be used in the data processing operation processes of map construction, database construction, data analysis, statistics and the like. The structured data may be stored in Json (JS Object Notation), csv (Comma-Separated Values), or other formatted forms.

Referring to fig. 1, there is shown a schematic diagram of structured data stored in the form of Json, in the structured data shown in fig. 1, a correspondence relationship between "do not attend job of board of directors" and "independent board of directors", a correspondence relationship between "name of do not attend board of directors" and "zhangsan", a correspondence relationship between "reason for do not attend board of directors" and "reason for official", and a correspondence relationship between "name of person being delegated" and "lee-four" are clearly and neatly expressed, and therefore, it can be seen that data analysis and statistics can be facilitated based on the structured data.

With the large increase of data and the simplicity of using forms to display data, forms are increasingly used in various fields to display data information, such as financial fields, internet fields, education fields, and the like, wherein financial fields are documents such as stock books, quarterly newspapers, yearly newspapers, semiannuals, and the like, and therefore, obtaining structured data from forms is becoming more and more common. However, the inventor finds that, in the research on obtaining the structured data from the table, the related way of obtaining the structured data from the table has certain limitations.

In the related art, there are several common scenes in obtaining the structured data from the table, one of the scenes is to obtain the structured data from the table with normal borders, that is, there are borders around each text, and for this type of table, there are more mature methods for obtaining the structured data from the table, such as a visual information back-stepping method. Another common scenario is to obtain structured data from a table with default borders, which can be divided into two types, a simple table with default borders, and a complex table with default borders. It can be understood that the table borders are divided into an inner border and an outer border, where the table with the default outer border does not affect the structured information extraction, that is, the table with the default outer border can be similar to the table with the normal border, and therefore, the table with the default border in the embodiment of the present application mainly aims at the table with the default inner border.

The simple table refers to that only the first row or column is a key, the remaining rows or columns are values, the complex table refers to that the previous N rows or the previous N columns are keys, the previous row is a parent key of the next row, or the previous column is a parent key of the next column, which may be sequentially represented as keys, child keys, grandchild keys, etc., and the remaining rows or columns are values.

Referring to fig. 2a, fig. 2a shows a schematic diagram of a simple table with a default border, in which only in the first row in fig. 2a there are key "account age", "receivables billing rate (%)" and "other receivables billing rate (%)", and the other rows are respectively value corresponding to "account age", "receivables billing rate (%)" and value corresponding to "other accounts billing rate (%)", and the default border of the simple table is vertical.

Referring to fig. 3a, fig. 3a shows a schematic diagram of a complex table with default frame in the embodiment of the present application, in fig. 3a, the first row has key "reddish year", the second row has sub-keys "2015 year", "2016 year" and "2017 year", the remaining rows except the first row and the second row are values corresponding to "reddish year" and "2015 year", values corresponding to "reddish year" and "2016 year", and values corresponding to "reddish year" and "2017 year", respectively, and the vertical frame of the complex table is omitted.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating another complex table with default borders in the embodiment of the present application, in fig. 4, a first column includes key "year of tax," a second column includes sub-keys "2010," "2011," and "2012," and the remaining columns except for the first column and the second column are values, and the horizontal borders of the complex table are omitted.

For a simple table with a default frame, in a related mode, a vertical blank area can be detected, a vertical dividing line or a horizontal dividing line is directly determined at any position in the vertical blank area, and a table with a normal frame can be obtained, and then a visual information reverse-deducing method is used for obtaining structured data. Referring to fig. 2b, fig. 2b shows a schematic diagram of a table with a normal frame after dividing by a dividing line in the embodiment of the present application, and as shown in fig. 2b, vertical blank regions (region 1 and region 2 in fig. 2 b) are detected first, and a vertical dividing line y1 and a vertical dividing line y2 are determined directly at any position in region 1 and region 2, respectively, so that a table with a normal frame can be obtained.

For the complex table with default frame, the same method as the simple table with default frame is used, so that errors occur, and the obtained structured data is often different from the standard structured data.

In order to solve the above problems, the inventors studied the characteristics of a complex table with default borders and the same method as a simple table with default borders, and found that value parts in the complex table are generally regularly arranged and there is generally no error in dividing according to detected blank regions, but key parts in the table are irregularly arranged and there is an error in dividing according to detected blank regions. Referring to fig. 3b, fig. 3b shows a schematic diagram of a complex table after a dividing line is divided by an error in the embodiment of the present application, as shown in fig. 3b, rows of value regions (third, fourth, and five rows in the table of fig. 3 b) are regularly arranged, and rows of key regions (first and second rows in the table of fig. 3 b) are irregularly arranged, that is, regions 11 and 13 above "2015 year" and "2017 year" are blank regions. Therefore, the whole area where the "reddish year" is located is divided by using the vertical dividing lines y3 and y4, so that the area 11, the area 12 and the area 13 in fig. 3b are divided by mistake, and thus only the sub-key corresponding to the "reddish year" in "2016" can be obtained, so that wrong structured data can be obtained.

Based on the above findings, it is suggested that the cause of inaccuracy of the structured data obtained in the related art is mainly caused by the key area portion. Furthermore, the inventor proposes a data processing method, an apparatus, an electronic device, and a storage medium provided by the present application, in the method, after a first text unit in a table to be processed is obtained, a target first text unit pair having a master-slave relationship is obtained based on the first text unit, and then key structured data corresponding to the table to be processed is obtained based on the target first text unit pair.

In the foregoing manner, since the key structured data obtained by the target first text unit pair having the master-slave relationship is utilized, the obtained key structured data can accurately express the master-slave relationship between the first text unit and the first text unit, so that the accuracy of the key structured data is improved, and the accuracy of obtaining the structured data from the table is further improved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 5, fig. 5 is a flowchart illustrating a data processing method according to an embodiment of the present application, where the method includes:

s110, classifying the text units in the table to be processed to obtain the text units corresponding to the keys of the table to be processed, wherein the obtained text units corresponding to the keys are first text units.

The table to be processed may be understood as a table file with default frame, which needs to perform structured data acquisition, where the table file may be from documents in multiple formats, such as a PDF (Portable Document Format) Document, and may also be a picture Format Document, where the picture Format includes common png (Portable Network Graphics), jpeg (Joint Photographic Experts Group), bmp (BitMaP), and the like.

A text element may be understood as a collection of text elements, which may be understood as a single number, punctuation, word, etc. in a table, and thus a text element may consist of at least one text element, while a table file may consist of a plurality of text elements, and it may be understood that a table file may not generally comprise only one table, and thus a table file may consist of at least two text elements.

Illustratively, with continuing reference to fig. 3a, in fig. 3a, the complex table is composed of text units such as "year in red", "2015", and "0.00%", and the "score", "red", "year", "degree" 4 words in the text unit "year in red" are respectively one text element, and the numbers "2", "0", "1", "5" and the words "year" in the text unit "2015" are respectively one text element, and the three numbers "0" and the punctuations "-" and "%" in the text unit "0.00%" are respectively one text element.

As can be seen from the foregoing, each table includes at least an area corresponding to a key and an area corresponding to a value, and thus, a text element corresponding to each key in the area corresponding to the key can be understood as a first text element, and a text element corresponding to each value in the area corresponding to the value can be understood as a second text element. For a complex table, the area corresponding to the key is at least two rows or two columns, and the previous row is a parent key of the next row, or the previous column is a parent key of the next column, which may be sequentially represented as a key, an offspring key, a grandchild key, etc., and the key, the offspring key, and the grandchild key may consider that there is a master-slave relationship, so that a table corresponding to at least two first text units having a master-slave relationship may be considered as a complex table, and it should be noted that there is no master-slave relationship between child keys corresponding to the same key, or there is no master-slave relationship between grandchild keys corresponding to the same child key.

It should be noted that, as a mode, the table to be processed may be obtained by performing table extraction, border default screening, simple and complex table screening, and the like on a document, that is, after reading a document, first extracting the existing table, then performing border default screening on the table, selecting the table with the border default, and finally screening the complex table with the border default from the table with the border default.

As one method, a special character detection method may be adopted for extracting a table in a document, for example, special characters such as "table 1", "table 2", and "table" in the document may be detected, and then whether a line region exists in a preset distance threshold region above or below the special characters is detected, and the region where the line exists is extracted as the table. The table frame default screening and the simple and complex table screening steps can adopt a pre-trained neural network model for screening, namely, the table extracted from the document is input into the pre-trained neural network model to classify whether the frame is in default or not and classify the simple and complex table, and optionally, the neural network model can be a convolutional neural network. Furthermore, through the method, the table to be processed, namely the complex table with the default frame, can be obtained from the document.

Alternatively, the table to be processed may also be directly available. The method comprises the steps of manually extracting tables in a document, performing default screening of table borders and simple and complex table screening, and uploading processed tables serving as tables to be processed to electronic equipment for processing.

In this embodiment, after the table to be processed is obtained by the above method, that is, after the complex table with a default frame is obtained, the electronic device may classify the text units in the table to be processed to obtain the first text unit and the second text unit, so that each first text unit of the table to be processed can be obtained.

As a way, the text units in the table to be processed are classified, and a pre-trained neural network model may be used for screening, that is, the text units obtained from the table to be processed are input into the pre-trained neural network model to classify the unit types, and optionally, the neural network model may be a convolutional neural network. Furthermore, through the above manner, the type of each text unit in the table to be processed can be obtained, that is, whether each text unit belongs to the first text unit or the second text unit respectively.

In addition, considering that the form to be processed is a PDF document or a picture format document, etc., on the bottom layer, the electronic device does not directly recognize the text unit in the form to be processed, and therefore, in order to obtain the text unit, the form to be processed needs to be processed.

As a manner, the table to be processed may be analyzed first to obtain the text elements corresponding to the table to be processed and the respective coordinates of each text element. Alternatively, the table to be processed may be parsed by a parser. For example, if the to-be-processed form is a PDF document, a PDF parser may be used to obtain text elements corresponding to the to-be-processed form and respective coordinates of each text element; if the to-be-processed form is a picture format document, the text elements corresponding to the to-be-processed form and the respective corresponding coordinates of each text element can be obtained by using a picture parser corresponding to the picture format.

After the text element corresponding to the to-be-processed table and the respective corresponding coordinate of each text element are obtained, each text unit included in the to-be-processed table can be obtained based on the text element corresponding to the to-be-processed table and the respective corresponding coordinate of each text element.

As a mode, obtaining each text unit included in the table to be processed based on the text element corresponding to the table to be processed and the respective corresponding coordinate of each text element, includes: obtaining the transverse distance between each text element based on the corresponding coordinate of each text element; and distinguishing a plurality of text elements with the same horizontal coordinate based on the distance threshold and the horizontal distance between the text elements to obtain each text unit included in the to-be-processed table.

The horizontal distance between the text elements can be understood as a difference value of the horizontal coordinates of the text elements, and since the horizontal coordinate occupied by each text element is a range, the left boundary of each text element can be uniformly used as a calculation reference, or the right boundary of each text element can be uniformly used as a calculation reference, or the middle value of the horizontal coordinate of each text element can be uniformly used as a calculation reference.

It can be understood that, during editing of the table, the horizontal interval between the text elements included in each text unit is preset, and therefore, the distance threshold may be understood as the preset horizontal interval between the text elements included in the text unit, and if the horizontal distance between two text elements is greater than the distance threshold, it indicates that the two text elements do not belong to the same text unit. And then, by the above method, whether the adjacent text elements belong to the same text unit can be judged, and then each text unit included in the to-be-processed form is obtained.

And S120, obtaining a target first text unit pair with master-slave relation based on the first text unit.

In order to obtain accurate key structured data subsequently, two first text units can be selected from the obtained first text units to perform master-slave relationship judgment, and the master-slave relationship judgment result includes two situations, one is that a master-slave relationship exists between the two first text units, and the other is that a master-slave relationship does not exist between the two first text units. If the master-slave relationship exists between the two first text units, the target first text unit pair with the master-slave relationship can be obtained according to the master-slave relationship between the two first text units. For example, if there is a master-slave relationship between the first text unit "year reddening" and the first text unit "year 2015", the target key text pair having the master-slave relationship may be found to be "year reddening-2015".

It should be noted that the target first text unit pair may be used to indicate that there is a master-slave relationship between the two first text units, and does not necessarily indicate that a combination, a concatenation, or a combination is to be performed between the two first text units.

In consideration of that, in a normal case, the master-slave relationship only exists between the first text units in two adjacent rows or two adjacent columns, and therefore, as a manner, the master-slave relationship may be determined between the obtained first text units, and the master-slave relationship may be determined between the first text units in two adjacent rows or two adjacent columns in sequence. For example, it may be determined whether a master-slave relationship exists between the first text unit of the first line and the first text unit of the second line, so that the two first text units having the master-slave relationship form a target first text unit pair. Referring to fig. 6, fig. 6 shows a schematic diagram of another form to be processed in the embodiment of the present application, and taking the form to be processed shown in fig. 6 as an example, a master-slave relationship determination may be performed between the first text unit "job" and the first text unit "director", a master-slave relationship determination may be performed between the first text unit "job" and the first text unit "manager", a master-slave relationship determination may be performed between the first text unit "director" and the first text unit "pro director", a master-slave relationship determination may be performed between the first text unit "manager" and the first text unit "sub-manager", and the like.

In addition, in consideration of the fact that whether the first text units are positioned in two adjacent rows or two adjacent columns during the implementation of the bottom layer needs to be determined through further calculation according to the coordinates of the first text units, in order to save the step of determining whether the first text units are positioned in two adjacent rows or two adjacent columns, as another way, the master-slave relationship between the acquired first text units is determined, or the master-slave relationship between any two first text units is determined, and at this time, the target first text unit pair with the master-slave relationship can also be obtained. Referring to fig. 6, taking the table to be processed shown in fig. 6 as an example, the master-slave relationship between the first text unit "job" and the first text unit "director", the master-slave relationship between the first text unit "director" and the first text unit "manager", the master-slave relationship between the first text unit "job" and the first text unit "director", the master-slave relationship between the first text unit "manager" and the first text unit "paramaster", and the like may be determined.

As a mode, the master-slave relationship between the two first text units may be determined by classifying the two first text units using a pre-trained neural network model, so as to determine whether the master-slave relationship exists between the two first text units.

S130, acquiring the master-slave relationship between the target first text units included in each target first text unit pair.

Wherein, two first text units included in the target first text unit pair can be understood as target first text units.

As can be seen from the foregoing, the target first text unit pair may be used to indicate that a master-slave relationship exists between two first text units, and therefore, based on each target first text unit pair, the electronic device may extract a master-slave relationship that each first text unit pair includes two target first text units, so as to obtain a master-slave relationship between the target first text units included in each target first text unit pair.

And S140, obtaining key structured data corresponding to the to-be-processed table based on the master-slave relationship between the target first text units included in each target first text unit pair, wherein the key structured data is data representing the master-slave relationship between the first text units and the first text units.

It is understood that the key structured data is data representing a master-slave relationship between the first text unit and the first text unit, and the target first text unit pair includes the first text unit having the master-slave relationship therebetween, so that the key structured data corresponding to the table to be processed can be obtained based on the master-slave relationship between the target first text units included in each target first text unit pair.

As a mode, obtaining key structured data corresponding to a table to be processed based on a master-slave relationship between target first text units included in each target first text unit pair, includes: and carrying out hierarchical processing on each target first text unit based on the master-slave relationship between the target first text units included in each target first text unit pair to obtain key structured data corresponding to the table to be processed.

Here, the master-slave relationship between the two target first text units may be understood as a superior-inferior relationship between the two target first text units, for example, there is a master-slave relationship between "year reddening" and "year 2015", and thus "year reddening" may be understood as an superior level of "year 2015" and "year 2015" may be understood as an inferior level of "year reddening".

Therefore, after the master-slave relationship between the target first text units is obtained, the target first text units can be subjected to hierarchical processing according to the master-slave relationship between the target first text units, so that the key structured data corresponding to the to-be-processed table is obtained.

It will be appreciated that there are two possible scenarios for the table to be processed comprising a plurality of first text unit pairs. In the first scenario, the table to be processed only includes a key and a sub-key, but the key corresponds to a plurality of sub-keys, and at this time, there are at least two (two or more) pairs of first text units. Illustratively, still taking the table to be processed shown in fig. 3a as an example, based on the table to be processed shown in fig. 3a, the target key text pair including "reddish year-2015 year", "reddish year-2016 year" and "reddish year-2017 year" can be obtained, and then the master-slave relationship between the target first text unit "reddish year" and the target first text unit "2015 year" can be obtained based on "reddish year-2015 year", so as to obtain "reddish year" at the upper level and "2015 year" at the lower level; the master-slave relationship between the target first text unit "reddening year" and the target first text unit "2016 year" can be obtained based on "reddening year-2016 year", so that the upper level is "reddening year", and the lower level is "2016 year"; the principal and subordinate relationship between the target first text unit "reddening year" and the target first text unit "2017 year" can be obtained based on the "reddening year-2017 year", so that the upper level is the "reddening year" and the lower level is the "2017 year", therefore, after each target first text unit is graded, the key is the "reddening year", the lower level corresponding to the "reddening year" includes the "2015 year", "2016 year" and "2017 year", namely, in the key structured data of the table to be processed, the sub-key is the "reddening year", and the sub-key is the "2015 year", "2016 year" and "2017 year". It can be seen that compared to the key in the key structured data obtained in fig. 3b being "reddish year", the sub-key is only "2016", which is more complete, and therefore, the key structured data obtained in the embodiment of the present application has high accuracy.

In the second scenario, the to-be-processed table may include keys, children keys, grandchildren keys, and the like, and at this time, it may be understood that the first text units in the to-be-processed table are nested step by step. Illustratively, still taking the to-be-processed form shown in fig. 6 as an example, in fig. 6, the target key text pair includes "job-director", "job-manager", "director-manager", and "manager-director", the master-slave relationship between the target first text unit "job" and the target first text unit "director" can be obtained based on "job-director", thereby obtaining "job" as opposed to "director" as superior; based on the master-slave relationship between the target first text unit "job" and the target first text unit "manager", the "job" is obtained to be superior to the "manager"; the master-slave relationship between the target first text unit "president" and the target first text unit "president" can be obtained based on "president-president", thereby obtaining "president" as an upper level with respect to "president"; the master-slave relationship between the target first text unit "board" and the target first text unit "board" can be obtained based on "board-board", thereby obtaining "board" as an upper level against "board"; the master-slave relation between the target first text unit manager and the target first text unit right management can be obtained based on manager-right manager, so that the manager is superior to the right management; the master-slave relationship between the target first text unit manager and the target first text unit assistant manager can be obtained based on manager-assistant manager, so that the manager is superior to the assistant manager.

Therefore, after the first text units of the targets are classified, the first level including "post", the second level corresponding to "post" including "president" and "manager", the third level corresponding to "president" of the second level including "president" and "subpresident", the third level corresponding to "manager" of the second level including "pro manager" and "submanager", that is, the key in the key structured data of the to-be-processed form finally obtained is "post", the child key is "president" and "manager", the grandchild key corresponding to the child key "president" is "pro president" and "subpresident", and the grandchild key corresponding to the child key "manager" is "pro manager" and "submanager" can be obtained after sorting.

It can be understood that the to-be-processed table includes a first text unit and a second text unit, the structured data corresponding to the to-be-processed table may include a master-slave relationship between the first text unit and an association relationship between the first text unit and the second text unit, and the key structured data is data representing the master-slave relationship between the first text unit and the first text unit, so in order to obtain the structured data corresponding to the to-be-processed table, in addition to obtaining the key structured data corresponding to the to-be-processed table, key value structured data representing the association relationship between the first text unit and the second text unit may also be obtained, and the association relationship between the first text unit and the second text unit may also be understood as an association relationship between a key and a corresponding value. After the key value structured data corresponding to the to-be-processed table is obtained, the key structured data and the key value structured data can be merged to obtain the structured data corresponding to the to-be-processed table.

Illustratively, still taking the table to be processed shown in fig. 3a as an example, in the key-value structured data corresponding to the table to be processed, the key is "reddening year", the sub-keys are "2015 year", "2016 year" and "2017 year", in the key-value structured data corresponding to the table to be processed, the sub-key "2015 year" corresponds to value "0.00", "-17.63", "0.00%", the sub-key "2016 year" corresponds to value "0.00", "-11.86", "0.00%", and the sub-key "2017 year" corresponds to value "0.00", "35.00" and "0.00%". Merging the key structured data and the key value structured data to obtain structured data corresponding to a to-be-processed table, wherein the key is 'reddening year', the sub-keys are '2015 year', '2016 year' and '2017 year', the values corresponding to the sub-key '2015 year' are '0.00', '17.63' and '0.00%', the values corresponding to the sub-key '2016 year' are '0.00', '11.86' and '0.00%', and the values corresponding to the sub-key '2017 year' are '0.00', '35.00' and '0.00%'.

Because the table to be processed has the default frame, the table is not convenient for the user to view, and therefore, after the structured data corresponding to the table to be processed is obtained, the table with the complete frame corresponding to the table to be processed can be generated based on the structured data, and then the table with the complete frame is output again, so that the content of the table can be viewed more clearly and intuitively.

As one way, based on the structured data, generating a table with a complete border corresponding to the table to be processed may include: obtaining table labels corresponding to the first text units included in the structured data based on the corresponding relation between the structured data and the table labels; and generating a table with a complete frame corresponding to the table to be processed based on the table label.

The form tag may be used to generate a form, for example, when generating an HTML (Hyper Text Markup Language) form, the form tag may include tags such as < tr >, < th >, and < td >, where < tr > is used to define a form row, < th > is used to define a form header, and < td > is used to define a form unit. It should be noted that the number of columns of the table is related to the number of defined table cells, for example, 5 table cells are defined, and the table has 5 columns.

It can be understood that a key is located in the first row of the structured data, a sub-key is located in the second row of the structured data, and a value is arranged in the next rows, accordingly, a key is located in the head of the table, a sub-key is located below a key, and a value is located below a sub-key, so that, as a way, the corresponding relationship between the structured data and the table label can be established according to the relative position relationship between each text unit in the structured data.

After the corresponding relationship between the structured data and the table labels is established, the table labels corresponding to the first text units included in the structured data can be obtained. After the form tags corresponding to the first text units included in the structured data are obtained, the electronic device may automatically generate an HTML form, that is, a form with a complete border corresponding to the form to be processed, based on the form tags.

By way of example, continuing with the table to be processed shown in fig. 3a as an example, it is assumed that in the structured data corresponding to the table to be processed, the key is "reddish year", the sub-keys are "2015", "2016", "2017", the value corresponding to the sub-key "2015" is "0.00", "-17.63", "0.00%", the value corresponding to the sub-key "2016" is "0.00", "-11.86", "0.00%", and the value corresponding to the sub-key "2017" is "0.00", "35.00", "0.00%". According to the corresponding relationship between the structured data and the table labels, the table row label < tr > and the table head label < th > of the first level corresponding to the key, the table row label < tr > and the table cell label < td > of the second level corresponding to the sub-key "2015", "2016" and "2017", the value "0.00", "17.63" and "0.00%" corresponding to the sub-key "2015" respectively correspond to the third, fourth, fifth level table row label < tr > and the table cell label < td >, and the value "0.00", "11.86" and "0.00%" corresponding to the sub-key "2016" respectively correspond to the third, fourth, fifth level table row label < tr > and the table cell label < td >, values "0.00", "35.00" and "0.00%" corresponding to the sub-key "2017" correspond to the third, fourth and fifth-level table row tag < tr > and the table cell tag < td >, respectively. Based on the form tag, the electronic device can automatically generate a form with complete borders as shown in FIG. 3c using HTML language.

According to the data processing method, after the text units in the table to be processed are classified to obtain each first text unit of the table to be processed, a target first text unit pair with a master-slave relationship is obtained based on the first text units, and then key structured data corresponding to the table to be processed can be obtained based on the target first text unit pair. The key structured data obtained by the target first text unit pair with the master-slave relationship is utilized, so that the master-slave relationship between the first text unit and the first text unit can be accurately expressed by the obtained key structured data, the accuracy of the key structured data is improved, and the accuracy of obtaining the structured data from the table is further improved.

Referring to fig. 7, fig. 7 is a flowchart illustrating a data processing method according to another embodiment of the present application, where the method includes:

s210, classifying the text units in the table to be processed to obtain the text units corresponding to the keys of the table to be processed, wherein the obtained text units corresponding to the keys are the first text units.

S220, obtaining at least one first text unit pair based on any two first text units, wherein every two first text units correspond to one first text unit pair.

In order to facilitate subsequent obtaining of the target first text unit pairs with master-slave relationship, after the first text units in the to-be-processed table are obtained, any two first text units can be combined to obtain at least one first text unit pair, namely each first text unit pair corresponds to two first text units, and then the target first text unit pairs with master-slave relationship are obtained from the first text unit pairs.

Illustratively, referring to FIG. 8, FIG. 8 shows a schematic diagram of a portion of a first text unit pair obtained from the to-be-processed table shown in FIG. 3a, and 5 first text unit pairs, respectively "year Red-2015", a first text unit pair "year Red-2016", a first text unit pair "year Red-2017", a first text unit pair "year 2015-year Red", and a first text unit pair "year 2015-2016-year", are shown in FIG. 8.

It should be noted that, there is a sequential order between the two first text units included in the first text unit pair, and even if the two first text unit pairs include the same two first text units, the sequential order of the two same first text units in the two first text unit pairs is different, and then the two first text unit pairs are also different. For example, the first text unit pair "year-2015 reddish" is different from the first text unit pair "year-2015 reddish".

And S230, acquiring the corresponding pair type of each first text unit pair.

Each first text unit pair corresponds to one pair type, the pair type of the first text unit pair can be used for representing whether a master-slave relationship exists between two first text units included in the first text unit pair, the pair type can include a first pair type and a second pair type, if the pair type is the first pair type, the master-slave relationship exists between the two first text units included in the first text unit pair, and if the pair type is the second pair type, the master-slave relationship does not exist between the two first text units included in the first text unit pair. Therefore, in order to facilitate subsequent obtaining of the target first text unit pair having the master-slave relationship, the pair type corresponding to each first text unit pair may be obtained, and the first text unit pair having the first pair type is directly determined as the target first text unit pair having the master-slave relationship.

As one way, as shown in fig. 9, obtaining the pair type corresponding to each first text unit pair includes:

and S231, acquiring the feature vectors corresponding to the first text unit pairs respectively.

It should be noted that, in this embodiment, the feature vector corresponding to the first text unit pair may be understood as representing one first text unit pair by using the feature vector. The feature vector may include two attributes, an input sequence length and a word vector dimension.

As one way, the word segmentation may be performed on the first text unit pair to obtain the word segmentation text corresponding to the first text unit pair and the corresponding word number, for example, the word segmentation operation is performed on the first text unit pair "red year-2015 year", and then 6 word segmentation texts "score", "red", "year", "degree", "2015", and "year" are obtained. The sequence length of the feature vector may be the same as the number of words included in the first text unit pair, that is, the length of the input sequence of the feature vector is 6.

After the plurality of word segmentation texts corresponding to the first text unit are obtained, a pre-established word vector configuration rule can be obtained in a pre-training word vector mode, and then a vector corresponding to each word segmentation text independently is obtained according to the word vector configuration rule and is used as a vector representation of each word segmentation text. The vector corresponding to each participle text independently can be a multidimensional vector, for example, 64-dimensional or 128-dimensional, etc.

Alternatively, the word segmentation of the first text unit pair may use a word segmentation tool such as jieba (jieba) word segmentation, dog word segmentation, and the like.

Optionally, the Word vector configuration rule obtained in advance may use a Word2 vec-based mode or a glove (global Vectors for Word reconstruction) based mode. It should be noted that word2vec is a correlation model used for training to generate word vectors. The word2vec model may be a three-layer neural network, where an input layer of the word2vec model is used to perform one-hot encoding on words, and a hidden layer of the word2vec model is a linear unit. The dimension of the output layer of the word2vec model is the same as that of the input layer of the word2vec model, and is realized based on Softmax regression. Among them, softmax Regression (softmax Regression) is a general form of logistic Regression, and softmax Regression can be used for multi-classification. In the training process, the N-gram language model can be trained through a neural network, and word vectors corresponding to words are obtained in the training process. After training is completed based on word2vec, the method can be used for mapping each word to a vector and representing the relation between words and words, and further, the second feature vector representation corresponding to each word segmentation text is obtained based on a word2vec model. GloVe is a word representation tool based on global word frequency statistics, which can represent a vocabulary as a vector consisting of real numbers, and the vector captures some semantic characteristics such as similarity and analogy among vocabularies.

Referring to fig. 10, fig. 10 shows a schematic diagram of a feature vector corresponding to a first text unit pair in the embodiment. The vertical lattices respectively correspond to the vocabularies sequentially included by the first text unit pair, and the horizontal lattices represent word vector dimensions respectively corresponding to the vocabularies, wherein the k dimension can be 64 dimensions or 128 dimensions.

And S232, obtaining the pair type of the corresponding first text unit pair based on each feature vector.

It is to be understood that, after the feature vectors are obtained, the corresponding relationship between the feature vectors and the pair types of the first text unit pair can be used to obtain the pair type of each feature vector corresponding to the first text unit pair.

As one mode, steps S231 to S232 in the present embodiment may be performed by a trained classification model. In this way, the correspondence between the feature vector and the pair type of the first text unit pair can be derived by the classification model.

The text classification model may be a bayesian classification model, a maximum entropy classifier, a convolutional neural network, or the like, as some approaches.

Alternatively, as shown in fig. 11, obtaining the pair type corresponding to each first text unit pair includes:

and S233, performing feature extraction on each first text unit pair through a feature extraction layer of the classification model to obtain a feature vector corresponding to each first text unit pair.

The feature extraction layer is used for preprocessing original input data and then extracting features of the preprocessed data to obtain feature vectors corresponding to the original data.

And S234, processing the feature vector corresponding to each first text unit through the feature processing layer of the classification model to obtain the pair type corresponding to each first text unit.

The feature processing layer is used for calculating scores of each pair type according to the feature vectors corresponding to each first text unit pair and normalizing the scores into the probability of each pair type.

As an alternative, the classification model may be a convolutional neural network, please refer to fig. 12, where fig. 12 illustrates a schematic network structure of a convolutional neural network proposed in an embodiment of the present application, in fig. 12, the convolutional neural network includes a feature extraction layer and a feature processing layer, where the feature extraction layer may include an input layer, a convolutional layer, and a pooling layer, and the feature processing layer may include a fully-connected layer and an output layer. The present embodiment will further describe step S233 to step S234 in conjunction with the convolutional neural network shown in fig. 12. As shown in fig. 12, each first text unit pair may be input to the convolutional neural network, so as to obtain a classification label of the first text unit pair, i.e., a pair type of the first text unit pair.

As shown in fig. 12, the input layer in the convolutional neural network is configured to receive each first text unit pair input, and convert the natural text in each first text unit pair into a mathematical vector with an input sequence length of n and a word vector dimension of k, where n is equal to the number of words obtained by performing word segmentation on the first text unit pair, and k is a preset dimension, and may be 64 dimensions or 128 dimensions, and the like. The convolution layer is used for further extracting features of the feature vector corresponding to the first text unit according to the word sense, the word sequence and the context. The pooling layer is used for screening the features according to the importance degree of the further extracted features, namely the feature vectors corresponding to the first text unit pair. The full-connection layer is used for calculating a classification system according to the screened feature vectors, namely calculating the score of each pair type and normalizing the score into the probability of each pair type, determining the pair type of the first text unit pair according to the probability of the pair type, and the output layer is used for outputting a classification label, namely outputting the pair type of the first text unit pair. In operation, the convolutional neural network shown in fig. 12 operates on the first pair of text units input, one-dimensional convolutional layers with filter _ size (2,3,4), each filter having 2 channels, which is equivalent to extracting two 2-gram, 3-gram, and 4-gram features, respectively.

As one way, the classification model may be obtained by: acquiring a sample first text unit pair carrying a pair type label, namely training data; and training the initial model based on the sample first text unit pair and the carried pair type label to obtain a classification model.

The sample first text unit pair may be understood as a first text unit pair carrying a classification tag, that is, a first text unit pair carrying a pair type tag, where the pair type tag may include a first pair type tag and a second pair type tag. Alternatively, the sample first text element pair may be obtained from a table with the border intact.

For example, referring to fig. 13 and fig. 14 together, fig. 13 shows a schematic diagram of a table with a complete frame according to an embodiment of the present application, and fig. 14 shows a schematic diagram of a sample first text unit pair obtained from the table shown in fig. 13. And as shown in fig. 14, the sample first text unit pair carries labels 1 and 0, where label 1 represents a first pair of type labels and label 0 represents a second pair of type labels. When the first text unit pair carrying the pair type labels is used as a training sample to train the convolutional neural network, the sample first text unit pair carrying the first pair type labels can be used as a positive sample, and the sample first text unit pair carrying the second pair type labels can be used as a negative sample.

Referring to fig. 15, fig. 15 is a schematic diagram illustrating a training process of a convolutional neural network according to an embodiment of the present application, where as shown in fig. 15, the training process includes: s201, inputting training data into an input layer; s202, the input layer converts natural texts in the training data into mathematical vectors; s203, the convolution layer further extracts features according to the word meaning, the word sequence and the context; s204, the pooling layer screens the features according to the importance degree of the features; s205, the full connection layer calculates the score of each category and normalizes the score into the probability of each category; s206, the output layer finally outputs the prediction category; and S207, evaluating an error value between the prediction category and the pair type label through a loss function, and updating the parameter weight through a back propagation error and a gradient. The above processes S201 to S207 are equivalent to completing one training of the convolutional neural network (initial model), and the convolutional neural network is repeatedly performed by using a plurality of training samples to obtain the classification model. Alternatively, the loss function may employ a cross-entropy loss function.

When the convolutional neural network is trained, manual parameter adjustment can be carried out so as to optimize the convolution and improve the neural network. Among them, there are various manual parameter adjusting methods.

As one approach, manual referencing may include: different optimization algorithms, such as SGD (random gradient descent) optimization algorithm, Adam optimization algorithm, Nadam optimization algorithm, etc., are tried to get better accuracy.

Alternatively, the manual parameter adjustment may include: different learning rates are tried to obtain better accuracy, and under the condition of equivalent accuracy, the learning rate with higher training speed is selected.

Alternatively, the manual parameter adjustment may include: and (3) adding dropout layers after each layer of the network, and trying different dropout parameters to obtain better generalization performance, wherein dropout means that a neural network unit is temporarily discarded from the network according to a certain probability in the training process of the deep learning network.

As one way, the pair type of the first text unit pair may be determined by training the learning probability through a language model, and at this time, the pair type of the first text unit pair may be obtained by inputting each feature vector to the language model, respectively. This approach may be used without constructing negative examples.

S240, determining the first text unit pair with the first pair type as a target first text unit pair with master-slave relationship.

Since the pair type is the first pair type, which indicates that the two first text units included in the first text unit pair have the master-slave relationship therebetween, in this embodiment, the first text unit pair whose pair type is the first pair type may be directly determined as the target first text unit pair having the master-slave relationship.

S250, acquiring the master-slave relationship between the target first text units included in the target first text unit pair.

And S260, obtaining key structured data corresponding to the to-be-processed table based on the master-slave relationship between the target first text units included in the target first text unit pair, wherein the key structured data is data representing the master-slave relationship between the first text units and the first text units.

And S270, obtaining key value structured data corresponding to the table to be processed, wherein the key value structured data are data representing the incidence relation between the first text unit and the second text unit, and the second text unit is a text unit corresponding to each value of the table to be processed.

And S280, merging the key structured data and the key value structured data to obtain structured data corresponding to the to-be-processed table.

The data processing method includes classifying text units in a table to be processed to obtain first text units of the table to be processed, obtaining at least one first text unit pair based on any two first text units, enabling every two first text units to correspond to one first text unit pair, then obtaining corresponding pair types of the first text unit pairs, then determining the first text unit pairs with the first pair types as target first text unit pairs with master-slave relations, and finally obtaining key structured data corresponding to the table to be processed based on the target first text unit pairs. The method has the advantages that the first text unit pair can be formed based on any two first text units, the target first text unit pair with the master-slave relationship is obtained based on the pair type of the first text unit pair, the obtaining process of the target first text unit pair is simplified, and the key structured data obtained by the target first text unit pair with the master-slave relationship is utilized, so that the obtained key structured data can accurately express the master-slave relationship between the first text unit and the first text unit, the accuracy of the key structured data is improved, and the accuracy of obtaining the structured data from the table is improved.

Referring to fig. 16, fig. 16 is a flowchart illustrating a data processing method according to another embodiment of the present application, the method including:

s310, classifying the text units in the table to be processed to obtain the text units corresponding to the keys of the table to be processed, wherein the obtained text units corresponding to the keys are first text units.

S320, obtaining a target first text unit pair with master-slave relation based on the first text unit.

S330, acquiring the master-slave relationship between the target first text units included in the target first text unit pair.

S340, obtaining key structured data corresponding to the to-be-processed table based on the master-slave relationship between the target first text units included in the target first text unit pair, wherein the key structured data is data representing the master-slave relationship between the first text units and the first text units.

And S350, acquiring each second text unit set, wherein each second text unit included in each second text unit set has at least one same coordinate attribute, and the second text unit is a text unit corresponding to each value of the table to be processed.

It will be appreciated that the form to be processed is made up of a plurality of text units, each of which may have a corresponding coordinate attribute, which are classified into a first text unit and a second text unit, and thus each of the first text unit and the second text unit has a corresponding coordinate attribute.

It should be further noted that the to-be-processed table may include a first type of to-be-processed table and a second type of to-be-processed table, where the first type of to-be-processed table refers to that the structured data in the to-be-processed table is presented in a vertical manner, for example, the table shown in fig. 3a, and in the table shown in fig. 3a, the relationship between the key, the sub-key, and the value is presented in a vertical manner; the second type of table to be processed means that the structured data in the table to be processed is presented in a horizontal manner, for example, the table shown in fig. 4, in which the relationship between key, sub-key, value is presented in a horizontal manner in the table shown in fig. 4.

As a mode, it is determined whether the to-be-processed table is the first type of to-be-processed table or the second type of to-be-processed table, and a pre-trained neural network model may be used for determining, that is, the to-be-processed table is input into the pre-trained neural network model to determine whether the to-be-processed table is the first type of to-be-processed table or the second type of to-be-processed table, and optionally, the neural network model may be a convolutional neural network. Further, by the above method, it can be determined whether the to-be-processed table is the first type of to-be-processed table or the second type of to-be-processed table.

The second text unit set can be understood as a set of a column or a line of second text units, and all the second text units in the second text unit set belong to the same first text unit. For a first type of table to be processed, the second set of text units is a set of a column of second text units, and for a second type of table to be processed, the second set of text units is a set of a row of second text units.

The coordinate attribute can be understood as coordinate information corresponding to a preset position in each text unit. As some approaches, for a first type of table to be processed, the coordinate attribute may be a left boundary of an abscissa, a right boundary of an abscissa, and a middle value of the abscissa corresponding to the text unit, and for a second type of table to be processed, the coordinate attribute may be an upper and lower boundary value of an ordinate corresponding to the text unit.

With reference to the foregoing embodiment, each text element corresponding to the table to be processed corresponds to a coordinate, so that the coordinate range of each text unit can be calculated according to the coordinate corresponding to each text element, and the coordinate attribute corresponding to each text unit can be obtained according to the coordinate range of each text unit.

Considering that a table is usually selected to be aligned when editing, such as left-aligned, right-aligned, center-aligned, etc., for a table edited in a left-aligned manner in a table of a first type, the left boundary of the abscissa of each second text unit in a second text unit set is the same, for a table edited in a right-aligned manner in a table of a first type, the right boundary of the abscissa of each second text unit in a second text unit set is the same, and for a table edited in a center-aligned manner in a table of a first type, the middle value of the abscissa of each second text unit in a second text unit set is the same. Therefore, for the first type of table to be processed, as a way, each second text unit set may be obtained through a left boundary of an abscissa of the text unit, a right boundary of the abscissa, or a middle value of the abscissa, that is, it is determined which second text units belong to the same column. By way of example, with continued reference to fig. 3a, in fig. 3a, the abscissa right boundaries of the second text units "0.00", "17.63", "0.00%" are the same, the abscissa right boundaries of the second text units "0.00", "11.86", "0.00%" are the same, and the abscissa right boundaries of the second text units value "0.00", "35.00", "0.00%" are the same, so that, based on the table to be processed shown in fig. 3a, 3 second sets of text units can be obtained, the first second set of text units comprising the first text unit "0.00", "17.63", "0.00%", the second set of text units comprising the first text unit "0.00", "11.86", "0.00%", and the third second set of text units comprising the first text unit "0.00", "35.00", "0.00%".00% ".

And for a second type of table to be processed, since the second text unit set is a set of one line of the second text unit. Therefore, for the table to be processed of the second type, as a way, each second text unit set may be obtained through the vertical coordinate upper and lower boundary values of the text units, that is, it is determined which second text units belong to the same line. By way of example, with continuing reference to fig. 4, in fig. 4, the ordinate upper and lower boundary values of the second text elements "0.98", "0.88", "0.90" are the same, the ordinate upper and lower boundary values of the second text elements "0.96", "0.95", "0.94" are the same, and the ordinate upper and lower boundary values of the second text elements "0.75", "0.76", "0.80" are the same, so that, based on the table to be processed shown in fig. 4, 3 second text element sets can be obtained, the first text element set including the first text element "0.98", "0.88", "0.90", the second text element set including the first text element "0.96", "0.95", "0.94", the third text element set including the first text element "0.75", "0.76", "0.80",

it should be noted that one first text unit may correspond to one second text unit, or may correspond to a plurality of second text units, and therefore, each second text unit set may include at least one second text unit.

And S360, acquiring the first text units corresponding to the second text unit sets.

Since all the second text units in the second text unit sets belong to the same first text unit, obtaining the first text unit corresponding to each second text unit set can be understood as a process of obtaining the first text unit to which each second text unit set belongs, that is, a process of obtaining a key to which a value belongs.

It will be appreciated that for a first type of table to be processed, since the structured data is presented in a vertical manner, as one way, the second text element of the uppermost line in each of the second text element sets may be determined first according to the coordinate range of each second text element, and then the first text element adjacent to the second text element of the uppermost line in each of the second text element sets may be determined as the first text element corresponding to each of the second text element sets according to the coordinate range of the first text element and the coordinate range of the second text element of the uppermost line in each of the second text element sets.

For the second type of table to be processed, since the structured data is presented in a horizontal manner, as one manner, the second text unit in the leftmost column of each second text unit set may be determined according to the coordinate range of each second text unit, and then the first text unit adjacent to the second text unit in the leftmost column of each second text unit set may be determined as the first text unit corresponding to each second text unit set according to the coordinate range of the first text unit and the coordinate range of the second text unit in the leftmost column of each second text unit set.

And S370, merging each second text unit set and the first text unit corresponding to each second text unit set to obtain key value structured data corresponding to the table to be processed.

The process of merging each second text unit set and the first text units corresponding to each second text unit set can be understood as a process of establishing an association relationship between the second text unit set and the corresponding first text units, and after the process of establishing the association relationship between the second text unit set and the corresponding first text units, key value structured data representing the association relationship between the first text units and the second text units can be obtained.

And S380, merging the key structured data and the key value structured data to obtain the structured data corresponding to the to-be-processed table.

In the data processing method provided by the application, after the text units in the table to be processed are classified to obtain the text units corresponding to the keys of the table to be processed as the first text units, firstly, based on the first text unit, obtaining a target first text unit pair with master-slave relation, then based on the target first text unit pair, obtaining key structured data corresponding to the table to be processed, then, each second text unit set is obtained, then, the first text unit corresponding to each second text unit set is obtained, and finally, merging the key structured data and the key value structured data to obtain the structured data corresponding to the table to be processed. The key value structured data of the table to be processed is obtained by combining the second text unit sets and the first text units corresponding to the second text unit sets, so that the accuracy of the key value structured data is improved, and the accuracy of the structured data obtained from the table is further improved.

Referring to fig. 17, fig. 17 is a flowchart illustrating a data processing method according to an embodiment of the present application, where the method includes a network training stage and a model application stage.

In the network training link, firstly, a complex table with a complete frame is obtained, then training samples are extracted from the complex table with the complete frame, namely a first text unit pair of the samples is extracted, the training samples carry classification labels of whether a master-slave relationship exists, and finally, a plurality of training samples are respectively input into a classification network (for example, a convolutional neural network) to be trained, so that a text classification model is obtained.

In the model application link, firstly, a table to be processed is obtained, then, the table to be processed is processed, first text units are obtained, first text unit pairs are generated based on any two first text units and serve as test data, then, the test data are input into a classification model, output of the classification model is obtained, namely, the pair types of the text unit pairs, then, the master-slave relationship between the first text units is judged based on the pair types of the text unit pairs, and then, key structured data corresponding to the table to be processed is generated based on the existence of the master-slave relationship between the first text units.

Referring to fig. 18, fig. 18 is a block diagram of a data processing apparatus 400 according to an embodiment of the present application, where the apparatus 400 includes:

the first text unit obtaining module 410 classifies text units in the table to be processed to obtain text units corresponding to keys of the table to be processed, where the obtained text units corresponding to the keys are first text units.

The target first text unit pair obtaining module 420 is configured to obtain a target first text unit pair having a master-slave relationship based on the first text unit.

By one approach, the target first pair of text units obtaining module 420 includes:

the first text unit pair obtaining submodule is used for obtaining at least one first text unit pair based on any two first text units, and every two first text units correspond to one first text unit pair.

And the pair type obtaining submodule is used for obtaining the pair type corresponding to each first text unit pair.

And the target key text pair determining submodule is used for determining the first text unit pair with the first pair type as the target first text unit pair with master-slave relationship.

As a mode, the pair type obtaining submodule is specifically configured to obtain a feature vector corresponding to each first text unit pair; based on each feature vector, a pair type of the corresponding first text unit pair is obtained.

As a manner, the pair type obtaining sub-module is specifically configured to perform feature extraction on each first text unit pair through a feature extraction layer of a classification model to obtain a feature vector corresponding to each first text unit pair, and process the feature vector corresponding to each first text unit pair through a feature processing layer of the classification model to obtain a pair type corresponding to each first text unit pair. Wherein, the classification model is obtained by the following steps: acquiring a sample first text unit pair carrying a pair type label; and training the initial model based on the sample first text unit pair and the carried pair type label to obtain a classification model.

A master-slave relationship obtaining module 430, configured to obtain a master-slave relationship between the target first text units included in the target first text unit pair.

The key structured data obtaining module 440 is configured to obtain key structured data corresponding to the table to be processed based on a master-slave relationship between the target first text unit and the target first text unit included in the target first text unit pair, where the key structured data is data representing the master-slave relationship between the first text unit and the first text unit.

As one mode, the key structured data obtaining module 440 is specifically configured to perform hierarchical processing on each target first text unit based on a master-slave relationship between the target first text units included in each target first text unit pair, so as to obtain the key structured data corresponding to the to-be-processed table.

As one mode, as shown in fig. 19, fig. 19 shows a block diagram of a data processing apparatus 400 according to another embodiment of the present application, where the apparatus 400 further includes:

the key-value structured data obtaining module 450 is configured to obtain key-value structured data corresponding to a table to be processed, where the key-value structured data is data representing an association relationship between a first text unit and a second text unit, and the second text unit is a text unit corresponding to each value of the table to be processed.

Optionally, the to-be-processed table includes a plurality of text units, each text unit has a corresponding coordinate attribute, and the plurality of text units includes a first text unit and a second text unit, in this way, the key-value structured data obtaining module 450 is specifically configured to obtain each second text unit set, and each second text unit included in each second text unit set has at least one same coordinate attribute; acquiring first text units corresponding to the second text unit sets; and merging the second text unit sets and the first text units corresponding to the second text unit sets to obtain key value structured data corresponding to the to-be-processed table. The to-be-processed forms comprise a first type of to-be-processed form and a second type of to-be-processed form, the coordinate attributes of the first type of to-be-processed form comprise a horizontal coordinate left boundary, a horizontal coordinate right boundary and a horizontal coordinate middle value, and the coordinate attributes of the second type of to-be-processed form comprise a vertical coordinate upper and lower boundary value.

The structured data obtaining module 460 is configured to merge the key structured data and the key value structured data to obtain structured data corresponding to the to-be-processed table.

A frame generating module 470, configured to obtain a table label corresponding to each first text unit included in the structured data based on a corresponding relationship between the structured data and the table label; and generating a table with a complete frame corresponding to the table to be processed based on the table label.

As one mode, the apparatus 400 further includes:

and the table analysis module is used for analyzing the table to be processed to obtain the text elements corresponding to the table to be processed and the coordinates corresponding to each text element.

And the text unit obtaining module is used for obtaining each text unit included in the table to be processed based on the text element corresponding to the table to be processed and the respective corresponding coordinate of each text element.

As one mode, the text unit obtaining module is specifically configured to obtain a lateral distance between each text element based on a respective corresponding coordinate of each text element; and distinguishing a plurality of text elements with the same horizontal coordinate based on the distance threshold and the horizontal distance between the text elements to obtain each text unit included in the to-be-processed table.

According to the data processing device, after text units in a table to be processed are classified to obtain text units corresponding to keys of the table to be processed as first text units, target first text unit pairs with master-slave relations are obtained on the basis of the first text units, and then master-slave relations among the target first text units included in each target first text unit pair are obtained; obtaining key structured data corresponding to the table to be processed based on the master-slave relationship between the target first text units included by each target first text unit pair, then obtaining each second text unit set, then obtaining first text units corresponding to each second text unit set, then combining each second text unit set and the first text units corresponding to each second text unit set to obtain key value structured data corresponding to the table to be processed, and finally combining the key structured data and the key value structured data to obtain structured data corresponding to the table to be processed. The key value structured data of the table to be processed is obtained by combining the second text unit sets and the first text units corresponding to the second text unit sets, so that the accuracy of the key value structured data is improved, and the accuracy of the structured data obtained from the table is further improved.

It should be noted that the device embodiment and the method embodiment in the present application correspond to each other, and specific principles in the device embodiment may refer to the contents in the method embodiment, which is not described herein again.

An electronic device provided by the present application will be described below with reference to fig. 20.

Referring to fig. 20, based on the data processing method, another electronic device 200 including a processor 104 capable of executing the data processing method is provided in the embodiment of the present application, where the electronic device 200 may be a smart phone, a tablet computer, a portable computer, or the like. Electronic device 200 also includes memory 104, network module 106, and screen 108. The memory 104 stores programs that can execute the content of the foregoing embodiments, and the processor 102 can execute the programs stored in the memory 104.

Processor 102 may include, among other things, one or more cores for processing data and a message matrix unit. The processor 102 interfaces with various components throughout the electronic device 200 using various interfaces and circuitry to perform various functions of the electronic device 200 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 104 and invoking data stored in the memory 104. Alternatively, the processor 102 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 102 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 102, but may be implemented by a communication chip.

The Memory 104 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 104 may be used to store instructions, programs, code sets, or instruction sets. The memory 104 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the terminal 100 in use, such as a phonebook, audio-video data, chat log data, and the like.

The network module 106 is configured to receive and transmit electromagnetic waves, and achieve interconversion between the electromagnetic waves and the electrical signals, so as to communicate with a communication network or other devices, for example, an audio playing device. The network module 106 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and so forth. The network module 106 may communicate with various networks such as the internet, an intranet, a wireless network, or with other devices via a wireless network. The wireless network may comprise a cellular telephone network, a wireless local area network, or a metropolitan area network. For example, the network module 106 may interact with a base station.

The screen 108 may display interface content and may also be used to respond to touch gestures.

It should be noted that, in order to implement more functions, the electronic device 200 may also protect more devices, for example, may also protect a structured light sensor for acquiring face information or may also protect a camera for acquiring an iris.

Referring to fig. 21, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer readable medium 1100 has stored therein a program code that can be called by a processor to execute the method described in the above method embodiments.

The computer-readable storage medium 1100 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 1100 includes a non-volatile computer-readable storage medium. The computer readable storage medium 1100 has storage space for program code 1110 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 1110 may be compressed, for example, in a suitable form.

Based on the above-mentioned data processing method, according to an aspect of an embodiment of the present application, there is provided a computer program product or a computer program including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

To sum up, in a data processing method, an apparatus, an electronic device, a storage medium, and a computer program product or a computer program provided in an embodiment of the present application, after text units in a table to be processed are classified to obtain text units corresponding to keys of the table to be processed as first text units, a target first text unit pair having a master-slave relationship is obtained based on the first text units, then a master-slave relationship between target first text units included in each target first text unit pair is obtained, key structured data corresponding to the table to be processed is obtained based on a master-slave relationship between target first text units included in each target first text unit pair, then each second text unit set is obtained, then first text units corresponding to each second text unit set are obtained, and then each second text unit set and first text units corresponding to each second text unit set are combined And finally, combining the key value structured data and the key value structured data to obtain the structured data corresponding to the table to be processed. The key value structured data of the table to be processed is obtained by combining the second text unit sets and the first text units corresponding to the second text unit sets, so that the accuracy of the key value structured data is improved, and the accuracy of the structured data obtained from the table is further improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:数据路由方法、装置、设备及计算机可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!