Incremental acquisition method for data of multi-source heterogeneous relational database
1. An incremental acquisition method for multi-source heterogeneous relational database data is characterized by comprising the following steps:
s1, target table checking step: reading metadata of a source database table, checking whether a target database table meets the acquisition requirement, entering a subsequent data reading step if the target database table meets the acquisition requirement, and otherwise entering a target table adjusting step;
s2, target table adjusting step: deleting a target table in the target table under the condition that the target table does not meet the acquisition requirement, automatically generating a DDL statement of the target table according to a metadata structure of a source database table and the type of the target database, and creating the target table, wherein the DDL statement also establishes an expanded hash field in the target table;
s3, reading data: reading data of a source table and data of a target table, performing character string splicing calculation on record values according to a field sequence in metadata for each record in the source table, performing hash operation on the spliced character strings to obtain hash values, storing all the record values and the hash values obtained through the operation in a memory, and arranging all the hash values in an ascending order according to a character string sequence to form a record set of the source data table; for data in the target table, directly performing ascending sequencing according to the expanded hash fields and reading the data into a memory, thereby forming a target data table record set;
s4, data comparison: comparing a source data table record set and a target data table record set stored in a memory, and marking records only appearing in the source data table record set as new records; for records that appear only in the target data table record set, marking as deleted;
s5, history data recording step: when data adding and deleting operations are carried out on the target table, the added and deleted data of the target table are stored in the historical data table;
s6, data updating step: for the record marked as new added in the record set of the source data table, executing the operation of the new added record in the target table; for records in the target data table record set marked as deleted, a delete record operation is performed in the target table.
2. The method of claim 1, wherein the method is performed in a loop.
3. The incremental acquisition method for the multi-source heterogeneous relational database data according to claim 1 or 2, wherein the step S1 specifically includes: reading the structure metadata of the source data table in a JDBC mode, updating the metadata information of the source database table to be consistent with the type of the target table according to the field type mapping relation between the source database and the target database, checking whether the data table meeting the acquisition requirement exists in the target database, continuously checking the structure of the data table if the data table exists, and directly creating the target table if the data table does not exist.
4. The incremental acquisition method for the data of the multi-source heterogeneous relational database according to claim 3, wherein the checking whether the data table meeting the acquisition requirement exists in the target library specifically comprises:
firstly, obtaining metadata information f of data table from metadata of source table1(l1,t1),f2(l2,t2),…,fn(ln,tn) Wherein t represents a field type, l represents a field length, and f represents a field name; check again whether the same table f 'is present in the target repository'1(l’1,t’1),f’2(l’2,t’2),…,f’n(l’n,t’n),f’n+1(l’n+1,t’n+1) Wherein t ' represents a field type, l ' represents a field length, and f ' represents a field name; the metadata comparison rule of the source table and the target table is as follows:
fn(ln,tn)=f’n(l’n,t’n),n=1,2,3…;
if the condition is satisfied, the extended hash field f 'is removed from the target table'n+1(l’n+1,t’n+1) Otherwise, if there is no redundant field, the check is passed, otherwise it is not passed.
5. The incremental acquisition method for the multi-source heterogeneous relational database data according to claim 3, wherein the subsequent step is performed if the acquisition requirement is met in step S1, and the step of adjusting the target table is performed if the acquisition requirement is not met, specifically comprising: and if the acquisition requirement is met, performing a third step, if the first step does not meet the acquisition requirement and the target table does not exist, rebuilding the target table, and if part of the fields do not meet the acquisition requirement, deleting the target table and rebuilding.
6. The incremental acquisition method of the multi-source heterogeneous relational database data according to claim 4 or 5, wherein the automatically generating a target library table DDL statement in S2 specifically comprises: and generating a DDL statement of the target database type according to the field type mapping relation between the source database and the target database.
7. The method for incremental acquisition of data from a multi-source heterogeneous relational database according to claim 6, wherein the hash operation in S3 is Md5 hash calculation.
8. The method for incrementally collecting data from a multisource heterogeneous relational database as recited in claim 6, wherein the historical data table in step S5 is stored using a NoSQL database.
9. The method for incrementally collecting data from a multisource heterogeneous relational database according to claim 8, wherein the historical data table in step S5 is stored using an HBase database.
10. The incremental acquisition method for multi-source heterogeneous relational database data according to any one of claims 7 to 9, wherein the step S6 specifically includes: executing corresponding operation on the target table according to the data marks; when the target is operated, different SQL operation statements are generated according to different marking modes, and a batch submission mode is adopted during execution.
Background
In recent years, with the acceleration of the digital transformation of enterprises and public institutions, data middleboxes become the advantages of the digital transformation, and a plurality of enterprises propose a "middlebox strategy", one of the purposes of which is to realize the centralized storage and control of business data.
The key for realizing incremental acquisition is to accurately and quickly acquire change data, and an excellent incremental acquisition mechanism requires that the change data in a service system can be accurately acquired according to a certain frequency, and meanwhile, too much pressure cannot be caused on the original service system, so that the existing service is influenced. Compared with full extraction, the design of incremental extraction is more complex, and a method for automatically converting the full extraction process into the incremental extraction process is realized, provided that changed data must be captured, and a method for capturing the changed data commonly used in incremental data acquisition is shown in table 1:
TABLE 1 incremental data extraction general method
As can be seen from the above table, each method has certain limitations and different applicable scenes, and if a general data increment acquisition method is provided, the basic balance is achieved on the indexes of data completeness, low invasiveness, high extraction performance and the like, and the method is a good choice in engineering application. The invention is generated under the background, and improves the availability in engineering practice by innovating the acquisition mode and improving the performance of a full-table comparison mode and adding the low invasiveness and the high completeness of data.
Disclosure of Invention
Technical problem to be solved
The invention aims to solve the technical problem of how to provide an incremental acquisition method for data of a multi-source heterogeneous relational database, and the data acquisition requirement among the multi-source heterogeneous databases is met.
(II) technical scheme
In order to solve the technical problem, the invention provides an incremental acquisition method for data of a multi-source heterogeneous relational database, which comprises the following steps:
s1, target table checking step: reading metadata of a source database table, checking whether a target database table meets the acquisition requirement, entering a subsequent data reading step if the target database table meets the acquisition requirement, and otherwise entering a target table adjusting step;
s2, target table adjusting step: deleting a target table in the target table under the condition that the target table does not meet the acquisition requirement, automatically generating a DDL statement of the target table according to a metadata structure of a source database table and the type of the target database, and creating the target table, wherein the DDL statement also establishes an expanded hash field in the target table;
s3, reading data: reading data of a source table and data of a target table, performing character string splicing calculation on record values according to a field sequence in metadata for each record in the source table, performing hash operation on the spliced character strings to obtain hash values, storing all the record values and the hash values obtained through the operation in a memory, and arranging all the hash values in an ascending order according to a character string sequence to form a record set of the source data table; for data in the target table, directly performing ascending sequencing according to the expanded hash fields and reading the data into a memory, thereby forming a target data table record set;
s4, data comparison: comparing a source data table record set and a target data table record set stored in a memory, and marking records only appearing in the source data table record set as new records; for records that appear only in the target data table record set, marking as deleted;
s5, history data recording step: when data adding and deleting operations are carried out on the target table, the added and deleted data of the target table are stored in the historical data table;
s6, data updating step: for the record marked as new added in the record set of the source data table, executing the operation of the new added record in the target table; for records in the target data table record set marked as deleted, a delete record operation is performed in the target table.
Further, the method is performed cyclically.
Further, the step S1 specifically includes: reading the structure metadata of the source data table in a JDBC mode, updating the metadata information of the source database table to be consistent with the type of the target table according to the field type mapping relation between the source database and the target database, checking whether the data table meeting the acquisition requirement exists in the target database, continuously checking the structure of the data table if the data table exists, and directly creating the target table if the data table does not exist.
Further, the checking whether the data table meeting the acquisition requirement exists in the target library specifically includes:
firstly, obtaining metadata information f of data table from metadata of source table1(l1,t1),f2(l2,t2),…,fn(ln,tn) Wherein t represents a field type, l represents a field length, and f represents a field name; check again whether the same table f 'is present in the target repository'1(l’1,t’1),f’2(l’2,t’2),…,f’n(l’n,t’n),f’n+1(l’n+1,t’n+1) Wherein t ' represents a field type, l ' represents a field length, and f ' represents a field name; the metadata comparison rule of the source table and the target table is as follows:
fn(ln,tn)=f’n(l’n,t’n),n=1,2,3…;
if the condition is satisfied, the extended hash field f 'is removed from the target table'n+1(l’n+1,t’n+1) Otherwise, if there is no redundant field, the check is passed, otherwise it is not passed.
Further, if the acquisition requirement is met in step S1, the subsequent step is performed, otherwise, the target table adjustment step specifically includes: and if the acquisition requirement is met, performing a third step, if the first step does not meet the acquisition requirement and the target table does not exist, rebuilding the target table, and if part of the fields do not meet the acquisition requirement, deleting the target table and rebuilding.
Further, the automatically generating a target library table DDL statement in S2 specifically includes: and generating a DDL statement of the target database type according to the field type mapping relation between the source database and the target database.
Further, the hash operation in S3 is Md5 hash calculation.
Further, the history data table in the step S5 is stored using a NoSQL database.
Further, the history data table in the step S5 is stored using the HBase database.
Further, the step S6 specifically includes: executing corresponding operation on the target table according to the data marks; when the target is operated, different SQL operation statements are generated according to different marking modes, and a batch submission mode is adopted during execution.
(III) advantageous effects
The invention provides an incremental acquisition method of multi-source heterogeneous relational database data. Fast incremental data acquisition can be achieved. The method is suitable for various databases by utilizing a general data reading mode and calculating in the memory for a storage technology independent of a multi-source heterogeneous database. Therefore, the method is suitable for rapid large-batch incremental acquisition in the multi-source heterogeneous relational database.
Drawings
FIG. 1 is a flow chart of an incremental acquisition method for multi-source heterogeneous relational database data according to the present invention.
Detailed Description
In order to make the objects, contents and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention relates to an incremental acquisition method for data of a multi-source heterogeneous relational database, which belongs to the field of data processing technology during enterprise and public institution digital transformation and is used for leading external data sources to adopt an incremental data acquisition method during data warehouse or data center establishment.
The purpose of the invention is: the incremental acquisition method for the relational database is provided, and the data acquisition requirements among multi-source heterogeneous databases are met.
In order to achieve the above object, the present invention provides a method for incremental acquisition of a relational database, which is executed in a loop, and comprises:
and S1, checking the target table. Reading metadata of a source database table, checking whether a target database table meets the acquisition requirement, entering a subsequent data reading step if the target database table meets the acquisition requirement, and otherwise entering a target table adjusting step;
and S2, adjusting the target table. And deleting the target table in the target table under the condition that the target table does not accord with the acquisition requirement, automatically generating a DDL statement of the target table according to the metadata structure of the source database table and the type of the target database, and creating the target table, wherein the DDL statement also establishes an expanded hash field in the target table.
And S3, reading data. Reading data of a source table and data of a target table, performing character string splicing calculation on record values according to a field sequence in metadata for each record in the source table, performing hash operation on the spliced character strings to obtain hash values, storing all the record values and the hash values obtained through the operation in a memory, and arranging all the hash values in an ascending order according to a character string sequence to form a record set of the source data table. And directly sorting the data in the target table in an ascending order according to the expanded hash fields and reading the data into the memory, thereby forming a target data table record set.
And S4, data comparison. Comparing a source data table record set and a target data table record set stored in a memory, and marking records only appearing in the source data table record set as new records; for records that appear only in the target data table record set, marking as deleted;
and S5, history data recording. When data adding and deleting operations are carried out on the target table, the added and deleted data of the target table are stored in the historical data table;
and S6, data updating. For the record marked as new added in the record set of the source data table, executing the operation of the new added record in the target table; for records in the target data table record set marked as deleted, a delete record operation is performed in the target table.
In the data acquisition process, records of a source database table are utilized to carry out Hash operation to obtain a recorded Hash value, the recorded Hash value is compared with the Hash value stored in a target table, and the time complexity of comparison between a source data table record set and a target data table record set can be accelerated by sequencing the record sets. Fast incremental data acquisition can be achieved. The method is suitable for various databases by utilizing a general data reading mode and calculating in the memory for a storage technology independent of a multi-source heterogeneous database. Therefore, the method is suitable for rapid large-batch incremental acquisition in the multi-source heterogeneous relational database.
FIG. 1 is a flow chart of a method for incremental acquisition of data from a multi-source heterogeneous relational database. The method is a circularly executed process, and as shown in fig. 1, the method includes:
(1) and checking the target table. Reading the metadata of the source database table, checking whether the target database table meets the data acquisition requirement, entering a subsequent data reading step if the target database table meets the data acquisition requirement, and otherwise entering a target table adjusting step.
In specific implementation, firstly, reading the structure metadata of the source data table in the JDBC mode, then updating the metadata information of the source database table to be consistent with the type of the target table according to the field type mapping relation between the source database and the target database, then checking whether the target database has a data table meeting the acquisition requirement, if so, continuing checking the structure of the data table, and if not, directly creating the target table. The checking method comprises obtaining data from metadata of the source tableMetadata information f of data table1(l1,t1),f2(l2,t2),…,fn(ln,tn) Where t represents the field type, l represents the field length, and f represents the field name. Check again whether the same table f 'is present in the target repository'1(l’1,t’1),f’2(l’2,t’2),…,f’n(l’n,t’n),f’n+1(l’n+1,t’n+1) Where t ' represents the field type, l ' represents the field length, and f ' represents the field name. The metadata comparison rule of the source table and the target table is as follows:
fn(ln,tn)=f’n(l’n,t’n),(n=1,2,3…)
if the condition is satisfied, the extended hash field f 'is removed from the target table'n+1(l’n+1,t’n+1) Otherwise, if there is no redundant field, the check is passed, otherwise it is not passed.
(2) And adjusting the target table. And deleting the target table in the target table under the condition that the target table does not accord with the acquisition requirement, automatically generating a DDL statement of the target table according to the metadata structure of the source database table and the type of the target database, and creating the target table. Wherein the DDL statement also establishes an extended hash field in the target table.
And during specific implementation, generating a DDL statement of the target database type according to the field type mapping relation between the source database and the target database, if the first step meets the acquisition requirement, performing a third step, if the first step does not meet the acquisition requirement and the target table does not exist, reconstructing the target table, and if part of the fields do not meet the acquisition requirement, deleting the target table and reconstructing the target table.
(3) And reading data. Reading data of a source table and data of a target table, performing character string splicing calculation on record values according to a field sequence in metadata for each record in the source table, performing hash operation on the spliced character strings to obtain hash values, storing all the record values and the hash values obtained through the operation in a memory, and arranging all the hash values in an ascending order according to a character string sequence to form a record set of the source data table. And directly sorting the data in the target table in an ascending order according to the expanded hash fields and reading the data into the memory, thereby forming a target data table record set.
In specific implementation, for each record in the source table, the character string splicing calculation is performed on the record value according to the field sequence in the metadata. Suppose fn(n-1, 2,3 …) for each record the value Dn(n is 1,2,3 …), the concatenated string S, S is D1 D2…DnIf D isnIf null or empty, using a placeholder 'learning', then carrying out Md5 hash calculation on S to obtain S ', storing the hash value S' and the key value relation of the fields and the field values into a memory database, and arranging in an ascending order to obtain a source data table record set; and then, directly sequencing the data in the target table in an ascending order according to the expanded hash fields and reading the data into the memory database, thereby forming a target data table record set.
And for the newly created target table, the content is empty, the target data table record set is empty, and the data in the source data table record set is updated to the target table during subsequent data comparison and updating. In the second loop, the target table already has data, and the target data table record set can be obtained by sorting according to the hash fields in an ascending order.
(4) And (5) data comparison. Comparing a source data table record set and a target data table record set stored in a memory, and marking records only appearing in the source data table record set as new records; for records that appear only in the target data table record set, marking as deleted;
in specific implementation, comparing the two sets according to the source data table record set and the target table record set which are sequenced in the third step, and marking the records which only appear in the source data table record set as new records; for records that appear only in the target data table record set, the flag is deleted. Sorting and aligning can change the time complexity from O (n)2) Reduced to O (N);
(5) and recording historical data. When data adding and deleting operations are carried out on the target table, the added and deleted data of the target table are stored in the historical data table;
in specific implementation, before adding and deleting data in the target table, the marked data set in the step four is stored in the historical data table, and as the data in the historical data table is likely to be larger and larger, a NoSQL database, such as HBase, can be selected for storage when the data is actually used; when the query is presented, the change condition of the whole data can be queried according to the service main key.
(6) And (5) updating data. For the record marked as new added in the record set of the source data table, executing the operation of the new added record in the target table; for records in the target data table record set marked as deleted, a delete record operation is performed in the target table.
In specific implementation, corresponding operation is executed on the target table according to the data marks. When the target is operated, different SQL operation statements are generated according to different marking modes, a batch submission mode is adopted during execution, the number of batch submissions can be dynamically set, the operation on the target data table is optimized, and the data processing time is saved.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.