Big data query method, system and medium based on ClickHouse dictionary table
1. A big data query method based on a ClickHouse dictionary table is characterized in that,
(1) creating a dictionary table: setting fields needing to be synchronized to a dictionary table according to key-value mapping requirements; synchronizing data of corresponding fields from a specified data source to a dictionary table;
(2) configuring dataCube: adding a dictionary table name, a field name corresponding to the dimension field in the dictionary table and a field name needing to be displayed in a mapping value mode to the dimension field needing to be displayed in a key-value mapping mode in the dataCube configuration, and generating a dictionary structure associated with the dimension field;
(3) big data query: receiving a query instruction, wherein the query instruction comprises a queried field A, reading dataCube configuration information, if the field A is matched with a dimension field which is configured with a dictionary table, constructing an SQL query statement for acquiring the field A from a database, and the SQL query statement also comprises a statement for acquiring a field which is displayed in a mapping value mode and corresponds to the field A from the dictionary table; the query results are then returned.
2. The ClickHouse dictionary table-based big data query method according to claim 1, wherein the query instruction further includes a restrictive query condition for a field that needs to be displayed in a mapping value manner, and the SQL query statement includes a statement that is queried using the restrictive query condition.
3. The big data query method based on the clickwouse dictionary table according to claim 1, wherein when the dictionary table is created, a port, an account and a password are configured, and data is automatically synchronized from a specified data source.
4. The big data query method based on the ClickHouse dictionary table as claimed in claim 1, wherein the dictionary table is configured with an update period when being created, and data is automatically synchronized from a specified data source according to the update period.
5. The big data query method based on the ClickHouse dictionary table as claimed in claim 1, wherein when the field A is data type data and is calculated by other fields, a measure field is set to obtain a calculation formula of the field A, and the SQL query statement includes statements of other fields required for calculating the field A and obtained from a database.
6. A big data query system based on a ClickHouse dictionary table is characterized by comprising a dictionary table creating module, a dataCube configuration module and an SQL statement generating and executing module;
the dictionary table creating module sets fields needing to be synchronized to the dictionary table according to the key-value mapping requirement; synchronizing data of corresponding fields from a specified data source to a dictionary table;
the method comprises the steps that a database configuration module adds a dimension field in database configuration aiming at a field needing to be displayed in a key-value mapping mode, adds a dictionary table name in configuration information of the dimension field, a field name corresponding to the dimension field in the dictionary table and a field name needing to be displayed in a mapping value mode, and generates a dictionary structure relevant to the dimension field;
the SQL sentence generating and executing module receives a query instruction, reads dataCube configuration information, if a field A included in the query instruction is matched with a dimension field, and the dimension field is configured with a dictionary table, an SQL query sentence for acquiring the field A from a database is constructed, and the SQL query sentence also includes a sentence for acquiring a field displayed in a mapping value mode corresponding to the field A from the dictionary table; the SQL statement is then executed and the query results are returned.
7. The ClickHouse dictionary table-based big data query system according to claim 6, wherein the query instruction further includes a restrictive query condition for a field that needs to be displayed in a mapping value manner, and the SQL query statement constructed by the SQL statement generation and execution module includes a statement that is queried using the restrictive query condition.
8. The ClickHouse dictionary table-based big data query system according to claim 6, wherein the dictionary table creation module automatically synchronizes data from a specified data source according to the configured port, account and password, and/or the dictionary table creation module automatically synchronizes data from a specified data source according to the configured update period.
9. The ClickHouse dictionary table-based big data query system according to claim 6, wherein the field A is calculated from other fields, the measure field added by the dataCube configuration module includes a calculation formula for obtaining the field A, and the SQL query statement generated by the SQL statement generation and execution module includes statements for obtaining other fields required for calculating the field A from the database.
10. A storage medium having stored thereon a computer program which, when executed, implements the ClickHouse dictionary table-based big data query method of any one of claims 1 to 5.
Background
When data analysis is performed, operations such as searching, classifying, and counting various dimensions of data are required to find and analyze differences between data. Data of various dimensions are usually named by using a number, namely an ID which is often called by us, in consideration of coding, security and other problems in use and transmission. In performing data analysis, the purely numerical ID results are difficult for a user without specialized training to understand their meaning, thus creating a key-value mapping problem.
For example, a table A containing m rows of financial data, which includes currency fields, which typically deals with multiple currencies, represented by numerical codes, where 156 represents the RMB and 840 represents the U.S. dollar, when analyzing the table A, additional currency attributes, such as currency name and country, are associated with the currency fields for better data analysis, which can present the following problems:
1. if the country and currency names are not in the table A, a table B containing the currency number codes and the corresponding country and currency names needs to be created, and the method is called a multi-table connection method. Supposing that the table B has n rows, the fact that the country and currency names exist when the data result is returned is realized through the JOIN (connection) of the table A and the table B during query, and then the data processing amount of a background is increased from m rows to m x n rows, so that query performance is greatly influenced, and the data processing amount can hardly be used during big data analysis;
2. in the prior art, ID and name matching is also realized by a large-width table method, but if the table a has country and currency names, a user cannot read the country and currency name languages displayed by default in the system, or the country names change, and the country names correspond to different abbreviations, and the like, the data size of the table a needs to be modified, generally speaking, the table a has a large data size, and the modification difficulty is large.
Disclosure of Invention
The invention provides a big data query method based on a ClickHouse dictionary table, which can directly use a mapping value to query data during big data query, improve the data query efficiency and improve the overall performance of a system.
The invention provides a big data query method based on a ClickHouse dictionary table, which comprises the following steps:
(1) creating a dictionary table: setting fields needing to be synchronized to a dictionary table according to key-value mapping requirements; synchronizing data of corresponding fields from a specified data source to a dictionary table;
(2) configuring dataCube: adding a dictionary table name, a field name corresponding to the dimension field in the dictionary table and a field name needing to be displayed in a mapping value mode to the dimension field needing to be displayed in a key-value mapping mode in the dataCube configuration, and generating a dictionary structure associated with the dimension field;
(3) big data query: receiving a query instruction, wherein the query instruction comprises a queried field A, reading dataCube configuration information, if the field A is matched with a dimension field which is configured with a dictionary table, constructing an SQL query statement for acquiring the field A from a database, and the SQL query statement also comprises a statement for acquiring a field which is displayed in a mapping value mode and corresponds to the field A from the dictionary table; the query results are then returned.
According to the invention, by setting the dictionary table with data stored in the memory, relevant data is obtained by querying the database, and meanwhile, some associated data is obtained through the dictionary table, so that the effect of multi-table JOIN in the data query process is realized, meanwhile, the loss of performance in multi-table JOIN is effectively solved, and the data query efficiency is greatly improved.
Preferably, the query instruction further includes a restrictive query condition for a field that needs to be displayed in a mapping value manner, and the SQL query statement includes a statement that is queried by using the restrictive query condition.
Preferably, when the dictionary table is created, the port, account and password are configured, and data is automatically synchronized from the specified data source.
Preferably, when the dictionary table is created, an update cycle is configured, and data is automatically synchronized from the specified data source according to the update cycle.
Preferably, the field a is data type data and is calculated by other fields, the measure field is set to obtain a calculation formula of the field a, and the SQL query statement includes statements of other fields required for calculating the field a and obtained from the database.
The invention also provides a big data query system based on the ClickHouse dictionary table, which comprises a dictionary table creating module, a dataCube configuration module and an SQL statement generating and executing module;
the dictionary table creating module sets fields needing to be synchronized to the dictionary table according to the key-value mapping requirement; synchronizing data of corresponding fields from a specified data source to a dictionary table;
the method comprises the steps that a database configuration module adds a dimension field in database configuration aiming at a field needing to be displayed in a key-value mapping mode, adds a dictionary table name in configuration information of the dimension field, a field name corresponding to the dimension field in the dictionary table and a field name needing to be displayed in a mapping value mode, and generates a dictionary structure relevant to the dimension field;
the SQL sentence generating and executing module receives a query instruction, reads dataCube configuration information, if a field A included in the query instruction is matched with a dimension field, and the dimension field is configured with a dictionary table, an SQL query sentence for acquiring the field A from a database is constructed, and the SQL query sentence also includes a sentence for acquiring a field displayed in a mapping value mode corresponding to the field A from the dictionary table; the SQL statement is then executed and the query results are returned.
Preferably, the query instruction further includes a restrictive query condition for a field that needs to be displayed in a mapping value manner, and the SQL query statement constructed by the SQL statement generation and execution module includes a statement that is queried by using the restrictive query condition.
Preferably, the dictionary table creating module automatically synchronizes data from the specified data source according to the configured port, account and password, and/or the dictionary table creating module automatically synchronizes data from the specified data source according to the configured update period.
Preferably, the field a is calculated from other fields, the measure field added by the dataCube configuration module includes a calculation formula for obtaining the field a, and the SQL query statement generated by the SQL statement generation and execution module includes statements for obtaining other fields required for calculating the field a from the database.
The invention also provides a storage medium, which stores a computer program, when the program is executed, the big data query method based on the ClickHouse dictionary table provided by the invention is realized.
Compared with the prior art, the invention has the advantages that:
in the prior art, codes are usually required to be written to realize the function of synchronizing data from different databases, and the scripts need to be separately deployed and started at regular time and pay attention to the script execution condition; according to the technical scheme, data synchronization can be achieved only by creating the dictionary table, the database, the synchronization condition, the synchronization period and the like can be specified, the ClickHouse can automatically execute periodical updating of the dictionary table and dynamic loading of missing values according to the sentences for creating the dictionary table, the execution result is recorded, and the later maintenance cost is low.
The invention is realized based on the dictionary table of the ClickHouse, the data is stored in the memory in the form of the Hash table, and can contain any number of elements with any identifiers, thereby realizing the effect of multi-table JOIN in the data query process, and effectively solving the loss of performance when the multi-table is associated. The invention provides a scheme for directly using a mapping value to analyze data during large data visualization analysis based on a ClickHouse, which is not a scheme for scanning a whole table, effectively solves the loss of performance during multi-table association, and has the advantages of simple implementation, no need of maintenance and almost no loss of performance.
Drawings
FIG. 1 is a flow diagram illustrating an update of an external dictionary table according to one embodiment of the present invention;
FIG. 2 is a flow diagram illustrating configuration of an external dictionary table according to one embodiment of the present invention;
FIG. 3 is a UML diagram of an external dictionary table and data table according to one embodiment of the present invention;
FIG. 4 is a flowchart illustrating a query with a field according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the drawings and specific examples.
The embodiment takes the example of creating a dictionary table for the currency fields in the financial data as an example, and details a data analysis method based on the clickwouse dictionary table, which includes the creation, update, configuration and query of the dictionary table. In the embodiment, a currency dictionary table is created based on a ClickHouse dictionary table, data contents in the dictionary table are configured, an update period is set to update parameters in the dictionary table, and when big data is inquired, a mapping value matched with the parameters is quickly read by calling field parameters in the currency dictionary table, so that quick reading and processing of the data are realized, and the data analysis efficiency is improved.
The ClickHouse dictionary table is a simple and practical storage medium provided by the ClickHouse, data is defined in the form of key value and attribute mapping, and the data in the dictionary can be actively or passively loaded into a memory and supports dynamic updating. Due to the characteristic that dictionary data is resident in a memory, the method is suitable for storing constant or frequently-used dimension table data, so that unnecessary JOIN connection data is avoided, and the data processing efficiency is improved.
Creating a dictionary table:
creating a currency dictionary table dit _ currenty, which is responsible for the dictionary table creation module, as shown in fig. 1, comprising the following steps:
1. creating a script to generate a currency DICTIONARY table using a CREATE DICTIONARY interface
(1) The data SOURCE of the currency symbol and the name is on the currenty table of the MySQL database, so the SOURCE interface is set to designate the data SOURCE as MySQL, and information such as a port, an account number, a password and the like is configured. The ClickHouse supports data acquisition by connecting a designated database, an http request API or reading a file.
(2) Fields required for synchronization from the currenty table, such as id, name, count to the currency dictionary table dit _ currenty, are set according to the key-value requirements. The present embodiment needs to return the currency id and name to the data analyst, so the dictionary table includes the field id and name.
(3) According to the update frequency of the key-value data, the update cycle of the LIFETIME interface designation dictionary table dit _ currency is set. Because the name and the country information corresponding to the currency symbol are updated rarely, the name and the country information are updated once every 1-2 hours;
the complete SQL statement that creates the dictionary table is as follows:
CREATE DICTIONARY dict_currency
(
id String,
name String,
country String
)
PRIMARY KEY id
SOURCE(MYSQL(
port 3306
user 'user'
password '1234'
db 'medsdk'
table 'currency'
))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(MIN 3600 MAX 7200);
the above statements create a dictionary table named as "dit _ currenty", which includes three fields of id, name, and count, the data source is a currenty table of the medsdk database, the port 3306, the user name is user, the password is 1234, and the update period is 3600 seconds to 7200 seconds.
2. The execution script generates an external dictionary table, dit _ currency, in the ClickHouse, and at this time, data is synchronized in the dictionary table.
The flow chart of the created dictionary table updating is shown in fig. 2. The method comprises the steps of firstly judging whether an update period set when a dictionary table is created is reached or not when the dictionary table is updated, waiting for a period of time and judging again if the update period is not reached, judging whether the load of a ClickHouse server meets the condition that the dictionary table is diode _ current can be updated or not if the update period is reached, entering a waiting procedure for the update process if the load of the ClickHouse server cannot meet the follow-up processing condition when the dictionary table is updated, and synchronizing corresponding field data in the data table current into the dictionary table when the load of the ClickHouse server meets the update condition to update the dictionary table.
Configuring dataCube:
analyzing and configuring the currency dictionary table dit _ currency information, wherein the information is realized by a dataCube configuration module, and the method comprises the following steps:
(1) and loading the configuration of the dataCube, circularly reading and analyzing the dimension field, and finding a currentid field which is a field needing to be displayed in a key-value mapping mode.
(2) Adding configuration of an external dictionary table dit _ current to configuration information of a current _ id field, setting table (external dictionary table name) as dit _ current, setting name (field name displayed in a mapping value mode) as field name of dit _ current, and setting id (field name corresponding to current _ id field) as field name id of dit _ current on a map structure.
Configuration sample, yaml format:
- id: currency_id
name: Currency
kind: string
map:
table: dict_currency
name: name
id: id
(3) because the external dictionary table dit _ currenty is configured in the currentid field, a related dictionary structure is generated by analysis, and the currentid can be related to a corresponding dictionary table; if the dimension field does not have an external dictionary table configured, no associated dictionary structure is generated.
The dimension field may also include other fields other than the current _ id, and if other fields need to be displayed in a key-value mapping manner as in the current _ id, they are configured as in (2). The loop reads and configures all dimension fields to end.
Referring to fig. 3, the database table finish _ table shown in fig. 3 has a current _ id field, and when a query instruction of a data analyst queries the current _ id, the measurement field includes the current _ id field, and the current _ id in the query instruction can be directly associated to the dictionary table. However, in order to facilitate the data analyst to use the system, the query instruction issued by the data analyst using the query page may not be currentjd, but may be a term that is convenient for the data analyst to understand, such as "currency ID", and the relevant data table does not have a "currency ID" field, so that mapping configuration needs to be performed in the dimension field, the "currency ID" is mapped to currentjd, and then the configuration in (2) is performed.
The query instruction sent by the data analyst may also be a result value, which needs to be calculated through a field value in the database, if the result value needs to be queried, a measure field is introduced for calculation, the measure field represents numerical data, such as a request number of 100, a click number of 10, a click rate of 0.1 and the like, the type of the data is a numerical value, a dictionary table does not need to be configured, and calculation can be performed according to the requirement of the query instruction; for example, the working year X of a person in a unit needs to be known, but the data base stores the year of the person in employment X1 and the year of the person out of employment X2, and the values of X2-X1 can be directly calculated through the measure field.
FIG. 4 is a UML diagram of an external dictionary table and data table. And the dit _ current is a dictionary table of a current _ id column of the financial statement finance _ table, and the dit _ customer is a dictionary table of a customer _ id column of the financial statement finance _ table. When big data query currentjd column is performed, the name of dictionary table dit _ currentdy is added to the data returned to the front end, and if necessary, a count can be returned at the same time.
Data query:
the following describes a data query for a field using a dictionary table.
FIG. 4 is a schematic flow chart. Carrying out data query by Currency, and realizing by an SQL statement generation and execution module, wherein the method comprises the following steps:
(1) the server side receives the income data which needs to be returned by the front end according to Currency grouping, the grouping parameter is 'Currency _ id', and the searching parameter is 'people'.
(2) The server side reads the configuration information of the dataCube, a dimension field comprises a currentjd and dictionary table dic _ currentjy information is configured, and the currentjd field has an associated dictionary structure.
1) And dynamically generating an SQL query statement conforming to the ClickHouse specification, wherein the statement is as follows:
SELECT currency_id AS id,
sum(income) AS income
FROM finance_table
GROUP BY currency_id
2) generating a statement for acquiring a currentjd mapping value name in a SELECT expression according to a dictionary structure associated with a currentjd field, wherein the name is a mapping value field name in a dictionary table; the modified SQL query statement is as follows:
SELECT currency_id AS id,
dictGet('dict_currency', 'name', tuple(currency_id)) AS name,
sum(income) AS income
FROM finance_table
GROUP BY currency_id
3) adding a statement for searching and matching a 'people' field by using a LIKE operator and a wildcard '%' in a WHERE expression, wherein the modified SQL query statement comprises the following steps:
SELECT currency_id AS id,
dictGet('dict_currency', 'name', tuple(currency_id)) AS name,
sum(income) AS income
WHERE name LIKE '% people'
FROM finance_table
GROUP BY currency_id
If the query command that does not need to be passed by the front end does not have the search parameter "people", the query statement does not include the statement "WHERE name LIKE"% people' "".
4) Executing an inquiry statement, and returning the obtained id, name and income data to the front end in a list manner; the structure is as follows:
[
{
"id": 156,
"name" means "RMB",
"income": 1000
}
]
5) the front end displays the value of the name field to be checked by a data analyzer, and the data analyzer can see that the data is 1000 yuan of income RMB.
Considering that there are many names corresponding to the currency id, and it takes a long time for the data analyst to select all the names to display to the front end, the above example performs the contraction query through the parameter "people", performs the regular matching by using the two characters "people", and the currency list finally seen by the data analyst only contains the "renminbi" in "two characters.
Assuming that the server side receives the income data which the front end needs to return and is grouped according to Currency, the grouping parameter is not 'currentjd' but 'C _ id', and the searching parameter is 'people'. At this time, in the dataCube configuration, the dimension field needs to map C _ id to current _ id, and at the same time, dictionary table information is configured. When the dataCube configuration information is read, it is found that the C _ id maps currentjd, which can still be implemented by the foregoing SQL query statement.
Assuming that the front-end packet parameter received by the server end is 'X', meaning working year, and there is no X field in the database table, only including X1 for the year of employment and X2 for the year of departure, X can be obtained from X2 for the year of employment-X1 for the year of employment. At this time, because the result value needs to be inquired, when the dacCube configuration is performed, the measure field is set to perform the calculation of X2-X1, because the data type of the measure field is a numerical value, the calculation can be directly performed, and when the configuration information of the dacCube is read, the data obtained by inquiring X2 and X1 are subtracted again to obtain the final value, and the final value is returned to the front end. The data configuration mode of the invention is flexible, and the dictionary table can be selected to be configured or not configured according to different requirements, so as to obtain the fastest query efficiency.
The invention and its embodiments have been described above schematically, without limitation, and the invention can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The representation in the drawings is only one of the embodiments of the invention, the actual construction is not limited thereto, and any reference signs in the claims shall not limit the claims concerned. Therefore, if a person skilled in the art receives the teachings of the present invention, without inventive design, a similar structure and an embodiment to the above technical solution should be covered by the protection scope of the present patent.