Spark SQL multi-table connection optimization method and device, computer equipment and storage medium
1. A Spark SQL multi-table connection optimization method is characterized by comprising the following steps:
acquiring an SQL multi-table connection query statement set designed according to a target data warehouse;
dividing the SQL multi-table connection query statement set into a training set and a test set;
training a pre-constructed Markov decision model according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse;
and connecting the data tables of the target data warehouse according to the multi-table connection query plan selection strategy.
2. The method of claim 1, wherein the markov decision model is constructed by:
combining SQL multi-table connection query statements input by a Spark program with data warehouse information, and abstracting the SQL multi-table connection query statements into state characteristics of a Markov decision model in a connection tree form;
combining the original logic plan selection and physical plan selection stages of Spark SQL, redesigning a plan selection method, and abstracting the plan selection method into action characteristics of a Markov decision model;
according to the physical realization of Spark SQL connection, a time cost model for evaluating the time cost of executing actions is designed and used as the reward of a Markov decision model;
constructing the Markov decision model based on the state features, the action features, and the reward.
3. The method of claim 2, wherein the step of abstracting the state features of the markov decision model in the form of a junction tree by combining the SQL multi-table join query statements entered by the Spark program with the data warehouse information comprises:
combining SQL multi-table connection query statements input by a Spark program with data warehouse information, and representing a corresponding data table in the form of each connection tree, wherein the characteristics of the connection trees are represented in the form of vectors as follows:
wherein, the first m bits are the connection keys contained in the connection tree expressed in the form of one-hot coding, and the last Ca bit is the estimated base number of the execution result of the connection tree and is obtained by a base number estimator provided by a data warehouse;
the forest, consisting of each of the junction trees, is represented as a state feature of a Markov decision model.
4. The method according to claim 3, wherein the step of redesigning the plan selection method by combining the logical plan selection stage and the physical plan selection stage of Spark SQL, and abstracting the behavior characteristics of the Markov decision model comprises:
selecting two connection trees L and R from the forest of the state characteristics for connection, and designating a connection physical realization method pa which is expressed in a vector form [ L, R, pa ] as an action characteristic of a Markov decision model;
wherein, L represents a connection left table, R represents a connection right table, and pa represents a connection physical implementation method.
5. The method of claim 4, wherein the connection physical implementation method comprises:
and broadcasting the small table, and then performing any one or more of Hash connection, Hash connection after data mixing and sorting and combining connection after data mixing.
6. The method according to claim 4, wherein the step of designing a time cost model for evaluating a time cost of performing an action as a reward for the Markov decision model according to a physical implementation of a Spark SQL connection comprises:
and for the action characteristics [ L, R, pa ], taking the estimation cardinality of the connection tree L and R as input, selecting a cost model according to the selected connection physical realization method pa, setting parameters of the cost model according to the performance of Spark cluster computing nodes to obtain estimated action time cost, and taking the estimated action time cost as the reward of a Markov decision model after normalization.
7. The method of claim 1, wherein the step of training a pre-constructed markov decision model based on the training set and the test set to obtain a multi-table connected query plan selection strategy for the target data warehouse comprises:
and training a pre-constructed Markov decision model by adopting a cross-checking training mode according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse.
8. A Spark SQL multi-table connection optimization apparatus, comprising:
the statement set acquisition module is used for acquiring an SQL multi-table connection query statement set designed according to the target data warehouse;
the dividing module is used for dividing the SQL multi-table connection query statement set into a training set and a test set;
the training module is used for training a pre-constructed Markov decision model according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse;
and the connection module is used for selecting a strategy according to the multi-table connection query plan and connecting the data tables of the target data warehouse.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
Background
Spark SQL is a component of Spark used to process structured data, one of the most commonly used components in Spark. The origin of Spark SQL is to provide a quick and handy tool for technicians familiar with traditional relational databases but not understanding the MapReduce method, and Spark clusters can be used to process large-scale data in the form of a structured data query language. The implementation mode is that the catalog query optimizer analyzes and optimizes the input SQL-like statements, and finally forms Spark bottom program execution. The Spark SQL query optimization is the same as that of the traditional database, and the problem of low multi-table connection query execution performance exists. For multi-table connection query, different connection sequences among the multi-tables and physical implementation of the connection are important factors influencing execution time, and an optimizer is required to quickly and effectively select an optimal execution plan.
The existing method for solving the problem of selecting the multi-table connection query plan by Spark SQL is based on cost optimization, the cost of connecting two tables is estimated in advance by setting a cost model, a connection sequence with the minimum estimated cost is selected by using a dynamic programming algorithm to construct a logic plan, and a physical plan is selected according to a set threshold value to realize physical construction. However, the multi-table connection execution plan search space is explosively increased along with the number of connection tables, the time for selecting the optimal plan is too long when the number of connection tables is large in query, and the threshold for selecting the connection physical implementation method is manually adjusted depending on the experience of developers. With the development of some complex analysis methods in big data, the problem of Spark SQL in multi-table join query is more and more prominent. In recent years, methods for optimizing multi-table join queries using deep reinforcement learning have emerged in the traditional database field. However, these methods are only optimized for the sequence of multi-table join, in Spark SQL, the query is converted into Spark RDD underlying program for execution, and different physical implementation methods are also important factors affecting the multi-table join query performance during the join operation.
Therefore, the existing Spark SQL query optimizer has the experience of developers that the selection of the optimal connection execution plan is inefficient and the selection of the optimal physical plan is dependent.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a Spark SQL multi-table connection optimization method, apparatus, computer device, and storage medium capable of improving the efficiency of optimal connection execution planning.
A Spark SQL multi-table connection optimization method comprises the following steps:
acquiring an SQL multi-table connection query statement set designed according to a target data warehouse;
dividing the SQL multi-table connection query statement set into a training set and a test set;
training a pre-constructed Markov decision model according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse;
and connecting the data tables of the target data warehouse according to the multi-table connection query plan selection strategy.
In one embodiment, the markov decision model is constructed by:
combining SQL multi-table connection query statements input by a Spark program with data warehouse information, and abstracting the SQL multi-table connection query statements into state characteristics of a Markov decision model in a connection tree form;
combining the original logic plan selection and physical plan selection stages of Spark SQL, redesigning a plan selection method, and abstracting the plan selection method into action characteristics of a Markov decision model;
according to the physical realization of Spark SQL connection, a time cost model for evaluating the time cost of executing actions is designed and used as the reward of a Markov decision model;
constructing the Markov decision model based on the state features, the action features, and the reward.
In one embodiment, the step of abstracting the state features of the markov decision model in the form of a connection tree by combining the SQL multi-table join query statement input by the Spark program with the data warehouse information includes:
combining SQL multi-table connection query statements input by a Spark program with data warehouse information, and representing a corresponding data table in the form of each connection tree, wherein the characteristics of the connection trees are represented in the form of vectors as follows:
wherein, the first m bits are the connection keys contained in the connection tree expressed in the form of one-hot coding, and the last Ca bit is the estimated base number of the execution result of the connection tree and is obtained by a base number estimator provided by a data warehouse;
the forest, consisting of each of the junction trees, is represented as a state feature of a Markov decision model.
In one embodiment, the step of combining the two original stages of logical plan selection and physical plan selection of Spark SQL, redesigning the plan selection method, and abstracting the behavior characteristics of the markov decision model includes:
selecting two connection trees L and R from the forest of the state characteristics for connection, and designating a connection physical realization method pa which is expressed in a vector form [ L, R, pa ] as an action characteristic of a Markov decision model;
wherein, L represents a connection left table, R represents a connection right table, and pa represents a connection physical implementation method.
In one embodiment, the connection physical implementation method includes:
and broadcasting the small table, and then performing any one or more of Hash connection, Hash connection after data mixing and sorting and combining connection after data mixing.
In one embodiment, the step of designing a time cost model for evaluating a time cost of performing the action according to a physical implementation of a Spark SQL connection as a reward for the markov decision model includes:
and for the action characteristics [ L, R, pa ], taking the estimation cardinality of the connection tree L and R as input, selecting a cost model according to the selected connection physical realization method pa, setting parameters of the cost model according to the performance of Spark cluster computing nodes to obtain estimated action time cost, and taking the estimated action time cost as the reward of a Markov decision model after normalization.
In one embodiment, the step of training a pre-constructed markov decision model according to the training set and the test set to obtain a multi-table join query plan selection policy of the target data warehouse includes:
and training a pre-constructed Markov decision model by adopting a cross-checking training mode according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse.
A Spark SQL multi-table connection optimization apparatus, the apparatus comprising:
the statement set acquisition module is used for acquiring an SQL multi-table connection query statement set designed according to the target data warehouse;
the dividing module is used for dividing the SQL multi-table connection query statement set into a training set and a test set;
the training module is used for training a pre-constructed Markov decision model according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse;
and the connection module is used for selecting a strategy according to the multi-table connection query plan and connecting the data tables of the target data warehouse.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method.
According to the Spark SQL multi-table connection optimization method, device, computer equipment and storage medium, the SQL multi-table connection query statement set designed according to the target data warehouse is obtained; dividing an SQL multi-table connection query statement set into a training set and a test set; training a pre-constructed Markov decision model according to the training set and the test set to obtain a multi-table connection query plan selection strategy of a target data warehouse; the Markov decision model is solved by using a deep reinforcement learning algorithm to obtain a multi-table connection query plan selection strategy of a target data warehouse, and the data tables of the target data warehouse are connected according to the multi-table connection query plan selection strategy. The problems that the efficiency of existing query optimization of Spark SQL is low when a multi-table connection query plan is selected and the existing query optimization of Spark SQL depends on the experience of developers are solved, and the multi-table connection query execution plan can be quickly and effectively constructed, so that the Spark SQL execution efficiency is improved.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating a Spark SQL multi-table join optimization method in one embodiment;
FIG. 2 is a diagram illustrating an example of a join tree construction for a multi-table join SQL query, in accordance with an embodiment;
FIG. 3 is a diagram illustrating relationships between modules of a training method according to an embodiment;
FIG. 4 is a graph of experimental test plan build time results in one embodiment;
FIG. 5 is a graph illustrating results of execution times for an experimental test plan in one embodiment;
fig. 6 is a block diagram illustrating an exemplary Spark SQL multi-table connection optimization apparatus.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a Spark SQL multi-table connection optimization method is provided, which includes the following steps:
step S220, obtaining an SQL multi-table connection query statement set designed according to the target data warehouse.
The target data warehouse refers to a database which needs to be connected with data tables.
Step S240, dividing the SQL multi-table connection query statement set into a training set and a test set.
And step S260, training the pre-constructed Markov decision model according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse.
The Markov decision model is used for selecting the Spark SQL multi-table connection query plan, the Spark SQL multi-table connection query execution plan selection problem can be converted into a Markov decision process through the Markov decision model, and a strategy for guiding the construction of an optimal plan, namely the multi-table connection query plan selection strategy of a target data warehouse, can be obtained by training through a deep reinforcement learning algorithm.
Specifically, the method comprises the following steps: when Spark SQL query is applied to a certain data warehouse, the SQL multi-table connection query statement of the data warehouse is abstracted to an initial state according to the information of the data table in the data warehouse and the data table base information acquired by Spark SQL through a histogram, the initial state is represented in a matrix form, for the data warehouse, training is carried out according to a pre-constructed Markov decision model, an action to be executed in the current state is selected, a new state is obtained, if the obtained new state is an end state, the optimal execution plan is established, otherwise, the action to be executed next is continuously selected according to an action selection strategy, until the obtained new state is the end state, the optimal execution plan is established, and the optimal execution plan is the multi-table connection query plan selection strategy of the data warehouse.
Step S280, connecting the data tables of the target data warehouse according to the multi-table connection query plan selection strategy.
According to the Spark SQL multi-table connection optimization method, an SQL multi-table connection query statement set designed according to a target data warehouse is obtained; dividing an SQL multi-table connection query statement set into a training set and a test set; training a pre-constructed Markov decision model according to the training set and the test set to obtain a multi-table connection query plan selection strategy of a target data warehouse; the Markov decision model is solved by using a deep reinforcement learning algorithm to obtain a multi-table connection query plan selection strategy of a target data warehouse, and the data tables of the target data warehouse are connected according to the multi-table connection query plan selection strategy. The problems that the efficiency of existing query optimization of Spark SQL is low when a multi-table connection query plan is selected and the existing query optimization of Spark SQL depends on the experience of developers are solved, and the multi-table connection query execution plan can be quickly and effectively constructed, so that the Spark SQL execution efficiency is improved.
In one embodiment, the Markov decision model is constructed by: combining SQL multi-table connection query statements input by a Spark program with data warehouse information, and abstracting the SQL multi-table connection query statements into state characteristics of a Markov decision model in a connection tree form; combining the original logic plan selection and physical plan selection stages of Spark SQL, redesigning a plan selection method, and abstracting the plan selection method into action characteristics of a Markov decision model; according to the physical realization of Spark SQL connection, a time cost model for evaluating the time cost of executing actions is designed and used as the reward of a Markov decision model; a Markov decision model is constructed from the state features, the action features, and the reward.
In order to convert the execution plan selection process of the multi-table connection into a markov decision process and solve the problem by using a reinforcement learning method, firstly, an input multi-table connection query SQL statement needs to be abstracted.
In one embodiment, the step of abstracting the state features of the markov decision model in the form of a junction tree by combining the SQL multi-table join query statements input by the Spark program with the data warehouse information comprises:
combining SQL multi-table connection query statements input by a Spark program with data warehouse information, representing a corresponding data table in the form of each connection tree, wherein the characteristics of the connection trees are represented in the form of vectors as follows:
wherein, the first m bits are the connection keys contained in the connection tree expressed in the form of one-hot coding, and the last Ca bit is the estimated base number of the execution result of the connection tree and is obtained by a base number estimator provided by a data warehouse;
the forest, consisting of connected trees, is represented as state features of a markov decision model.
The execution plan of multi-table connection can be generally abstracted into a connection tree form, the connection tree is a binary tree, wherein non-leaf nodes represent connection intermediate results and are provided with two sub-nodes, the leaf nodes represent basic data tables participating in connection, and the root nodes represent final results of query. A novel connection tree construction method is designed on the basis of a connection tree structure. First, the initial state is a set of basic data tables contained in a multi-table join query, each basic table can be abstracted into a join tree with only root nodes, and then the initial state can be regarded as a forest consisting of the join trees representing the basic tables. Secondly, each step of operation is to select two connection trees from the forest to carry out connection operation, the obtained intermediate result forms a new connection binary tree, and the new connection binary tree and the rest connection trees form a new forest and are transferred to a new state. This operation is continued until only one connection tree remains, i.e., one connection plan for which a connection query is constructed. In the connection tree construction method, the forest in each step can be regarded as the current state, two connection trees are selected to be connected and regarded as the action, the probability of selecting the action in each state and the change situation of the state after selecting the action are only related to the current state, and the connection tree construction method can be proved to have Markov property.
For the state in the above designed connection tree construction method, the connection tree TiExpressed in vector form:
Ti=[… 0 1 0 … Ca]
wherein, the first m bits are the connection keys contained in the connection tree expressed in the form of one-hot coding, and the last bit Ca is the estimated base number of the execution result of the connection tree. The state is treated as a set of junction trees, represented in matrix form. The state is encoded using an n (m +1) matrix of the form:
wherein SiIs the state at time i, n is the number of all relational tables in the applied data warehouse, and is also the number of connection trees that are most likely to appear, m is the number of attributes that can be used as connection keys in the data warehouse, CaiIs the prediction basis of the junction tree i. Each row represents a link tree in a current state, wherein a bit with a 1 indicates that the link key represented by the column has been merged into the link tree represented by the row, and all rows with 0 indicate that no link tree exists in the current state at the position. When only one row in the matrix is not all 0, it indicates that the termination state is reached, resulting in a connection plan tree.
In one embodiment, the method for redesigning the plan selection method by combining the two stages of logical plan selection and physical plan selection originally existing in Spark SQL, and abstracting the behavior characteristics of the markov decision model includes the steps of: selecting two connection trees L and R from a forest of state characteristics for connection, and designating a connection physical realization method pa, which is expressed in a vector form [ L, R, pa ] as an action characteristic of a Markov decision model; wherein, L represents a connection left table, R represents a connection right table, and pa represents a connection physical implementation method.
The markov decision model also needs to abstract the join tree merge operation resulting in a state change as a motion vector. The Markov decision model designs the motion feature vectors into a triplet form as shown in the formula:
A(Si,j)=[L,R,pa]
wherein, A (S)iJ) represents in state SiAnd the next selected action j, L and R represent the row number of the left and right subtrees participating in the connection in the state matrix, namely the serial number of the connection tree, and the value ranges of L and R are [0, n ]. pa is the connecting operation base of choiceLayer implementation, in Spark SQL, the physical implementation of isovalue connections has three ways, and 0, 1, and 2 are used to represent the actions, respectively. For a data warehouse with n data tables, the possible actions are commonAnd judging which actions are optional according to the current state. As shown in FIG. 2, the process of constructing a join tree for a multi-table join SQL query is illustrated by way of example, and the state and action characteristics are expressed in the form of a matrix and a vector set in the manner described above. In the example, there are three tables in the repository and each table has two attributes that can be used as a link, and the three tables of table A, B, C are multi-table linked with k attributes as links, so there are three linked trees in the initial state, and columns with 1 in each row represent the location of the attributes of the links A.k, B.k, C.k in the repository. And acquiring cardinalities of the three basic tables from the data warehouse through a cardinality estimation module, and putting the cardinalities into the last column to form a matrix representing the initial state. For this state, the selectable actions in the action list are determined and selected. In the example, the first action selects [0, 1, pa]If the connection tree 0 and the connection tree 1 are to be connected by using the connection bottom layer implementation method pa, two connection trees exist in the second state, the newly obtained connection tree radix attribute is obtained by a radix estimation module in the data warehouse, action selection is continued, and only one connection tree exists in the third state, namely, the termination state is reached.
In one embodiment, the connection physical implementation method comprises the following steps: and broadcasting the small table, and then performing any one or more of Hash connection, Hash connection after data mixing and sorting and combining connection after data mixing.
Wherein Spark SQL provides three physical implementations for equivalent connections, including: broadcast Join (Hash Join after broadcasting small table), Shuffle Hash Join (Hash Join after data shuffling), Sort Merge Join (Sort and Merge Join after data shuffling). Taking pa value ranges as {0, 1, 2}, which respectively represent three physical implementations, that is, pa is 0 to represent hash connection after broadcasting a small table, pa is 1 to represent hash connection after data mixing, and pa is 2 to represent sorting and connection after data mixing.
In one embodiment, the step of designing a time cost model for evaluating a time cost of performing an action as a reward for the markov decision model in accordance with a physical implementation of a Spark SQL connection, comprises: and for the action characteristics [ L, R, pa ], taking the estimation cardinality of the connection tree L and R as input, selecting a cost model according to a selected connection physical realization method pa, setting parameters of the cost model according to the performance of Spark cluster computing nodes to obtain estimated action time cost, and taking the estimated action time cost as the reward of a Markov decision model after normalization.
In one embodiment, the step of training the pre-constructed markov decision model to obtain the multi-table join query plan selection strategy for the target data repository based on the training set and the test set comprises: and training the pre-constructed Markov decision model by adopting a cross-checking training mode according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse.
As shown in fig. 3, in an embodiment, the method for training the pre-constructed markov decision model according to the training set and the test set specifically includes the following steps:
the training method comprises a multi-table connection query statement abstraction module (such as an abstraction model shown in fig. 3), a radix estimation and connection cost design module (such as a cost model shown in fig. 3) and a deep reinforcement learning training module (such as a training model shown in fig. 3), wherein the multi-table connection query statement abstraction module is responsible for converting an optimal plan selection problem of multi-table connection into three parts of state, action and state transition in a markov decision process, and abstracting state features capable of representing the current connection tree condition, action features capable of representing connection sequence selection and bottom layer execution method selection and a state transition equation capable of representing state change after action execution from an SQL multi-table connection query statement input by a Spark SQL program and an applied data warehouse. The radix estimation and connection cost design module designs a cost model for judging the resource consumption of a certain action executed in a certain state by analyzing the execution process of the Spark SQL connection operation bottom layer, and calculates by taking the radix of the data table and the intermediate result as input to obtain the estimation cost as the reward in the reinforcement learning. For three physical implementations of Broadcast Join, Shuffle Hash Join and Sort target Join of Spark SQL equivalent connection, the radix estimation and connection cost design module is respectively set as:
CostSMJ(A,B)=(Ca(A)*size(A)+Ca(B)*size(B))*T(SS)+(Ca(i)+Ca(j))*T(c)
among them, CostBJ(A, B) is the time cost of Spark SQL for executing Broadcast Join on data table A and data table B in the computing cluster; costSHJ(A, B) is the time cost of Spark SQL for executing Shuffle Hash Join on data table A and data table B in the computing cluster; costSMJ(A, B) is the time cost of Spark SQL for performing Sort Merge Join on data table A and data table B in the computation cluster; ca (i) is the base number of table i, i belongs to A or B, and A and B are data table numbers; ca (j) is the base number of table j, i ≠ j ∈ A or B; size (i) is the average size of each piece of data in table i, i belongs to A or B, and A and B are data table numbers; t (bc) is the time constant for the Spark cluster to broadcast a unit of data,is to calculate the pair of child nodes of the cluster at SparkAnd the time cost of performing the Hash Join in table B,is the partial table of table a assigned to the child node of the Spark calculation cluster,is to calculate the pair of child nodes of the cluster at SparkWatch with a watchThe time cost of performing the Hash Join,is a partial table assigned to the Spark computing cluster child node by table B; t (hs) is the time constant for the cluster unit data to perform Hash Shuffle, t (ss) is the time constant for the cluster unit data to perform Sort Shuffle, and t (c) is the time constant for the cluster compute node to compare the unit data. CostHJ(i, j) the time cost model is set as:
where t (hash) is a time constant for the cluster computing node to create the hash table for the unit data, M is a memory size provided by the cluster child node, and t (w) is a time constant for the cluster child node to write the unit data. When the memory ca (j) size (j) occupied by the table j is smaller than or equal to the memory size M provided by the child node, the time cost of the Hash Join is the time for establishing the Hash table for the table i plus the time for scanning the table j, otherwise, the time cost is the time for overflowing and writing the Hash table to the disk. The time constants mentioned above should be set according to the configuration environment of the Spark cluster. And after the action is executed in a certain state, carrying out normalization processing according to the cost value obtained by the time cost model and then taking the normalized value as the reward value of the reinforcement learning.
The multi-table connection plan construction problem is converted into a Markov decision process through the multi-table connection query statement abstraction module and the radix estimation and connection cost design module, and training is performed through a deep reinforcement learning algorithm. In the implementation process, a deep reinforcement learning DQN algorithm is used for training, a strategy matrix is obtained after the training is converged, and the optimal execution plan can be more efficiently selected when Spark SQL is optimized and multi-table connection query is guided.
In order to verify the improvement of the optimization performance of Spark SQL multi-table connection query, the JOB SQL multi-table connection query set established on the Imdb network movie database is used for experimental verification. The experiment adopts cross check training, uses 80 SQL multi-table connection query statements as a training set, uses 33 SQL multi-table connection query statements as a test set, constructs four groups of training sets and test sets with different divisions, ensures that each data table appears in the training set, avoids insufficient training, sets the number of MLP network layers of DQN to 3 in the experiment to obtain better effect, and compares the application with the CBO module of the existing optimized multi-table connection of Spark SQL:
as shown in fig. 4, a comparison of the time spent by selecting the optimal plan by the conventional dynamic programming method of Spark SQL (CBO-dp) under different connection numbers and the Spark SQL multi-table connection optimization method based on deep reinforcement learning (CBO-drl) designed by the present application is shown, where the abscissa is the connection number, the ordinate is the plan construction time (ms), and the ordinate is measured by a logarithmic scale. It can be seen from the figure that the method based on dynamic programming increases exponentially with the increase of the number of connections, while the time complexity of the Spark SQL multi-table connection optimization method based on deep reinforcement learning is approximately linear because it is greedy constructed by using a trained markov decision model. From the aspect of efficiency of plan construction, when the number of connections is less than 5, the method based on dynamic programming is superior to the Spark SQL multi-table connection optimization method based on deep reinforcement learning, because a certain time is needed for loading and calculating the strategic neural network in the model, but when the number of connections is more than 5, the Spark SQL multi-table connection optimization method based on deep reinforcement learning is significantly superior to the traditional method. As shown in FIG. 5, a time comparison of CBO-dp and CBO-drl build plans and execution at different connection numbers is shown. It can be seen that the Spark SQL multi-table connection optimization method based on deep reinforcement learning can accurately select the optimal plan execution under most conditions, and the overall performance is significantly better than that of the conventional method when the number of connections is 3, 6, 8, 9, and 11. On average, CBO-drl performed a 15.2% speed increase over CBO-dp.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 6, there is provided a Spark SQL multi-table connection optimization apparatus, including: a statement set acquisition module 310, a partitioning module 320, a training module 330, and a connection module 340.
And a statement set acquiring module 310, configured to acquire the SQL multi-table join query statement set designed according to the target data warehouse.
The dividing module 320 is configured to divide the SQL multi-table join query statement set into a training set and a test set.
And the training module 330 is configured to train the pre-constructed markov decision model according to the training set and the test set to obtain a multi-table connection query plan selection strategy of the target data warehouse.
The connection module 340 is configured to select a policy according to the multi-table connection query plan, and connect the data tables of the target data warehouse.
For the specific limitation of the Spark SQL multi-table link optimization device, reference may be made to the above limitation on the Spark SQL multi-table link optimization method, which is not described herein again. The modules in the Spark SQL multi-table connection optimization device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the Spark SQL multi-table link optimization method described above when executing the computer program.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the Spark SQL multi-table connection optimization method described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.