Distributed database access processing method and apparatus

By constructing execution plan features and node features, combined with index configuration features, the execution plan cost of distributed databases is predicted, solving the problem of accuracy in query request routing and improving the efficiency and timeliness of database access processing.

CN117762986BActive Publication Date: 2026-06-30BEIJING OCEANBASE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING OCEANBASE TECHNOLOGY CO LTD
Filing Date
2023-12-05
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, during the query request routing process of distributed databases, the accuracy of query cost routing selection is low due to network latency and cross-storage node access bandwidth, which affects the efficient operation of the database.

Method used

By constructing execution plan features and node features, and combining them with the index configuration features of the target data table, the execution cost of each execution plan is predicted, thereby determining the target execution plan and executing the access statements, thus improving the accuracy of execution plan prediction.

Benefits of technology

It improves the efficiency and timeliness of distributed database access processing, optimizes query request routing through accurate execution cost prediction, and enhances the overall performance of the database.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117762986B_ABST
    Figure CN117762986B_ABST
Patent Text Reader

Abstract

This specification provides a method and apparatus for accessing a distributed database. One method includes: during data access to a distributed database via an access statement, constructing execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of the access statement; generating node features based on the data partitions and node communication bandwidth of each execution node; merging the execution plan features and node features into a merged feature according to the data partitions; predicting the execution cost of each execution plan based on the merged feature and the index configuration features of the target data table of the access statement; thereby determining the target execution plan and performing the execution processing of the access statement according to the target execution plan.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This document relates to the field of database technology, and in particular to a method and apparatus for accessing a distributed database. Background Technology

[0002] Distributed databases provide distributed storage and computing functions through a service cluster composed of several storage nodes. Different data in a distributed database table can be allocated to different storage nodes for storage according to a partitioning strategy. Query requests for different data need to be routed to their corresponding storage nodes for processing as much as possible. To ensure the efficient operation of a distributed database, the routing process for query requests is often based on the query cost. However, due to network latency and cross-storage node access bandwidth, the accuracy of routing based on query cost is relatively low. Summary of the Invention

[0003] This specification provides one or more embodiments of a distributed database access processing method, comprising: constructing execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of an access statement in the distributed database; generating node features based on the data partitions and node communication bandwidth of the execution nodes of each execution plan; merging the execution plan features and the node features according to the data partitions to obtain merged features; and predicting the execution cost of each execution plan based on the merged features and the index configuration features of the target data table of the access statement, so as to determine the target execution plan and execute the access statement based on the execution cost.

[0004] This specification provides one or more embodiments of a distributed database access processing apparatus, comprising: a plan feature construction module configured to construct execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of an access statement in the distributed database; a node feature generation module configured to generate node features based on the data partitions and node communication bandwidth of the execution nodes of each execution plan; a feature merging module configured to merge the execution plan features and the node features according to data partitions to obtain merged features; and an execution cost prediction module configured to predict the execution cost of each execution plan based on the merged features and the index configuration features of the target data table of the access statement, so as to determine the target execution plan and execute the access statement based on the execution cost.

[0005] This specification provides one or more embodiments of a distributed database access processing device, including: a processor; and a memory configured to store computer-executable instructions, which, when executed, cause the processor to: construct execution plan features based on the execution logic, execution nodes, and data partitions of various execution plans for an access statement of the distributed database; generate node features based on the data partitions and node communication bandwidths of the execution nodes of the various execution plans; merge the execution plan features and the node features according to the data partitions to obtain merged features; and predict the execution cost of each execution plan based on the merged features and the index configuration features of the target data table of the access statement, so as to determine the target execution plan based on the execution cost and execute the access statement.

[0006] This specification provides one or more embodiments of a storage medium for storing computer-executable instructions, which, when executed by a processor, implement the following process: Constructing execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan for an access statement in a distributed database. Generating node features based on the data partitions and node communication bandwidth of the execution nodes for each execution plan. Merging the execution plan features and the node features according to the data partitions to obtain merged features. Predicting the execution cost of each execution plan based on the merged features and the index configuration features of the target data table of the access statement, determining the target execution plan based on the execution cost, and executing the access statement. Attached Figure Description

[0007] To more clearly illustrate the technical solutions in one or more embodiments of this specification or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0008] Figure 1 A schematic diagram illustrating the implementation environment of a distributed database access processing method provided in one or more embodiments of this specification;

[0009] Figure 2 A flowchart illustrating a distributed database access processing method provided in one or more embodiments of this specification;

[0010] Figure 3 This specification provides a deployment architecture diagram for one or more embodiments of the OceanBase distributed database.

[0011] Figure 4A schematic diagram illustrating the execution of a query plan provided for one or more embodiments of this specification;

[0012] Figure 5 An architectural block diagram of a cost prediction model provided for one or more embodiments of this specification;

[0013] Figure 6 A flowchart illustrating a distributed database access processing method for database query scenarios, provided for one or more embodiments of this specification;

[0014] Figure 7 This is a schematic diagram of an embodiment of a distributed database access processing apparatus provided in one or more embodiments of this specification;

[0015] Figure 8 This is a schematic diagram of the structure of a distributed database access processing device provided for one or more embodiments of this specification. Detailed Implementation

[0016] To enable those skilled in the art to better understand the technical solutions in one or more embodiments of this specification, the technical solutions in one or more embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of the embodiments. Based on one or more embodiments of this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of this document.

[0017] The distributed database access processing method provided in one or more embodiments of this specification is applicable to the implementation environment of a distributed database system. (Refer to...) Figure 1 The implementation environment includes at least:

[0018] The distributed database 101 includes a plan generation unit 102 that generates one or more execution plans for access statements, a plan scheduling unit 103 that selects a target execution plan for scheduling based on the execution cost of the execution plan, and a cost prediction model 104 that predicts the execution cost of the execution plan. The distributed database 101 is deployed with at least one processing cluster, each processing cluster consisting of one or more processing nodes, and each processing node containing multiple data partitions.

[0019] In this implementation environment, for an access statement that accesses data to the distributed database 101, the plan generation unit 102 generates an execution plan for the access statement and sends the generated execution plan to the plan scheduling unit 103 for execution plan scheduling. During the scheduling process, the plan scheduling unit 103 needs to call the cost prediction model 104 to predict the execution cost of each execution plan and determine the target execution plan to execute the access statement based on the predicted execution cost.

[0020] Specifically, in the process of predicting the execution plan of access statements, on the one hand, execution plan features are constructed based on the execution logic, execution nodes, and data partitions of each execution plan; on the other hand, node features are generated based on the data partitions and node communication bandwidth of the execution nodes of each execution plan. Then, the execution plan features and node features are merged according to the data partitions to obtain merged features. Based on the merged features and the index configuration features of the target data table of the access statement, the execution cost of each execution plan is predicted. In this way, the execution cost prediction process can combine the execution logic, execution nodes, data partitions, and node communication bandwidth to make the prediction of the execution cost of the execution plan more accurate, which helps to improve the timeliness and efficiency of processing access statements that access data to the distributed database 101.

[0021] This specification provides one or more embodiments of a distributed database access processing method, as follows:

[0022] Reference Figure 2 The distributed database access processing method provided in this embodiment specifically includes steps S202 to S208.

[0023] Step S202: Construct execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database.

[0024] The distributed database access statement described in this embodiment refers to a statement that accesses data in the distributed database. This access statement can be a data operation statement that performs add, delete, modify, or query operations on the distributed database, such as the SQL query statement for querying data in the OceanBase distributed database: "SELECT * FROM t1, t2 WHERE t1.c1=t2.c1".

[0025] Optionally, the access statements of the distributed database are executed by at least one processing cluster of the distributed database; the processing cluster consists of one or more processing nodes, and each processing node is configured with at least one data partition corresponding to a data table in the distributed database. The processing node can be a compute node or a storage node, specifically a server responsible for performing data operations.

[0026] For example, Figure 3 The deployment architecture diagram of the OceanBase distributed database shown includes three server clusters: server cluster 1, server cluster 2, and server cluster 3. Each server cluster consists of 3 server nodes, and each server node is configured with 4 data partitions, which correspond to the corresponding data tables. Paxos groups are also set up.

[0027] In the specific execution process, after obtaining the access statement of the distributed database, an execution plan for the access statement is generated. During the generation of the execution plan, the execution of an access statement may involve multiple data operations. Different data operations can be executed on different processing nodes. Since there are multiple processing nodes in the distributed database, the execution of the access statement may be executed on different processing nodes. Accordingly, the execution plans generated for the access statement may also be different. This embodiment predicts the execution cost of each execution plan of the access statement and determines which execution plan to use to execute the access statement based on the execution cost.

[0028] In one optional implementation of this embodiment, the execution plans for each access statement are generated as follows: the access statement is parsed to obtain a syntax tree structure, and the syntax tree is converted into an access tree structure; the access tree structure is optimized to obtain an execution tree structure, and the execution plans are generated based on the execution tree structure and the node topology of the execution nodes.

[0029] For example, in the OceanBase distributed database, the SQL query "SELECT * FROM t1, t2 WHERE t1.c1 = t2.c1" undergoes syntax parsing during the execution plan generation process. This parsing yields a hierarchical tree structure, i.e., a syntax tree.

[0030] Then, the syntax tree structure is converted into a query tree structure for querying data from the OceanBase distributed database, and the query is optimized using optimization rules to obtain the optimized query tree structure. Finally, based on the characteristics of the query, the topology information of the OceanBase distributed database, and the optimized query tree structure, multiple query plans for the SQL query statement are generated.

[0031] like Figure 4 The diagram shows the execution of one of the multiple query plans, which involves three server nodes: server node 1, server node 2, and server node 3 in server cluster 1.

[0032] During the execution of the query plan, server node 1 sends a data operation to server node 2, and server node 2 performs a data query on table t1 of the OceanBase distributed database. Also, server node 1 sends a data operation to server node 3 in server cluster 1, and server node 3 performs a data query on table t2 of the OceanBase distributed database. Server node 2 and server node 3 communicate with each other during the data query process, and finally return the queried data to server node 1.

[0033] In practice, based on the execution plan of the generated access statement, the execution plan features are constructed from the execution logic, execution nodes and data partitions of each execution plan of the access statement. This is to obtain the execution plan features carrying the execution logic, execution nodes and data partitions of the execution plan, and to provide a data foundation for the subsequent calculation of the execution cost of the execution plan.

[0034] In one optional implementation of this embodiment, execution plan features are constructed based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database, including:

[0035] Based on the execution logic, execution nodes, and data partitions of each execution plan, construct graph-structured data;

[0036] The graph structure data is transformed into a vector, and the resulting execution plan vector is used as the execution plan feature.

[0037] In this context, the execution nodes of the execution plan refer to the processing nodes in the distributed database that participate in the execution of the execution plan, such as... Figure 4 In the schematic diagram of a query plan for the SQL query statement shown, the execution nodes involved in the execution of the query plan are server node 1, server node 2, and server node 3 in server cluster 1.

[0038] The execution logic of an execution plan refers to the data operations that need to be performed at each execution node during the execution of the execution plan, such as... Figure 4 The schematic diagram of a query plan for the SQL query statement shown includes the following execution logic: server node 1 in server cluster 1 sends data operations to server node 2 and server node 3; server node 2 in server cluster 1 performs a data query on table t1 of the OceanBase distributed database; and server node 3 in server cluster 1 performs a data query on table t2 of the OceanBase distributed database.

[0039] The data partition of the execution plan refers to the data partition on the processing node corresponding to the data table of the distributed database targeted by the access statement. For example, the data partitions corresponding to the data table t1 of the OceanBase distributed database include: data partitions P7 and P8 of server node 2 in server cluster 1; the data partitions corresponding to the data table t2 of the OceanBase distributed database include: data partitions P11 and P12 of server node 3 in server cluster 1.

[0040] The data partitions of the current execution plan include data partitions P7 and P8 of server node 2 in server cluster 1, and data partitions P11 and P12 of server node 3 in server cluster 1.

[0041] In the specific execution process, based on the execution plan data of each execution statement of the access statement, which consists of three dimensions: execution logic, execution node, and data partition, the execution plan data of these three dimensions is integrated in a graph structure manner to obtain graph structure data carrying the execution plan data of the three dimensions of execution logic, execution node, and data partition. The graph structure data is then transformed into vector form to obtain the three-dimensional execution plan features in vector form.

[0042] In addition, during the process of obtaining the three-dimensional execution plan features based on the execution plan data of the execution logic, execution node, and data partition of each execution statement of the access statement, the execution plan data of the execution logic, execution node, and data partition can be encoded separately, and the encoded results can be converted into vector form to obtain the three-dimensional execution plan features in vector form.

[0043] Step S204: Generate node characteristics based on the data partitions and node communication bandwidth of the execution nodes of each execution plan.

[0044] In practical implementation, during the execution of an access statement's execution plan, the execution of an execution plan may involve multiple processing nodes. When the execution plan is executed through multiple processing nodes, the processing nodes may engage in data communication or data transmission during the execution process. Data communication or data transmission between processing nodes requires a certain processing time. Considering the data transmission between processing nodes in a distributed database, in order to more accurately predict the execution cost of the execution plan, we can extract the node communication bandwidth-related features of the execution nodes of the execution plan. This allows us to better consider the network overhead in the distributed database environment. Therefore, when using the extracted node communication bandwidth-related features to predict the execution cost of the following execution plan, it helps to improve the accuracy and comprehensiveness of the execution cost prediction.

[0045] In one optional implementation of this embodiment, node characteristics are generated based on the data partitions and node communication bandwidth of the execution nodes of each execution plan, including:

[0046] A graph structure data is constructed based on the data partitions of the execution nodes and the node communication bandwidth; the graph nodes in the graph structure data correspond to the execution nodes, the node attributes of the graph nodes correspond to the data partitions of the execution nodes, and the connection weights between the graph nodes correspond to the node communication bandwidth.

[0047] The graph structure data is transformed into a vector, and the resulting node vectors are used as the node features.

[0048] Among them, node communication bandwidth refers to the communication bandwidth for data communication or data transmission between execution nodes included in the execution plan. The node communication bandwidth can be read from the pre-stored node configuration of the processing nodes. Based on the node communication bandwidth and the number of data operations allocated to each execution node in the actual execution scenario, the transmission time for data communication or data transmission between processing nodes can be calculated.

[0049] In the specific execution process, based on the data partition and node communication bandwidth of each execution statement of the access statement, the processing node data of these two dimensions is integrated in a graph structure to obtain graph structure data carrying the processing node data of the data partition and node communication bandwidth dimensions. The graph structure data is then transformed into vector form, and the transformed node vectors are used as node features of the execution plan. In addition, the data partition and node communication bandwidth of each execution statement can be encoded separately, and the encoded results can be transformed into vector form, and the transformed node vectors are used as node features of the execution plan.

[0050] Step S206: Merge the execution plan features and the node features according to the data partition to obtain merged features.

[0051] The above describes how to construct execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database, thereby obtaining graph structure data carrying execution plan data in three dimensions: execution logic, execution nodes, and data partitions. The graph structure data is then transformed into vector form to obtain three-dimensional execution plan features in vector form. In addition, node features are generated based on the data partitions and node communication bandwidth of the execution nodes of each execution plan, thereby obtaining processing node data carrying the two dimensions of data partitions and node communication bandwidth.

[0052] Based on this, the data partitions carried by both execution plan features and node features are used as the basis for merging these two parts of data. Specifically, the execution logic and execution nodes carried in the execution plan features are merged with the node communication bandwidth corresponding to the same data partition in the node features. Alternatively, according to the data partitions, the node communication bandwidth carried in the node features is merged into the execution logic and execution nodes corresponding to the same data partition in the execution plan features. This yields execution plan data carrying four dimensions of data: execution logic, execution nodes, data partitions, and node communication bandwidth. This allows for better consideration of network overhead in a distributed database environment, thus enabling more accurate prediction of execution cost based on network overhead considerations in a distributed database environment.

[0053] In one optional implementation of this embodiment, the execution plan features and the node features are merged according to data partitioning to obtain merged features, including:

[0054] By parsing the execution plan vector and the node vector, the vector elements in the execution plan vector and the node vector that correspond to the same data partition are determined;

[0055] The vector elements corresponding to the same data partition in the execution plan vector and the node vector are merged, and the merged vector is used as the merge feature.

[0056] In the specific execution process, the execution plan vector is a three-dimensional vector carrying execution plan data in three dimensions: execution logic, execution nodes, and data partitions. The node vector is a two-dimensional vector carrying data partitions and node communication bandwidth. For each node vector, based on the data partition vector element in the node vector, an execution plan vector containing the same data vector element is determined. The other node communication element in the node vector, other than the data partition, is added as a new vector element to the determined execution plan vector, thereby obtaining a four-dimensional execution plan vector, which is the execution plan data carrying four-dimensional data of execution logic, execution nodes, data partitions, and node communication bandwidth.

[0057] Step S208: Based on the merging features and the index configuration features of the target data table of the access statement, predict the execution cost of each execution plan, so as to determine the target execution plan and execute the access statement based on the execution cost.

[0058] The target data table of the access statement in this embodiment refers to the data table in the distributed database corresponding to the data table identifier carried by the access statement. For example, in the OceanBase distributed database, the SQL query statement "SELECT * FROM t1, t2 WHERE t1.c1 = t2.c1" carries data table identifiers t1 and t2. Therefore, the target data table of this SQL query statement is data table t1 and data table t2 in the OceanBase distributed database.

[0059] The index configuration characteristics of the target data table refer to the characteristics of the index configuration information of the target data table. The index configuration information includes the index name, index type, table name and / or number of rows. The index configuration information of OceanBase distributed database can be obtained by querying OceanBase system view. The index configuration information can be configured according to the actual access requirements of distributed database. Different index configuration information can be configured for different query scenarios.

[0060] This embodiment, based on the execution plan data obtained through data merging that carries four-dimensional data including execution logic, execution nodes, data partitions, and node communication bandwidth, obtains merged features carrying execution logic, execution nodes, data partitions, and node communication bandwidth. In the process of predicting the execution cost of the execution plan, the execution cost is predicted based on the merged features and the underlying logical configuration of the distributed database. Specifically, the execution cost of the execution plan of the access statement is predicted based on the merged features and the index configuration features of the target data table of the access statement, thereby improving the accuracy of the execution cost prediction and thus improving the access processing efficiency of the distributed database.

[0061] When predicting the execution cost of an access statement's execution plan based on its merge characteristics and the index configuration characteristics of the target table, the index configuration characteristics required for execution cost prediction can be obtained from the access statement itself before predicting the execution cost of the access statement's execution plan. In addition, to improve the efficiency of execution cost prediction, the index configuration characteristics of each table in the distributed database can be pre-generated and stored. When predicting the execution cost of the access statement's execution plan, only the index configuration characteristics of the target table of the corresponding access statement need to be read.

[0062] In one optional implementation of this embodiment, the index configuration features of the target data table of the access statement are obtained in the following manner: based on the data table identifier carried by the access statement, the data table corresponding to the data table identifier in the distributed database is determined to be the target data table, and the index configuration features of the pre-generated target data table are read.

[0063] In the process of generating index configuration features for each data table in a distributed database, in order to improve the comprehensiveness of the underlying logical configuration of the distributed database and thus help improve the accuracy of execution cost prediction based on merging features and index configuration features, an attention mechanism can be introduced to learn the local relationships between indexes during the generation of index configuration features in the distributed database, thereby improving the feature expression effect of index configuration features. The following description uses the determination process of the index configuration features of the target data table of the access statement in the distributed database as an example. The determination process of the index configuration features of other data tables in the distributed database besides the target data table is similar, and will not be described in detail here.

[0064] In one optional implementation of this embodiment, the index configuration features of the target data table are determined in the following manner:

[0065] A configuration embedding vector is generated based on the index configuration information of the target data table, and a configuration embedding matrix is ​​constructed based on the index configuration vector.

[0066] Self-attention calculation is performed on the configuration embedding matrix to obtain an attention sequence, and an index configuration vector is generated based on the attention sequence as the index configuration feature.

[0067] For example, the index configuration sequence Index = {I1, I2, ..., I...} for any data table in a distributed database. k}, I1, I2, I k This represents the index configuration information of the data table, and the index configuration sequence Index is used to generate the index configuration embedding matrix E. Index =(E1) T E2T ,…,E k T Each index configuration I1 corresponds to an embedding vector E1. T ;

[0068] Then, a self-attention mechanism is used to configure the embedding matrix E for the index. Index Self-attention computation is performed to learn the internal correlations between the index configuration information of the data table. The self-attention mechanism allows the index configuration information at each position in the sequence to interact and pass information with the index configuration information at other positions, thereby capturing the long-range dependencies between the index configuration information at different positions. After self-attention computation, a vector sequence is obtained. Each vector in the vector sequence represents the index configuration information that integrates the comprehensive internal correlations at different positions. The vector sequence is then weighted and summed to obtain the index configuration vector, which is the index configuration feature that carries the local relationships and long-range dependencies between the index configuration information.

[0069] In specific implementation, during the prediction of the execution cost of the execution plan of the access statement based on the merging characteristics of four-dimensional data carrying execution logic, execution nodes, data partitions, and node communication bandwidth, as well as the index configuration characteristics of the target data table of the access statement, the efficiency of execution cost prediction can be improved by data vectorization. In an optional implementation provided in this embodiment, the prediction of the execution cost of each execution plan based on the merging characteristics and the index configuration characteristics of the target data table of the access statement includes:

[0070] The merged vector of the merged feature is concatenated with the index configuration vector of the index configuration feature to obtain the prediction input vector;

[0071] The predicted input vector is input into the cost prediction algorithm to predict the execution cost, thereby obtaining the execution cost of each execution plan.

[0072] The determination of the merge vector of the merge feature and the index configuration vector of the index configuration feature can refer to the specific processing procedures for generating the merge vector and the index configuration vector provided above.

[0073] In practical applications, after combining the four-dimensional data merging characteristics (carrying execution logic, execution nodes, data partitions, and node communication bandwidth) and the index configuration characteristics of the target data table of the access statement to predict the execution cost of the access statement's execution plan and obtain the execution cost of each access plan, the target execution plan for the current access statement can be determined in each access plan based on the execution cost of each access plan. The access statement is then executed according to the determined target execution plan, thereby improving the efficiency of the distributed database in processing and responding to access statements.

[0074] Specifically, in one optional implementation of this embodiment, determining the target execution plan based on the execution cost and executing the access statement includes:

[0075] The execution cost that satisfies the execution conditions is determined from the execution costs of each execution plan and taken as the target execution cost;

[0076] The access statement is executed according to the execution plan corresponding to the target execution cost, and the execution result of the access statement is obtained.

[0077] Among them, the execution cost that meets the execution conditions can be an execution plan whose execution cost is less than that of other execution plans, such as the first execution plan after sorting all execution plans in ascending order of execution cost.

[0078] In this embodiment, the prediction of the execution cost of the execution plan for the access statement is performed after the execution plan of the access statement is generated but before the access statement is scheduled according to the execution plan. The target execution plan for executing the access statement is determined by calculating the execution cost of each execution plan, thereby enabling the access statement to be scheduled according to the target execution plan and executed on the corresponding processing node for the corresponding data operation. To automate the scheduling of access statements, the execution cost of each execution plan for the access statement can be predicted using a cost prediction model, thereby improving the processing efficiency of access statements through scheduling automation.

[0079] Optionally, the distributed database access processing method provided in this embodiment is executed through a cost prediction model;

[0080] The cost prediction model includes: a plan feature extraction unit, a node feature extraction unit, a feature merging unit, an index feature extraction unit, and a cost prediction unit;

[0081] The step of constructing execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database is executed by the plan feature extraction unit.

[0082] The step of generating node features based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan is executed by the node feature extraction unit;

[0083] The step of merging the execution plan features and the node features according to data partitioning to obtain merged features is executed by the feature merging unit.

[0084] The index configuration features of the target data table are obtained through the index feature extraction unit;

[0085] The operation of predicting the execution cost of each execution plan based on the merging characteristics and the index configuration characteristics of the target data table of the access statement is performed by the cost prediction unit.

[0086] For example, the following model function is established for the cost prediction model:

[0087] C = LearnedCost(qry, svr, idx)

[0088] Where qry represents the execution plan data of each query plan of the query statement, svr represents the data partition and node communication bandwidth of the processing node of each query plan, idx represents the index configuration information of the target data table in the OceanBase distributed database carried by the query statement, and C represents the query cost of each query plan of the query statement.

[0089] like Figure 5 As shown, the cost prediction model obtained by training the model function specifically includes: a plan feature extraction layer, a node feature extraction layer, a feature merging layer, an index feature extraction layer, and a cost prediction layer;

[0090] The plan feature extraction layer is used to construct query plan features based on the execution plan data, which consists of the execution logic, execution nodes, and data partitions of each query plan in the input query statement.

[0091] The node feature extraction layer is used to generate node features based on the data partitioning and node communication bandwidth of the execution nodes of each query plan;

[0092] The feature merging layer is used to merge query plan features and node features according to data partitioning to obtain merged features;

[0093] The index feature extraction layer is used to extract the index configuration features of the target data table corresponding to the data table identifier carried by the query statement from the pre-generated and stored set of index configurations of the OceanBase distributed database.

[0094] The cost prediction layer is used to predict the query cost of each query plan for a query statement based on the merge characteristics and the index configuration characteristics of the target data table of the query statement, and outputs the predicted query cost of each query plan for each query statement.

[0095] The following example uses the distributed database access processing method provided in this embodiment in a database query scenario as an example, combined with... Figure 6 The method for accessing and processing a distributed database provided in this embodiment will be further explained below. Figure 6 This method is applied to the access processing of distributed databases in database query scenarios, and specifically includes the following steps.

[0096] Step S602: Construct graph structure data based on the execution logic, execution nodes, and data partitions of multiple query plans of the query statement in the distributed database.

[0097] Step S604: Perform vector transformation on the graph structure data to obtain the query plan vector.

[0098] Step S606: Construct graph structure data based on the data partitioning and node communication bandwidth of the execution nodes of multiple query plans.

[0099] Step S608: Perform vector transformation on the graph structure data to obtain node vectors.

[0100] Step S610: Determine the vector elements corresponding to the same data partition in the query plan vector and node vector by parsing.

[0101] Step S612: Merge the vector elements corresponding to the same data partition in the query plan vector and node vector to obtain the merged vector.

[0102] Step S614: Determine the target data table corresponding to the data table identifier carried by the query statement in the distributed database, and read the index configuration vector of the target data table that has been pre-generated and stored.

[0103] Step S616: Perform vector concatenation between the merged vector and the index configuration vector to obtain the prediction input vector.

[0104] Step S618: Input the predicted input vector into the cost prediction algorithm to predict the execution cost and obtain the execution cost of each query plan.

[0105] It should be pointed out that, Figure 6The distributed database query processing method shown can be applied to a cost prediction model, which includes a plan feature extraction layer, a node feature extraction layer, a feature merging layer, an index feature extraction layer, and a cost prediction layer. Accordingly, steps S602 to S604 can be executed by the plan feature extraction layer, steps S606 to S608 can be executed by the plan feature extraction layer, steps S610 to S612 can be executed by the feature merging layer, step S614 can be executed by the index feature extraction layer, and steps S616 to S618 can be executed by the cost prediction layer.

[0106] The following is an embodiment of a distributed database access processing device provided in this specification:

[0107] In the above embodiments, a method for accessing a distributed database is provided. Correspondingly, a device for accessing a distributed database is also provided, which will be described below with reference to the accompanying drawings. Figure 7 This illustration shows a schematic diagram of an embodiment of a distributed database access processing device provided in this embodiment.

[0108] Since the apparatus embodiments correspond to the method embodiments, the descriptions are relatively simple. For relevant parts, please refer to the corresponding descriptions of the method embodiments provided above. The apparatus embodiments described below are merely illustrative.

[0109] This embodiment provides a distributed database access processing device, the device comprising:

[0110] The plan feature construction module 702 is configured to construct execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database.

[0111] The node feature generation module 704 is configured to generate node features based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan;

[0112] The feature merging module 706 is configured to merge the execution plan features and the node features according to the data partition to obtain merged features;

[0113] The execution cost prediction module 708 is configured to predict the execution cost of each execution plan based on the merging features and the index configuration features of the target data table of the access statement, so as to determine the target execution plan and execute the access statement based on the execution cost.

[0114] The following is an embodiment of a distributed database access processing device provided in this specification:

[0115] Corresponding to the distributed database access processing method described above, based on the same technical concept, one or more embodiments of this specification also provide a distributed database access processing device, which is used to execute the distributed database access processing method provided above. Figure 8 This is a schematic diagram of the structure of a distributed database access processing device provided for one or more embodiments of this specification.

[0116] This embodiment provides a distributed database access processing device, including:

[0117] like Figure 8 As shown, the access processing device for a distributed database can vary significantly due to differences in configuration or performance. It may include one or more processors 801 and a memory 802, where one or more application programs or data can be stored. The memory 802 can be temporary or persistent storage. The application programs stored in the memory 802 may include one or more modules (not shown), each module including a series of computer-executable instructions from the distributed database access processing device. Furthermore, the processor 801 may be configured to communicate with the memory 802, executing the series of computer-executable instructions in the memory 802 on the distributed database access processing device. The distributed database access processing device may also include one or more power supplies 803, one or more wired or wireless network interfaces 804, one or more input / output interfaces 805, one or more keyboards 806, etc.

[0118] In one specific embodiment, the distributed database access processing device includes a memory and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for use in the distributed database access processing device, and is configured to be executed by one or more processors. The one or more programs include computer-executable instructions for performing the following:

[0119] Based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database, construct the execution plan characteristics;

[0120] Node characteristics are generated based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan;

[0121] The execution plan features and node features are merged according to the data partitioning to obtain merged features;

[0122] Based on the merging characteristics and the index configuration characteristics of the target data table of the access statement, the execution cost of each execution plan is predicted, so as to determine the target execution plan and execute the access statement based on the execution cost.

[0123] This specification provides an example of a storage medium as follows:

[0124] Corresponding to the distributed database access processing method described above, based on the same technical concept, one or more embodiments of this specification also provide a storage medium.

[0125] The storage medium provided in this embodiment is used to store computer-executable instructions, which, when executed by a processor, implement the following process:

[0126] Based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database, construct the execution plan characteristics;

[0127] Node characteristics are generated based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan;

[0128] The execution plan features and node features are merged according to the data partitioning to obtain merged features;

[0129] Based on the merging characteristics and the index configuration characteristics of the target data table of the access statement, the execution cost of each execution plan is predicted, so as to determine the target execution plan and execute the access statement based on the execution cost.

[0130] It should be noted that the embodiment of a storage medium in this specification and the embodiment of a distributed database access processing method in this specification are based on the same inventive concept. Therefore, the specific implementation of this embodiment can be referred to the implementation of the corresponding method described above, and the repeated parts will not be described again.

[0131] The various embodiments in this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments. For example, the device embodiment, equipment embodiment and storage medium embodiment are all similar to the method embodiment, so the description is relatively simple. When reading the relevant content of the device embodiment, equipment embodiment and storage medium embodiment, please refer to the description of the method embodiment.

[0132] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0133] In the 1930s, improvements to a technology could be clearly distinguished as either hardware improvements (e.g., improvements to the circuit structure of diodes, transistors, switches, etc.) or software improvements (improvements to the methodology). However, with technological advancements, many improvements to the methodology today can be considered direct improvements to the hardware circuit structure. Designers almost always obtain the corresponding hardware circuit structure by programming the improved methodology into the hardware circuit. Therefore, it cannot be said that an improvement to the methodology cannot be implemented using a hardware physical module. For example, a Programmable Logic Device (PLD) (e.g., a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user programming the device. Designers can program a digital system themselves to "integrate" it onto a PLD, without needing chip manufacturers to design and manufacture dedicated integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing integrated circuit chips, this programming is mostly implemented using "logic compiler" software. Similar to the software compiler used in program development, the original code before compilation must be written in a specific programming language, called a Hardware Description Language (HDL). There are many HDLs, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, and RHDL (Ruby Hardware Description Language). Currently, the most commonly used are VHDL (Very-High-Speed ​​Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should understand that by simply performing some logic programming on the method flow using one of these hardware description languages ​​and programming it into an integrated circuit, the hardware circuit implementing the logical method flow can be easily obtained.

[0134] The controller can be implemented in any suitable manner. For example, it can take the form of a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code form, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included therein for implementing various functions can also be considered as structures within the hardware component. Alternatively, the means for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.

[0135] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.

[0136] For ease of description, the above apparatus is described by dividing it into various functional units. Of course, when implementing the embodiments of this specification, the functions of each unit can be implemented in one or more software and / or hardware.

[0137] Those skilled in the art will understand that one or more embodiments of this specification can be provided as a method, system, or computer program product. Therefore, one or more embodiments of this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0138] This specification is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this specification. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0139] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0140] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0141] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0142] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0143] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0144] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising at least one…" does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0145] One or more embodiments of this specification can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a particular task or implement a particular abstract data type. One or more embodiments of this specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0146] The above description is merely an embodiment of this document and is not intended to limit the scope of this document. Various modifications and variations can be made to this document by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this document should be included within the scope of the claims of this document.

Claims

1. A method for accessing a distributed database, comprising: Based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database, construct the execution plan characteristics; Node characteristics are generated based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan; By parsing the execution plan features and the node features, vector elements corresponding to the same data partition in the execution plan features and the node features are determined, and the vector elements are merged to obtain merged features; Based on the merging characteristics and the index configuration characteristics of the target data table of the access statement, the execution cost of each execution plan is predicted, so as to determine the target execution plan and execute the access statement based on the execution cost.

2. The distributed database access processing method according to claim 1, wherein constructing execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of the distributed database access statement includes: Based on the execution logic, execution nodes, and data partitions of each execution plan, construct graph-structured data; The graph structure data is transformed into a vector, and the resulting execution plan vector is used as the execution plan feature.

3. The distributed database access processing method according to claim 2, wherein generating node characteristics based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan includes: A graph structure data is constructed based on the data partitioning and node communication bandwidth of the execution node; In the graph structure data, the graph nodes correspond to the execution nodes, the node attributes of the graph nodes correspond to the data partitions of the execution nodes, and the connection weights between graph nodes correspond to the node communication bandwidth. The graph structure data is transformed into a vector, and the resulting node vectors are used as the node features.

4. The distributed database access processing method according to claim 1, wherein each execution plan of the access statement is generated in the following manner: The access statement is parsed to obtain a syntax tree structure, and the syntax tree is converted into an access tree structure; The access tree structure is optimized to obtain the execution tree structure, and the execution plans are generated based on the execution tree structure and the node topology of the execution nodes.

5. The distributed database access processing method according to claim 1, wherein the access statement of the distributed database is executed by at least one processing cluster of the distributed database; The processing cluster consists of one or more processing nodes, and each processing node is configured with at least one data partition corresponding to a data table in the distributed database.

6. The distributed database access processing method according to claim 1, before the step of predicting the execution cost of each execution plan based on the merging feature and the index configuration feature of the target data table of the access statement, so as to determine the target execution plan based on the execution cost and execute the access statement, further includes: Based on the data table identifier carried by the access statement, the data table corresponding to the data table identifier in the distributed database is determined to be the target data table, and the index configuration features of the pre-generated target data table are read.

7. The distributed database access processing method according to claim 6, wherein the index configuration characteristics of the target data table are determined in the following manner: A configuration embedding vector is generated based on the index configuration information of the target data table, and a configuration embedding matrix is ​​constructed based on the configuration embedding vector. Self-attention calculation is performed on the configuration embedding matrix to obtain an attention sequence, and an index configuration vector is generated based on the attention sequence as the index configuration feature.

8. The distributed database access processing method according to claim 1, wherein predicting the execution cost of each execution plan based on the merging feature and the index configuration feature of the target data table of the access statement includes: The merged vector of the merged feature is concatenated with the index configuration vector of the index configuration feature to obtain the prediction input vector; The predicted input vector is input into the cost prediction algorithm to predict the execution cost, thereby obtaining the execution cost of each execution plan.

9. The distributed database access processing method according to claim 1, wherein determining the target execution plan based on the execution cost and executing the access statement comprises: The execution cost that satisfies the execution conditions is determined from the execution costs of each execution plan and taken as the target execution cost; The access statement is executed according to the execution plan corresponding to the target execution cost, and the execution result of the access statement is obtained.

10. The distributed database access processing method according to claim 1, wherein the distributed database access processing method is executed through a cost prediction model; The cost prediction model comprises: The system includes a plan feature extraction unit, a node feature extraction unit, a feature merging unit, an index feature extraction unit, and a cost prediction unit. Specifically, the step of constructing execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of the distributed database access statement is executed by the plan feature extraction unit; the step of generating node features based on the data partitions and node communication bandwidth of each execution plan execution node is executed by the node feature extraction unit; the step of determining the vector elements corresponding to the same data partition in the execution plan features and node features by parsing the execution plan features and node features, and merging the vector elements to obtain merged features is executed by the feature merging unit; the index configuration features of the target data table are obtained by the index feature extraction unit; and the operation of predicting the execution cost of each execution plan based on the merged features and the index configuration features of the target data table of the access statement is executed by the cost prediction unit.

11. An access processing apparatus for a distributed database, comprising: The plan feature building module is configured to build execution plan features based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database. The node feature generation module is configured to generate node features based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan; The feature merging module is configured to parse the execution plan features and the node features, determine the vector elements in the execution plan features and the node features that correspond to the same data partition, and merge the vector elements to obtain merged features; The execution cost prediction module is configured to predict the execution cost of each execution plan based on the merging characteristics and the index configuration characteristics of the target data table of the access statement, so as to determine the target execution plan and execute the access statement based on the execution cost.

12. An access processing device for a distributed database, comprising: processor; And, a memory configured to store computer-executable instructions, which, when executed, cause the processor to: Based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database, construct the execution plan characteristics; Node characteristics are generated based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan; By parsing the execution plan features and the node features, vector elements corresponding to the same data partition in the execution plan features and the node features are determined, and the vector elements are merged to obtain merged features; Based on the merging characteristics and the index configuration characteristics of the target data table of the access statement, the execution cost of each execution plan is predicted, so as to determine the target execution plan and execute the access statement based on the execution cost.

13. A storage medium for storing computer-executable instructions, which, when executed by a processor, perform the following process: Based on the execution logic, execution nodes, and data partitions of each execution plan of the access statements in the distributed database, construct the execution plan characteristics; Node characteristics are generated based on the data partitioning and node communication bandwidth of the execution nodes of each execution plan; By parsing the execution plan features and the node features, vector elements corresponding to the same data partition in the execution plan features and the node features are determined, and the vector elements are merged to obtain merged features; Based on the merging characteristics and the index configuration characteristics of the target data table of the access statement, the execution cost of each execution plan is predicted, so as to determine the target execution plan and execute the access statement based on the execution cost.