A query cardinality estimation method based on a hybrid autoregressive model and sampling

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By combining a multi-path parallel mask autoencoder model with offline join key sampling, the problem of high quantile error in cardinality estimation under large data volumes is solved, achieving higher cardinality estimation accuracy and database query stability.

CN118568129BActive Publication Date: 2026-06-30FUDAN UNIVERSITY +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: FUDAN UNIVERSITY
Filing Date: 2024-05-21
Publication Date: 2026-06-30

Application Information

Patent Timeline

21 May 2024

Application

30 Jun 2026

Publication

CN118568129B

IPC: G06F16/2453; G06F16/2455; G06F18/27; G06F18/25; G06N3/0455; G06N3/084; G06N3/048

AI Tagging

Technology Topics

Database queryQuery plan

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing cardinality estimation methods suffer from high quantile errors when dealing with large datasets and multiple table joins, especially low selectivity queries or multiple table joins. This leads to the query optimizer selecting an incorrect execution plan, making it difficult to meet database optimization needs.

Method used

A hybrid approach combining a multi-path parallel masked autoencoder model and offline join key sampling is adopted. The joint probability distribution between different columns is obtained through training, and virtual tuples are generated by combining offline join key sampling for cardinality estimation. The estimation accuracy is improved by using a hybrid autoregressive model and a weighted average of sampling.

Benefits of technology

In large-scale OLAP analytical databases, it significantly reduces high quantile errors, improves the accuracy of cardinality estimation, reduces the possibility of erroneous execution plans, and enhances the stability and execution efficiency of database queries.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN118568129B_ABST

Patent Text Reader

Abstract

This invention belongs to the field of database query technology, specifically a query cardinality estimation method based on a hybrid autoregressive model and sampling. The invention includes constructing a multi-path parallel masked autoencoder model, which can better learn the joint probability distribution in the data; an offline join key sampling cardinality estimation method, which performs single-table sampling in multi-table joins according to the join key to complete cardinality estimation; and a cardinality estimation method that combines the advantages of autoregressive models and join key sampling, thereby improving the overall accuracy of cardinality estimation. This invention can reduce the query quantile error when using traditional autoregressive models for cardinality estimation, improve the stability of cardinality estimation, enhance the quality of query plans generated by the query optimizer, and accelerate database query execution.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of database query technology, specifically relating to a query cardinality estimation method based on a hybrid autoregressive model and sampling. Background Technology

[0002] With the development of the information age, the amount of data stored in databases is increasing every day. Under such circumstances, optimizers need to have higher accuracy and stability and reduce the proportion of large errors in order to improve the quality of generated plans and reduce the additional execution costs caused by the optimizer generating incorrect plans and the system oscillations caused by excessive errors.

[0003] However, both traditional cardinality estimation methods and current machine learning-based cardinality estimation methods have large high quantile errors. These high quantile errors can cause the query optimizer to generate incorrect execution plans when faced with a small number of queries, increasing the execution time of the execution plan.

[0004] Currently, data-driven methods based on autoregressive models can adapt to more load scenarios and have higher accuracy. However, autoregressive models require reading all data from a single database table or all data from all estimated tables in outer joins during training. With the ever-increasing volume of data today, this learning cost is enormous. Therefore, when faced with queries with low selectivity or queries involving a small number of joined tables in multi-table joins, they will produce large high quantile errors, further causing the query optimizer to select the wrong plan, making it difficult to meet current database optimization needs. Summary of the Invention

[0005] The purpose of this invention is to provide a hybrid autoregressive model and a sampled query cardinality estimation method that has high stability and meets the requirements of database query optimization.

[0006] The query cardinality estimation method based on a hybrid autoregressive model and sampling proposed in this invention includes the following steps: design and training of a multi-path parallel masked autoencoder model, offline join key sampling, and cardinality estimation based on a hybrid autoregressive model and offline join key sampling.

[0007] (I) Design and Training of Multi-path Parallel Masked Autoencoder Model:

[0008] The aforementioned multi-path parallel mask autoencoder model, hereinafter referred to as the model, is specifically a multi-layer neural network model constructed in parallel using multiple mask autoencoder layers. It is trained using the original dataset to obtain the joint probability distribution between different columns in the original database data table.

[0009] The multi-path parallel mask autoencoder model includes an embedding encoding layer and a computation module; the computation module includes a data feature extraction layer, a data feature enhancement layer and a final computation layer; the final output of the model is the conditional probability distribution between different columns.

[0010] The feature extraction layer includes multiple masked autoencoder layers and an average pooling layer; wherein, the masked autoencoder layer is derived from the single-layer structure of the masked autoregressive model, and the fully connected layer acquires autoregressive properties by setting masks between layers;

[0011] The feature enhancement layer has the same structure as the feature extraction layer, including multiple masked autoencoder layers and an average pooling layer. The masked autoencoder layer is derived from the single-layer structure of the masked autoregressive model. By setting masks between layers, the fully connected layer can acquire autoregressive properties.

[0012] The final computation layer includes a complete and large mask autoencoder layer W;

[0013] The model first encodes the row data in the input raw data table through an embedding encoding layer; typically, the length of a single column in the embedding layer is L. em Set the value to within 64, encode the data according to the value range, and encode values less than L. em The columns use one-hot encoding, and the values are greater than L. em The columns use embedded encoding, and the encoding layer output is set to

[0014] The above embedding encoding can be used to encode a single-column value range L through a single matrix multiplication operation. c The vector data is reduced to a 64-dimensional vector, and each column with a value range exceeding 64 corresponds to a vector of size (L). c The trainable dimensionality reduction matrix of ,64).

[0015] After encoding, the data is input into the model's computational part, which consists of three layers: a data feature extraction layer, a data feature enhancement layer, and a final computation layer. The model's final output is the conditional probability distribution between different columns. The overall model structure is shown below. Figure 1 As shown.

[0016] The feature extraction layer comprises multiple masked autoencoder layers and an average pooling layer; the masked autoencoder layers used are derived from the single-layer structure of the masked autoregressive model (Mathieu Germain's paper "MADE: Masked Autoencoder for Distribution Estimation" published at ICML 2015); they enable the fully connected layers to acquire autoregressive properties by setting masks between layers. The connection mask between two neurons is shown in the following equation:

[0017]

[0018] Where x represents the node number of the (i-1)th layer, y represents the node number of the i-th layer, and layer i-1 [x] represents the actual node value, which is the value of the next layer. i [y] node value is greater than or equal to layer i-1 When [x], it means that the information of node x in layer i can be propagated to node y in layer i-1, and the Mask value is 1. The next layer... i [y] node value is greater than layer i-1 When [x], it indicates that the information of node x in layer i cannot be propagated to node y in layer i-1, and the Mask value is 0. See [x] for internal details. Figure 2 As shown.

[0019] For the last layer of the neural network, in order to ensure that the final output value is autoregressive, that is, the data at position y=i in the last layer will not accept inputs from any data at position x≤i in the last layer, the connection mask of the last layer is as follows:

[0020]

[0021] The input to the feature extraction layer is the data sequence after embedding encoding. The data is passed through multiple parallel masked autoencoder layers, then concatenated vertically and fed into an average pooling layer. The number of masked autoencoder layers is typically five. Finally, ReLU activation is used to obtain the output of this layer. See [link to layer structure] for details. Figure 3 As shown, the calculation formula is as shown in equation (1), where This represents the i-th masked autoencoder layer performing operations on the input;

[0022]

[0023] The input to the feature enhancement layer is the data sequence calculated by the feature extraction layer. The data is passed through parallel masked autoencoder layers, then vertically concatenated and input into an average pooling layer. The number of masked autoencoders is typically five. Finally, sigmoid activation is used to obtain the layer's output. See [link to layer structure] for details. Figure 4 As shown. The calculation formula is shown in equation (2) below, where This represents the i-th masked autoencoder layer performing operations on the input.

[0024]

[0025] The final computation layer takes as input the encoded data and the horizontal combination of the output vectors from the feature extraction layer and the feature enhancement layer, specifically by using the concatenate function to... By concatenating the data and adding random Dropout, this design allows the model to generalize to different features, ensuring that the parallel connected layers learn different data features. The input is represented as:

[0026]

[0027] The above features are processed by a single masked autoencoder layer W, and the output is an output vector data of the same length as the embedded encoded vector data, as shown in the following formula (4):

[0028]

[0029] The output is then decoded by the subsequent decoding layer to obtain the conditional probabilities of different values for each column. The final computation layer structure is as follows: Figure 5 As shown.

[0030] The training process of this model requires a complete data sequence from the original database, obtained by batch random sampling of multiple table joins. The original database here refers to the relational database where the user wants to learn about the cardinality of the joins, containing single tables, foreign key constraints, database schema, and table indexes. For relational databases, join operations are performed based on equivalent keys, for example, T1.A = T2.A, meaning a join is only established and the data is added to the result set when the attributes in column A of T1 are equal to those in column A of T2. This type of join relationship within the relational database schema can be viewed as a graph data type. For the autoregressive model, the complete data sequence generated during training requires sequentially traversing the data of all tables in the database schema. This process has a start and end point; therefore, a root table needs to be set as the starting point for the joins. Starting from the root table, the entire dataset is traversed and joined according to the equivalent join keys, performing sampled outer joins. During the join process, duplicate keys in the next table will amplify the tuples in the previous joined table. Therefore, the proportion of each different key in the final training space is different. Accordingly, the join weight of each table in the path needs to be recorded during sampling, i.e., the subsequent table T. i+1 Compared with Table T above i When performing equivalence key joins, T i Some key values will be affected by T i+1The amplification occurs when the number of equal key values is greater than 1; this amplification coefficient is the join weight. After determining the join weights, the complete data sequence is sampled randomly from each table according to the join weights to generate complete tuples for the outer joins. More detailed sampling and cardinality estimation methods can be found in the section on weight sampling and cardinality estimation in Zongheng Yang's 2020 VLDB paper, "NeuroCard: One Cardinality Estimator for All Tables". After obtaining the randomly sampled complete data sequence, the data is input into the cross-entropy between the model's output probability distributions. Backpropagation using this cross-entropy allows for model training.

[0031] (II) Offline connection key sampling:

[0032] The offline connection key sampling, hereinafter referred to as sampling, includes the following steps: determining the order of the sampling construction pattern diagram, confirming the sampling ratio, and performing sampling based on the connection.

[0033] First, input the database schema diagram for the cardinality estimation task in step (1). Each table in the schema diagram is regarded as a node in the graph, and the connection key between tables is regarded as a directed connection between nodes in the graph. The final output is a schema diagram in the form of a directed acyclic graph.

[0034] To construct the directed acyclic graph described above, the sampling starting point table from the model training process in step (I) needs to be used as the starting point for constructing the pattern graph. Starting from the starting point, the breadth-first traversal algorithm (please refer to the references) is used to search the entire database pattern with the connection key as the edge, while recording the calling order of each edge, and finally generating the directed acyclic graph.

[0035] Next, set the overall sampling ratio f, which can typically be set to 0.1% to ensure the total sampling volume is not too large. Starting from the root table, randomly sample data tuples at a ratio of f to form the root table sample S. root .

[0036] For the first sampling, let S i-1 =S root Calculate the sample S i-1 Follow the above calling order to access the next join table T. i The different values of the corresponding join keys are joined by equi-joins, and then the join results are joined. res Then extract n quantities from them i The tuples constitute the current table sample S i count(X) is the total number of rows in the table or result set X, and the number of samples n. i With the recording ratio f i The calculation is as follows:

[0037] n i =count(T i )×f, (5)

[0038]

[0039] For subsequent cyclic sampling, the order of calling edges and nodes recorded in the directed acyclic graph is followed, with each iteration using the previous cyclic sampling as S. i-1 Calculate S i-1 With T i Retrieve the current table T based on the different values of the join key. i Sampling S i With the recording ratio f i .

[0040] After the loop finishes, samples from all tables can be obtained and compiled into a List. S When using different values for joins, if the number of join results is less than n... i If the value is not specified, then all connection results should be used as samples S. i Set f i =1.

[0041] (III) Cardinality estimation using a hybrid autoregressive model and offline linker sampling:

[0042] The aforementioned hybrid autoregressive model and cardinality estimation based on offline join key sampling, hereinafter referred to as the hybrid method, specifically includes: cardinality estimation of the query, cardinality estimation based on offline join key sampling, and hybrid cardinality estimation; the specific estimation process is as follows:

[0043] First, the cardinality of the query is estimated using the multi-path parallel masked autoencoder model trained in step (I). A certain number of virtual tuples are generated under the predicate constraints of the query, and the conditional probabilities of these virtual tuples in the overall data distribution are obtained. The cardinality estimate of the query is obtained by summing these virtual tuples.

[0044] The predicate constraint of the query is in the form of the following equation (7), where T i .C j R represents the data in the j-th column of the i-th table. ij This represents the range of query constraint values on this column:

[0045] Q = {T1.C} 11 ∈R1,…,T n .C m ∈R nm}, (7)

[0046] Within the constraint range, the autoregressive model generates n dummy tuples, typically n is 4000. The distribution proportion of these dummy tuples is then obtained, leading to the estimated selectivity of the query, sql′(Q).

[0047]

[0048] The cardinality can be obtained by multiplying the query selectivity by the total cardinality:

[0049] card′ AR (Q)=sel′(Q)×card ALL (9)

[0050] For the cardinality estimate on the offline join key sample, it can be obtained by simply executing the cardinality estimation target query Q on the completed sample:

[0051] card′ S (Q) = Q(sample), (10)

[0052] Finally, the mixed cardinality estimation method yields a final cardinality value by weighting the two values in terms of magnitude:

[0053] card′(Q) = ,

[0054] W1×max(card′ S (Q),card′ AR (Q))+W2×min(card′ S (Q),card′ AR (Q)), (11)

[0055] Wherein, W1 and W2 are weighting coefficients, with W1 typically set to 0.75 and W2 typically set to 0.25.

[0056] The query cardinality estimation method based on a hybrid autoregressive model and sampling in this invention has the following advantages:

[0057] When optimizing queries in large-scale datasets of OLAP analytical databases, the multi-path parallel masked autoencoder model of this invention can improve the accuracy of cardinality estimation compared to traditional models. The offline join key sampling method can compensate for the low join count and high quantile error in selectivity queries that still exist in the aforementioned autoregressive model-based methods. Finally, the hybrid method proposed in this invention can provide more accurate subquery cardinality estimates, reduce the generation of high quantile errors, decrease the possibility of catastrophic execution plans generated by the database optimizer due to high quantile errors, and improve the stability of database query execution. Attached Figure Description

[0058] Figure 1This is a diagram illustrating the multi-path parallel mask autoencoder model of the present invention.

[0059] Figure 2 This refers to the mask autoencoder layer structure used in this invention and the corresponding conditional probability of the output.

[0060] Figure 3 This is the feature extraction layer structure for a multi-path parallel masked autoencoder model.

[0061] Figure 4 This is the feature enhancement layer structure for a multi-path parallel mask autoencoder model.

[0062] Figure 5 This is the final computational layer structure of the multi-path parallel mask autoencoder model.

[0063] Figure 6 This is a sampled directed acyclic graph example.

[0064] Figure 7 A detailed example of the sampling process is provided. Detailed Implementation

[0065] The following are specific examples of the present invention, which further describe the present invention.

[0066] The IMDB Job-Light dataset contains information on numerous films and television programs from 1880 to 2019, including details about the companies, actors, and directors associated with each. This information is stored in different tables and can be accessed via foreign key joins, making it a commonly used multi-table dataset.

[0067] PostgreSQL database: This is an open-source object-relational database that uses standard SQL for queries. It has a query optimization module and a corresponding cardinality estimation module. The deployment location of this invention is its cardinality estimation module.

[0068] First, the IMDB dataset is processed using the weighted connection method to generate a dataset that can be used to train the multi-path parallel masked autoencoder model. Then, the model is trained using this dataset. Training is considered complete once the model's cross-entropy stabilizes.

[0069] During the training process described above, a breadth-first traversal algorithm is used to traverse the complete database schema starting from the root table of the training set, generating the directed acyclic graph and connection order required for sampling. The construction process is detailed below. Figure 6 As shown.

[0070] After determining the sampling order, the overall sampling ratio f needs to be determined simultaneously, typically set to 0.1%, from T. root Random sampling begins, with a sample size of T.root Multiply the total number of rows in the table by f. After sampling is complete, the next edge of the sampled data in the directed acyclic graph is... Figure 6 The connection key of the middle connection edge 1 is T root The unique value extraction calculation is performed on k1 to obtain the unique key data table k1, and then it is compared with the table T1 pointed to by the edge. root Perform a join operation on .k1 = T1.k1, and randomly sample the table S1 in the join result JoinResult1.

[0071] Then, following the order of the connected edges in the directed acyclic graph, starting from the already sampled table, the sampling process is repeated for the new table, for example... Figure 6 and Figure 7 The process starts from T1 and connects to T2 via a connecting edge, eventually completing the sampling and construction of all tables.

[0072] In the online estimation phase, it is connected to the cardinality estimation module of PostgreSQL. PostgreSQL provides the query content (i.e., subquery) for the required cardinality estimation. The cardinality estimation module first transforms the query into a query constraint range, and then uses a trained multi-way parallel masked autoencoder model to generate virtual tuples within the range constraints and sums their probability distributions to obtain the cardinality estimate card′ of the autoregressive model. AR (Q), then the subquery is executed on the sample, and the result is divided by the proportion of the sample used by this query to obtain the sampling cardinality estimate card′. S (Q), and finally the mixed cardinality estimation module calculates the weighted average of the two to obtain the final cardinality estimate card′(Q).

[0073] Table 1 shows the relative errors of the test workload Job-Light on the following methods: the native cardinality estimation method in PostgreSQL, the query-driven MSCN method based on multi-convolutional neural networks, the data-driven DeepDB method based on composite networks, the data- and query-driven UAE method based on autoregressive models, the data-driven NeuroCard method based on autoregressive models, and the hybrid method HAS-CE of this invention. On the Job-Light dataset, the high quantile estimation errors (maximum error, 99th quantile, 95th quantile, and 90th quantile) of this invention are significantly lower than the other five existing methods, which can reduce database optimization errors caused by high quantile errors. At the same time, its median error is second only to DeepDB, and it can perform cardinality estimation well.

[0074] Table 1. Comparison of the relative test error of the present invention with other existing methods on the Job-Light test set.

[0075]

Claims

1. A query cardinality estimation method based on a hybrid autoregressive model and sampling, characterized in that, The specific steps are as follows: (I) Design and training of a multi-path parallel mask autoencoder model: The aforementioned multi-path parallel mask autoencoder model, hereinafter referred to as the model, is specifically a multi-layer neural network model constructed in parallel using multiple mask autoencoder layers. It is trained using the original dataset to obtain the joint probability distribution between different columns in the original database data table. The multi-path parallel mask autoencoder model includes an embedding encoding layer and a computation module; the computation module includes a data feature extraction layer, a data feature enhancement layer, and a final computation layer; the final output of the model is the conditional probability distribution between different columns. The feature extraction layer includes multiple masked autoencoder layers and an average pooling layer; wherein, the masked autoencoder layer is derived from the single-layer structure of the masked autoregressive model, and the fully connected layer acquires autoregressive properties by setting masks between layers; The feature enhancement layer has the same structure as the feature extraction layer, including multiple masked autoencoder layers and an average pooling layer. The masked autoencoder layer is derived from the single-layer structure of the masked autoregressive model. By setting masks between layers, the fully connected layer can acquire autoregressive properties. The final computation layer includes a complete and relatively large mask autoencoder layer. ; (ii) Offline connection key sampling: The offline connection key sampling, hereinafter referred to as sampling, includes the following steps: determining the sampling construction pattern diagram order, confirming the sampling ratio, and performing sampling based on the connection. (III) Cardinality estimation of mixed autoregressive model and offline connection bond sampling: The aforementioned hybrid autoregressive model and cardinality estimation based on offline join key sampling specifically include: cardinality estimation of the query, cardinality estimation based on offline join key sampling, hybrid cardinality estimation, to obtain the final query cardinality estimate; The specific process for step (three) is as follows: First, the cardinality of the query is estimated using the multi-path parallel masked autoencoder model trained in step (1). A certain number of virtual tuples are generated under the predicate constraints of the query, and the conditional probabilities of these virtual tuples in the overall data distribution are obtained. The summation of these virtual tuples gives the cardinality estimate of the query. The predicate constraint of the query is in the form of the following equation (7), where Representing the The first table Column data, This represents the range of query constraint values on this column: ，（7） Generate the autoregressive model within the constraint range. The query selectivity is estimated by analyzing the number of dummy tuples and their distribution ratios. : ，（8） The cardinality is calculated by multiplying the query selectivity by the total cardinality. ，（9） For the cardinality estimate on the offline join key sampling, the cardinality estimate will be used to target the query. By executing the command on the completed sample, you can obtain the following: ，（10） The final base value is obtained by weighting the two values according to their magnitude: ，（11） in, , These are the weighting coefficients.

2. The query cardinality estimation method according to claim 1, characterized in that, In step (1), the model first encodes the row data in the input original data table through the Embedding encoding layer; Single column data length of the coding layer Set to within 64, encode data according to value range, and for values smaller than 64... The columns use one-hot encoding, and the value range is greater than The columns use embedding encoding, and the data sequence output by the encoding layer is set to... ; The aforementioned embedding encoding layer uses a single matrix multiplication operation to transform a single-column value range. The vector data is reduced to a 64-dimensional vector, and each column with a value exceeding 64 corresponds to a vector of size [missing information]. Trainable dimensionality reduction matrix; After encoding, the data is input into the computation module of the model; where: The input to the feature extraction layer is the data sequence after embedding encoding. The data is passed through multiple masked autoencoder layers in parallel and then vertically concatenated before being input into an average pooling layer. Finally, ReLU is used for activation to obtain the output of this layer. The calculation formula is shown in equation (1) below, where Represents the extraction layer Each masked autoencoder layer performs operations on the input; ，（1） The input to the feature enhancement layer is the data sequence calculated by the feature extraction layer. The data is passed through parallel masked autoencoder layers, then vertically concatenated and input into an average pooling layer; finally, Sigmoid is used for activation to obtain the output of this layer; the calculation formula is shown in equation (2) below, where Represents the enhancement layer Each masked autoencoder layer performs operations on the input; ，（2） The final computation layer takes as input the encoded data and the output vectors of the feature extraction and feature enhancement layers as a horizontal combination, specifically by using the concatenate function. Perform concatenation, while also applying random Dropout; the input is represented as: ，（3） Through a single mask autoencoder layer After processing the above features, the output is an output vector data of the same length as the embedded encoding vector data, as shown in the following formula (4): ，（4） The subsequent decoding layer performs decoding to obtain the conditional probabilities of different values for each column.

3. The query cardinality estimation method according to claim 2, characterized in that, In step (1), during model training, a complete data sequence after batch random sampling of multiple table joins in the original database needs to be input. The original database refers to the relational database that the user wants to learn about the cardinality of the joints, which includes single tables, foreign key constraints, database schema, and table indexes. For relational databases, join operations are performed according to the equivalence keys. The join relationship in the relational database schema can be regarded as a graph data type. For autoregressive models, the complete data sequence to be generated during training requires sequential traversal of the data in all tables of the database schema. This process has a start and an end point. A root table needs to be set as the starting point of the join. Starting from the root table, the entire dataset is traversed and joined according to the equivalence join key, and the outer join is sampled. During the connection process, duplicate keys in the next table amplify the tuples in the previous connected table. Therefore, the proportion of each different key in the final training space is different. Accordingly, it is necessary to record the connection weight of each table in the path during sampling. After determining the connection weight, the complete data sequence is sampled randomly in each table according to the connection weight, and the complete tuples of the outer connections are generated. After obtaining the above randomly sampled complete data sequence, the data is input into the model, and the cross-entropy between the probability distributions is output. The model training is completed by using the cross-entropy for backpropagation.

4. The query cardinality estimation method according to claim 3, characterized in that, The specific process for step (two) is as follows: First, input the database schema diagram for the cardinality estimation task in step (1). Each table in the schema diagram is regarded as a node in the diagram, and the connection key between tables is regarded as a directed connection between nodes in the diagram. The final output is a schema diagram in the form of a directed acyclic graph. The construction of the above directed acyclic graph requires taking the sampling starting point table in the model training process in step (1) as the starting point for constructing the pattern graph. Starting from the starting point, the breadth-first traversal algorithm is used to search the entire database pattern with the connection key as the edge. At the same time, the calling order of each edge is recorded, and finally a directed acyclic graph is generated. Subsequently, the overall sampling ratio was set. Starting from the root table, samples are randomly taken from it. The root table is composed of proportional data tuples. ; For the first sampling Calculate the sampling of the previous cycle. Follow the above calling order to access the next join table. Different values of the corresponding join keys are joined using equi-joins, and then the join results are analyzed. Draw out the quantity from then on The tuples constitute the current sample , For a table or result set Total number of rows, number of samples Recording ratio The calculation is as follows: ，（5），（6） For subsequent iterative sampling, the order in which edges and nodes are recorded in the directed acyclic graph is followed, with each iteration using the previous iterative sampling as the basis. ,calculate and Retrieve the current table based on different values of the join key. Current sampling Recording ratio ; After the loop finishes, samples from all tables are collected and compiled into a list. Use different values for joins; if the number of join results is less than [a certain value], [the result will be affected]. If the value is not specified, then all connection results are used as the current sample. ,set up .