A cardinality estimation method and system for cross-database and concept drift

By constructing feature extraction and context-adaptive inference modules, and utilizing Bayesian meta-training and a dynamic context memory library, the cardinality estimation problem under cross-database and concept drift conditions is solved, achieving fast adaptive and high-precision cardinality prediction, and reducing the deployment and maintenance costs of new databases.

CN122045172BActive Publication Date: 2026-06-16NORTHEASTERN UNIV CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHEASTERN UNIV CHINA
Filing Date
2026-04-16
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies struggle to achieve fast, adaptive cardinality estimation online in scenarios involving cross-database differences and time-varying concepts, leading to unstable query optimization results and issues such as update delays and high maintenance costs.

Method used

A feature extraction module and a context-adaptive inference module are constructed. Through Bayesian meta-training and multi-head self-attention mechanism, SQL queries and database statistics are transformed into high-dimensional dense vectors. Cardinality prediction is performed using Bayesian inference, and fast adaptation is achieved by combining dynamic context memory.

🎯Benefits of technology

In cross-database migration and concept drift scenarios, it significantly reduces the deployment cost of new databases, quickly corrects estimation bias, reduces parameter updates and resource consumption, and improves estimation accuracy and stability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122045172B_ABST
    Figure CN122045172B_ABST
Patent Text Reader

Abstract

The application discloses a cross-database and concept drift-oriented cardinality estimation method and system, and relates to the technical field of database cardinality estimation. A feature extraction module is constructed and pre-trained to obtain a trained feature extraction module; the feature extraction module serves as a SQL query encoder to convert SQL queries and database statistical information into a high-dimensional dense vector; a context adaptive reasoning module is constructed and subjected to Bayesian meta-training to obtain a trained context adaptive reasoning module; the context adaptive reasoning module performs cardinality prediction based on the high-dimensional dense vector to obtain a predicted cardinality; and SQL queries are acquired, and online cardinality estimation is performed by using the trained feature extraction module and the context adaptive reasoning module to obtain a predicted cardinality. The application can solve the problem that the prior art is difficult to realize online and rapid adaptation in an environment where cross-database differences and data drifts over time coexist.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of database cardinality estimation technology, and in particular relates to a cardinality estimation method and system for cross-database and concept drift. Background Technology

[0002] In database management systems, query optimizers typically use cost-based optimization to generate execution plans, with cardinality estimation and selectivity prediction being key inputs for cost estimation. A large deviation in cardinality estimation can lead to errors in critical decisions such as join order, join algorithm, and index selection, resulting in query plan degradation and significantly increased execution latency.

[0003] In real-world production environments, on the one hand, databases exhibit significant differences across different business systems, schemas, and data characteristics. Estimation models often face cross-database drift (cross-domain differences) when migrating from the training database to a new one. On the other hand, even within the same database, continuous data insertion, deletion, and updates cause changes in attribute distribution, column correlations, and query workload patterns over time, resulting in temporal conceptual drift. These two types of drift often coexist: the model needs both cross-database portability and rapid adaptive capability when the same database changes over time. Failure to effectively address these issues will lead to a continuous decline in estimation accuracy in new databases or at new stages, further impacting query optimization and system stability.

[0004] In existing technologies, one approach uses deep learning to construct cardinality estimators. For example, Chinese patent CN115269639A, "A Cardinality Estimation Method for Relational Databases Based on Deep Learning," establishes multiple cardinality estimators for nodes and connected subgraphs of a relational topology graph. It trains these estimators by creating SQL statements for training and obtaining a training set. When database data changes or the amount of changed data reaches a set scale, the estimator to be updated needs to be determined and retrained to complete the update. This type of method can improve estimation accuracy to some extent, but it typically relies on the distribution of training samples under a specific database or workload, and its patterns are fixed in the model parameters. When migrating to a new database, the feature distribution, connection-related structures, and predicate patterns change, easily leading to insufficient cross-database generalization ability. When the same database experiences concept drift over time, it often requires periodically constructing training SQL, executing and collecting supervisory signals, and retraining for updates, resulting in additional computational and maintenance costs, and a time delay window where "drift has occurred but the model has not yet been updated." The fundamental reason is that the model mainly fits the historical distribution through parameterization, making it difficult to handle cross-database differences and time drift simultaneously under conditions of no training or low training cost.

[0005] Another approach employs machine learning for selectivity prediction and relies on real feedback for closed-loop updates. For example, Chinese patent CN112380243A, "A Machine Learning-Based Method for Predicting SQL Query Selectivity," constructs feature vectors by collecting SQL queries and database statistics, and trains a regression model using the actual selectivity of the query as a supervisory signal, thereby outputting a selectivity prediction. This type of method is highly dependent on feature design and supervisory signals: in cross-database drift scenarios, the difference in distribution between training samples and the target database leads to unstable model transfer performance; in scenarios where the concept drifts over time, maintaining accuracy requires continuous collection of real feedback and repeated training or iterative updates, introducing problems such as data collection costs, training and update resource consumption, and update latency. When the distribution changes faster than the collection and retraining speed, prediction errors accumulate over time and affect optimizer decisions. This is because the performance of supervised learning models is highly dependent on "labeled samples under the current distribution," and the update loop is difficult to balance low overhead and real-time performance.

[0006] Another type of existing technology attempts to optimize queries from the perspective of "execution plan-cost". For example, Chinese patent CN115617830A, "A Data Query Optimization Processing Method and Apparatus Based on Machine Learning", transforms the execution plan and execution environment parameters into representation vectors, inputs them into a query optimizer model based on a long short-term memory network to obtain cost information, and selects a target execution plan based on the cost information; the model is trained from historical execution plan samples and their corresponding actual costs. This type of method can improve the plan selection effect under certain conditions, but the cost is strongly correlated with data distribution, query load, and hardware and software environment. When migrating to a new database or environment, the representativeness of historical samples for the current scenario decreases, and cost estimation is prone to systematic bias; when the same database drifts over time or environmental parameters change, the model usually needs to be supplemented with new samples and updated with model parameters. This is because the model mainly relies on offline historical data to learn mapping relationships, which is prone to mismatch under cross-database differences and time drift.

[0007] Furthermore, some existing technologies offer model update frameworks from the perspective of a general "learning system update." For example, Chinese patent CN111798001A, "Learning-Based Data Processing System and Model Update Method," proposes generating learning models from learning datasets and performing recognition processing, and involves cascading models generated from different learning datasets and subsequent updates. While such frameworks can provide a certain update mechanism, in scenarios with continuous data drift, they typically still rely on the accumulation of new data and training update processes, making it difficult to simultaneously achieve online real-time performance and low maintenance costs, especially in the cross-database migration stage where it is difficult to achieve "low sample size, low cost, and rapid adaptation."

[0008] Besides learning model approaches, traditional statistical or indexing structures are also used for cardinality estimation. For example, Chinese patent CN103544258A, "A Method and Apparatus for Cardinality Estimation under Multi-Interval Query Conditions in Big Data," estimates cardinality under multi-interval query conditions through pre-partitioning, tree-structured indexing, and cardinality estimators, emphasizing online incremental update efficiency. These methods typically rely on pre-defined statistical structures or approximate estimators. When faced with complex multi-table joins, strong attribute correlations, and situations where correlation structures change over time, their ability to characterize higher-order correlations is limited. Furthermore, when cross-database migrations or concept drift alter the original partitioning and statistical assumptions, estimation errors may still increase. This is because the information source for these methods is primarily static or semi-static statistical structures, which have limited expressive power and struggle to quickly form new correlation expressions after a drift occurs.

[0009] Another type of method employs explicit concept drift detection followed by adaptation. For example, Chinese patent CN101827002A, "A Concept Drift Detection Method for Data Stream Classification," uses mechanisms such as data stream segmentation, sliding windows, and voting to detect drift; CN112000705B, "An Unbalanced Data Stream Mining Method Based on Active Drift Detection," includes drift detection and drift adaptation processes. This type of method can be used for general data stream tasks, but it typically requires additional detection modules and window maintenance, and may face instability due to false positives, false negatives, or detection delays. Furthermore, its problem setting is more geared towards general data stream mining, not directly addressing tasks strongly coupled with SQL structure, such as cardinality estimation required for database query optimization. This is because this approach usually follows a "detect first, then adjust" processing chain, inevitably introducing additional steps and system overhead, and making it difficult to guarantee immediate and effective correction of subsequent queries during real-time database optimization. More importantly, this approach typically focuses on time drift within the same database and cannot naturally solve the domain difference problem caused by cross-database migration.

[0010] In summary, existing technologies often have the following common shortcomings in drift scenarios:

[0011] First, many methods solidify statistical regularities in model parameters or static statistical structures, making it difficult to simultaneously adapt to cross-database differences and time-varying concepts.

[0012] Secondly, relying on offline or periodic retraining, real feedback data collection, or historical sample accumulation can easily lead to update delays and additional maintenance costs.

[0013] Third, introducing explicit drift detection and re-adaptation links increases system complexity and may lead to detection instability. Therefore, a method that can... Figure 1The method and system described herein demonstrate a way to achieve rapid online adaptation with low overhead and stably support key database tasks such as cardinality estimation when cross-database drift and time-dependent drift of the same database coexist. Summary of the Invention

[0014] To address the shortcomings of existing technologies, this invention provides a cardinality estimation method and system for cross-database and concept drift, thereby solving the problem that existing technologies struggle to achieve rapid online adaptation in environments where cross-database differences and data drift over time coexist.

[0015] The technical solution of this invention is as follows:

[0016] On the one hand, this invention provides a cardinality estimation method for cross-database and concept drift, comprising the following steps:

[0017] A feature extraction module is constructed and pre-trained to obtain the trained feature extraction module. The feature extraction module acts as an SQL query encoder to transform SQL queries and database statistics into a fixed-length high-dimensional dense vector.

[0018] A context-adaptive inference module is constructed and trained using Bayesian meta-engineering to obtain the trained context-adaptive inference module; the context-adaptive inference module is used to perform cardinality prediction based on a high-dimensional dense vector to obtain the predicted cardinality;

[0019] The SQL query is obtained, and the trained feature extraction module and context adaptive inference module are used to perform online cardinality estimation to obtain the predicted cardinality.

[0020] Furthermore, the database statistics include attribute value distribution, join scaling factor, predicate features, and metadata; the attribute value distribution is a one-dimensional histogram of the numerical distribution of each attribute involved in the SQL query; the join scaling factor is the scaling ratio distribution of the cardinality caused by the join operation corresponding to the join condition in the SQL query; the predicate features are used to represent the filtering conditions in the SQL query, including the operator type of the filtering predicate, the upper and lower bounds of the normalized filter value, and the single-table selectivity; the metadata is the number and size of the tables involved in the SQL query; and the filtering predicate is the predicate in the filtering conditions.

[0021] Furthermore, the feature extraction module includes an input embedding layer, a connection layer, and a filtering layer;

[0022] The input embedding layer is a linear layer used to map database statistical information into an initial embedding vector of a unified dimension that can be processed by the connection layer and the filtering layer. Specifically, for attributes appearing in the connection predicate... , attributes The attribute value distribution and the connection scaling factor are concatenated into a connection vector. , For the table number, The attribute is numbered, and the join predicate is the predicate in the join condition; for attributes appearing in the filter predicate... , attributes The attribute value distribution and predicate features are concatenated into a filter vector. At the same time, metadata is mapped to metadata vectors. ;

[0023] The connection layer combines all connection vectors and a randomly generated vector. As input, the attention scores between attributes are calculated using a multi-head self-attention mechanism to simulate the multi-table join process, and the final output is a vector reflecting the join relationship between tables and an intermediate context vector.

[0024] (1);

[0025] (2);

[0026] (3);

[0027] (4);

[0028] in, For the input matrix, For connection vectors, This is the number of the connection vector; The number of connection vectors, For transpose, For the first One's attention, For the attention head number, For the softmax function, , , It is a learnable parameter matrix; As a dimension, For the output matrix, For the sake of the bulls' self-attention, For the number of attention heads, For a learnable parameter matrix, For the intermediate context vector, A vector that reflects the join relationships between tables;

[0029] The filtering layer includes vectors reflecting inter-table joins output by the connection layer, intermediate context vectors, filter vectors output by the input embedding layer, and metadata vectors. As input, a multi-head self-attention mechanism is used to simulate the process of applying filtering conditions to the joint distribution; finally, the output matrix is ​​used to... Extracting intermediate context vectors The vector at the corresponding position is used as a high-dimensional dense vector. ;

[0030] (5);

[0031] (6);

[0032] (7);

[0033] (8);

[0034] (9);

[0035] in, The input matrix of the filter layer; For the filter vector, This is the number of the filter vector. This represents the number of filter vectors. The number of metadata vectors; Represents the intermediate context vector In the input matrix row index in To attract attention, This is the output matrix.

[0036] Furthermore, the specific method for pre-training the feature extraction module is as follows:

[0037] First, databases from different domains are collected. For each database, several SQL queries are generated, including varying numbers of join and filter conditions. These queries are then executed in the databases to obtain the true cardinality as labels. During training, a supervised learning approach is used, where the SQL queries and their corresponding database statistics are input into the feature extraction module and encoded into high-dimensional dense vectors. Subsequently, high-dimensional dense vectors The prediction base is determined using a temporary MLP prediction head, and the loss function during training is:

[0038] (10);

[0039] in, The loss function; As a base for prediction, As the true base, The training sample number, This represents the number of training samples;

[0040] After training is complete, remove the temporary MLP prediction head and freeze all parameters of the feature extraction module.

[0041] Furthermore, the construction and training of the context-adaptive inference module follow Bayesian inference theory, setting the cardinality distribution pattern of the current database as determined by a set of observed query contexts. To characterize, This is a high-dimensional dense vector output by the feature extraction module. This represents the corresponding true cardinality. The sample number. To query the number of samples in the context set; for a new input high-dimensional dense vector The context-adaptive inference module aims to model its posterior prediction distribution to obtain the prediction cardinality, which is generated by considering all possible latent data generation mechanisms. Points are earned through a specific method:

[0042] (11);

[0043] in, As the base for prediction; Let be the posterior prediction distribution, representing the set of observed query contexts. Then, for a new input high-dimensional dense vector The output distribution; Indicates a given data generation mechanism The following is a high-dimensional dense vector as a new input. The output distribution; Let be the posterior distribution, representing the set of observed query contexts. Data generation mechanism The probability distribution; This indicates that the data generation mechanism is The current query context set is observed at that time. The likelihood; For data generation mechanism The prior distribution; For the prior distribution space;

[0044] Furthermore, the method for constructing the prior distribution space is as follows:

[0045] In the construction process, Bayesian neural networks and structural causal model generative mechanisms are introduced to generate different prior distributions, thereby constructing the prior distribution space.

[0046] Furthermore, the Bayesian meta-training process is specifically as follows:

[0047] First, starting from the constructed prior distribution space Random sampling data generation mechanism prior distribution This serves as the current task distribution; subsequently, the sampled task distribution is used to generate batches of synthetic data, a portion of which is used as a query context set. A portion is used as the target sample, which includes high-dimensional dense vectors. And as the true cardinality of the labels, the generated query context set is then used. and high-dimensional dense vectors The constituent sequence is input into the context-adaptive reasoning model for forward propagation, and the model outputs a label.

[0048] The training process uses the negative log-likelihood loss of the regression task as the optimization objective. :

[0049] (12);

[0050] in, Represents the true cardinality. This indicates an operation to calculate the expected value. and These are the cardinal mean and standard deviation predicted by the context-adaptive inference module, respectively.

[0051] Furthermore, the process of obtaining the SQL query, utilizing the trained feature extraction module and context-adaptive inference module to perform online cardinality estimation, and obtaining the predicted cardinality, specifically involves:

[0052] A dynamic context memory that reflects the current data distribution in the database is pre-established and maintained. This dynamic context memory is a first-in-first-out queue structure used to store a set of known query samples obtained from the target database. Each query sample is encapsulated as a sample pair including a high-dimensional dense vector and the true cardinality. And it is stored in the dynamic context memory. Represents a high-dimensional dense vector. Represents the true cardinality. Indicates the ID of the queried sample; the dynamic context memory is kept up-to-date. Each query sample forms the query context set in the current dynamic context memory. ;

[0053] When a target SQL query for estimating the cardinality is received At that time, first perform the target SQL query. The input is fed into the feature extraction module, which uses its frozen parameters to perform forward computation and extract a high-dimensional dense vector. Subsequently, the current query context set is read from the aforementioned dynamic context memory. And combine all query samples with the current high-dimensional dense vector The sequences are spliced ​​and assembled to form an input sequence with a temporal structure. The input sequence is then fed into the context-adaptive inference module; during a single forward propagation, the context-adaptive inference module dynamically infers the prediction cardinality; finally, the context-adaptive inference module directly outputs the target SQL query. Prediction base This predicted cardinality is then returned to the database query optimizer, which actually executes the target SQL query. To obtain the true cardinality and simultaneously pair the new query samples Write to the tail of the first-in-first-out queue structure, remove the oldest query sample from the head of the queue structure, and complete the update of the dynamic context memory.

[0054] On the other hand, the present invention also provides a cardinality estimation system for cross-database and concept drift, used to implement a cardinality estimation method for cross-database and concept drift, including:

[0055] The database management system receives SQL queries and sends SQL queries and database statistics to external adaptive components.

[0056] The adaptive components include a feature extraction module, a context-adaptive inference module, a first-in-first-out queue structure, and a result write-back module. Upon receiving an SQL query, it does not interfere with the original parsing process of the database management system. Instead, it uses the feature extraction module in parallel to transform the SQL query and database statistics into a fixed-length high-dimensional dense vector. The context-adaptive inference module is then used to perform cardinality prediction based on the high-dimensional dense vector to obtain the predicted cardinality. For use by the query optimizer; the actual cardinality after the query is executed. The result will be written back to the module into a first-in-first-out queue structure.

[0057] Thirdly, this application proposes an electronic device comprising: one or more processors, and a memory for storing instructions that, when executed by the one or more processors, cause the one or more processors to perform the aforementioned cardinality estimation method for cross-database and concept drift.

[0058] Fourthly, this application proposes a computer-readable storage medium storing executable instructions that, when executed, cause a processor to perform the aforementioned cardinality estimation method for cross-database and concept drift.

[0059] Fifthly, this application proposes a computer program product, including a computer program or instructions that, when executed by a processor, implement the aforementioned cardinality estimation method for cross-database and concept drift.

[0060] Compared with the prior art, the present invention has at least the following beneficial effects, especially in that it can simultaneously address cross-database drift and concept drift within the same database over time:

[0061] 1. Strong cross-database migration capability, significantly reducing the deployment cost of new databases. This invention transforms query structure information and database statistics into transferable, compact representation vectors, enabling the input representation of cardinality estimation to have consistent semantics across schemas and data distributions. This reduces reliance on large amounts of training SQL, labeled samples, or long retraining processes in the target database when migrating from the source database to the target database, lowering the cost of deploying new databases and model implementation, and improving the stability of estimation accuracy in cross-database scenarios.

[0062] 2. It possesses rapid adaptive capabilities to concept drift, avoiding the "drift-retraining" lag window. When continuous insertion, deletion, and updates of data in the same database cause changes in attribute distribution, inter-column correlation, and workload patterns, this invention can utilize a small number of recent executed queries as contextual samples for inference. This allows the estimation results to quickly correct for the current distribution without waiting for offline data collection, periodic retraining, or parameter updates. This significantly shortens the time window between drift and estimation recovery, reducing the risk of plan degradation caused by estimation bias during the drift phase.

[0063] 3. It eliminates or significantly reduces the number of parameter updates and retraining cycles, lowering maintenance costs and resource consumption. Existing methods typically require continuous construction of training sets and training to update model parameters in drift scenarios. This invention achieves adaptation during the inference phase through a context learning mechanism, improving performance with new samples without updating or with minimal model parameter updates. This reduces training resource consumption, lowers operational complexity, and mitigates the risk of system interference and unavailability caused by frequent training.

[0064] 4. It can simultaneously adapt to scenarios involving "cross-database discrepancies + time drift," enhancing long-term robustness. In actual production, databases may undergo cross-database migration or continuously evolve during operation. This invention, through a combination of "transferable representations + online sample closure," allows usable accuracy to be obtained through representation transfer during the cross-database startup phase, and continuously adapts to time drift as the sample database is incrementally updated during subsequent operation, maintaining stable estimation performance over long-term operation.

[0065] 5. The project is easy to implement and can be integrated without changing the core execution engine of the DBMS. The cardinality estimation results output by this invention can be used as a substitute or supplementary input for the query optimizer, and the context samples can be naturally obtained through post-execution feedback. The overall chain requires minimal modification to the database kernel, has high deployment flexibility, is suitable for various databases and various workload scenarios, and has strong scalability. Attached Figure Description

[0066] Figure 1 This is a schematic diagram illustrating scenarios of cross-database drift and time concept drift in existing technologies;

[0067] Figure 2 This is a flowchart of a cardinality estimation method for cross-database and concept drift in an embodiment of the present invention;

[0068] Figure 3 This is a structural diagram of the feature extraction module in an embodiment of the present invention;

[0069] Figure 4 This is a structural diagram of a cardinality estimation system for cross-database and concept drift in an embodiment of the present invention. Detailed Implementation

[0070] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0071] Example 1:

[0072] A cardinality estimation method for cross-database and concept drift, such as Figure 2 As shown, it includes the following steps:

[0073] Step 1: Construct a feature extraction module and pre-train it to obtain a trained feature extraction module;

[0074] The feature extraction module, acting as a general-purpose SQL query encoder, transforms heterogeneous, structured SQL queries and database statistics into a fixed-length, high-dimensional, dense vector. This high-dimensional dense vector It highly condenses the semantic information of SQL queries and the distribution information of underlying data in the database; the format of the SQL query in this embodiment is as follows: , For tables in the database, and As a condition;

[0075] The database statistics include attribute value distribution, join scaling factor, predicate features, and metadata; the attribute value distribution is a one-dimensional histogram of the numerical distribution of each attribute involved in the SQL query; the join scaling factor is the distribution of the scaling ratio of the join operation to the cardinality for each join condition in the SQL query, for example, for each join condition ( (in the middle) , and For tables in the database, and For the table number, and For the attributes in the table, and The attribute number, whose scaling factor represents "table". A certain record in the table The predicate features are used to represent the filtering conditions in the SQL query, including the operator type of the filtering predicate, the upper and lower bounds of the normalized filter value, and the single table selectivity; the metadata is the number of tables involved in the SQL query and the size of the tables; the filtering predicate is the predicate in the filtering conditions.

[0076] The feature extraction module, as shown below Figure 3 As shown, it specifically includes an input embedding layer, a connection layer, and a filtering layer;

[0077] The input embedding layer is a linear layer responsible for mapping database statistical information into an initial embedding vector of a unified dimension that can be processed by the connection layer and the filtering layer. Specifically, for attributes appearing in the connection predicate... , attributes The attribute value distribution and the connection scaling factor are concatenated into a connection vector. , The attribute is numbered, and the join predicate is the predicate in the join condition; for attributes appearing in the filter predicate... , attributes The attribute value distribution and predicate features are concatenated into a filter vector. At the same time, metadata is mapped to metadata vectors. ;

[0078] The connection layer combines all connection vectors and a randomly generated vector. As input, the attention scores between attributes are calculated using a multi-head self-attention mechanism to simulate the multi-table join process, and the final output is a vector reflecting the join relationship between tables and an intermediate context vector.

[0079] (1);

[0080] (2);

[0081] (3);

[0082] (4);

[0083] in, For the input matrix, For connection vectors, This is the number of the connection vector; The number of connection vectors, For transpose, For the first One's attention, For the attention head number, For the softmax function, , , It is a learnable parameter matrix; As a dimension, For the output matrix, For the sake of the bulls' self-attention, For the number of attention heads, For a learnable parameter matrix, For the intermediate context vector, A vector that reflects the join relationships between tables;

[0084] The filtering layer includes vectors reflecting inter-table joins output by the connection layer, intermediate context vectors, filter vectors output by the input embedding layer, and metadata vectors. As input, a multi-head self-attention mechanism is used to simulate the process of applying filtering conditions to the joint distribution (in a database, the common distribution of two or more attributes is called the joint distribution, and the function that represents the joint distribution is called the joint probability density function); finally, the output matrix... Extracting intermediate context vectors The vector at the corresponding position is used as a high-dimensional dense vector. This high-dimensional dense vector integrates the entire query plan structure and data distribution prediction information, rather than simple statistical values;

[0085] (5);

[0086] (6);

[0087] (7);

[0088] (8);

[0089] (9);

[0090] in, The input matrix of the filter layer; For the filter vector, This is the number of the filter vector. This represents the number of filter vectors. The number of metadata vectors; Represents the intermediate context vector In the input matrix row index in To attract attention, The output matrix is ​​used; in the filtering layer, the intermediate context vector is used. Viewed as a carrier of the "global query vector": it interacts with all tokens as a query within self-attention, thus simulating the process of "applying filtering conditions to the joint distribution after concatenation". The final result is a high-dimensional dense vector. That is, the output matrix In and intermediate context vector The state corresponding to that row is a high-dimensional dense latent representation that integrates "connection structure + predicate constraints + table data".

[0091] To make high-dimensional dense vectors To achieve cross-database versatility, the feature extraction module requires large-scale pre-training. This begins with data preparation, collecting numerous databases from different domains. For each database, a massive number of SQL queries are generated, including varying numbers of join and filter conditions. These queries are then executed in the databases to obtain the true cardinality as labels. The goal of this training process is to employ supervised learning, inputting the SQL queries and their corresponding database statistics into the feature extraction module, encoding them into high-dimensional dense vectors. Subsequently, high-dimensional dense vectors The prediction base is determined using a temporary MLP prediction head, and the loss function during training is:

[0092] (10);

[0093] in, The loss function; As a base for prediction, As the true base, The training sample number, This represents the number of training samples. After training, the temporary MLP prediction head is removed, and all parameters of the feature extraction module are frozen. At this point, the feature extraction module is capable of transforming any SQL query into a high-quality, high-dimensional, dense vector. The ability.

[0094] Step 2: Construct a context-adaptive inference module and train it using Bayesian meta-training to obtain a trained context-adaptive inference module; the context-adaptive inference module is used to perform cardinality prediction based on a high-dimensional dense vector to obtain the predicted cardinality;

[0095] In a preferred embodiment of the present invention, the core component of inference prediction is constructed as a context-adaptive inference module, which adopts a prior data fitting network based on the Transformer architecture. Its core technical feature is that the context-adaptive inference module is endowed with context adaptability in the inference stage through Bayesian meta-training. Unlike the supervised learning based on real SQL logs in the front-end feature extraction module, the context-adaptive inference module is pre-trained based on the database. Its training goal is not to memorize specific database knowledge, but to learn a general Bayesian inference algorithm, that is, how to infer the label of unknown samples based on given context samples.

[0096] The construction and training of the context-adaptive inference module follow Bayesian inference theory. Logically, it is assumed that the cardinality distribution of the current database is determined by a set of observed query contexts. To characterize, This is a high-dimensional dense vector output by the feature extraction module. This represents the corresponding true cardinality. The sample number. To query the number of samples in the context set; for a new input high-dimensional dense vector The context-adaptive inference module aims to model its posterior prediction distribution to obtain the prediction cardinality, which is generated by considering all possible latent data generation mechanisms. ( (This is a hidden variable with no actual meaning) is obtained by integration, specifically in the form:

[0097] (11);

[0098] in, As the base for prediction; Let be the posterior prediction distribution, representing the set of observed query contexts. Then, for a new input high-dimensional dense vector The output distribution; Indicates a given data generation mechanism The following is a high-dimensional dense vector as a new input. The output distribution; Let be the posterior distribution, representing the set of observed query contexts. Data generation mechanism The probability distribution; This indicates that the data generation mechanism is The current query context set is observed at that time. The likelihood; For data generation mechanism The prior distribution; For the prior distribution space;

[0099] To enable the context-adaptive inference module to adapt to various complex database distribution scenarios, this invention constructs a rich and diverse prior distribution space during the Bayesian meta-training phase. Specifically, to capture complex dependencies and reveal potential causal mechanisms, Bayesian neural networks (BNNs) and structural causal models (SCMs) are introduced into the construction of the prior distribution space to generate different prior distributions: the former is used to generate prior distributions with complex nonlinear mappings, while the latter is used to characterize the possible causal dependency structures between variables and generate prior distributions, thereby enabling the synthesis task to cover a wider range of database attribute correlations and distribution variation patterns.

[0100] Based on the above theory, the Bayesian meta-training process of the context-adaptive inference module is a one-time offline process. In the specific implementation of training, it first starts from the constructed prior distribution space... Random sampling data generation mechanism prior distribution This serves as the current task distribution. Subsequently, the task distribution obtained from this sampling is used to generate batches of synthetic data, a portion of which is used as a query context set. A portion is used as the target sample, which includes high-dimensional dense vectors. And as the true cardinality of the labels, the generated query context set is then used. and high-dimensional dense vectors The constituent sequence is input into the context-adaptive inference module for forward propagation, and the model outputs a label. Since cardinality estimation is essentially a regression task, this embodiment uses the negative log-likelihood loss of a regression task as the optimization objective. Specifically, this loss function aims to minimize the difference between the predicted distribution and the true label, and is typically formalized as a Gaussian likelihood loss:

[0101] (12);

[0102] in, Represents the true cardinality. This indicates an operation to calculate the expected value. and These are the cardinality mean and standard deviation predicted by the context-adaptive inference module, respectively. The optimization objective is to minimize this. The context-adaptive inference module is trained to adapt to a given set of query contexts. In the case of maximizing The likelihood probability of occurrence;

[0103] Through the aforementioned Bayesian meta-training based on synthetic data, the context-adaptive inference module has gained a series of key technological advantages, effectively addressing the pain points of traditional cardinality estimation models. Firstly, it possesses the ability to quickly adapt to new concepts. Having been exposed to massive amounts of complex distribution variations generated by Bayesian neural networks and structural causal models during the training phase, the context-adaptive inference module has learned how to utilize query context sets. To calibrate predictions. When concept drift occurs in the database, only the query context set needs to be updated. First, the context-adaptive inference module can automatically adapt to new distributions through inference logic without requiring parameter updates. Second, this meta-training is a one-time investment; once trained, the context-adaptive inference module can serve as a general-purpose inference engine. Combined with the feature extraction module, it can be directly deployed to any new database instance, achieving the technical effect of "train once, use forever."

[0104] Step 3: Obtain the SQL query, and use the trained feature extraction module and context adaptive inference module to perform online cardinality estimation to obtain the predicted cardinality;

[0105] First, a dynamic context memory that reflects the current data distribution state of the database is pre-established and maintained. This dynamic context memory is essentially a first-in-first-out (FIFO) queue structure used to store a set of known query samples obtained from the historical query logs of the target database or from real-time sampling execution. Each query sample is encapsulated as a sample pair including a high-dimensional dense vector and the true cardinality. And it is stored in the dynamic context memory. Represents a high-dimensional dense vector. Represents the true cardinality. Indicates the ID of the queried sample; the dynamic context memory is kept up-to-date. These samples form the query context set in the current dynamic context memory. This set of query contexts In effect, this constitutes an implicit description of the underlying data distribution patterns of the target database at the current moment. When data insertion, deletion, or updates occur in the target database, causing distribution drift, the query context set is used. It will update automatically, thus making the query context set It can capture and reflect these changes in real time;

[0106] Based on this, when a target SQL query for estimating the cardinality is received... Upon arrival, the online inference process is immediately triggered. First, the target SQL query is performed. The input is fed into the feature extraction module, which uses its frozen parameters to perform forward computation, extracting a high-dimensional dense vector representing the semantic and structural information of the query. Subsequently, the current query context set is read from the aforementioned dynamic context memory. And combine all query samples with the current high-dimensional dense vector The sequences are spliced ​​and assembled to form an input sequence with a temporal structure. The input sequence is then fed into the context-adaptive inference module; during a single forward propagation, the context-adaptive inference module allows the high-dimensional dense vector to... By combining the true cardinality in the query context set, a predicted cardinality that conforms to the current data distribution pattern is dynamically inferred. Finally, the context-adaptive inference module directly outputs the target SQL query. Prediction base This predicted cardinality is then returned to the database query optimizer, which actually executes the target SQL query. To obtain the true cardinality and simultaneously pair the new query samples The query sample is written to the tail of the first-in-first-out (FIFO) queue structure, and the oldest query sample is removed from the head of the queue structure, thus updating the dynamic context memory. The entire process requires no backpropagation or parameter fine-tuning of the model; it achieves adaptive perception and accurate prediction of new data distributions simply by updating the input sequence content.

[0107] Table 1 shows the performance of this invention on the cardinality estimation task;

[0108]

[0109] To rigorously evaluate the robustness of this method, two scenarios were designed based on the degree and nature of the changes: Mild drift involved rearranging and sampling 50% of the dataset to test the model's response to moderate but significant data changes without introducing additional data manipulation. Severe drift further increased the challenge by not only rearranging and sampling 60% of the data but also introducing an additional 10% of random data manipulation (including insertions, deletions, and value replacements) to assess the model's ability under severe data transformations. Table 1 reports the overall performance comparison results in the cardinality estimation task. The results show that the present invention consistently demonstrates superior performance across all datasets and dynamic scenarios, generally performing comparably to, or even better than, fine-tuning methods. Specifically, the present invention achieved best performance on 29 out of 32 quantile metrics. Even when fine-tuning methods are included in the comparison, it still surpasses nearly half of the evaluation results in almost all metrics, highlighting its extremely high accuracy in dynamic environments.

[0110] Example 2:

[0111] A cardinality estimation system oriented towards cross-database and concept drift is provided to implement a cardinality estimation method oriented towards cross-database and concept drift, such as... Figure 4 As shown, it includes:

[0112] The database management system receives SQL queries and sends SQL queries and database statistics to external adaptive components.

[0113] The adaptive components include a feature extraction module, a context-adaptive inference module, a first-in-first-out queue structure, and a result write-back module. Upon receiving an SQL query, it does not interfere with the original parsing process of the database management system. Instead, it uses the feature extraction module in parallel to transform the SQL query and database statistics into a fixed-length high-dimensional dense vector. The context-adaptive inference module is then used to perform cardinality prediction based on the high-dimensional dense vector to obtain the predicted cardinality. For use by the query optimizer; the actual cardinality after the query is executed. The result will be written back to the module into a first-in-first-out queue structure.

[0114] A series of data modification operations (such as batch inserts or deletes) occurred in the database, causing the data generation distribution to change from the original distribution. Mutation into distribution At this point, relying solely on historical static patterns will lead to biased model predictions. However, under the mechanism of this invention, when the first query after the drift is executed, although its predicted value may still be affected by the old distribution, the true cardinality obtained after execution is accurate. This will directly reflect the distribution Due to this characteristic, the sample pair carrying "new distribution knowledge" is then pushed into the context memory queue.

[0115] As subsequent queries are executed, the context memory queue... The composition of the samples will gradually change, that is, reflect The old samples were gradually removed, reflecting... The new samples gradually become dominant, thus enabling the prediction results to shift to the new distribution through dynamic updates of the input context without the need for offline retraining. The model achieves rapid convergence and adaptive bias correction. Mathematically, this process is equivalent to the model inferring the underlying data generation mechanism in real time by observing contextual samples. This ensures the robustness of the base estimate in long-term operation.

[0116] Example 3:

[0117] This embodiment proposes an electronic device, including: one or more processors, and a memory for storing instructions, which, when executed by the one or more processors, cause the one or more processors to perform a cardinality estimation method for cross-database and concept drift.

[0118] The electronic device may be a mobile phone, computer, or tablet computer, etc., and includes a memory and a processor. The memory stores a computer program that, when executed by the processor, implements a cardinality estimation method for cross-database and concept drift as described in the embodiments. It is understood that the electronic device may also include input / output (I / O) interfaces and communication components.

[0119] The processor is used to execute all or part of the steps in the cardinality estimation method for cross-database and concept drift as described in the above embodiments. The memory is used to store various types of data, which may include, for example, instructions for any application or method in an electronic device, as well as application-related data.

[0120] The processor may be implemented as an Application Specific Integrated Circuit (ASIC), Digital Signal Processor (DSP), Programmable Logic Device (PLD), Field Programmable Gate Array (FPGA), controller, microcontroller, microprocessor, or other electronic components, and is used to execute the cardinality estimation method for cross-database and concept drift described in the above embodiments.

[0121] Example 4:

[0122] This embodiment proposes a computer-readable storage medium that stores executable instructions. When these instructions are executed, if they are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.

[0123] The computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of a cardinality estimation method for cross-database and concept drift as described in various embodiments of this application.

[0124] The aforementioned storage media include: flash memory, hard disks, multimedia cards, card-type memory (e.g., SD (Secure Digital Memory Card) or DX (Memory Data Register, MDR) memory), random access memory (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic storage, disks, optical discs, servers, APP (Application) application stores, and other media capable of storing program checksums. These media store computer programs, which, when executed by a processor, can implement the various steps of the aforementioned cardinality estimation method for cross-database and concept drift.

[0125] Example 5:

[0126] This embodiment proposes a computer program product, including a computer program or instructions, which, when executed by a processor, implements the aforementioned cardinality estimation method for cross-database and concept drift.

[0127] Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or part of the technical solution, can be embodied in the form of a computer program product.

[0128] The various embodiments in this application are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on describing the differences from other embodiments.

[0129] The scope of protection of this application is not limited to the embodiments described above. Obviously, those skilled in the art can make various modifications and variations to this disclosure without departing from the scope and spirit of this disclosure. If such modifications and variations fall within the scope of this disclosure and its equivalents, then the intent of this disclosure also includes these modifications and variations.

Claims

1. A cardinality estimation method for cross-database and concept drift, characterized in that, Includes the following steps: A feature extraction module is constructed and pre-trained to obtain the trained feature extraction module. The feature extraction module acts as an SQL query encoder to transform SQL queries and database statistics into a fixed-length high-dimensional dense vector. A context-adaptive inference module is constructed and trained using Bayesian meta-engineering to obtain the trained context-adaptive inference module; the context-adaptive inference module is used to perform cardinality prediction based on a high-dimensional dense vector to obtain the predicted cardinality; The construction and training of the context-adaptive inference module follow Bayesian inference theory, setting the cardinality distribution of the current database as determined by a set of observed query contexts. To characterize, This is a high-dimensional dense vector output by the feature extraction module. This represents the corresponding true cardinality. The sample number. To query the number of samples in the context set; for a new input high-dimensional dense vector The context-adaptive reasoning module is designed to adapt to the query context set. Model its posterior prediction distribution to obtain the prediction cardinality; The SQL query is obtained, and the trained feature extraction module and context adaptive inference module are used to perform online cardinality estimation to obtain the predicted cardinality.

2. The cardinality estimation method for cross-database and concept drift as described in claim 1, characterized in that, The database statistics include attribute value distribution, join scaling factor, predicate features, and metadata; the attribute value distribution is a one-dimensional histogram of the numerical distribution of each attribute involved in the SQL query; the join scaling factor is the distribution of the scaling ratio of the cardinality caused by the join operation corresponding to the join condition in the SQL query; the predicate features are used to represent the filtering conditions in the SQL query, including the operator type of the filtering predicate, the upper and lower bounds of the normalized filter values, and the single-table selectivity; the metadata is the number of tables involved in the SQL query and the size of the tables; The filtering predicate is the predicate in the filtering conditions.

3. The cardinality estimation method for cross-database and concept drift as described in claim 1, characterized in that, The feature extraction module includes an input embedding layer, a connection layer, and a filtering layer; The input embedding layer is a linear layer used to map database statistical information into an initial embedding vector of a unified dimension that can be processed by the connection layer and the filtering layer. Specifically, for attributes appearing in the connection predicate... , attributes The attribute value distribution and the connection scaling factor are concatenated into a connection vector. , For the table number, The attribute is numbered, and the join predicate is the predicate in the join condition; for attributes appearing in the filter predicate... , attributes The attribute value distribution and predicate features are concatenated into a filter vector. At the same time, metadata is mapped to metadata vectors. ; The connection layer combines all connection vectors and a randomly generated vector. As input, the attention scores between attributes are calculated using a multi-head self-attention mechanism to simulate the multi-table join process, and the final output is a vector reflecting the join relationship between tables and an intermediate context vector. (1); (2); (3); (4); in, For the input matrix, For connection vectors, This is the number of the connection vector; The number of connection vectors, For transpose, For the first One point of attention, For the attention head number, For the softmax function, , , It is a learnable parameter matrix; As a dimension, For the output matrix, For the sake of the bulls' self-attention, For the number of attention heads, For a learnable parameter matrix, For the intermediate context vector, A vector that reflects the join relationships between tables; The filtering layer includes vectors reflecting inter-table joins output by the connection layer, intermediate context vectors, filter vectors output by the input embedding layer, and metadata vectors. As input, a multi-head self-attention mechanism is used to simulate the process of applying filtering conditions to the joint distribution; finally, the output matrix is ​​used to... Extracting intermediate context vectors The vector at the corresponding position is used as a high-dimensional dense vector. ; (5); (6); (7); (8); (9); in, The input matrix of the filter layer; For the filter vector, This is the number of the filter vector. This represents the number of filter vectors. The number of metadata vectors; Represents the intermediate context vector In the input matrix row index in To attract attention, This is the output matrix.

4. The cardinality estimation method for cross-database and concept drift as described in claim 1, characterized in that, The specific method for pre-training the feature extraction module is as follows: First, databases from different domains are collected. For each database, several SQL queries are generated, including varying numbers of join and filter conditions. These queries are then executed in the databases to obtain the true cardinality as labels. During training, a supervised learning approach is used, where the SQL queries and their corresponding database statistics are input into the feature extraction module and encoded into high-dimensional dense vectors. Subsequently, high-dimensional dense vectors The prediction base is determined using a temporary MLP prediction head, and the loss function during training is: (10); in, The loss function; As a base for prediction, As the true base, The training sample number, This represents the number of training samples; After training is complete, remove the temporary MLP prediction head and freeze all parameters of the feature extraction module.

5. The cardinality estimation method for cross-database and concept drift according to claim 1, characterized in that, The posterior prediction distribution is generated through a mechanism that considers all possible potential data. Points are earned through a specific method: (11); in, As the base for prediction; Let be the posterior prediction distribution, representing the set of observed query contexts. Then, for a new input high-dimensional dense vector The output distribution; Indicates a given data generation mechanism The following is a high-dimensional dense vector as a new input. The output distribution; Let be the posterior distribution, representing the set of observed query contexts. Data generation mechanism The probability distribution; This indicates that the data generation mechanism is The current query context set is observed at that time. The likelihood; For data generation mechanism The prior distribution; For the prior distribution space.

6. The cardinality estimation method for cross-database and concept drift according to claim 5, characterized in that, The method for constructing the prior distribution space is as follows: In the construction process, Bayesian neural networks and structural causal model generative mechanisms are introduced to generate different prior distributions, thereby constructing the prior distribution space.

7. The cardinality estimation method for cross-database and concept drift according to claim 1, characterized in that, The Bayesian meta-training process is specifically as follows: First, starting from the constructed prior distribution space Random sampling data generation mechanism prior distribution As the current task distribution; Subsequently, the task distribution obtained from this sampling is used to generate batches of synthetic data, a portion of which is used as a query context set. A portion is used as the target sample, which includes high-dimensional dense vectors. And as the true cardinality of the labels, the generated query context set is then used. and high-dimensional dense vectors The constituent sequence is input into the context-adaptive inference module for forward propagation, and the model outputs a label; The training process uses the negative log-likelihood loss of the regression task as the optimization objective. : (12); in, Represents the true cardinality. This indicates an operation to calculate the expected value. and These are the cardinal mean and standard deviation predicted by the context-adaptive inference module, respectively.

8. The cardinality estimation method for cross-database and concept drift according to claim 1, characterized in that, The process of obtaining the SQL query involves using the trained feature extraction module and context-adaptive inference module to perform online cardinality estimation, thereby obtaining the predicted cardinality. Specifically: A dynamic context memory that reflects the current data distribution in the database is pre-established and maintained. This dynamic context memory is a first-in-first-out queue structure used to store a set of known query samples obtained from the target database. Each query sample is encapsulated as a sample pair including a high-dimensional dense vector and the true cardinality. And it is stored in the dynamic context memory. Represents a high-dimensional dense vector. Represents the true cardinality. Indicates the sample number being queried; The dynamic context memory is kept up-to-date. Each query sample forms the query context set in the current dynamic context memory. ; When a target SQL query for estimating the cardinality is received At that time, first perform the target SQL query. The input is fed into the feature extraction module, which uses its frozen parameters to perform forward computation and extract a high-dimensional dense vector. Subsequently, the current query context set is read from the aforementioned dynamic context memory. And combine all query samples with the current high-dimensional dense vector The sequences are spliced ​​and assembled to form an input sequence with a temporal structure. The input sequence is then fed into the context-adaptive inference module; during a single forward propagation, the context-adaptive inference module dynamically infers the prediction cardinality; finally, the context-adaptive inference module directly outputs the target SQL query. Prediction base This predicted cardinality is then returned to the database query optimizer, which actually executes the target SQL query. To obtain the true cardinality and simultaneously pair the new query samples Write to the tail of the first-in-first-out queue structure, remove the oldest query sample from the head of the queue structure, and complete the update of the dynamic context memory.

9. A cardinality estimation system for cross-database and concept drift, used to implement the cardinality estimation method for cross-database and concept drift as described in any one of claims 1-8, characterized in that, include: The database management system receives SQL queries and sends SQL queries and database statistics to external adaptive components. The adaptive components include a feature extraction module, a context-adaptive inference module, a first-in-first-out queue structure, and a result write-back module. Upon receiving an SQL query, it does not interfere with the original parsing process of the database management system. Instead, it uses the feature extraction module in parallel to transform the SQL query and database statistics into a fixed-length high-dimensional dense vector. The context-adaptive inference module is then used to perform cardinality prediction based on the high-dimensional dense vector to obtain the predicted cardinality. For use by the query optimizer; the actual cardinality after the query is executed. The result will be written back to the module into a first-in-first-out queue structure.

10. A computer program product, characterized in that, Includes a computer program or instructions that, when executed by a processor, implement the cardinality estimation method for cross-database and concept drift as described in any one of claims 1-8.