Automated query predicate selectivity prediction using machine learning models

By using the cumulative frequency function (CFF) of a machine learning model to predict cardinality in relational databases, the problem of inaccurate cardinality estimation is solved, the query execution plan is optimized, and query performance is improved.

CN116057518BActive Publication Date: 2026-06-16INTERNATIONAL BUSINESS MACHINE CORPORATION

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INTERNATIONAL BUSINESS MACHINE CORPORATION
Filing Date
2021-08-03
Publication Date
2026-06-16

Smart Images

  • Figure CN116057518B_ABST
    Figure CN116057518B_ABST
Patent Text Reader

Abstract

Methods, computer systems, and computer program products for cardinality estimation are provided. Embodiments of the invention include accessing database relations (302). The database relations are used to collect random samples from each of the database relations (304). Training data is then generated from the random samples (306). The training data is used to build a cumulative frequency function (CFF) model (308). The cumulative frequency function (CFF) model then provides a cardinality estimate for the output of a SQL operator (310).
Need to check novelty before this filing date? Find Prior Art

Description

Background Technology

[0001] This invention generally relates to the field of computing, and more specifically to query predicate selectivity using machine learning. Query optimization is a feature that is effective when executing queries by creating a management system in a relational database.

[0002] Structured Query Language (SQL) is a programming language created and designed to manage data stored in relational databases. Structured Query Language (SQL) can include specific programming languages ​​used to manage data stored in relational database management systems (RDBMS).

[0003] Relational Database Management Systems (RDBMS) typically include single-column statistics collected on individual columns within a relation. A relation can include tuples or attributes that describe relational or defining characteristics of tables or relationships between tables. For example, a relation includes data values ​​on a table, and a relational database can store data values ​​as relations or tables. A collection of relations or tables can be stored on the database as a relational model.

[0004] Relational models can be based on predicate logic, which can include conditions to be applied to tables or relations. In relational models, if statistics for multiple columns in a table are not collected, predicates can be assumed to be independent when calculating cardinality. Predicates can typically be represented in the form of ranges requiring lower and upper bounds. BETWEEN and range predicates can be formulated as separate ranges of values ​​for each predicate. Equality predicates can have equal lower and upper bounds. This can provide some improvement in cardinality estimation; however, implementing improved cardinality estimation is computationally intensive, expensive, and time-consuming.

[0005] Cardinality estimation plays a crucial role in Structured Query Language (SQL) optimization. In the context of SQL, cardinality refers to the number of rows processed by different operators in the query execution plan. The number of rows entering an operator can decrease or increase as the number of rows processed by different operators increases until the final result is returned to the application or the user who issued the query. This differs from column cardinality, which refers to the statistical count of the number of distinct values ​​in a column of a database table. A database table scan can impose a predicate that reads all rows in the table; however, the output may only include rows that satisfy the predicate conditions. Cardinality estimation is a prediction of how many rows satisfy the conditions versus how many rows are input into the scan operators.

[0006] Cost-based optimizers can rely on cardinality estimation to calculate the execution cost of an access plan and select the optimal access plan. Cost-based optimizers can use statistics based on table data values ​​in database tables to identify effective execution plans for structured queries. Therefore, the accuracy of cardinality for structured queries can have an impact on cost-based optimizers. Cardinality accuracy contributes to the overall goal of selecting the best execution plan through cardinality estimation. Achieving accurate cardinality estimation can be difficult, or the information required to simplify this process can be expensive, or both. Cardinality accuracy is particularly difficult to achieve on tables with correlated or skewed columns. Inaccurate cardinality estimation can lead to the selection of slow access plans and unpredictable query performance. Summary of the Invention

[0007] Exemplary embodiments disclose methods, computer systems, and computer program products for cardinality estimation. In one aspect of an exemplary embodiment, the invention includes accessing database relations. The database relations are used to collect random samples from each relation in the database relations. Training data is then generated from the random samples. The training data is used to build a cumulative frequency function (CFF) model. The cumulative frequency function (CFF) model then provides a cardinality estimate of the output of an SQL operator.

[0008] In another aspect of the exemplary embodiments, the method, computer system, and program product include an extended cumulative frequency function (CFF) model to predict cardinality estimates of SQL operators that serve as equality predicates.

[0009] In another aspect of the exemplary embodiments, the method, computer system, and program product include an extended cumulative frequency function (CFF) model to predict cardinality estimates of the join operator.

[0010] In another aspect of the exemplary embodiments, the method, computer system, and program product include an extended cumulative frequency function (CFF) model to predict cardinality estimates for lossless join operators. Attached Figure Description

[0011] These and other objects, features, and advantages of the invention will become clear from the following detailed description of exemplary embodiments of the invention, which will be read in conjunction with the accompanying drawings. The various features in the drawings are not to scale, as they are illustrated for clarity by those skilled in the art in conjunction with specific embodiments. In the drawings:

[0012] Figure 1 A networked computer environment according to at least one embodiment is shown;

[0013] Figure 2 This is a block diagram example generated from sampled and training data for model training according to at least one embodiment;

[0014] Figure 3 This is an operation flowchart illustrating the process of an adaptive method for automatically predicting query predicates using a machine learning model, according to at least one embodiment.

[0015] Figure 4 According to at least one embodiment Figure 1 A block diagram depicting the internal and external components of a computer and server;

[0016] Figure 5 Includes embodiments according to this disclosure Figure 1 A block diagram illustrating a cloud computing environment for a computer system; and

[0017] Figure 6 According to embodiments of this disclosure Figure 5 A block diagram illustrating the functional layers of an illustrative cloud computing environment. Detailed Implementation

[0018] Detailed embodiments of the claimed structures and methods are disclosed herein; however, it should be understood that the disclosed embodiments are merely illustrative of the claimed structures and methods, which can be embodied in various forms. The invention can be embodied in many different forms and should not be construed as being limited to the exemplary embodiments set forth herein. Rather, these exemplary embodiments are provided to make this disclosure comprehensive and complete, and to fully convey the scope of the invention to those skilled in the art. Details of well-known features and techniques may be omitted in the description to avoid unnecessarily obscuring the presented embodiments.

[0019] As mentioned earlier, query optimization is a feature that is effective when executing queries by creating a management system in a relational database. Structured Query Language (SQL) is a programming language created and designed to manage data stored in a relational database. Structured Query Language (SQL) can include specific programming languages ​​used to manage data stored in a relational database management system (RDBMS).

[0020] Relational Database Management Systems (RDBMS) typically include single-column statistics collected on individual columns within a relation. A relation can include tuples or attributes that describe relational or defining characteristics of tables or relationships between tables. For example, a relation includes data values ​​on a table, and a relational database can store data values ​​as relations or tables. A collection of relations or tables can be stored on the database as a relational model.

[0021] Relational models can be based on predicate logic, which can include conditions to be applied to tables or relations. In relational models, if statistics for multiple columns in a table are not collected, predicates can be assumed to be independent when calculating cardinality. Predicates can typically be represented in the form of ranges requiring lower and upper bounds. BETWEEN predicates and range predicates can be formulated as separate ranges of values ​​for each predicate. Equality predicates can have equal lower and upper bounds. This can provide improvements in cardinality estimation; however, implementing improved cardinality estimation is computationally intensive, expensive, and time-consuming.

[0022] Cardinality estimation plays a crucial role in Structured Query Language (SQL) optimization. Cardinality, in its context as SQL, refers to the number of rows processed by different operators in the query execution plan. The number of rows entering an operator can decrease or increase as the number of rows processed by different operators increases until the final result is returned to the application or the user who issued the query. This differs from column cardinality, which refers to the statistical count of the number of distinct values ​​in a column of a database table. A database table scan can impose a predicate that reads all rows in the table; however, the output may only include rows that satisfy the predicate conditions. Cardinality estimation is a prediction of how many rows satisfy the conditions versus how many rows are input into the scan operators.

[0023] Cost-based optimizers can rely on cardinality estimation to calculate the execution cost of an access plan and select the optimal access plan. Cost-based optimizers can use statistics based on table data values ​​in database tables to identify effective execution plans for structured queries. Therefore, the accuracy of cardinality for structured queries can have an impact on cost-based optimizers. Cardinality accuracy contributes to the overall goal of selecting the best execution plan through cardinality estimation. Achieving accurate cardinality estimation can be difficult, or the information required to simplify this process can be expensive, or both. Cardinality accuracy is particularly difficult to achieve on tables with correlated or skewed columns. Inaccurate cardinality estimation can lead to the selection of slow access plans and unpredictable query performance. Therefore, creating systems that leverage machine learning to predict more accurate cardinality estimates can be advantageous.

[0024] The exemplary embodiments described below provide systems, methods, and program products for machine learning and cardinality estimation. Accordingly, embodiments of the present invention have the ability to improve the field of machine learning and query optimization by creating machine learning models to learn and predict or approximate a multivariate cumulative frequency function (CFF) of relations that will help estimate the cardinality of all relation operators. More specifically, the cumulative frequency function (CFF) model is created to learn the data distribution in the relation, and the learned data distribution can be used by the model to accurately estimate the cardinality or predict the selectivity of query predicates.

[0025] The cardinality of a table is the number of rows, or row flow, processed by operators in the database. Queries typically apply predicates to a table to obtain rows of interest. These predicates can involve different relational comparison operators; for example, `age=20` and `height>=6` are predicates on the `age` and `height` columns, respectively. Predicate selectivity is the percentage of rows selected by the query after applying the predicate. In the context of predicate filtering, cardinality estimation is an estimate of the number of rows after applying the predicate, relative to the number of rows before the predicate was applied.

[0026] Columns can be statistically correlated, and this statistical correlation can lead to selectivity in combined predicates, replacing the typical independence assumption. The typical independence assumption can be produced using a single selectivity. Selectivity can include statistical analysis or calculations using cardinality collected from database tables or indexes. Multiple column indexes can be used to provide statistical correlation information, but defining many indexes for statistical correlation results in high costs for maintaining them. Collecting data on multiple column statistics avoids estimation errors for multiple predicates; however, deciding which column combinations to collect and store can be difficult and expensive. For example, assuming a relationship with N correlated columns, the number of multiple column statistics covering all column combinations is 2^N. As the number of column combinations increases, relational database management systems (RDBMS) may rely on user input to determine which column groups to collect statistics from. Relying on user input may be impractical due to the high skill level and deep knowledge of the data distribution required of users, and additionally, statistical correlation factors between multiple columns will be needed.

[0027] According to embodiments, a cumulative frequency function (CFF) or cumulative distribution function (CDF) can be used to estimate the selectivity of predicates applied in a table, in an index scan, or for other filtering operations. Cumulative frequency can include statistical analysis to identify or calculate query selectivity when each column is greater than or less than or equal to a specific or predetermined value. A cumulative frequency function (CFF) can describe a function that identifies, for example, the probability that a value is less than or equal to a specific value. A multivariate cumulative frequency function (CFF) can include an identifier of a probability function covering values ​​in two or more columns.

[0028] Machine learning can be used to approximate multivariate cumulative frequency functions (CFFs) of relationships. Different types of machine learning models can be built and used to create predictive analytics and results from CFFs. Models can include, for example, deep learning models, neural networks, artificial neural networks (ANNs), and regression models. Deep learning can also be associated with or referred to as hierarchical learning or deep structured learning. The term "model" can be used to indicate different types of machine learning models, such as deep learning models, neural network models, trained models, attention-based models, classification models, regression models, or decision tree models. Regression models can be used to analyze and compute statistical data and trends, or to quantify relationships or connections between data. Deep learning models can be used, for example, to learn the cumulative frequency function (CFF) for prediction or to provide outputs that may include cardinality estimation.

[0029] Training and updating models can include supervised, unsupervised, and semi-supervised machine learning processes. Supervised learning can use labeled datasets or labeled training sets to build, train, and update models. Unsupervised learning can use all unlabeled data to train models. Semi-supervised learning can use both labeled and unlabeled datasets to train models.

[0030] Deep learning is a type of machine learning that can classify information based on input or training data. Input data can include structured or unstructured data. Structured data can include highly organized data, such as spreadsheets, relational databases, or data stored in fixed fields. Unstructured data can include unorganized data with an unconventional internal structure, such as portable document formats (PDFs), images, presentations, web pages, video content, audio content, emails, word processing documents, or multimedia content.

[0031] According to an embodiment, inequality predicates (e.g., predicates with the ≤ operator) can be used to create or generate a cumulative frequency function (CFF) model. Other types of predicates will be provided as alternative embodiments. According to alternative embodiments, the functionality of the established cumulative frequency function (CFF) model can be extended to estimate the cardinality of additional SQL predicate comparison operators, such as equality, BETWEEN, or strict inequalities (i.e., greater than or less than).

[0032] According to an alternative embodiment, the established Cumulative Frequency Function (CFF) model using ≤ can be extended to estimate the cardinality of SQL predicates or SQL operators with more complex predicates (such as IN, OR, and NOT predicates). Using the same established Cumulative Frequency Function (CFF) model, additional complex predicates can also be considered by combining multiple basic and complex predicates.

[0033] According to an alternative embodiment, the established cumulative frequency function (CFF) model can be extended to estimate the cardinality of joins. Typically, modern database systems rely on using assumptions when calculating join cardinality to simplify the calculation of join distributions across tables. Three commonly used examples of assumptions when there are no statistics to identify correlations between predicates can include independence, consistency, and inclusion.

[0034] The independence assumption treats multiple predicate column values ​​as independent, and the combined selectivity is the product of the individual predicate selectivity. The consistency assumption can ignore or skew statistics when calculating join outcome selectivity. The inclusion assumption simplifies the calculation of join predicate selectivity when it is assumed that all distinct values ​​in a column with fewer distinct values ​​exist in a join column in another table containing a higher number of distinct values. A direct application of the cumulative frequency function (CFF) involves using join columns to build a model. Here, predicates applied to columns not involved in the join can be evaluated to obtain information about the join columns, aiding in join cardinality estimation.

[0035] According to an alternative embodiment, the established cumulative frequency function (CFF) model may include estimating the cardinality of the predicates to be extended to lossless joins or lossless join predicates. For example, if the joins between facts and a set of dimension relations are lossless, a model can be built on the joins between the fact table and all dimension tables.

[0036] Reference Figure 1 This describes an exemplary networked computer environment 100 according to one embodiment. The networked computer environment 100 may include a computer 102 having a processor 104 and a data storage device 106 and a query optimization program 110a enabled to run software program 108. The networked computer environment 100 may also include a server 112 and a communication network 116, the server 112 being enabled to run a query optimization program 110b that can interact with a database 114. The networked computer environment 100 may include multiple computers 102 and servers 112, only one of which is shown. The communication network 116 may include different types of communication networks, such as wide area networks (WANs), local area networks (LANs), telecommunications networks, wireless networks, public switched networks, and / or satellite networks. It should be understood that... Figure 1 This illustration provides only one possible implementation and does not imply any limitation regarding the environment in which different embodiments may be implemented. Many modifications can be made to the depicted environment based on design and implementation requirements.

[0037] Client computer 102 can communicate with server computer 112 via communication network 116. Communication network 116 may include connections such as wired, wireless communication links, or fiber optic cables. (See reference...) Figure 4As discussed, server computer 112 may include internal component 902a and external component 904a, and client computer 102 may include internal component 902b and external component 904b. Server computer 112 can also operate in cloud computing service models, such as Software as a Service (SaaS), Analytics as a Service (AaaS), Blockchain as a Service (BaaS), Platform as a Service (PaaS), or Infrastructure as a Service (IaaS). Server 112 can also reside in cloud computing deployment models, such as private cloud, community cloud, public cloud, or hybrid cloud. Client computer 102 can be, for example, a mobile device, telephone, personal digital assistant, netbook, laptop computer, tablet computer, desktop computer, or any type of computing device capable of running programs, accessing networks, and accessing database 114. Depending on different implementations of this embodiment, query optimization programs 110a and 110b can interact with database 114, which can be embedded in different storage devices, such as, but not limited to, computer / mobile device 102, networked server 112, or cloud storage service.

[0038] According to this embodiment, a user using client computer 102 or server computer 112 can use query optimization programs 110a and 110b (respectively) to create machine learning models to approximate the multivariate cumulative frequency function (CFF) of the relation, thereby accurately predicting the cardinality of the query predicate. See below for further details. Figure 2 and Figure 3 A more detailed explanation of query optimization methods.

[0039] Now for reference Figure 2 The diagram illustrates an example of sampled and training data generated by query optimizers 110a and 110b for model training, according to at least one embodiment. Database relation 202 may include a database containing values ​​in rows and columns, such as structured data or a collection of structured data values. Database relation 202 may have a specified number of rows and columns. Database relation 202 may be scanned for a process to obtain random sample 204 to generate combinations of all or some columns in database relation 202. Random sample 204 of data may include data used to generate training data 206 for a machine learning model. Randomly collected samples of relations can be used to train the model. Random sample 204 can be selected without needing to know how to execute queries to build a predictive model.

[0040] Now for reference Figure 3 The document describes an operational flowchart illustrating an exemplary adaptive method for automating a query selective prediction process 300, used by query optimization programs 110a, 110b according to at least one embodiment.

[0041] In step 302, database relation 202 is accessed as input. The accessed database relation 202 can have varying sizes, ranging from small to large. The accessed and collected database relation 202 could, for example, contain a sample of 10,000 rows. A sample of 10,000 rows can be considered a small sample, and this sample can be obtained from a much larger amount of relation data. This can be for a dataset with n columns c1, c2, ... c n The relation R collects data. For simplicity and as an example, Table 1 provides the data accessed and prepared for a single relation (relation R) with two columns (c1 and c2) and the value of each column.

[0042]

[0043] Data in a relation can be character data, or may include character data; therefore, relational data can include numeric data, character data, or both. Data in a relation can also include date data, time data, and special types of data such as Boolean data. If the data in a relation is a character data type or contains character data types, then the data in the relation can be encoded in numeric form.

[0044] At step 304, a random sample 204 is collected for each relation. For example, the random sample 204 can be identified and collected for a large relation containing a large number of data point combinations that can be used as training data or to train a model. Database relations 202 can be scanned, and the collected samples can include combinations of columns that generate all or part of the data. For example, if database relations 202 consists of one million (1,000,000) rows for a given relation R, the random sample could include 2,000 rows of random samples or reservoir samples for each relation during the collection of statistics or values. The random sample 204 process can be used to train the model, and it may not be necessary to store or maintain training data for continuous cardinality prediction. Training data may also be stored if some models or model training require it.

[0045] In an alternative embodiment, a non-random sampling technique can be used. For example, every 10 rows in the table can be sampled, or 2 rows from every 20 pages of the table can be sampled.

[0046] At position 306, training data is generated. Training data can be generated using a regression model with multiple columns represented as vectors. The training data can represent predicates using multiple columns represented as vectors with values ​​from random samples. The expected or optimal output of the training data can be obtained by calculating or determining selectivity from the sample data. The input to the regression model can include values ​​for each column, and these values ​​can represent, for example, an upper bound (i.e., <=). The output from the regression model can include selectivity, for example, a cardinality relative to the number of eligible rows if multiplied by the size of the table. Selectivity of the predicates, or the selectivity of combined predicates, can represent the statistical correlation of the columns, which are represented as vectors. Selectivity can be a statistical analysis using cardinality values ​​collected from these columns.

[0047] For example, the input to a regression model is a vector of length c, where c is the number of columns in the relation. Training data points can represent any number of predicates. For columns without predicates, the maximum value of the column can be used. For example, considering a table with 3 columns c1, c2, and c3, data points representing c1≤v1 and c3≤v3 can be represented by c1≤v1 and c2≤max(c2) and c3≤v3, where max(c2) is the maximum value in the second column. Data points can be generated randomly, either by considering all combinations of values ​​in the relation or by considering some or some combinations of values ​​in the relation. Data points can be labeled by finding the cumulative frequency function (CFF) frequency points (e.g., the CFF column in the generated training data 206) divided by the total number of rows satisfying the predicates represented in the data point.

[0048] The training data 206 can be obtained by computing the n-dimensional joint cumulative frequency function (CFF) for each row (e.g., rows 1-3 in Table 1) or for each row from the random samples from step 304. To generate. The cumulative frequency function (CFF) can be defined as a non-decreasing and right-continuous function, for example, useful for model learning and approximation.

[0049] Other models besides regression models can be used to build upon a single relationship or multiple relationships, and the other models used can depend on the number of columns used as training data. A single model can be built, or multiple models can be built.

[0050] In step 308, a model is trained to create a Cumulative Frequency Function (CFF) model. The CFF model can take a vector representing the values ​​of the predicates created in step 306, and this model can be created to predict the selectivity of each query predicate. The CFF model can be created or generated using inequality predicates (e.g., predicates with the ≤ operator). Other types of predicates will be provided as alternative embodiments. The CFF model can be built or generated by formulating the problem as a regression problem, which can be used with a regression model to solve regression problems. Data in the relation can be converted to a floating-point representation that allows handling character data types in addition to numeric data types. Many encoding techniques can be used in the conversion process. Additionally, for categorical values, a mapping table can be used for a set of ordinal numbers.

[0051] The Cumulative Frequency Function (CFF) identifies the frequency of values ​​in a relation that are less than a reference value. Consider a relation R with N columns; the CFF for R is defined as follows:

[0052]

[0053] in, It corresponds to a set of values The set of predicates, the set of values It is used together with the ≤ operator in column sets. Above, represented as .here, It is the set of predicates that satisfy the relation. The number of rows in the relation compared to the total number of rows in the relation.

[0054] According to the alternative embodiment, 308a, the functionality of the established Cumulative Frequency Function (CFF) model can be extended to estimate the cardinality of additional Structured Query Language (SQL) operators, such as those used for equality predicates. In Structured Query Language (SQL) form, the Cumulative Frequency Function (CFF) can be expressed as follows:

[0055]

[0056] The functionality of the established Cumulative Frequency Function (CFF) model can be extended to predict additional SQL operators, such as equality predicates and BETWEEN predicates. Since the CFF model established at step 308 is trained on join operators (e.g., ≤), the CFF model at 308 may not directly predict other predicate types. Other predicate types can be rewritten based on the predicate ≤ in order to compute or calculate selectivity.

[0057] To achieve better cardinality estimation of the equation predicate, additional training points can be generated by including a small constant δ. For example, for data points... It can generate additional points. Or it can generate other combinations with a maximum value, such as , , .

[0058] Table 2 provides an example of training data for the cumulative frequency function (CFF) model (e.g., a regression model) of the data provided in Table 1, where δ=0.1.

[0059]

[0060] Extended cumulative frequency function (CFF) models for other predicate types can be shown, for example, in Table 3.

[0061]

[0062] According to an alternative embodiment, 308a, the established Cumulative Frequency Function (CFF) model can be extended to estimate the cardinality of SQL operators using more complex predicates, such as IN, OR, and NOT predicates. These more complex predicates (such as IN, OR, and NOT predicates) can consist of one or more sub-terms that are processed or computed by selectively deriving selectivity from the basic components applied to the Cumulative Frequency Function (CFF) model created in step 308. For example, the basic components of the Cumulative Frequency Function (CFF) of relation R are described above. .

[0063] Examples of applying basic components to more complex predicates (e.g., IN, OR, and NOT) can be seen in...

[0064]

[0065]

[0066] Application to the NOT predicate:

[0067]

[0068] And its application to OR predicates:

[0069]

[0070]

[0071] And its application to the IN predicate:

[0072] .

[0073] Complex predicates can also be considered by combining multiple basic and complex predicates as follows:

[0074] ;

[0075] or

[0076] ;

[0077] or

[0078]

[0079]

[0080] +

[0081] +

[0082]

[0083] +

[0084] .

[0085] Each set of inferences can be computed in a single batch to minimize the overhead of computing multiple inferences for complex scenarios. Computing each batch of inferences in a single batch minimizes the round-trip calls to the Cumulative Frequency Function (CFF) model (i.e., known as inference calls), resulting in reduced computation time.

[0086] According to an alternative embodiment, 308b, the established cumulative frequency function (CFF) model can be extended by estimating the cardinality applied to the joins. A single table model can be stored to capture the distribution of non-joined columns. The join distribution can be estimated using the most frequent value of each joined column determined using a column probability density function (PDF). The probability density function (PDF) can be used to extend the cumulative frequency function (CFF) model to joins. For example, the estimation of the join distribution may involve calculating the selectivity of binary combinations with non-joined columns for each value in the join outcome. For example, suppose there are two tables T1(c1, c2, c3, c4) and T2(c1, c2, c5, c6), where c1 is a joined column and c2 represents a non-joined column. Selectivity = Length(T1(c1) == T1T2(c1) and T1(c2) == T1T2(c2)). The selectivity value is the value calculated and predicted using the established cumulative frequency function (CFF) model.

[0087] Each selectivity is multiplied by the frequency of that particular value. The result can be an array containing every value of c2 in T1 and all the changed values ​​of c1 after the join. The distribution of two unjoined columns can be captured in a similar step or can be captured as the distribution of an n-way join. The c2 selectivity can be used to establish a forward join operation on c2 with another table or with multiple other tables.

[0088] Traditional implementations of using statistical views to improve the estimated cardinality of joins utilize relational database management systems (RDBMS) to leverage multi-column statistics and statistics collected on materialized query tables (MQTs). Instances of RDBMS can include IBM® DB2® (IBM® (IBM DB2 and all IBM DB2-based trademarks and logos are trademarks or registered trademarks of International Business Machines Corporation and / or its subsidiaries)). It can store multi-column statistics and provide the number of distinct groupings of a set of values. The number of distinct groupings can be used to interpret the correlation between multiple equality join predicates applied to each pair of joins, such as T1 JOIN T2 ON T1.A=T2.A AND T1.B=T2.b.

[0089] The established cumulative frequency function (CFF) model can provide additional benefits for pairwise joins because it can be used to account for any local predicates applied to the tables involved in the join. Local predicates can include, for example, range predicates and predicates with expressions. Using multi-column statistics might assume a global consistency across values, which may not be the case for models that lead to reduced error in estimation.

[0090] According to an alternative embodiment, 308c, the established cumulative frequency function (CFF) model can be extended by estimating the cardinality of the lossless joins. The lossless join predicates may, for example, include a star schema. If there are enforced or discovered constraints between the tables, a single model can be defined on all tables in the schema. Alternatively, a single model can be defined on a set of stars within the schema.

[0091] A model can be built for each star, and the optimizer can use or leverage the model to estimate the cardinality of any connection permutations covered by the model. For example, consider a schema containing a fact table F and a set of dimension tables {D1, D2, ..., Dn}. If every connection between the facts and dimensions Di is lossless, a model can be built on the connections between the fact table F and all dimension tables. Predictions for columns or subsets of tables covered by the model can be computed by setting the input to the maximum value of each column that does not participate in the connections.

[0092] At position 310, a prediction of the cardinality of the query predicate is provided as output. The established cumulative frequency function (CFF) model can provide a cardinality estimate that does not require storage, although this cardinality estimate can be stored. While the data can be stored, it is not necessary to store or maintain the training data and cardinality estimate for continuous cardinality prediction for the model.

[0093] Understandable. Figure 2 and 3 The illustration provides only one embodiment and does not imply any limitation on how different embodiments may be implemented. Many modifications can be made to the depicted embodiment based on design and implementation requirements.

[0094] Figure 4 This is an illustrative embodiment of the present invention. Figure 1 Block diagram 900 depicts the internal and external components of a computer. It should be understood that... Figure 4 This illustration provides only one possible implementation and does not imply any limitation regarding the environment in which different embodiments may be implemented. Many modifications can be made to the depicted environment based on design and implementation requirements.

[0095] Data processing systems 902 and 904 represent any electronic device capable of executing machine-readable program instructions. Data processing systems 902 and 904 may represent smartphones, computer systems, PDAs, or other electronic devices. Examples of computing systems, environments, and / or configurations that data processing systems 902 and 904 may represent include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, network PCs, minicomputer systems, and distributed cloud computing environments that include any of the above systems or devices.

[0096] The user client computer 102 and the network server 112 may include Figure 4 The internal components 902a, 902b and external components 904a, 904b of the corresponding groups are shown. Each group of internal components 902a, 902b includes one or more processors 906, one or more computer-readable RAMs 908 and one or more computer-readable ROMs 910 on one or more buses 912, as well as one or more operating systems 914 and one or more computer-readable tangible storage devices 916. One or more operating systems 914, software programs 108 and query optimizers 110a in client computer 102 and query optimizers 110b in network server 112 may be stored on one or more computer-readable tangible storage devices 916 for execution by one or more processors 906 via one or more RAMs 908 (which typically include cache memory). Figure 4In the illustrated embodiment, each computer-readable tangible storage device 916 is a disk storage device of an internal hard disk drive. Alternatively, each computer-readable tangible storage device 916 is a semiconductor storage device, such as ROM 910, EPROM, flash memory, or any other computer-readable tangible storage device capable of storing computer programs and digital information.

[0097] Each set of internal components 902a, 902b also includes a R / W driver or interface 918 for reading from and writing to one or more portable computer-readable tangible storage devices 920, such as CD-ROMs, DVDs, Memory Sticks, magnetic tapes, disks, optical discs, or semiconductor storage devices. Software programs (such as software program 108 and query optimization programs 110a, 110b) can be stored on one or more corresponding portable computer-readable tangible storage devices 920, read from and loaded into corresponding computer-readable tangible storage devices 916 via the corresponding R / W driver or interface 918.

[0098] Each set of internal components 902a, 902b may also include a network adapter (or switch port card) or interface 922, such as a TCP / IP adapter card, a wireless Wi-Fi interface card, or a 3G or 4G wireless interface card, or other wired or wireless communication links. Software program 108 and query optimization program 110a in client computer 102 and query optimization program 110b in network server computer 112 can be downloaded from an external computer (e.g., a server) via a network (e.g., the Internet, a local area network, or another wide area network) and the corresponding network adapter or interface 922. From the network adapter (or switch port adapter) or interface 922, software program 108 and query optimization program 110a in client computer 102 and query optimization program 110b in network server computer 112 are loaded into the corresponding hard disk drive 916. The network may include copper wire, fiber optic, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers.

[0099] Each set of external components 904a, 904b may include a computer display monitor 924, a keyboard 926, and a computer mouse 928. External components 904a, 904b may also include a touchscreen, a virtual keyboard, a touchpad, a pointing device, and other human-machine interface devices. Each set of internal components 902a, 902b also includes a device driver 930 that interfaces with the computer display monitor 924, keyboard 926, and computer mouse 928. The device driver 930, R / W driver or interface 918, and network adapter or interface 922 include hardware and software (stored in storage device 916 and / or ROM 910).

[0100] It should be understood in advance that while this disclosure includes a detailed description of cloud computing, the implementation of the teachings cited herein is not limited to cloud computing environments. Rather, embodiments of the invention can be implemented in conjunction with any other type of computing environment now known or developed hereafter.

[0101] Cloud computing is a service delivery model that enables convenient, on-demand network access to a shared pool of configurable computing resources, such as networks, network bandwidth, servers, processing power, storage, applications, virtual machines, and services. This shared pool of configurable computing resources can be rapidly provisioned and released with minimal management effort or interaction with service providers. This cloud model may include at least five features, at least three service models, and at least four deployment models.

[0102] The features are as follows:

[0103] On-demand self-service: Cloud consumers can unilaterally and automatically provide computing power, such as server time and network storage, as needed, without requiring human interaction with the service provider.

[0104] Extensive network access: Capabilities are available through the network and accessed via standard mechanisms that facilitate the use of heterogeneous thin client platforms or thick client platforms (e.g., mobile phones, laptops, and PDAs).

[0105] Resource pooling: A provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, where different physical and virtual resources are dynamically assigned and reassigned as needed. There is a sense of location independence because consumers typically do not have control or knowledge of the exact location of the resources provided, but may be able to specify the location at a higher level of abstraction (e.g., country, state, or data center).

[0106] Rapid flexibility: The ability to provide capacity quickly and flexibly, automatically scaling down and up rapidly in some situations to scale up rapidly. For consumers, the available supply capacity often appears unlimited and can be purchased in any quantity at any time.

[0107] Measuring services: Cloud systems automatically control and optimize resource usage by leveraging metering capabilities at a level of abstraction appropriate to the service type (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency to both the providers and consumers of the services being utilized.

[0108] The service model is as follows:

[0109] Software as a Service (SaaS): This provides consumers with the ability to use the provider's applications running on cloud or hybrid cloud infrastructure. The applications can be accessed from different client devices via a thin client interface such as a web browser (e.g., web-based email). Consumers do not manage or control the underlying cloud infrastructure, including the network, servers, operating system, storage, or even individual application capabilities, with possible exceptions such as limited user-specific application configuration settings.

[0110] Platform as a Service (PaaS): This provides consumers with the ability to deploy applications created or acquired by the consumer, using programming languages ​​and tools supported by the provider, onto cloud infrastructure. Consumers do not manage or control the underlying cloud infrastructure, including networks, servers, operating systems, or storage, but they have control over the deployed applications and the configuration of possible application hosting environments.

[0111] Analytics as a Service (AaaS): This provides consumers with the ability to access an analytics platform using a web-based or cloud-based network (i.e., infrastructure). The analytics platform may include access to analytics software resources or access to related databases, corpora, servers, operating systems, or storage. Consumers do not manage or control the underlying web-based or cloud-based infrastructure (including databases, corpora, servers, operating systems, or storage), but they have control over the deployed applications and, possibly, the configuration of the application hosting environment.

[0112] Infrastructure as a Service (IaaS): This provides consumers with the capability to deliver processing, storage, networking, and other basic computing resources that enable them to deploy and run arbitrary software, which may include operating systems and applications. Consumers do not manage or control the underlying cloud infrastructure, but rather have control over the operating system, storage, deployed applications, and potentially limited control over selected networking components (e.g., host firewalls).

[0113] The deployment model is as follows:

[0114] Private cloud: A cloud infrastructure that operates solely for an organization. It can be managed by the organization or a third party and can exist on-site or off-site.

[0115] Community cloud: A cloud infrastructure shared by several organizations and supporting a specific community with shared concerns (e.g., tasks, security requirements, policies, and compliance considerations). It can be managed by an organization or a third party and can exist on-site or off-site.

[0116] Public cloud: Makes cloud infrastructure available to the public or large industry groups and is owned by an organization that sells cloud services.

[0117] Hybrid cloud: A cloud infrastructure is a combination of two or more clouds (private, community, or public) that remain a single entity but are bound together by standardized or proprietary technologies that enable data and applications to be ported (e.g., cloud bursting for load balancing between clouds).

[0118] Cloud computing environments are service-oriented, focusing on statelessness, loose coupling, modularity, and semantic interoperability. At the heart of cloud computing is the infrastructure comprising a network of interconnected nodes.

[0119] Now for reference Figure 5 This describes an illustrative cloud computing environment 1000. As shown, the cloud computing environment 1000 includes one or more cloud computing nodes 100 that can communicate with local computing devices used by cloud consumers, such as, for example, personal digital assistants (PDAs) or cellular phones 1000A, desktop computers 1000B, laptop computers 1000C, and / or automotive computer systems 1000N. The nodes 100 can communicate with each other. They can be physically or virtually grouped (not shown) in one or more networks, such as private clouds, community clouds, public clouds, or hybrid clouds, or combinations thereof, as described above. This allows the cloud computing environment 1000 to provide infrastructure, platforms, and / or software as services that cloud consumers do not need to maintain for them on their local computing devices. It should be understood that... Figure 5 The types of computing devices 1000A-N shown are intended to be illustrative only, and computing node 100 and cloud computing environment 1000 can communicate with any type of computerized device via any type of network and / or network-addressable connection (e.g., using a web browser).

[0120] Now for reference Figure 6 This illustrates a set of functional abstraction layers 1100 provided by the cloud computing environment 1000. It should be understood in advance that... Figure 6 The components, layers, and functions shown are intended to be illustrative only, and embodiments of the invention are not limited thereto. As described, the following layers and corresponding functions are provided:

[0121] The hardware and software layer 1102 includes hardware and software components. Examples of hardware components include: a mainframe 1104; a server 1106 based on a RISC (Reduced Instruction Set Computer) architecture; a server 1108; a blade server 1110; a storage device 1112; and a network and networking component 1114. In some embodiments, the software components include network application server software 1116 and database software 1118.

[0122] The virtualization layer 1120 provides an abstraction layer from which the following examples of virtual entities can be provided: virtual server 1122; virtual storage 1124; virtual network 1126, including virtual private network; virtual application and operating system 1128; and virtual client 1130.

[0123] In one example, management layer 1132 may provide the functionality described below. Resource Provisioning 1134 provides dynamic procurement of computing resources and other resources used to perform tasks within the cloud computing environment. Metering and Pricing 1136 provides cost tracking as resources are utilized within the cloud computing environment and bills or invoices for the consumption of these resources. In one example, these resources may include application software licenses. Security provides authentication for cloud consumers and tasks, as well as protection for data and other resources. User Portal 1138 provides access to the cloud computing environment for consumers and system administrators. Service Level Management 1140 provides cloud resource allocation and management to ensure that required service levels are met. Service Level Agreement (SLA) Planning and Fulfillment 1142 provides pre-scheduling and procurement of cloud resources, anticipating future requirements for those resources according to the SLA.

[0124] Workload layer 1144 provides examples of functionalities that can leverage a cloud computing environment. Examples of workloads and functionalities that can be provided from this layer include: mapping and navigation 1146; software development and lifecycle management 1148; virtual classroom education delivery 1150; data analytics and processing 1152; transaction processing 1154; and query optimization 1156. Query optimization procedures 110a and 110b provide a method for forecast cardinality estimation.

[0125] This invention can be a system, method, and / or computer program product at any possible level of technical detail integration. The computer program product may include a computer-readable storage medium (or media) having computer-readable program instructions thereon for causing a processor to execute aspects of the invention.

[0126] Computer-readable storage media can be tangible devices capable of retaining and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example, but not limited to, electronic storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of computer-readable storage media includes: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital universal disk (DVD), memory sticks, floppy disks, mechanical encoding devices such as punched cards or protrusions in slots having instructions recorded thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media should not be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses passing through fiber optic cables), or electrical signals emitted through wires.

[0127] The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to a corresponding computing / processing device, or downloaded via a network (e.g., the Internet, a local area network, a wide area network, and / or a wireless network) to an external computer or external storage device. The network may include copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives computer-readable program instructions from the network and forwards them to a computer-readable storage medium within the corresponding computing / processing device.

[0128] Computer-readable program instructions used to perform the operations of this invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​(such as Smalltalk, C++, etc.) and procedural programming languages ​​(such as the "C" programming language, the Python programming language, or similar programming languages). The computer-readable program instructions may be executed entirely on a user's computer, partially on a user's computer, as a standalone software package, partially on a user's computer and partially on a remote computer, or entirely on a remote computer or server. In the latter case, the remote computer may be connected to the user's computer via any type of network (including a local area network (LAN) or a wide area network (WAN)) or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) may execute computer-readable program instructions by utilizing state information from the computer-readable program instructions to personalize the electronic circuitry in order to perform aspects of this invention.

[0129] The present invention will now be described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each block of the flowchart illustrations and / or block diagrams, as well as combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0130] These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions / actions specified in one or more boxes of a flowchart and / or block diagram. These computer-readable program instructions may also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner, such that the computer-readable storage medium storing the instructions comprises an article of manufacture containing instructions that implement aspects of the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0131] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus, or other device to produce computer-implemented processing, such that the instructions executed on the computer, other programmable apparatus, or other device perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0132] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. Each block in a flowchart or block diagram may represent a module, segment, or portion of instructions, including one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a non-linear order. For example, depending on the function involved, two consecutively shown blocks may actually be executed substantially simultaneously, or these blocks may sometimes be executed in reverse order. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action or executes a combination of dedicated hardware and computer instructions.

[0133] Various embodiments of the invention have been described for illustrative purposes, but are not intended to be exhaustive or limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope of the described embodiments. The terminology used herein has been chosen to best explain the principles of the embodiments, their practical application, or technical improvements to technologies found in the market, or to enable those skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for using machine learning for cardinality estimation, the method comprising: Accessing database relationships; Collect random samples from each of the database relationships; Training data for creating a cumulative frequency function model is generated from the random samples collected from each of the database relationships; The cumulative frequency function model is created from the generated training data using machine learning; and Predict cardinality estimates for multiple structured query language predicates, the cardinality estimates being based on the cumulative frequency function model. The cumulative frequency function is defined as follows:

2. The method according to claim 1, further comprising: The cumulative frequency function model is extended to predict cardinality estimates of some of the predicates in the plurality of structured query language predicates, wherein some of the predicates in the plurality of structured query language predicates are equality predicates.

3. The method according to claim 1, further comprising: The cumulative frequency function model is extended to predict the cardinality estimate of the connective predicate.

4. The method according to claim 1, further comprising: The cumulative frequency function model is extended to predict the cardinality estimate of lossless connection predicates.

5. The method according to claim 1, wherein, Some of the operators in the multiple Structured Query Language operators are inequality predicates.

6. The method according to claim 1, wherein, The training data is generated by using multiple columns of a vector represented as values ​​from the random samples.

7. The method according to claim 6, wherein, The value from the random sample represents the predicate and the optimal output of the training data, wherein the optimal output of the training data is obtained by calculating selectivity from the random sample.

8. The method according to claim 1, wherein, The cumulative frequency function model determines the selectivity of each query predicate.

9. The method according to claim 1, wherein, The database relationships include numerical data and character data.

10. A computer system that uses machine learning for cardinality estimation, comprising: One or more processors, one or more computer-readable storage devices, one or more computer-readable tangible storage media, and program instructions stored on at least one of the one or more computer-readable tangible storage media for execution by at least one of the one or more processors via at least one of the one or more computer-readable storage devices, wherein the computer system is capable of performing a method including the following steps: Accessing database relationships; Collect random samples from each of the database relationships; Training data for creating a cumulative frequency function model is generated from the random samples collected from each of the database relationships; The cumulative frequency function model is created from the generated training data using machine learning; and Predict cardinality estimates for multiple structured query language predicates, the cardinality estimates being based on the cumulative frequency function model. The cumulative frequency function is defined as follows:

11. The computer system of claim 10, further comprising: The cumulative frequency function model is extended to predict cardinality estimates of some of the predicates in the plurality of structured query language predicates, wherein some of the predicates in the plurality of structured query language predicates are equality predicates.

12. The computer system of claim 10, further comprising: The cumulative frequency function model is extended to predict the cardinality estimate of the connective predicate.

13. The computer system of claim 10, further comprising: The cumulative frequency function model is extended to predict the cardinality estimate of lossless connection predicates.

14. The computer system according to claim 10, wherein, Some of the structured query language operators are inequality predicates.

15. The computer system according to claim 10, wherein, The training data is generated by using multiple columns of a vector that are represented as values ​​from the random samples.

16. The computer system according to claim 15, wherein, The value from the random sample represents the predicate and the optimal output of the training data, wherein the optimal output of the training data is obtained by calculating selectivity from the random sample.

17. The computer system according to claim 10, wherein, The cumulative frequency function model determines the selectivity of each query predicate.

18. The computer system according to claim 10, wherein, The database relationships include numerical data and character data.

19. A computer program product that uses machine learning for cardinality estimation, comprising: One or more computer-readable tangible storage media and program instructions stored on at least one of the one or more computer-readable tangible storage media, the program instructions being executable by a processor to cause the processor to perform a method comprising: Accessing database relationships; Collect random samples from each of the database relationships; Training data for creating a cumulative frequency function model is generated from the random samples collected from each of the database relationships; The cumulative frequency function model is created from the generated training data using machine learning; and Predict cardinality estimates for multiple structured query language predicates, the cardinality estimates being based on the cumulative frequency function model. The cumulative frequency function is defined as follows:

20. The computer program product of claim 19, further comprising: The cumulative frequency function model is extended to predict cardinality estimates of some of the predicates in the plurality of structured query language predicates, wherein some of the predicates in the plurality of structured query language predicates are equality predicates.

21. The computer program product of claim 19, further comprising: The cumulative frequency function model is extended to predict the cardinality estimate of the connective predicate.

22. The computer program product of claim 19, further comprising: The cumulative frequency function model is extended to predict the cardinality estimate of lossless connection predicates.

23. The computer program product according to claim 19, wherein, Some of the operators in the multiple Structured Query Language operators are inequality predicates.

24. The computer program product according to claim 19, wherein, The training data is generated by using multiple columns of a vector that are represented as values ​​from the random samples.

25. The computer program product according to claim 24, wherein, The value from the random sample represents the predicate and the optimal output of the training data, wherein the optimal output of the training data is obtained by calculating selectivity from the random sample.