Pre-computation-based query methods and systems
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI JIAOTONG UNIV
- Filing Date
- 2023-05-05
- Publication Date
- 2026-06-30
Smart Images

Figure CN118897843B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data query technology, specifically to a query method and system based on pre-computation. More particularly, it relates to a method for accelerating queries based on pre-computation in scenarios involving online analytical processing of multi-source data. Background Technology
[0002] With the explosive growth of business data and the expansion of company organizational structures, data management platforms face increasing problems and challenges. There is a desire for cost-effective and efficient methods for rapid querying and complex analysis of massive amounts of data. Currently, data warehouses and data lakes are hot research areas in data storage and management. A data warehouse is a traditional data storage architecture. After extraction, transformation, and loading, it effectively supports online analytical processing (OLAP). It is primarily used for processing structured data and boasts high data consistency and controllability. However, it suffers from excessively high infrastructure investment costs, cumbersome subsequent maintenance, and unreliable data timeliness. A data lake, on the other hand, is a relatively new data storage architecture. It is primarily used for processing semi-structured and unstructured data and possesses high data scalability, enabling rapid and flexible ingestion of various data types. However, it suffers from significantly lacking in data aggregation and processing capabilities and excessively high management difficulty.
[0003] To address the aforementioned issues, the lakeware architecture emerged. Built upon the low-cost data storage architecture of a data lake, it inherits the data processing and management functions of a data warehouse, providing strong flexibility and data management capabilities. While this lakeware architecture appears sufficient to meet current needs, it still offers significant room for optimization. First, although some query optimization techniques exist for lakeware architecture, its development is relatively recent, its ecosystem is less mature compared to various data warehouses, and the corresponding technologies are not yet fully developed. Second, for large-scale cloud object storage, analysis tasks starting from raw data are extremely time-consuming and place a heavy load on cloud servers. Therefore, designing a suitable caching strategy and auxiliary structures to accelerate query analysis tasks is essential. We can utilize materialized views, a pre-computation technique widely used in databases, to help improve query performance.
[0004] However, in multi-source data online analytical processing (OLAP) scenarios, different users, business lines, and departments typically need to periodically generate specific business reports. They send corresponding OLAP queries to the cloud and have certain quality of service (QoS) requirements, primarily time-related. However, OLAP queries often involve complex multi-table cascading and aggregation operations, requiring substantial computing resources and significant time. Furthermore, they cannot utilize identical intermediate results, leading to two similar query requests repeatedly calculating the same dataset.
[0005] Therefore, there is an urgent need in the market for a pre-computation-based query method that can improve the performance of online analytical processing in this scenario and increase query speed. Summary of the Invention
[0006] In view of the shortcomings of the prior art, the purpose of this invention is to provide a query method and system based on pre-computation.
[0007] A pre-computation-based query method provided by the present invention includes:
[0008] Indexing steps: Construct a hierarchical index based on the information in the pre-calculation table, and filter out a candidate set of pre-calculation tables that meet preset conditions through the index. At the same time, when a user creates, updates, or deletes a pre-calculation table, update the index information corresponding to the pre-calculation table.
[0009] Pre-calculation step: Based on the candidate set in the pre-calculation table, perform custom matching and rewrite of the query SQL statement on the basis of the current query engine;
[0010] Task scheduling steps: Dynamically schedule pre-computed tasks based on historical workloads and determine the timing for initializing and updating the pre-computed table.
[0011] Preferably, the hierarchical index has at least three levels, each level including a Hasse diagram and corresponding filtering conditions, and each level performs a matching check;
[0012] The matching check includes data source matching, predicate matching, and output expression matching.
[0013] Preferably, the indexing step includes the following sub-steps:
[0014] Step S1.1: Receive the query SQL statement submitted by the user and parse the query SQL statement into a logical plan;
[0015] Step S1.2: Obtain key information from the data service architecture according to the logical plan;
[0016] Step S1.3: Filter the key information layer by layer through the data source layer index, predicate layer index, and output expression layer index to obtain a pre-calculated table candidate set that meets the filtering conditions of the corresponding layer index.
[0017] Preferably, the custom matching and rewriting are performed using multiple multi-class matchers and rewriters, including predicate matching and rewriting, projection matching and rewriting, join matching and rewriting, aggregation matching and rewriting, and grouping matching and rewriting.
[0018] The candidate set of pre-computed tables is matched through a pipeline of multiple matchers and multiple rewriters. Only query SQL statements that pass the checks of all matchers will be used to rewrite the logical plan based on the pre-computed tables in the data warehouse; otherwise, the original SQL statements will be used to query the original tables in the data lake.
[0019] Preferably, the task scheduling step includes a resource monitoring sub-step and a scheduling sub-step;
[0020] Resource monitoring sub-step: Monitor the cluster's resource utilization rate, and select the pre-computed task to be scheduled based on the scheduling algorithm when the resource utilization rate is low;
[0021] Scheduling sub-step: Record the historical workload of query statements within a preset time period, select the optimal query template based on the historical workload, encapsulate it into a pre-computation task, and schedule it.
[0022] According to the present invention, a pre-computation-based query system includes:
[0023] Index module: Constructs a hierarchical index based on the information of the pre-calculation table, and filters out a candidate set of pre-calculation tables that meet preset conditions through the index. At the same time, when a user creates, updates or deletes a pre-calculation table, the index information corresponding to the pre-calculation table is updated.
[0024] Pre-computation module: Based on the candidate set in the pre-computation table, performs custom matching and rewriting of the query SQL statement on the basis of the current query engine;
[0025] Task scheduling module: Dynamically schedules pre-computed tasks based on historical workloads and determines the time for initializing and updating the pre-computed table.
[0026] Preferably, the hierarchical index has at least three levels, each level including a Hasse diagram and corresponding filtering conditions, and each level performs a matching check;
[0027] The matching check includes data source matching, predicate matching, and output expression matching.
[0028] Preferably, the index module includes the following sub-modules:
[0029] Module S1.1: Receives the query SQL statement submitted by the user and parses the query SQL statement into a logical plan;
[0030] Module S1.2: Obtain key information from the data service architecture according to the logical plan;
[0031] Module S1.3: The key information is filtered layer by layer through the data source layer index, predicate layer index, and output expression layer index to obtain a pre-calculated table candidate set that meets the filtering conditions of the corresponding layer index.
[0032] Preferably, the custom matching and rewriting are performed using multiple multi-class matchers and rewriters, including predicate matching and rewriting, projection matching and rewriting, join matching and rewriting, aggregation matching and rewriting, and grouping matching and rewriting.
[0033] The candidate set of pre-computed tables is matched through a pipeline of multiple matchers and multiple rewriters. Only query SQL statements that pass the checks of all matchers will be used to rewrite the logical plan based on the pre-computed tables in the data warehouse; otherwise, the original SQL statements will be used to query the original tables in the data lake.
[0034] Preferably, the task scheduling module includes a resource monitoring submodule and a scheduling submodule;
[0035] Resource monitoring submodule: Monitors the resource utilization rate of the cluster, and selects the pre-computed task to be scheduled according to the scheduling algorithm when the resource utilization rate is low;
[0036] Scheduling submodule: Records the historical workload of query statements within a preset time period, selects the optimal query template based on the historical workload, encapsulates it into a pre-computed task, and schedules it.
[0037] Compared with the prior art, the present invention has the following beneficial effects:
[0038] 1. This invention optimizes the query engine based on pre-computation and uses a space-for-time trade-off method to optimize the performance of online analytical processing tasks under the lakeside integrated architecture.
[0039] 2. This invention establishes an efficient index for pre-computed tables. The proposed pre-computed optimized query engine requires traversing all pre-computed tables and analyzing each table once, which is very time-consuming. Therefore, this invention summarizes the matching process as finding the relationship between supersets and subsets, and constructs a hierarchical index for quickly filtering candidate sets of pre-computed tables, further improving the performance of online analytical processing tasks.
[0040] 3. Based on historical workloads, this invention schedules high-priority pre-computation tasks only when the cluster's resource utilization is low, which can ensure that the timeliness of data is within a certain range and can make use of the data center's resources during off-peak periods. Attached Figure Description
[0041] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:
[0042] Figure 1 This is a schematic diagram of the system architecture of the present invention.
[0043] Figure 2This is a schematic diagram of the three-level indexing process in this invention.
[0044] Figure 3 This is a schematic diagram of the logic plan rewriting process in this invention. Detailed Implementation
[0045] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all fall within the protection scope of the present invention.
[0046] This invention relates to lake warehouse integration, pre-computed table management, and task scheduling. Specifically, in online analytical processing (OLAP) scenarios within enterprises, for users' specific business reporting needs, the cloud matches and rewrites complex SQL queries, uses pre-computed tables to calculate and return query results, thereby accelerating queries.
[0047] According to the present invention, a pre-computation-based query method is provided, such as... Figures 1 to 3 As shown, it includes:
[0048] Indexing steps: A hierarchical index is constructed based on the information from the pre-computed tables. Candidate pre-computed tables that meet preset conditions are then filtered using this index. Index management is also included; when a user creates, updates, or deletes a pre-computed table, the corresponding index information is updated to ensure index validity. The preset conditions are the filtering conditions corresponding to that hierarchical index, used to filter pre-computed tables that do not meet the filtering conditions. The hierarchical index has at least three levels, each including a Hasse diagram and corresponding filtering conditions, and each level performs a matching check. The matching check includes data source matching, predicate matching, and output expression matching. This Hasse diagram and the filtering conditions corresponding to each hierarchical index are constructed based on information from all pre-computed tables. Specifically, as shown... Figure 2 As shown, the indexing step includes the following sub-steps:
[0049] Step S1.1: Receive the query SQL statement submitted by the user and parse the query SQL statement into a logical plan.
[0050] Step S1.2: Obtain key information from the data service architecture according to the logical plan. Key information includes the original data tables referenced by the pre-computed tables, the conditions contained in the predicate statements, and the output expressions.
[0051] Step S1.3: Filter the key information layer by layer through the data source layer index, predicate layer index, and output expression layer index to obtain a pre-computed table candidate set that meets the filtering conditions of the corresponding layer index.
[0052] Specifically, the filtering conditions for each level of the index are as follows: Data source layer: the data table referenced by the query SQL should be a subset of the data tables referenced by the pre-computed table; Predicate layer: each equivalence class of the query SQL needs to be a superset of some equivalence class in the pre-computed table, and the value range of that equivalence class should be a subset of the value range of the corresponding equivalence class; Output expression layer: both the set of group by fields and the set of output fields in the query SQL should be subsets of the pre-computed table. More specifically, the formal settings for matching at each level are described below:
[0053] A three-tiered index—data source layer, predicate layer, and output expression layer—was constructed using data from the pre-computed table. The filtering conditions for each layer were abstracted into a problem of searching for supersets / subsets within the index. Before rewriting, candidate sets of pre-computed tables that meet the conditions are first filtered based on the index. This accelerates the search process and reduces the number of candidate sets, thereby indirectly improving rewriting efficiency.
[0054] Predicate matching: Predicate matching refers to the WHERE clause in SQL. Its predicates include AND and OR operations. Combining this with conjunction and disjunction from discrete mathematics, the WHERE clause in each SQL statement can be represented as W = P1∧P2∧...∧P n Predicates can be divided into three categories: equivalence predicates P equal Scope predicate P range Other predicates P other Then the predicate can be represented as P equal ∧P range ∧P other It satisfies the following constraints:
[0055]
[0056] Among them, P i q ,i∈{equal,range,other} represents the predicate conjunctive normal form of the query SQL. This represents the predicate conjunctive normal form for creating the precomputed table. Successful predicate matching requires satisfying...
[0057] Data source matching: The data table required by the query SQL can be converted from a pre-computed data table. Assume the data set referenced by the query SQL is T1 = {t1, t2, ..., t...} n The dataset referenced by the pre-computation table is set to T2 = {s1, s2, ..., s}. m When T1 = T2, the data sources are obviously consistent; when Let R = T2\T1 =
[0058] r1,r2,...,r m-n , The data source must match successfully if the data source is correct. It will not change the number of rows in the data table.
[0059] Output expression matching: The output fields of a query SQL can be derived from the output columns of a pre-computed table. Let the set of output expressions for the query SQL be C1 = {c11, c12, ..., c1...}. m The set of fields in the group by clause is G1 = {g11, g12, ..., g1...} p Let C2 = {c21, c22, ..., c2} be the set of output expressions for the pre-computation table. m The set of fields in the group by clause is G2 = {g21, g22, ..., g2}. q The output then indicates that a successful match requires G1 = G2.
[0060] Pre-computation step: Based on the candidate set of the pre-computation table, the query SQL statement is customized for matching and rewriting using the current query engine, such as the SparkSQL query engine. Custom matching and rewriting are performed through multiple multi-class matchers and rewriters, including predicate matching and rewriting, projection matching and rewriting, join matching and rewriting, aggregation matching and rewriting, and grouping matching and rewriting. The candidate set of the pre-computation table is pipelined through multiple matchers and rewriters. Only query SQL statements that pass all matcher checks are used to rewrite the logical plan based on the pre-computation table in the data warehouse; otherwise, the original SQL statement is used to query the original tables in the data lake. When subsequent query SQL tasks arrive, they are first parsed into a logical plan, then rewritten into a new logical plan based on the pre-computation table, and finally handed over to the computation engine for calculation.
[0061] By integrating pre-computation technology with the Spark SQL engine, a query SQL statement is first parsed into a logical plan. Then, based on the pre-computation table and custom rewriting requirements and matcher rewriter, the logical plan is rewritten into a new logical plan and handed over to the computation engine for calculation. The performance of online analytical processing is optimized by using the pre-computation table to trade space for time.
[0062] Specifically, the matcher and rewriter are the core of the pre-computation optimization technique, used to process context content. They exist in a pipeline form, and the query SQL is processed by these objects sequentially. A common parent class, ExpressionMatcher and ExpressionRewriter, is abstracted for all matchers and rewriters. Different matchers and rewriters inherit from these two parent classes and rewrite their specific logic according to their own characteristics. These matchers and rewriters exist in a pipeline form, and the query SQL is processed by these objects sequentially. To facilitate awareness of the context memory in the pipeline, this invention abstracts a rewrite context, rewriteContext, to store information about the pre-computation table and various components of the query SQL during pipeline operations, for use by different matching and rewriting logics. The results of splitting the pre-computation table and query SQL are shown in Table 1 below:
[0063] Table 1
[0064] member meaning queryPredicatte query statement conjunctive normal form of predicate tablePredicate Predicates in conjunctive normal form of precomputation tables queryProjection Output fields of the query statement tableProjection Output fields of the pre-calculated table queryGroup Grouping fields in a query statement tableGroup Grouping fields of the pre-calculated table queryAggregate Aggregate functions in query statements tableAggregate Aggregate functions of pre-calculated tables queryJoin Join conditions of query statement tableJoin Connection conditions of the pre-calculation table
[0065] The predicate part compares `queryPredicate` and `tablePredicate` to see if they meet the rewrite requirements. In the matcher, for equality predicates, if the query statement and the pre-computed table's equality predicates are completely identical, the predicate is deleted; otherwise, it is retained only if it exists in the query statement. For range predicates, it checks if the range of each equivalence class in the query SQL is a subset of the range of each equivalence class in the pre-computed table. For other predicates, they are converted to string form for comparison. To ensure the validity of the rewritten logical plan, all conditions appearing in the predicates must also appear in the output expression fields of the pre-computed table; otherwise, the rewritten logical plan will fail to be executed correctly by the Spark engine because the pre-computed table lacks these fields. Semantic comparisons are performed on the above predicates; if they are identical, the field mapping is recorded. In the rewriter, for the above Filter operators, the field references within the operators are replaced according to the field mappings recorded in the matcher, while the position of the operators in the logical plan remains unchanged.
[0066] Projection Part: The projection part compares queryProjection and tableProjection to see if they meet the rewrite requirements. In the matcher, the output fields of the query statement and the pre-computed table are first extracted from the context. Then, the semantics of the output fields are compared to check if all output fields of the query statement exist in the pre-computed table, and not just aliases. After passing the matcher's verification, the rewriter caches the mappings with the same semantics for the output fields. New Project operators are then constructed based on these mappings to replace the original Project operators, ensuring that child nodes remain unchanged.
[0067] The join part compares whether queryJoin and tableJoin meet the rewrite requirements. Before performing this operation, the pre-rule PushPredicateThroughJoin needs to be executed first. We only consider joins between two tables or multiple tables. In the matcher, we obtain the subtree rooted at the query statement and the Join operator of the pre-computed table, delete all Filter operators in the subtree, and then calculate the number of leaf nodes. If the number of leaf nodes in the two subtrees is the same, we recursively compare whether the left and right subtrees of the two trees are the same; if the number of leaf nodes in the pre-computed table is less, we directly determine that there is no match; if the number of leaf nodes in the pre-computed table is more, we first determine whether they are consistent based on the foreign key join graph. If they are consistent, we logically delete the redundant Join conditions in the pre-computed table, and then recursively compare whether the left and right subtrees of the two trees are the same. When the matcher determines that two Join operators match successfully, the rewriter searches the data in the corresponding pre-computed table, replaces the Join subtree of the query statement's logical plan with the pre-computed table, and inserts the previously deleted Filter operators above the pre-computed table, thus completing the rewriting of the join part.
[0068] Aggregation Section: The aggregation section compares `queryAggregate` and `tableAggregate` to see if they meet the rewrite requirements. For `count(*)`, the number of occurrences of `count(*) as C` in the query statement must be less than or equal to the number in the pre-computed table. Then, aggregate functions with the same field references are replaced with `sum(C)`, while those with different references are retained. For the `avg(k)` aggregate function, the condition for rewriting it is that the `count(*)` aggregate function must exist in the output field of the pre-computed table, and then it is rewritten as `sum(k) / count(*)`. For the `sum` aggregate function, there must be an alias `s` in the pre-computed table, and then it is replaced with `sum(s)`. For the `min` and `max` aggregate functions, their behavior changes depending on the grouping fields. When the grouping fields are completely identical, only the internal field references need to be replaced; when the grouping fields are inconsistent, it is considered that matching and rewriting are not possible.
[0069] Grouping Section: The grouping section compares whether `queryGroup` and `tableGroup` meet the rewrite requirements. In the matcher, the `Group By` fields of both the query statement and the pre-computed table are extracted, ensuring that the grouping fields in the query statement and the pre-computed table are identical. The mapping relationship between the two sets is also recorded. In the rewriteer, based on this mapping relationship, the `groupingExpression` in the `Aggregate` operator of the query SQL is replaced with a field reference from the pre-computed table.
[0070] Rewriting the logical plan: Custom rules can be used to transform the abstract syntax tree of the logical plan into another abstract syntax tree, thereby completing the rewriting of the logical plan. The matching and rewriting process between the query SQL and the pre-computed table is completed in the logical plan phase of Spark SQL. Three types of rewriting rules are defined based on two criteria: whether it contains aggregation operators and whether it contains join operators, as shown in Table 2 below:
[0071] Table 2
[0072] Rule Name Applicable SQL types WithoutJoinGroupRule Excludes join and aggregation operations WithoutJoinRule Includes only aggregation operations Standard Rule Includes join and aggregation operations
[0073] Logical plan rewriting process as follows Figure 3 As shown, the logical plan of the query SQL statement and the pre-computed table candidate set after index filtering are matched through a pipeline of multiple matchers and rewriters. Only the query SQL statement that passes the checks of all matchers will be used to rewrite the logical plan based on the pre-computed tables in the data warehouse; otherwise, the original SQL statement will be used to query the original tables of the data lake.
[0074] Task scheduling steps: Dynamically schedule pre-computed tasks based on historical workload, and determine the timing for initializing and updating the pre-computed table. The task scheduling steps include resource monitoring sub-steps and scheduling sub-steps.
[0075] Resource monitoring sub-step: Monitor the cluster's resource utilization rate. When the resource utilization rate is low, select the pre-computation task to be scheduled based on the scheduling algorithm. Specifically, use a metrics real-time monitoring system. When each SQL query statement arrives, obtain the latest metrics. If the resource utilization rate is less than the threshold, then call the scheduling module to complete the pre-computation task.
[0076] Scheduling sub-step: Record the historical workload of query statements within a preset time period. Select the optimal query template from the historical workload and encapsulate it into a pre-computation task for scheduling. Specifically, when each SQL query statement arrives, the historical workload's time window records whether the rewriting of that statement was successful. If it fails, it is added to the historical workload set. Selecting the optimal pre-computation task from the historical workload for scheduling is based on a cost model. The historical workload is essentially a sliding time window. This window records the query statements that arrived within a past period of workload, parses out the corresponding query templates, and records the frequency of each query template. Each type of query template is assigned a weight; a higher weight indicates higher optimization efficiency for the pre-computation task. Then, a weighted sum is calculated based on the frequency of occurrences within the most recent time window. The query template with the highest overall optimization efficiency is selected and encapsulated into a pre-computation task for scheduling. When called by the resource monitoring module, the workload set within the time window is denoted as W = {Q1, Q2, ..., Q...}. n The set of all query templates is Temps = {temp1, temp2, ..., temp...} k}, each template temp i The weight is weightt i The number of times it occurs in this workload is C(temp). i Each pre-computation task has a lifespan, denoted as E = {t1, t2, ..., t} for each template. k The score for each template present in the historical workload is}. i =weight i *C(temp i )+t i The cost model only needs to find the template with the highest score for scheduling.
[0077] The present invention also provides a pre-computation-based query system. Those skilled in the art can implement the pre-computation-based query system by executing the steps of the pre-computation-based query method. That is, the pre-computation-based query method can be understood as a preferred implementation of the pre-computation-based query system.
[0078] According to the present invention, a pre-computation-based query system includes:
[0079] The index module constructs a hierarchical index based on the information from the pre-computed tables and filters out candidate pre-computed tables that meet preset conditions. It also updates the corresponding index information when a user creates, updates, or deletes a pre-computed table. The hierarchical index has at least three levels, each including a Hasse diagram and corresponding filtering conditions, and each level performs a matching check. The matching check includes data source matching, predicate matching, and output expression matching. The index module includes the following sub-modules: Module S1.1: Receives user-submitted query SQL statements and parses them into a logical plan. Module S1.2: Obtains key information from the data service architecture based on the logical plan. Module S1.3: Filters the key information layer by layer using the data source layer index, predicate layer index, and output expression layer index, thereby obtaining a candidate set of pre-computed tables that meet the filtering conditions of the corresponding hierarchical index.
[0080] The pre-computation module performs custom matching and rewriting of query SQL statements based on the candidate set of the pre-computation table and the current query engine. Custom matching and rewriting are performed using multiple types of matchers and rewriters, including predicate matching and rewriting, projection matching and rewriting, join matching and rewriting, aggregation matching and rewriting, and grouping matching and rewriting. The candidate set of the pre-computation table undergoes pipeline matching by multiple matchers and rewriters. Only query SQL statements that pass all matcher checks are used in the logic plan rewriting based on the pre-computation table in the data warehouse; otherwise, the original SQL statement is used to query the original tables in the data lake.
[0081] The task scheduling module dynamically schedules pre-computation tasks based on historical workload, determining the timing for initializing and updating the pre-computation table. The task scheduling module includes a resource monitoring submodule and a scheduling submodule. The resource monitoring submodule monitors the cluster's resource utilization and selects pre-computation tasks to schedule based on a scheduling algorithm when resource utilization is low. The scheduling submodule records the historical workload of query statements within a preset time period and selects the optimal query template based on the historical workload, encapsulating it into a pre-computation task for scheduling.
[0082] Those skilled in the art will understand that, in addition to implementing the system, apparatus, and their modules provided by this invention in purely computer-readable program code, the same program can be implemented in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, the system, apparatus, and their modules provided by this invention can be considered a hardware component, and the modules included therein for implementing various programs can also be considered structures within the hardware component; alternatively, modules for implementing various functions can be considered both software programs implementing the method and structures within the hardware component.
[0083] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. Unless otherwise specified, the embodiments and features described in this application can be arbitrarily combined with each other.
Claims
1. A query method based on pre-computation, characterized in that, include: Indexing steps: Construct a hierarchical index based on the information in the pre-calculation table, and filter out a candidate set of pre-calculation tables that meet preset conditions through the index. At the same time, when a user creates, updates, or deletes a pre-calculation table, update the index information corresponding to the pre-calculation table. Pre-calculation step: Based on the candidate set in the pre-calculation table, perform custom matching and rewrite of the query SQL statement on the basis of the current query engine; Task scheduling steps: Dynamically schedule pre-computed tasks based on historical workload, and determine the time for initializing and updating the pre-computed table; The hierarchical index has at least three levels, each level including a Hasse diagram and corresponding filtering conditions, and each level performs a matching check. The matching check includes data source matching, predicate matching, and output expression matching; The indexing step includes the following sub-steps: Step S1.1: Receive the query SQL statement submitted by the user and parse the query SQL statement into a logical plan; Step S1.2: Obtain key information from the data service architecture according to the logical plan; Step S1.3: Filter the key information layer by layer using data source layer index, predicate layer index, and output expression layer index to obtain a pre-calculated table candidate set that meets the filtering conditions of the corresponding layer index; The custom matching and rewriting are performed through multiple multi-class matchers and rewriters, including predicate matching and rewriting, projection matching and rewriting, join matching and rewriting, aggregation matching and rewriting, and grouping matching and rewriting. The candidate set of pre-computed tables is matched through a pipeline of multiple matchers and multiple rewriters. Only query SQL statements that pass the checks of all matchers will be used to rewrite the logical plan based on the pre-computed tables in the data warehouse; otherwise, the original SQL statements will be used to query the original tables in the data lake.
2. The query method based on pre-computation according to claim 1, characterized in that, The task scheduling steps include a resource monitoring sub-step and a scheduling sub-step; Resource monitoring sub-step: Monitor the cluster's resource utilization rate, and select the pre-computed task to be scheduled based on the scheduling algorithm when the resource utilization rate is low; Scheduling sub-step: Record the historical workload of query statements within a preset time period, select the optimal query template based on the historical workload, encapsulate it into a pre-computation task, and schedule it.
3. A query system based on pre-computation, characterized in that, include: Index module: Constructs a hierarchical index based on the information of the pre-calculation table, and filters out a candidate set of pre-calculation tables that meet preset conditions through the index. At the same time, when a user creates, updates or deletes a pre-calculation table, the index information corresponding to the pre-calculation table is updated. Pre-computation module: Based on the candidate set in the pre-computation table, performs custom matching and rewriting of the query SQL statement on the basis of the current query engine; Task scheduling module: Dynamically schedules pre-computed tasks based on historical workloads and determines the timing for initializing and updating the pre-computed table; The hierarchical index has at least three levels, each level including a Hasse diagram and corresponding filtering conditions, and each level performs a matching check. The matching check includes data source matching, predicate matching, and output expression matching; The index module includes the following sub-modules: Module S1.1: Receives the query SQL statement submitted by the user and parses the query SQL statement into a logical plan; Module S1.2: Obtain key information from the data service architecture according to the logical plan; Module S1.3: The key information is filtered layer by layer through the data source layer index, predicate layer index, and output expression layer index to obtain a pre-calculated table candidate set that meets the filtering conditions of the corresponding layer index; The custom matching and rewriting are performed through multiple multi-class matchers and rewriters, including predicate matching and rewriting, projection matching and rewriting, join matching and rewriting, aggregation matching and rewriting, and grouping matching and rewriting. The candidate set of pre-computed tables is matched through a pipeline of multiple matchers and multiple rewriters. Only query SQL statements that pass the checks of all matchers will be used to rewrite the logical plan based on the pre-computed tables in the data warehouse; otherwise, the original SQL statements will be used to query the original tables in the data lake.
4. The pre-computation-based query system according to claim 3, characterized in that, The task scheduling module includes a resource monitoring submodule and a scheduling submodule; Resource monitoring submodule: Monitors the resource utilization rate of the cluster, and selects the pre-computed task to be scheduled according to the scheduling algorithm when the resource utilization rate is low; Scheduling submodule: Records the historical workload of query statements within a preset time period, selects the optimal query template based on the historical workload, encapsulates it into a pre-computed task, and schedules it.