Storage computing separation database query optimization method and system for massive data analysis

By collecting and analyzing query logs and metadata of massive amounts of data, fingerprint sequences and query topology graphs are constructed. The query plan is optimized by combining the multidimensional knapsack problem, which solves the problem of balancing query acceleration and view maintenance costs in the context of massive data, and improves query response speed and resource utilization.

CN121786068BActive Publication Date: 2026-06-26CHINA RAILWAY ELECTRIFICATION ENGINEERING GROUP CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA RAILWAY ELECTRIFICATION ENGINEERING GROUP CO LTD
Filing Date
2026-03-05
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively balance the benefits of query acceleration with the costs of view maintenance in environments with massive amounts of data, resulting in low system resource utilization and high query response latency.

Method used

By collecting business query logs and metadata, a syntax tree is constructed and a fingerprint sequence is generated to identify common substructures. The multidimensional knapsack problem is used to determine the combination of target substructures and generate query rewriting rules to optimize the query plan.

Benefits of technology

It optimizes query response speed under strict resource constraints, avoids frequent view failures, reduces system maintenance load, and improves resource utilization and performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121786068B_ABST
    Figure CN121786068B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of computer data processing, and provides a storage-computing separation database query optimization method and system for massive data analysis, which solves the technical problems of low system resource utilization and high query response delay. The method comprises the following steps: collecting business query logs and metadata information of all basic tables; constructing a syntax tree based on a query text, and calculating fluctuation indexes of each basic table; constructing a query topology graph based on a fingerprint sequence to obtain a candidate substructure set; taking the remaining available space as a space constraint, determining a target substructure combination through a multi-dimensional knapsack problem, and generating a query rewriting rule; matching the target fingerprint sequence with the query rewriting rule to determine a target substructure; replacing the target substructure with a read operation of pre-computed data corresponding to the target substructure combination; and generating an execution plan for optimizing the query, so that transparent rewriting of the query based on the pre-computed result is realized, and the query performance is maximally improved under the condition of limited maintenance cost.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer data processing technology, and in particular to a storage-computation separation database query optimization method and system for massive data analysis. Background Technology

[0002] With the development of big data technology, cloud-native database systems based on storage-compute separation architecture have become the mainstream choice for enterprise-level data warehouses and real-time analysis, and are widely used in massive data processing scenarios such as high-frequency business intelligence reports and real-time dashboard monitoring.

[0003] To improve query response speed, existing technologies typically employ materialized view mechanisms, which mainly rely on database administrators to manually define views based on experience or to pre-compute and store hot queries based on heuristic algorithms such as simple query frequency statistics. This approach attempts to directly reuse the results in subsequent queries by pre-materializing and storing the calculation results of high-frequency queries, thereby reducing repeated scanning and complex aggregation calculations of the underlying massive amounts of raw data.

[0004] However, existing solutions focus solely on query frequency, neglecting the data change frequency of the underlying data source and the rigid constraints of overall system storage space and backend maintenance time windows. This limitation can easily lead to excessively high maintenance costs due to frequent view failures in storage-compute separation environments with frequent data updates or limited resources, or it can crowd out critical computing resources due to a lack of global resource planning, failing to achieve globally optimal performance improvements. Therefore, existing technologies suffer from the technical problem of struggling to effectively balance query acceleration benefits and view maintenance costs in massive data environments, resulting in low system resource utilization and high query response latency. Summary of the Invention

[0005] The purpose of this application is to provide a storage-computation separation database query optimization method and system for massive data analysis, in order to solve the technical problem in the prior art that it is difficult to effectively balance the benefits of query acceleration and the cost of view maintenance in a massive data environment, resulting in low system resource utilization and high query response latency.

[0006] Firstly, this application provides a storage-computation separation database query optimization method for massive data analysis, including:

[0007] Collect business query logs and metadata information from all basic tables. The business query logs include query text and execution time, and the metadata information includes total data information and data increment information.

[0008] A syntax tree is constructed based on the query text. Key operator nodes in the syntax tree are parameterized to generate fingerprint sequences. Fluctuation indicators for each base table are calculated based on incremental data information.

[0009] A query topology graph is constructed based on fingerprint sequences. Common substructures are identified from the query topology graph through frequent subgraph mining. Based on the fluctuation index, the common substructures are pruned to obtain a set of candidate substructures. Each substructure in the candidate substructure set includes query frequency, maintenance cost and storage usage.

[0010] The remaining available space is used as a space constraint, the preset system maintenance time is used as a time constraint, the time-saving of each substructure in the candidate substructure set determined by query frequency and execution time is mapped to the item value, and the storage occupation and maintenance cost are used as the item weight. The target substructure combination is determined by the multidimensional knapsack problem and the query rewriting rules are generated.

[0011] The target fingerprint sequence generated based on the target query request is matched with the query rewriting rules to determine the target substructure. The target substructure is then replaced with a read operation on the pre-computed data corresponding to the combination of target substructures, generating an execution plan for optimizing the query.

[0012] Optionally, a syntax tree is constructed based on the query text, and key operator nodes in the syntax tree are parameterized to generate fingerprint sequences, including:

[0013] By scanning the character stream of the query text, keywords, identifiers, and literals are extracted to obtain a lexical unit sequence. The lexical unit sequence is then recursively parsed to construct a syntax tree.

[0014] Extract nodes of type connection, aggregation, and filtering from the syntax tree as key operator nodes;

[0015] The constant parameters in each key operator node are replaced with preset placeholder symbols, and formatting spaces and comment information in the syntax tree are removed to obtain the standard syntax tree.

[0016] The standard syntax tree is converted into a string sequence by preorder traversal, and the string sequence is encoded by a hash algorithm to obtain the fingerprint sequence.

[0017] Optionally, the incremental data information includes the number of new rows and the number of deleted rows in all base tables for each time slice;

[0018] The volatility indicators for each base table are calculated based on incremental data information, including:

[0019] Obtain the total number of rows in the base table within each time slice, and calculate the ratio of the number of new rows to the total number of rows in each base table within each time slice to obtain the rate of change sequence for each base table.

[0020] The variance of each rate of change sequence is used as the volatility index for the corresponding base table.

[0021] Optionally, a query topology graph is constructed based on the fingerprint sequence, and common substructures are identified from the query topology graph through frequent subgraph mining, including:

[0022] Using the operator nodes in the fingerprint sequence as graph nodes and the data flow relationship between operator nodes as directed edges, a query topology graph is constructed.

[0023] Add the query topology graph to the historical query graph set to obtain the updated historical query graph set. Use a depth-first search strategy to find multi-class connected subgraphs with the same topology in the updated historical query graph set.

[0024] Count the frequency of each type of connected subgraph in the updated historical query graph set, and identify the connected subgraphs whose frequency exceeds a preset frequency threshold as common substructures.

[0025] Optionally, a set of candidate substructures is obtained by pruning the common substructures based on the volatility index, including:

[0026] parse the graph nodes in the common substructure to determine the target base table referenced by the common substructure;

[0027] If the volatility index of the target base table is greater than the preset stability threshold, the corresponding common substructure is removed. If the volatility index of the target base table is less than or equal to the stability threshold, the corresponding common substructure is retained, and a candidate substructure set is obtained.

[0028] The frequency of occurrence of each substructure in the candidate substructure set in the updated historical query graph set is determined as the query frequency.

[0029] Based on the incremental data information of the target base table referenced by each substructure in the candidate substructure set, calculate the estimated time required to process the incremental data information of the target base table, and use the estimated time as the maintenance cost of the corresponding substructure in the candidate substructure set.

[0030] Based on the row record length and total data volume of the target base table, calculate the estimated data volume of the corresponding substructure in the candidate substructure set, and use the estimated data volume as the storage usage.

[0031] Optionally, the remaining available space is used as a spatial constraint, the preset system maintenance time is used as a time constraint, the time-saving of each substructure in the candidate substructure set determined based on query frequency and execution time is mapped to the item value, and storage occupation and maintenance costs are used as the item weight. A multidimensional knapsack problem is used to determine the target substructure combination and generate query rewriting rules, including:

[0032] The remaining available space of the storage system is used as a space constraint, the preset system maintenance time is used as a time constraint, and the storage occupation and maintenance cost are used as the weight of the item.

[0033] The average execution time of each substructure in the candidate substructure set in the business query log is taken as the single execution time of each substructure. Based on the single execution time, the preset estimated reading time and the query frequency, the time saved for each substructure in the candidate substructure set is calculated, and the time saved is taken as the value of the item.

[0034] Enumerate multiple substructure combinations based on the candidate substructure set, and calculate the total storage usage, total maintenance cost, and total item value of all substructures in each substructure combination;

[0035] The multidimensional knapsack problem algorithm is used to determine the target substructure combination that satisfies both spatial and time constraints and maximizes the total value of the items.

[0036] Assign a physical storage path to each target substructure in the target substructure combination, establish a mapping relationship between the fingerprint sequence of the target substructure and the physical storage path, and use the mapping relationship as the query rewrite rule.

[0037] Optionally, the target fingerprint sequence generated based on the target query request is matched with the query rewriting rules to determine the target substructure. The target substructure is then replaced with a read operation on the pre-computed data corresponding to the combination of target substructures, generating an execution plan for optimizing the query, including:

[0038] The target syntax tree corresponding to the target query request is traversed using a bottom-up traversal strategy, and the target fingerprint sequence of the subtree nodes in the target syntax tree is generated using a hash algorithm.

[0039] By hashing the target fingerprint sequence with the fingerprint sequence in the query rewriting rule, the target substructure in the target syntax tree that generates the target fingerprint sequence is determined;

[0040] Based on the physical storage path in the query rewrite rules, a view read operator is created to perform pre-computed data read operations. The view read operator is used to replace the target substructure, and the parent node of the target substructure is updated to reference the view read operator, resulting in the updated target syntax tree.

[0041] Based on the updated target syntax tree, an execution plan is generated to optimize the query.

[0042] Secondly, this application provides a storage-computation separation database query optimization system for massive data analysis, including:

[0043] The data collection module is used to collect business query logs and metadata information from all basic tables. The business query logs include query text and execution time, and the metadata information includes total data information and data increment information.

[0044] The processing module is used to construct a syntax tree based on the query text, parameterize the key operator nodes in the syntax tree to generate fingerprint sequences, and calculate the fluctuation index of each base table based on the incremental data information.

[0045] The module is used to build a query topology graph based on fingerprint sequences, identify common substructures from the query topology graph through frequent subgraph mining, and prune the common substructures based on fluctuation indicators to obtain a candidate substructure set. Each substructure in the candidate substructure set includes query frequency, maintenance cost and storage usage.

[0046] The determination module is used to use the remaining available space as a space constraint, the preset system maintenance time as a time constraint, map the time-saving of each substructure in the candidate substructure set determined according to the query frequency and execution time to the item value, and use the storage occupation and maintenance cost as the item weight. It determines the target substructure combination through a multidimensional knapsack problem and generates query rewriting rules.

[0047] The matching module is used to match the target fingerprint sequence generated based on the target query request with the query rewriting rules to determine the target substructure, replace the target substructure with the read operation of the pre-computed data corresponding to the combination of target substructures, and generate an execution plan for optimizing the query.

[0048] Thirdly, this application provides an electronic device, comprising:

[0049] Memory, used to store computer programs;

[0050] A processor is used to execute computer programs to implement the steps of the storage-computation separation database query optimization method for massive data analysis as described in the first aspect above.

[0051] Fourthly, this application provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can implement the steps of the storage-computation separation database query optimization method for massive data analysis as described in the first aspect above.

[0052] The beneficial effects of this application are:

[0053] The storage-computation separation database query optimization method for massive data analysis provided in this application ensures that the optimization model can not only perceive users' hot query patterns, but also the physical scale and dynamic changes of the underlying data by synchronously collecting business query logs and metadata information. This avoids the one-sidedness of optimizing solely based on query frequency. It solves the problem of frequent view failures caused by neglecting data change frequency in existing technologies. It prevents the generation of invalid or high-cost views, reducing the system's maintenance load from the source. It solves the technical problem of resource congestion or inability to complete view refreshes within limited maintenance windows due to a lack of global planning in existing technologies, achieving global optimization of resource utilization and performance improvement. It reduces I / O overhead and computing cluster load, greatly improving query response speed.

[0054] Furthermore, by quantifying the remaining available space of the storage system and the preset system maintenance time into spatial and temporal constraints respectively, quantifying the time savings calculated based on historical logs into item value, and quantifying storage occupation and maintenance costs into item weight, a multi-dimensional knapsack problem model is constructed. Then, by enumerating various combinations of candidate substructure sets, under the premise of strictly satisfying spatial and temporal constraints, the algorithm automatically solves for the target substructure combination that maximizes the total item value, and finally establishes a mapping relationship between fingerprint sequences and physical storage paths to generate query rewriting rules. This solves the technical barrier of balancing view maintenance costs and query response speed in high-concurrency scenarios. Attached Figure Description

[0055] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0056] Figure 1 A flowchart illustrating the storage-computation separation database query optimization method for massive data analysis provided in this application embodiment;

[0057] Figure 2 This application provides a schematic diagram of a process for obtaining storage occupancy in an embodiment of the present application.

[0058] Figure 3 A flowchart illustrating a process for obtaining query rewriting rules, provided as an embodiment of this application;

[0059] Figure 4 A schematic diagram of the structure of a storage-computation separation database query optimization system for massive data analysis provided in this application embodiment;

[0060] Figure 5This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0061] In high-frequency business intelligence reporting scenarios, where storage and computation are separated in massive data environments, existing materialized view mechanisms focus only on query frequency while ignoring data change frequency and rigid system resource constraints. This leads to excessively high view maintenance costs, resource congestion, and suboptimal global performance. This application addresses these issues by fully collecting business query logs and metadata to accurately quantify the volatility of data sources and the structural characteristics of queries. Then, by utilizing frequent subgraph mining combined with volatility index pruning, it intelligently identifies common computational substructures that are both high-frequency and relatively stable, eliminating invalid views with high maintenance costs at the source. Finally, the complex view selection problem is abstracted into a multidimensional knapsack mathematical model. Within strictly limited storage space and maintenance time windows, it solves for the most cost-effective target substructure combination from a global perspective. Through query transparent rewriting technology, it achieves automatic reuse of pre-calculated results without changing business code, thereby maximizing the response speed of massive data queries while ensuring controllable system resources.

[0062] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0063] The core of this application is to provide a storage-computation separation database query optimization method for massive data analysis. A flowchart of one specific implementation is shown below. Figure 1 As shown, the method includes:

[0064] Step 101: Collect business query logs and metadata information for all basic tables. Business query logs include query text and execution time, while metadata information includes total data volume and data increment information.

[0065] In this step, the base table refers to the entity table in the database that stores the original business data. As the source of data analysis, it forms the physical basis for generating materialized views. The business query log refers to all user-initiated operations recorded by the database during operation, including but not limited to query requests, execution status, and performance metrics for the base table. Metadata information describes the attributes of the base table itself and its data change status, reflecting data scale and activity levels. Query text is the core content of the business query log, specifically the string of structured query language statements submitted by the user. Execution time is the actual time required for the structured query language statement recorded in the business query log to be submitted and completed. Total data volume information is part of the metadata information, representing the current physical storage size or total number of rows in the base table. Data increment information is another part of the metadata information, reflecting the data changes in the base table within a specific time window, including the number of newly added and deleted rows.

[0066] In this embodiment, firstly, business query logs of all base tables are collected in real-time or periodically through an audit module or log interface connected to the database. During the collection process, the query text and execution time fields are extracted from each log entry. These data directly reflect the user's query intent and the current system's performance bottlenecks. Simultaneously, the metadata management module of the database storage engine is connected to obtain metadata information for all underlying base tables. This process focuses on collecting the total data volume information of each base table, such as the total number of rows or the disk space occupied, as well as data increment information, such as the number of records inserted, updated, or deleted in the table over a past period. This data provides a quantitative basis for subsequent assessment of data volatility and view maintenance costs. As an optional implementation, assuming there are base tables A and B in the database, the collected business query log includes a record whose query text is the sum of field C1 queried from base table A, with the filter condition that field C2 is greater than 10, and the execution time is 500 milliseconds. Simultaneously, the collected metadata information for base table A shows that its total data volume is 10 million rows, and the data increment information is 446,000 rows added in the past 30 minutes.

[0067] Step 102: Construct a syntax tree based on the query text, parameterize the key operator nodes in the syntax tree to generate fingerprint sequences, and calculate the fluctuation index of each base table based on the incremental data information.

[0068] In this step, the syntax tree refers to the tree-like data structure generated after syntactic analysis of the structured query language text, where each node represents a different operation operator in the query. Key operator nodes are those representing the core data operation logic within the syntax tree, including joins, aggregations, and filtering. The fingerprint sequence is a sequence obtained by encoding the parameterized key operator nodes, used to uniquely identify the logical structural characteristics of the query. The volatility index is a value calculated based on incremental data information, used to reflect the drastic degree of change in the underlying table data.

[0069] Step 201: By scanning the character stream of the query text, keywords, identifiers and literals are extracted to obtain a lexical unit sequence, and the lexical unit sequence is recursively parsed to construct a syntax tree.

[0070] In this step, keywords are reserved words predefined in the programming language with specific meanings, such as SELECT, FROM, WHERE, etc. Identifiers are symbolic strings used to name database objects. Literals are fixed data values ​​directly given in the query, such as the number 100 or the string "abc". The lexical unit sequence is an ordered list of the smallest syntactic units, including type and value, generated by the scanning process.

[0071] In this embodiment, a lexical analyzer is first used to scan the query text as a character stream. The analyzer ignores whitespace, identifies consecutive letter sequences as keywords or identifiers, and identifies number sequences or content enclosed in quotation marks as literals. Each identified portion is encapsulated as a lexical unit, including its type and original value, thus forming a sequence of lexical units. For example, for the query text `SELECT sum(c1) FROM T1 WHERE c2>10`, the resulting lexical unit sequence is `{SELECT,sum,(,c1,),FROM,T1,WHERE,c2,>,10}`.

[0072] Next, the sequence is fed into a recursive descent parser. The parser starts with the root rules of the query language and calls the corresponding subroutines to match lexical units. For example, the parsing function for the SELECT clause checks in sequence for the existence of the SELECT keyword, the list of column names, the FROM keyword, and the table name. If all lexical units conform to the predefined syntax rules during parsing, the parser constructs a syntax tree reflecting the query structure, where each node represents a syntactic component, such as the SelectStmt node which includes TargetList and FromClause child nodes.

[0073] Step 202: Extract nodes of type connection, aggregation, and filtering from the syntax tree as key operator nodes.

[0074] In this step, join nodes are nodes in the syntax tree that represent multi-table join operations, typically corresponding to equality conditions in the JOIN or WHERE clauses. Aggregate nodes are nodes that represent data summarization calculations, typically corresponding to the GROUP BY clause or aggregate functions. Filter nodes are nodes that represent data filtering conditions, typically corresponding to predicates in the WHERE or HAVING clauses. TableScan nodes are nodes that represent full table scan operations, usually serving as leaf nodes in the query plan, used to read data from the base tables.

[0075] In this embodiment, a depth-first or breadth-first traversal algorithm is used to perform a full scan of the constructed syntax tree. During the traversal, the type attribute of each node is checked. If the type of a node is marked as JOIN, AGGREGATE, FILTER, or TABLESCAN, it is identified as a key operator node and extracted. These nodes carry the core computational logic of the query and are the basis for subsequent fingerprint generation and identification of common substructures. For example, in the syntax tree, the AGGREGATE node representing the SUM function, the FILTER node representing the greater than sign, and the TABLESCAN node representing table T1 are identified and extracted to form the key operator node list {AGGREGATE, FILTER}.

[0076] Step 203: Replace the constant parameters in each key operator node with preset placeholder symbols, and remove formatting spaces and comment information from the syntax tree to obtain the standard syntax tree.

[0077] In this step, constant parameters refer to the specific numerical values ​​or string literals involved in the key operator nodes. These values ​​may vary in different queries, but the structural logic remains the same. Placeholder symbols are predefined special characters used to uniformly replace all constant parameters to eliminate differences in specific values. A standard syntax tree refers to a syntax tree that has undergone parameterization and cleaning, containing only the core logical structure.

[0078] In this embodiment, key operator nodes are traversed, and the literal portion of their child nodes is accessed. Regardless of whether the literal is a number, string, or date, its value is replaced with a preset placeholder symbol. For example, the condition c2>10 in the aforementioned filter node is replaced with c2>#. If another query c2>20 exists, its processing result will also be c2>#, thus making the two structurally identical. Simultaneously, all formatting spaces, newlines, and developer-added comments that do not affect semantics are removed from the syntax tree; that is, spaces in the standardized string representation are eliminated. After these two steps, a standard syntax tree is obtained, free from specific data interference and irrelevant information.

[0079] Step 204: Convert the standard syntax tree into a string sequence by preorder traversal, and encode the string sequence by a hash algorithm to obtain the fingerprint sequence.

[0080] In this step, preorder traversal is a tree traversal strategy that follows the order of visiting the root node, left subtree, and right subtree. The string sequence is a linear text formed by concatenating the node types visited during traversal with their normalized content. A hash algorithm is an algorithm that maps an input of arbitrary length to a fixed-length output, used to generate unique identifiers.

[0081] In this embodiment, a preorder traversal algorithm is used to serialize the standard syntax tree. Starting from the root node, the type name and key attributes of each node are concatenated into a string buffer according to the access order, with specific delimiters used between nodes. After traversal, a string sequence describing the entire tree structure is obtained. For example, the string sequence obtained after traversing the above standard syntax tree is Aggregate(Sum) Filter(GT) TableScan(T1). Then, a hash algorithm is applied to calculate this string sequence to generate a fixed-length hash value. This hash value is the fingerprint sequence, for example, the fingerprint sequence is {FP001}. It can uniquely identify the logical structure of the query, so that queries with the same structure but different parameters have the same fingerprint.

[0082] Step 211: Obtain the total number of rows in the base table within each time slice, and calculate the ratio of the number of new rows to the total number of rows in each base table within each time slice to obtain the rate of change sequence for each base table.

[0083] In this step, the rate of change sequence is a list of values ​​composed of the ratios of data changes within multiple time slices arranged in chronological order, used to reflect the trend of data changes.

[0084] In this embodiment of the application, for each base table, each time slice within the previously defined historical time window is traced back. For the first... Each time slice is used to obtain the total number of rows in the table at the end of that slice from the metadata information, denoted as . And the number of new rows that occur within that slice, denoted as And the number of rows deleted within that slice, denoted as Then, using the formula Calculate the rate of change of the data for this slice. Repeat this process for all slices, resulting in a series of ratios. The rate of change sequence of this base table is composed in chronological order.

[0085] For example, for base table T1, suppose the past 30 minutes are divided into three 10-minute time slices: the first slice adds 100,000 rows and deletes 0 rows, with a total of 10 million rows and a ratio of 0.01; the second slice adds 196,000 rows and deletes 0 rows, with a total of 10,196,000 rows and a ratio of 0.0192; the third slice adds 150,000 rows and deletes 0 rows, with a total of 10,346,000 rows and a ratio of 0.0145. Then the rate of change sequence for base table T1 is {0.01, 0.0192, 0.0145}.

[0086] Step 212: Determine the variance of each rate of change series as the volatility index of the corresponding base table.

[0087] In this step, variance is a statistical measure of the dispersion of a set of data; here, it is used to quantify the volatility of the rate of change series. The volatility index is the final numerical value used to assess the stability of the underlying table; a larger value indicates more drastic data fluctuations.

[0088] In this embodiment of the application, a rate of change sequence is received. First, calculate the arithmetic mean of the sequence. Then, the variance value is calculated using the variance formula. As shown in formula (1):

[0089] (1);

[0090] in, It is the total number of time slices. It is the first The rate of change of each slice. For example, for the aforementioned rate of change sequence {0.01, 0.0196, 0.0145}, first calculate the mean. Then, the sum of squares of the differences between each item and the mean is calculated and averaged to obtain a variance value of approximately 0.000015. Finally, this calculated variance value of 0.000015 is determined as the volatility index for the base table T1. If the variance value is small, it indicates that the data growth of the table is relatively stable; if the variance value is large, it indicates that the table frequently experiences sudden large amounts of data being written, resulting in poor stability.

[0091] Step 103: Construct a query topology graph based on the fingerprint sequence, identify common substructures from the query topology graph through frequent subgraph mining, and prune the common substructures based on the fluctuation index to obtain a candidate substructure set. Each substructure in the candidate substructure set includes query frequency, maintenance cost and storage usage.

[0092] In this step, the query topology graph refers to a directed graph structure constructed with operator nodes corresponding to fingerprint sequences as graph nodes and data flow directions between operators as edges. Frequent subgraph mining is a data mining technique that discovers frequently occurring subgraph patterns from graph datasets. Common substructures refer to connected subgraphs that repeatedly appear in multiple query topologies, identified by mining algorithms. The candidate substructure set is a collection of common substructures retained after pruning by volatility indicators. Query frequency refers to the number of times a substructure appears in historical queries. Maintenance cost refers to the estimated computational resources or time required to update the materialized view corresponding to the substructure. Storage footprint refers to the estimated storage space required for the materialized view corresponding to the substructure.

[0093] Step 301: Using the operator nodes in the fingerprint sequence as graph nodes and the data flow relationship between operator nodes as directed edges, construct a query topology graph.

[0094] In this step, a graph node refers to a vertex in the query topology graph that represents a specific operation operator, corresponding to the key operator reconstructed from the fingerprint sequence. A directed edge is an arrow connecting graph nodes, used to indicate the processing path of data flowing from one operator to another. The query topology graph is a directed graph composed of graph nodes and directed edges, used to visually represent the logical structure of a single query.

[0095] In this embodiment, the logical structure corresponding to the fingerprint sequence is first parsed to reconstruct the key operator information. For each operator, a corresponding graph node is created. Then, based on the input-output dependencies between operators, such as the JOIN operator's input coming from two TableScan operators, connections are established between the corresponding graph nodes. If the output of operator A is the input of operator B, a directed edge is created from A to B. In this way, the abstract logical fingerprint sequence is transformed into a query topology graph with a spatial structure, which clearly reflects the execution order and dependencies of each operation in the query. For example, for the logical Aggregate(Sum) Filter(GT)TableScan(T1) corresponding to the fingerprint sequence {FP001}, the constructed query topology graph G1 contains nodes {TableScan(T1), Filter(GT), Aggregate(Sum)} and edges {TableScan->Filter, Filter->Aggregate}.

[0096] Step 302: Add the query topology graph to the historical query graph set to obtain the updated historical query graph set. Use a depth-first search strategy to find multi-class connected subgraphs with the same topology in the updated historical query graph set.

[0097] In this step, the historical query graph set refers to a database or list that stores the topology of all queries processed within a past period. Depth-first search is an algorithm used to traverse or search graph structures, traversing the nodes of the graph along the depth of the tree, searching the branches of the graph as deeply as possible. A connected subgraph is a substructure in the query topology where any two nodes are connected by a path, and this structure appears multiple times in the set.

[0098] In this embodiment, the query topology graph is appended to a preset historical query graph set to update the set. Then, a frequent subgraph mining algorithm is applied, which employs a depth-first search strategy. It starts with the simplest single-node subgraph, recursively attempting to add edges and nodes to expand the subgraph, and performs matching throughout the entire set. During the search, the algorithm identifies connected subgraph patterns that are structurally identical, i.e., have the same topological structure. These patterns may include simple table scans and filtering structures, or complex multi-table joins and aggregation structures. For example, in the updated set, the algorithm identifies that the aforementioned topology G1 (TableScan->Filter->Aggregate) appears repeatedly in multiple historical queries.

[0099] Step 303: Count the frequency of each type of connected subgraph in the updated historical query graph set, and identify the connected subgraphs whose frequency exceeds the preset frequency threshold as common substructures.

[0100] In this step, frequency of occurrence refers to the total number of times a specific type of connected subgraph is included in the entire historical query graph set. The preset frequency threshold is a numerical limit, either manually set or automatically calculated by the system, used to distinguish between low-frequency and high-frequency patterns. Common substructures refer to those high-frequency connected subgraphs that meet the frequency requirements, representing typical hotspot query logic in the system.

[0101] In this embodiment, for each type of connected subgraph, a counter counts the number of times it appears in the set. Then, the counted frequency is compared with a preset frequency threshold. If the frequency of a subgraph type is greater than the threshold, it is identified as a common substructure and retained; otherwise, it is considered an occasional query pattern and ignored. The final output set of common substructures contains all high-frequency query logic fragments. For example, if the count shows that structure G1 appears a total of 600 times in the set, and the preset frequency threshold is 100, since 600 > 100, structure G1 is identified as a common substructure, denoted as type A.

[0102] like Figure 2 As shown, Figure 2 This is a schematic diagram illustrating a process for obtaining storage occupancy, provided as an embodiment of this application.

[0103] Step 311: parse the graph nodes in the common substructure to determine the target base table referenced by the common substructure.

[0104] In this step, the target base table refers to the underlying physical table directly associated with data read operations within the common substructure.

[0105] In this embodiment, each common substructure is traversed, and all graph nodes it contains are examined. Nodes of type TableScan are identified, and the table name or table identifier recorded in the node's attributes is extracted. These table names correspond to the target base table on which the common substructure depends. For example, for common substructure type A, its TableScan node is parsed, and the target base table it references is determined to be T1.

[0106] Step 312: If the volatility index of the target base table is greater than the preset stability threshold, the corresponding common substructure is removed. If the volatility index of the target base table is less than or equal to the stability threshold, the corresponding common substructure is retained, and a candidate substructure set is obtained.

[0107] In this step, the stability threshold is a boundary value used to determine whether the data in the base table is too active.

[0108] In this embodiment, for each common substructure, the volatility indicators corresponding to all target base tables it references are queried. These volatility indicators are then compared with a preset stability threshold. The logical judgment rule is: if the volatility indicator of any target base table referenced by the substructure exceeds the threshold, it indicates that the data source on which the substructure depends is extremely unstable and the maintenance cost is too high; therefore, the common substructure is directly removed. Only when the volatility indicators of all target base tables referenced by the substructure are less than or equal to the threshold is the common substructure retained. After this round of filtering, what remains is the candidate substructure set.

[0109] For example, the volatility index of the target base table T1 is found to be 0.000015, and the preset stability threshold is 0.01. Therefore, the data of T1 is determined to be stable, and the common substructure type A is retained and added to the candidate substructure set.

[0110] Step 313: Determine the frequency of occurrence of each substructure in the candidate substructure set in the updated historical query graph set as the query frequency.

[0111] In this step, query frequency is an attribute of the candidate substructure, and the occurrence frequency count obtained earlier is directly reused.

[0112] In this embodiment, for each substructure in the candidate substructure set, the frequency of its occurrence in the historical query graph set is directly obtained based on statistical results. This value is assigned to the query frequency attribute of the substructure. For example, for candidate substructure type A, its previously statistically calculated frequency of 600 is directly determined as its query frequency. This metric reflects the potential benefits of materializing the view.

[0113] Step 314: Based on the incremental data information of the target base table referenced by each substructure in the candidate substructure set, calculate the estimated time required to process the incremental data information of the target base table, and use the estimated time as the maintenance cost of the corresponding substructure in the candidate substructure set.

[0114] In this step, the estimated time consumption refers to the computation time required to complete the view refresh based on the incremental data. Maintenance cost is a metric measuring the system resources consumed to maintain the freshness of the view data.

[0115] In this embodiment, for each candidate substructure, the incremental data of the target base table it references within a unit of time, i.e., the number of newly added and deleted rows, is obtained. Combining the I / O performance parameters and computational performance parameters of the database system, the time required to read this incremental data and re-execute the substructure logic is calculated, as shown in formula (2):

[0116] (2);

[0117] in, Indicates the estimated time. Indicates the number of new rows. Indicates the number of rows deleted. This indicates the system I / O read rate (lines per second). This represents the total number of rows to be processed, usually equal to... , This represents the system's calculation rate (lines per second). The calculated... This refers to the maintenance cost of the substructure. For example, for candidate substructure type A, assume the number of new rows added per unit time... There were 446,000 rows; the number of rows deleted was [number missing]. For row 0, the system I / O rate The system's computing speed is 2 million lines per second. If the rate is 2 million lines per second, then calculate its maintenance cost. It takes 0.446 seconds.

[0118] Step 315: Based on the row record length and total data amount information of the target base table, calculate the estimated data amount of the corresponding substructure in the candidate substructure set, and use the estimated data amount as the storage occupancy.

[0119] In this embodiment, the average length of each record and the current total number of rows are obtained based on the metadata of the target base table. The size of the result set is estimated by combining the filtering rate of the substructure and the aggregation compression ratio. The calculation formula is as follows: ,in, This indicates the estimated amount of data. This represents the total number of rows in the original table that forms the target base table. This indicates the estimated filtering or aggregation ratio, which is the proportion of the number of rows in the result set to the number of rows in the original table. This indicates the average length of the result rows. (Calculated) That is, the storage space occupied by this substructure.

[0120] For example, for candidate substructure type A, assume the total number of rows in the original table T1 is... For 10 million rows, aggregation ratio The average row length of the result B, the average length of the resulting rows, then calculate the storage usage. .

[0121] Step 104: Using the remaining available space as a space constraint, the preset system maintenance time as a time constraint, the time-saving of each substructure in the candidate substructure set determined based on query frequency and execution time is mapped to the item value, and the storage occupation and maintenance cost are used as the item weight. The target substructure combination is determined through the multidimensional knapsack problem and query rewriting rules are generated.

[0122] In this step, "remaining available space" refers to the currently unused storage capacity of the storage device. "Space constraint" refers to a limiting dimension on the total weight of items in the knapsack problem, corresponding to the remaining available space. "Preset system maintenance time" refers to the length of the time window allowed for background tasks to run. "Time constraint" refers to another limiting dimension on the total weight of items in the knapsack problem, corresponding to system maintenance time. "Time saving" refers to the time reduction achieved by using a materialized view compared to directly executing the original query. "Item value" is the optimization objective in the knapsack problem, corresponding to time saving. "Item weight" refers to the constraints in the knapsack problem, including both storage usage and maintenance costs. The multidimensional knapsack problem is a type of combinatorial optimization problem aiming to select items under multiple constraints to maximize the total value. "Target substructure combination" refers to the optimal set of substructures selected through algorithmic solutions. "Query rewriting rule" refers to the established mapping relationship from fingerprint sequences to physical storage paths.

[0123] like Figure 3 As shown, Figure 3 This is a flowchart illustrating a process for obtaining query rewriting rules, provided as an embodiment of this application.

[0124] Step 401: Use the remaining available space of the storage system as a space constraint, the preset system maintenance time as a time constraint, and the storage occupancy and maintenance cost as the item weight.

[0125] In this step, remaining available space refers to the unused disk space in the database storage layer that can be allocated to materialized views. The space constraint is the first dimension of the knapsack model, representing the maximum weight capacity of the knapsack. The preset system maintenance time refers to the maximum time window allowed for the background view refresh task to run. The time constraint is the second dimension of the knapsack model. Item weight refers to the amount of system resources consumed by each candidate substructure, including storage usage and maintenance costs.

[0126] In this embodiment, the remaining available disk space (e.g., 500GB) is first obtained by querying the storage system's monitoring interface and set as the space constraint for the multidimensional knapsack problem. Simultaneously, a preset system maintenance time (e.g., 2 hours) is read from the system configuration file and set as the time constraint. Next, for each substructure in the candidate substructure set, storage usage and maintenance costs are extracted. These two metrics are then mapped to a two-dimensional item weight vector for that substructure in the knapsack problem. In this way, the resource constraints and consumption in the physical world are transformed into constraints and parameters in the mathematical model.

[0127] Step 402: Take the average execution time of each substructure in the candidate substructure set in the business query log as the single execution time of each substructure, and calculate the time saved for each substructure in the candidate substructure set based on the single execution time, the preset estimated reading time and the query frequency, and take the time saved as the item value.

[0128] In this step, average execution time refers to the average response time of a certain type of query in its unoptimized state, based on historical log statistics. Single execution time is a baseline value used to evaluate the cost of the original query. The preset estimated read time refers to the theoretical time required to directly read the materialized view result, which is usually much shorter than the original query time. Time saved refers to the total time saved across all historical queries through the materialized view mechanism. Item value is the objective function to be maximized in the knapsack model.

[0129] In this embodiment, for each candidate substructure, the business query log is traced back, all historical query records matching the fingerprint of that substructure are selected, and the arithmetic mean of their execution times is calculated as the single execution time. The system's preset view read baseline time is obtained. Combined with the query frequency of that substructure, the time saved is calculated using a formula. ,in, It means saving time. Indicates the time taken for a single execution. This indicates the estimated read time. This indicates the query frequency. The calculated frequency... This means that the value of the items in the knapsack problem is mapped to the value of the substructure.

[0130] For example, for candidate substructure type A, assume its single execution time is... The estimated read time is 10 seconds. The query frequency is 0.1 seconds. If the value is 100, then calculate the value of the item. For 990 seconds; for candidate substructure type B, assume its It lasts for 20 seconds. It takes 0.1 seconds. If the value is 50, then calculate the value of the item. It lasts for 995 seconds.

[0131] Step 403: Enumerate multiple substructure combinations based on the candidate substructure set, and calculate the total storage usage, total maintenance cost, and total item value of all substructures in each substructure combination.

[0132] In this step, a substructure combination refers to a subset of one or more substructures selected from the candidate substructure set. Total storage usage is the sum of the storage usage of all substructures in this combination. Total maintenance cost is the sum of the maintenance costs of all substructures in this combination. Total item value is the sum of the item values ​​of all substructures in this combination.

[0133] In this embodiment, the algorithm begins by exploring all possible subsets of the candidate substructure set. While theoretically it involves enumerating all permutations, in practice, it is typically constructed step-by-step using dynamic programming. For each considered substructure combination, the attributes of all individual substructures it contains are summed. For example, if the combination contains the aforementioned substructure types A and B, the storage footprint of type A is known. For 1GB, maintenance cost For 500 seconds, the value of the item The storage usage is 990; for type B. For 5GB, maintenance cost For 3000 seconds, the value of the item The value is 995. Then calculate the total storage usage of this combination. 6GB; Total maintenance cost For 3500 seconds; Total item value It was 1985.

[0134] Step 404: Use the multidimensional knapsack problem algorithm to determine the target substructure combination that satisfies both spatial and time constraints and maximizes the total value of the items.

[0135] In this step, the multidimensional knapsack problem algorithm refers to an algorithm that solves a value maximization problem under multiple capacity constraints, typically implemented using dynamic programming or branch and bound methods. The target substructure combination refers to the optimal solution set output by the algorithm, representing the most cost-effective view configuration scheme.

[0136] In this embodiment, a multidimensional knapsack problem algorithm is applied to filter and compare various combinations. First, it is checked whether the total storage usage of each combination is less than or equal to the space constraint, and whether the total maintenance cost is less than or equal to the time constraint. Only combinations that simultaneously satisfy both conditions are considered feasible solutions. Among all feasible solutions, their total item value is compared, and the combination with the largest value is found. All substructures contained in this combination are determined as the target substructure combination. For example, assume the space constraint is 10GB and the time constraint is 3600 seconds.

[0137] Option A only: Total space 1GB <= 10GB, total time 500s <= 3600s, total value 990. Feasible.

[0138] Option B only: Total space 5GB <= 10GB, total time 3000s <= 3600s, total value 995. Feasible.

[0139] Choosing A and B: Total space 6GB <= 10GB, total time 3500s <= 3600s, total value 1985. Feasible and with the highest value, therefore, the algorithm determines the target substructure combination as {Type A, Type B}.

[0140] Step 405: Assign a physical storage path to each target substructure in the target substructure combination, establish a mapping relationship between the fingerprint sequence of the target substructure and the physical storage path, and use the mapping relationship as the query rewrite rule.

[0141] In this step, the physical storage path refers to the specific location of the materialized view data file in the underlying file system or object storage. The query rewrite rule is a key-value pair mapping used to guide the query optimizer in transparent rewriting; the key is a fingerprint sequence, and the value is the physical storage path.

[0142] In this embodiment, for each substructure in the selected target substructure combination, a unique storage location or file path is assigned to it in the storage system. For example, the path / views / vA is assigned to substructure type A, and the path / views / vB is assigned to type B. Next, the fingerprint sequence of the substructure is extracted as the key, and the assigned path is used as the value to construct a mapping record. All these records are collected together to form a query rewriting rule base, as shown in Table 1:

[0143] Table 1 Query Rewrite Rule Base

[0144]

[0145] Step 105: Match the target fingerprint sequence generated based on the target query request with the query rewriting rules to determine the target substructure, replace the target substructure with the read operation of the pre-computed data corresponding to the combination of target substructures, and generate an execution plan for optimizing the query.

[0146] In this step, the target query request refers to the newly submitted query to be executed by the user. The target fingerprint sequence is a structural feature identifier generated for the target query request. The target substructure refers to the part of the syntax tree of the target query request that matches the query rewriting rules. The pre-computed data refers to the result data that has been pre-computed and stored based on the target substructure combination. The execution plan used to optimize the query refers to the database execution path that, after rewriting, uses the pre-computed data to replace some of the computational logic.

[0147] Step 501: Traverse the target syntax tree corresponding to the target query request using a bottom-up traversal strategy, and generate the target fingerprint sequence of the subtree nodes in the target syntax tree using a hash algorithm.

[0148] In this step, the bottom-up traversal strategy refers to a tree traversal method that starts from a leaf node and visits parent nodes level by level upwards, ensuring that all child nodes of a parent node have been processed before processing the parent node. A subtree node is any non-leaf node in the target syntax tree; it serves as the root node and together with all its descendant nodes, forms a subtree. The target fingerprint sequence is a unique hash identifier calculated for the logical structure of this subtree.

[0149] In this embodiment, the system first receives a target query request submitted by the user and parses it to generate a target syntax tree. Then, it traverses the tree using a bottom-up strategy. When a subtree node is accessed, it recursively obtains the fingerprint information of all its child nodes, concatenating the current node's type, parameterized attributes, and the fingerprints of its child nodes into a string. Then, using the same hash algorithm as previously used to generate the fingerprint sequence, it calculates the target fingerprint sequence for that subtree node. This process ensures that the fingerprint reflects the structural characteristics of the entire subtree rooted at that node.

[0150] For example, for query Q2: SELECT sum(c1) FROM T1 WHERE c2>10, when traversing to the subtree node containing the logic of Aggregate(Sum) Filter(GT) TableScan(T1), the target fingerprint sequence is calculated to be {FP001}.

[0151] Step 502: Determine the target substructure in the target syntax tree that generates the target fingerprint sequence by hash matching the target fingerprint sequence with the fingerprint sequence in the query rewrite rule.

[0152] In this step, hash matching refers to the operation of comparing whether two hash values ​​are completely equal, used to quickly determine whether two structures are consistent. The target substructure refers to the part of the target syntax tree where the generated fingerprint sequence successfully matches a record already existing in the query rewrite rule.

[0153] In this embodiment, after generating the target fingerprint sequence for each subtree node, it is immediately compared with the keys in the pre-loaded query rewriting rule base. If a target fingerprint sequence is found to exist in the rule base, it means that the structure of the subtree is completely consistent with a previously materialized view. At this time, the subtree node and all its descendant nodes are marked as the target substructure. The matching process is usually implemented using a hash table, which has a time complexity of O(1) and can achieve millisecond-level lookup in a large-scale rule base. For example, the system searches for the aforementioned target fingerprint sequence {FP001} in the query rewriting rule base and finds a successful match. The corresponding physical storage path is / views / vA, so the subtree is determined as the target substructure.

[0154] Step 503: Based on the physical storage path in the query rewrite rules, create a view read operator to perform pre-computed data read operations, replace the target substructure with the view read operator, and update the parent node of the target substructure's reference pointer to the view read operator to obtain the updated target syntax tree.

[0155] In this step, the view read operator refers to a special query execution operator whose configured data source is the physical storage path pointing to the pre-computed results file. A reference pointer refers to the memory address or logical link between a parent node and its child nodes.

[0156] In this embodiment, once the target substructure is determined, the physical storage path matching the fingerprint is extracted from the query rewrite rules. Next, a new view read operator is instantiated, and this path is configured as its data source parameter. Then, a tree structure modification operation is performed: the parent node of the target substructure is found, its original reference pointer pointing to the root node of the target substructure is disconnected, and it is redirected to the newly created view read operator. This operation logically cuts away the complex original computational subtree and pastes a simple read operation, thus obtaining the updated target syntax tree.

[0157] For example, the system creates a view reader operator ScanView, sets its path parameter to / views / vA, and points the parent node of the target substructure, such as SelectStmt, to this ScanView operator to complete the replacement.

[0158] Step 504: Generate an execution plan for optimizing the query based on the updated target syntax tree.

[0159] In this embodiment, the updated target syntax tree, after the replacement operation, is submitted to the query optimizer. The optimizer performs routine optimization on this new tree, but since the most computationally intensive part has been replaced by view read operators, the execution plan generated by the optimizer will directly include scan tasks on pre-computed data, and will no longer include the original multi-table joins or complex aggregation tasks. The final execution plan is then sent to the execution engine for scheduling and execution.

[0160] For example, based on the updated syntax tree, the generated execution plan is: Task 1: Scan / views / vA -> Task 2: Return Result. This plan directly reads the pre-calculated result, avoiding scans and aggregation calculations on table T1, thus achieving query optimization.

[0161] This application's embodiments, by synchronously collecting business query logs and metadata information, ensure that the optimization model can not only perceive users' hot query patterns but also the physical scale and dynamic changes of the underlying data, thereby avoiding the one-sidedness of optimizing solely based on query frequency. It solves the problem of frequent view failures caused by neglecting data change frequency in existing technologies. It prevents the generation of invalid or high-cost views, reducing the system's maintenance load from the source. It solves the technical problem of resource congestion or inability to complete view refreshes within limited maintenance windows due to a lack of global planning in existing technologies, achieving global optimization of resource utilization and performance improvement. It reduces I / O overhead and computing cluster load, greatly improving query response speed.

[0162] Figure 4 This is a schematic diagram illustrating a specific implementation of the storage-computation separation database query optimization system for massive data analysis provided in this application embodiment, with reference to... Figure 4 The system may include:

[0163] The data collection module 21 is used to collect business query logs and metadata information of all basic tables. The business query logs include query text and execution time, and the metadata information includes total data information and data increment information.

[0164] Processing module 22 is used to construct a syntax tree based on the query text, perform parameterization processing on the key operator nodes in the syntax tree to generate fingerprint sequences, and calculate the fluctuation index of each base table based on the data increment information.

[0165] Module 23 is used to construct a query topology graph based on fingerprint sequences, identify common substructures from the query topology graph through frequent subgraph mining, and prune the common substructures based on fluctuation indicators to obtain a candidate substructure set. Each substructure in the candidate substructure set includes query frequency, maintenance cost and storage usage.

[0166] The determination module 24 is used to use the remaining available space as a space constraint, the preset system maintenance time as a time constraint, map the time-saving of each substructure in the candidate substructure set determined according to the query frequency and execution time to the item value, use the storage occupation and maintenance cost as the item weight, determine the target substructure combination through the multidimensional knapsack problem and generate query rewriting rules.

[0167] The matching module 25 is used to match the target fingerprint sequence generated according to the target query request with the query rewriting rules to determine the target substructure, replace the target substructure with the reading operation of the pre-calculated data corresponding to the combination of target substructures, and generate an execution plan for optimizing the query.

[0168] The storage-computation separation database query optimization system for massive data analysis in this application embodiment is used to implement the aforementioned storage-computation separation database query optimization method for massive data analysis. Therefore, the specific implementation of the storage-computation separation database query optimization system for massive data analysis can be found in the embodiment section of the storage-computation separation database query optimization method for massive data analysis above. The specific implementation can be referred to the description of the corresponding embodiment, and will not be repeated here.

[0169] Figure 5 A schematic diagram of the hardware structure of the electronic device provided in an embodiment of this application is shown.

[0170] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for executing the computer program to implement the storage-computation separation database query optimization method for massive data analysis as described above.

[0171] The electronic device may include a processor 510 and a memory 520 storing computer program instructions.

[0172] Specifically, the processor 510 may include a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.

[0173] Memory 520 may include mass storage for data or instructions. For example, and not limitingly, memory 520 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 520 may include removable or non-removable (or fixed) media. Where appropriate, memory 520 may be internal or external to the integrated gateway disaster recovery device. In a particular embodiment, memory 520 is non-volatile solid-state memory.

[0174] Memory may include read-only memory (ROM), random access memory (RAM), disk storage media devices, optical storage media devices, flash memory devices, and electrical, optical, or other physical / tangible memory storage devices. Therefore, typically, memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the operations described with reference to the method according to the first aspect of this disclosure.

[0175] The processor 510 reads and executes computer program instructions stored in the memory 520 to implement the storage-computation separation database query optimization method for arbitrary massive data analysis in the above embodiments.

[0176] In one example, the electronic device may also include a communication interface 530 and a bus 540. Wherein, such as Figure 5 As shown, the processor 510, memory 520, and communication interface 530 are connected through bus 540 and complete communication with each other.

[0177] The communication interface 530 is mainly used to realize communication between various modules, devices, units and / or equipment in the embodiments of this application.

[0178] Bus 540 includes hardware, software, or both, that couples components of an online data traffic metering device together. For example, and not limitingly, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 540 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, any suitable bus or interconnect is contemplated herein.

[0179] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the storage-computation separation database query optimization method for massive data analysis described above.

[0180] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as USB flash drives, read-only memory, random access memory, portable hard drives, magnetic disks, or optical disks.

[0181] Embodiments of the present invention also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in the above-described embodiment of the storage-computation separation database query optimization method for massive data analysis.

[0182] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0183] The above provides a detailed description of the storage-computation separation database query optimization method and system for massive data analysis provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of this application.

Claims

1. A storage-computation separation database query optimization method for massive data analysis, characterized in that, include: Collect business query logs and metadata information from all basic tables. The business query logs include query text and execution time, and the metadata information includes total data information and data increment information. A syntax tree is constructed based on the query text, and the key operator nodes in the syntax tree are parameterized to generate fingerprint sequences. The fluctuation index of each base table is calculated based on the data increment information. A query topology graph is constructed based on the fingerprint sequence. Common substructures are identified from the query topology graph through frequent subgraph mining. Based on the fluctuation index, the common substructures are pruned to obtain a set of candidate substructures. Each substructure in the set of candidate substructures includes query frequency, maintenance cost, and storage usage. Using the remaining available space as a space constraint, the preset system maintenance time as a time constraint, the time-saving of each substructure in the candidate substructure set determined according to the query frequency and the execution time is mapped to the item value, the storage occupation and the maintenance cost are used as the item weight, and the target substructure combination is determined by the multidimensional knapsack problem and query rewriting rules are generated. The target fingerprint sequence generated based on the target query request is matched with the query rewriting rules to determine the target substructure. The target substructure is then replaced with a read operation on the pre-computed data corresponding to the combination of the target substructures to generate an execution plan for optimizing the query. Using the remaining available space as a space constraint and the preset system maintenance time as a time constraint, the time-saving of each substructure in the candidate substructure set determined based on the query frequency and execution time is mapped to item value. The storage occupation and maintenance cost are used as item weight. The target substructure combination is determined through a multidimensional knapsack problem, and query rewriting rules are generated, including: The average execution time of each substructure in the candidate substructure set in the business query log is taken as the single execution time of each substructure. Based on the single execution time, the preset estimated reading time, and the query frequency, the time saved for each substructure in the candidate substructure set is calculated, and the time saved is taken as the value of the item. Based on the candidate substructure set, enumerate multiple substructure combinations and calculate the total storage usage, total maintenance cost, and total item value of all substructures in each substructure combination. The substructure combination that satisfies the spatial and temporal constraints and maximizes the total value of the items is determined as the target substructure combination using the multidimensional knapsack problem algorithm. Assign a physical storage path to each target substructure in the target substructure combination, establish a mapping relationship between the fingerprint sequence of the target substructure and the physical storage path, and use the mapping relationship as a query rewrite rule.

2. The storage-computation separation database query optimization method for massive data analysis according to claim 1, characterized in that, A syntax tree is constructed based on the query text, and key operator nodes in the syntax tree are parameterized to generate fingerprint sequences, including: By scanning the character stream of the query text, keywords, identifiers, and literals are extracted to obtain a lexical unit sequence. The lexical unit sequence is then recursively parsed to construct a syntax tree. Extract nodes of type connection, aggregation, and filtering from the syntax tree as key operator nodes; The constant parameters in each key operator node are replaced with preset placeholder symbols, and formatting spaces and comment information in the syntax tree are removed to obtain a standard syntax tree; The standard syntax tree is converted into a string sequence by preorder traversal, and the string sequence is encoded by a hash algorithm to obtain a fingerprint sequence.

3. The storage-computation separation database query optimization method for massive data analysis according to claim 1, characterized in that, The incremental data information includes the number of new rows and the number of deleted rows in each time slice for all base tables; Based on the incremental data information, the volatility index of each base table is calculated, including: Obtain the total number of rows in the base table in each time slice, and calculate the ratio of the number of new rows to the total number of rows in each base table in each time slice to obtain the change rate sequence of each base table; The variance of each rate of change sequence is used as the volatility index for the corresponding base table.

4. The storage-computation separation database query optimization method for massive data analysis according to claim 1, characterized in that, A query topology graph is constructed based on the fingerprint sequence, and common substructures are identified from the query topology graph through frequent subgraph mining, including: Using the operator nodes in the fingerprint sequence as graph nodes and the data flow relationship between the operator nodes as directed edges, a query topology graph is constructed. The query topology graph is added to the historical query graph set to obtain an updated historical query graph set. A depth-first search strategy is used to search for multi-class connected subgraphs with the same topology in the updated historical query graph set. The frequency of occurrence of each type of connected subgraph in the updated historical query graph set is counted, and the connected subgraphs whose occurrence frequency exceeds a preset frequency threshold are identified as common substructures.

5. The storage-computation separation database query optimization method for massive data analysis according to claim 4, characterized in that, Based on the volatility index, the common substructure is pruned to obtain a set of candidate substructures, including: The graph nodes in the common substructure are parsed to determine the target base table referenced by the common substructure; If the volatility index of the target base table is greater than the preset stability threshold, the corresponding common substructure is removed; if the volatility index of the target base table is less than or equal to the stability threshold, the corresponding common substructure is retained, and a candidate substructure set is obtained. The frequency of occurrence of each substructure in the candidate substructure set in the updated historical query graph set is determined as the query frequency. Based on the incremental data information corresponding to the target base table referenced by each substructure in the candidate substructure set, calculate the estimated time required to process the incremental data information corresponding to the target base table, and use the estimated time as the maintenance cost of the corresponding substructure in the candidate substructure set. Based on the row record length of the target base table and the total data amount information, the estimated data amount of the corresponding substructure in the candidate substructure set is calculated, and the estimated data amount is used as the storage occupancy.

6. The storage-computation separation database query optimization method for massive data analysis according to claim 1, characterized in that, The target fingerprint sequence generated based on the target query request is matched with the query rewriting rules to determine the target substructure. The target substructure is then replaced with a read operation on the pre-computed data corresponding to the combination of the target substructures, generating an execution plan for optimizing the query, including: The target syntax tree corresponding to the target query request is traversed using a bottom-up traversal strategy, and the target fingerprint sequence of the subtree nodes in the target syntax tree is generated using a hash algorithm. By performing hash matching between the target fingerprint sequence and the fingerprint sequence in the query rewriting rule, the target substructure in the target syntax tree that generates the target fingerprint sequence is determined. Based on the physical storage path in the query rewrite rule, a view read operator is created to perform pre-computed data read operations. The view read operator is used to replace the target substructure, and the parent node of the target substructure is updated to reference the view read operator, resulting in an updated target syntax tree. Based on the updated target syntax tree, an execution plan is generated to optimize the query.

7. A storage-computation separation database query optimization system for massive data analysis, characterized in that, include: The data collection module is used to collect business query logs and metadata information of all basic tables. The business query logs include query text and execution time, and the metadata information includes total data information and data increment information. The processing module is used to construct a syntax tree based on the query text, perform parameterization processing on the key operator nodes in the syntax tree to generate fingerprint sequences, and calculate the fluctuation index of each base table based on the data increment information. The construction module is used to construct a query topology graph based on the fingerprint sequence, identify common substructures from the query topology graph through frequent subgraph mining, and prune the common substructures based on the fluctuation index to obtain a candidate substructure set. Each substructure in the candidate substructure set includes query frequency, maintenance cost, and storage usage. The determination module is used to use the remaining available space as a space constraint, the preset system maintenance time as a time constraint, map the time-saving of each substructure in the candidate substructure set determined according to the query frequency and the execution time to the item value, and use the storage occupation and the maintenance cost as the item weight. It determines the target substructure combination and generates query rewriting rules through a multidimensional knapsack problem. Specifically, the determination module is used to use the average execution time of each substructure in the candidate substructure set in the business query log as the single execution time of each substructure, and calculate the time-saving of each substructure in the candidate substructure set according to the single execution time, the preset estimated read time and the query frequency, and use the time-saving as the item value. Based on the candidate substructure set, enumerate multiple substructure combinations and calculate the total storage usage, total maintenance cost, and total item value of all substructures in each substructure combination. Using the multidimensional knapsack problem algorithm, the substructure combination that satisfies the spatial constraints and the time constraints and maximizes the total item value is determined as the target substructure combination; physical storage paths are allocated to each target substructure in the target substructure combination, and a mapping relationship between the fingerprint sequence of the target substructure and the physical storage path is established, and the mapping relationship is used as the query rewriting rule; The matching module is used to match the target fingerprint sequence generated according to the target query request with the query rewriting rules to determine the target substructure, replace the target substructure with the reading operation of the pre-calculated data corresponding to the combination of the target substructure, and generate an execution plan for optimizing the query.

8. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the storage-computation separation database query optimization method for massive data analysis as described in any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, enables the implementation of the storage-computation separation database query optimization method for massive data analysis as described in any one of claims 1 to 6.