A method, device, and medium for proactive hierarchical management of database buffer pools.
By generating page access sequence diagrams and hierarchical cost models, the problem of low resource utilization efficiency of the database buffer pool manager in the DRAM-CXL-NVM architecture was solved, resulting in improved query throughput and reduced latency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HIGHGO SOFTWARE
- Filing Date
- 2026-05-29
- Publication Date
- 2026-06-30
AI Technical Summary
Existing database buffer pool managers suffer from problems such as lag in OS hot detection, oscillations in database LRU and OS hierarchical migration, and unutilized operator access pattern heterogeneity when dealing with DRAM-CXL-NVM three-tier main memory architectures, resulting in low efficiency in computing resource utilization.
By generating a Page Access Timing Graph (PATG), and combining the capacity and access characteristics of the three-tier main memory system, a tiered cost model is used to calculate and generate an active tiered scheduling strategy. The scheduling operation of data pages is executed asynchronously before the query is executed, and deviations are monitored and corrected.
It improved query throughput, reduced tail query latency, enhanced memory resource utilization efficiency, and achieved precise matching with hardware.
Smart Images

Figure CN122309573A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of database technology, and in particular to a method, device and medium for proactive hierarchical management of database buffer pools. Background Technology
[0002] Existing database buffer pool managers based on a single-layer main memory of Dynamic Random Access Memory (DRAM) (using replacement algorithms such as Least Recently Used (LRU), Least Frequently Used (LFU), Clock Replacement Algorithm (CLOCK), and Adaptive Replacement Cache (ARC)) face the DRAM-CXL-NVM three-layer main memory architecture introduced by the Compute Express Link (CXL) 4.0 standard. Their layering decisions are entirely triggered by post-hoc sampling at the operating system level, for example, Meta TPP's millisecond-level sampling based on NUMA hint page faults. Furthermore, existing solutions do not consider the differences in access patterns of different operators, such as sequential scanning, hash construction, and B+ tree lookups, using a single "hotness" criterion for migration between all layers. This leads to three types of problems: lag in operating system (OS) hotness detection, oscillations in database LRU and OS layering migration, and failure to utilize the heterogeneity of operator access patterns. Summary of the Invention
[0003] This application provides a proactive hierarchical management method, device, and medium for a database buffer pool to solve the following technical problems: how to solve the problems of OS hot detection lag, database LRU and OS hierarchical migration oscillation, and heterogeneity of unutilized operator access patterns.
[0004] In a first aspect, embodiments of this application provide an active hierarchical management method for a database buffer pool. The method includes: receiving a query request for a database and generating a query plan based on the query request; extracting features from the query plan to obtain operator features, and generating a page access timing diagram corresponding to the query request based on the operator features; based on the page access timing diagram, and combining the capacity of DRAM, CXL memory, and NVM in the three-tier main memory system, calculating using a hierarchical cost model to allocate a target storage layer for the data page corresponding to the query plan, thereby generating an active hierarchical scheduling strategy; within a preset time window before the query plan begins execution, asynchronously executing the scheduling operation of the data page between the three-tier main memory system according to the active hierarchical scheduling strategy; during query execution, accessing the data page according to the target storage layer, and monitoring the deviation between the actual access and the predicted access, wherein the predicted access is determined based on the page access timing diagram; and triggering incremental correction scheduling when the deviation exceeds a preset deviation threshold.
[0005] Secondly, embodiments of this application also provide an active hierarchical management device for a database buffer pool, the device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform an active hierarchical management method for a database buffer pool as described in the first aspect above.
[0006] Thirdly, embodiments of this application also provide a computer storage medium storing computer-executable instructions, which, when executed, implement an active hierarchical management method for a database buffer pool as described in the first aspect above.
[0007] The active hierarchical management method, device, and medium for a database buffer pool provided in this application have the following beneficial effects: In this embodiment, by parsing the query plan and extracting operator features, a page access sequence diagram is generated, thereby solving the problem of OS post-sampling lag. Subsequently, based on the page access sequence diagram and combined with the hardware characteristics of the three-tier memory system, a proactive hierarchical scheduling strategy is generated through calculation using a hierarchical cost model. This strategy allocates a unique target storage layer to each data page, ensuring that the database kernel and the underlying memory management follow the same optimization objectives, eliminating strategy conflicts and data thrashing. Furthermore, the hierarchical cost model can distinguish the memory sensitivity of operators, achieving precise matching between requirements and hardware. Then, in specific execution, data pre-migration is asynchronously completed according to the proactive hierarchical scheduling strategy before the actual query execution, eliminating query "cold start" latency; during execution, prediction deviations are monitored and incremental corrections are triggered, ensuring the system's robustness under dynamic loads. Thus, in a working set scenario with over DRAM capacity, compared to the traditional combination of "OS TPP + database LRU", query throughput can be increased by more than 2 times, tail query latency reduced by more than 50%, and overall memory resource utilization efficiency significantly improved. Attached Figure Description
[0008] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings: Figure 1 A flowchart illustrating an active hierarchical management method for a database buffer pool provided in this application embodiment; Figure 2 A flowchart illustrating another proactive hierarchical management method for a database buffer pool provided in this application embodiment; Figure 3 A schematic diagram of a DRAM-CXL-NVM three-layer main memory system and a six-way migration path provided for embodiments of this application; Figure 4 A schematic diagram illustrating a PATG example and cross-query merging provided for embodiments of this application; Figure 5 A schematic diagram of a transaction consistency migration protocol state machine provided in an embodiment of this application; Figure 6 This is a schematic diagram of the internal structure of an active hierarchical management device for a database buffer pool, provided in an embodiment of this application. Detailed Implementation
[0009] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0010] In practical applications, the database buffer pool loads data pages into DRAM on demand and manages them using replacement algorithms such as LRU, LFU, CLOCK, and ARC. Compute Express Link (CXL) is a cache-consistent interconnect protocol between the CPU and accelerator / memory extension devices; the CXL.mem sub-protocol allows CXL-attached memory (typically with a latency of 170–300ns, between DRAM's approximately 80ns and SSD's approximately 100μs) to be managed by the kernel as part of the system memory. CXL 4.0, released in November 2025, will increase bandwidth to 128 GT / s (PCIe 7.0), driving the "DRAM-CXL-NVM / SSD" three-tier main memory architecture towards mainstream adoption in cloud environments. The Tiered Memory subsystem in Linux kernel 5.18+ already supports promotion / demotion based on data page popularity, with Meta TPP (Transparent Page Placement, ASPLOS 2023) being a representative solution, but migration decisions are triggered by the operating system rather than the database. NVM stands for Non-Volatile Memory in Chinese and Non-Volatile Memory in English.
[0011] IEEE 754 double-precision floating-point, relational algebraic physical operators (seq_scan, idx_scan, hash_build, hash_probe, sort, merge_join, nestloop), cost estimation frameworks (cardinality estimation, access cost model), and multi-version concurrency control (MVCC) are all existing well-known technologies.
[0012] One existing technology is a combination of Meta TPP (ASPLOS 2023) and a standard database LRU buffer pool, and its operation steps are as follows: Step S1 (Buffer Pool Initialization): The database manages the DRAM buffer pool using standard LRU, retaining hot data pages and evicting cold data pages to the solid state drive (SSD). Step S2 (OS sampling data page heat): The Linux TPP subsystem periodically samples process memory based on NUMA balancing hint pagefault; Step S3 (OS Passive Demotion): TPP demotes cold data pages from DRAM to the CXL layer according to the heat threshold; Step S4 (Data Failure Triggers Promotion): A data failure is triggered when a cold data page is accessed again, and the OS promotes the data page from CXL back to DRAM; Step S5 (Database unaware): The database buffer pool is unaware of the existence of the CXL layer and still uses LRU for eviction and reading back from SSD.
[0013] The shortcomings of existing technologies include the following five points: 1. OS-driven page hotspot detection is lagging and semantically disconnected from Structured Query Language (SQL): Because TPP samples page hotspots in process memory through NUMA hint page faults, the sampling frequency is limited to milliseconds by kernel scheduling and only reflects accesses that have already occurred; while database query plans can predict the page access sequence within the next 10–100 milliseconds the moment the query arrives. As a result, CXL layer migration decisions are always hindsight, and by the time hot pages are migrated back to DRAM from CXL, the access is already at its end, resulting in an unavoidable remote access delay and significantly increasing tail latency.
[0014] 2. LRU buffer pool and OS hierarchical memory cause migration oscillations: Since the database LRU is evicted based on the most recently accessed memory while OSTPP is migrated based on the long-term frequency, the two criteria are different and they are not visible to each other; when a page is evicted from DRAM to SSD by the database, the OS may have just promoted the CXL copy to DRAM, and vice versa; this causes bidirectional page migration oscillations between DRAM, CXL, and SSD, wasting interconnect bandwidth and requiring additional synchronization overhead for page copy consistency.
[0015] 3. Failure to utilize the heterogeneity of query plan bandwidth requirements: Because the existing solution treats all pages equally and determines the layer based on access frequency, while the access patterns of different database operators are very different: sequential scans require high bandwidth and low randomness, which is suitable for CXL (high bandwidth and tolerable latency); Hash Join build side requires low-latency random access, which must be DRAM; B+ tree lookup has short paths and repeatedly accesses the same batch of index pages, which must be DRAM. As a result, the existing single hotness criterion cannot match the most suitable level for different operators, and the overall system throughput does not reach the theoretical peak of the hardware.
[0016] 4. The choice of dirty page write-back level is ignored: Since the existing solution focuses on the layering of read paths, dirty page write-back only considers the "DRAM→SSD" path; while when the CXL layer exists, writing back dirty pages that may be read again in the short term to CXL is about 100 times faster than writing back to SSD; as a result, the tail latency of Write-Ahead Log (WAL) flushing and dirty page sinking does not benefit from the optimization brought by CXL.
[0017] 5. Layered migration process disrupts transaction consistency: Since the existing layered system treats pages as stateless storage objects, the cross-layer migration process does not consider the MVCC version chain; when a transaction reads the page during the migration, it may read a torn version chain or an expired tuple; as a result, the existing OS layered scheme is difficult to be compatible with database transaction isolation semantics, and actual deployment requires sacrificing certain migration features or introducing heavyweight locks.
[0018] The technical solutions proposed in the embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0019] Figure 1 This application provides a flowchart illustrating an active hierarchical management method for a database buffer pool. This method can be applied to a three-tier main memory system consisting of Dynamic Random Access Memory (DRAM), Compute-X Memory (CXL) interconnected via Compute-X Fast Interconnect (CXL), and Non-Volatile Memory (NVM). Figure 1 As shown in the figure, the active hierarchical management method for a database buffer pool provided in this application embodiment specifically includes the following steps: Step 101: Receive a query request for the database and generate a query plan based on the query request.
[0020] In practical applications, a database query request is an instruction from a user or application to initiate an operation on the database, which can be an SQL statement. In this embodiment, a query plan can be generated based on the aforementioned database query request. Through the above steps, the transformation from user intent to a specific, analyzable, and optimizable execution plan within the database is completed. This query plan serves as the information input and decision-making basis for the entire proactive hierarchical management method.
[0021] In practical applications, after receiving a database query request, the database optimizer can parse and optimize it to generate a physical execution plan tree. This is a tree-like data structure, such as PostgreSQL's Plan / PlanState tree or MySQL's JOIN / QEP_TAB tree. Each node represents a physical operator, or simply an operator, which is the specific operation actually executed by the database engine, such as Seq Scan, Index Scan, Hash Join, and Sort. The edges represent the direction of data flow, from child nodes (data source) to parent nodes (data processing).
[0022] Step 102: Extract features from the query plan to obtain operator features, and generate a page access sequence diagram corresponding to the query request based on the operator features.
[0023] In the embodiments of this application, the Page Access Time Graph (PATG) described above can characterize the access order, concurrency, and correlation with physical operators of data pages in a three-tier main memory system, determined by the query plan. That is, PATG can translate the query plan into an access sequence that the storage system can understand and predict.
[0024] In one possible implementation, the feature extraction of the query plan to obtain operator features includes: The physical execution plan tree corresponding to the query plan is traversed from bottom to top; For each physical operator node encountered during the traversal, a feature tuple corresponding to the physical operator node is constructed, wherein the feature tuple includes the operator type, estimated number of rows to be processed, access mode, reuse factor and deadline hint corresponding to the physical operator node.
[0025] In the above embodiments, operator features specifically refer to the feature tuples (op_type, est_rows, scan_mode, reuse_factor, deadline_hint) in the form of a 5-tuple, which are extracted and assembled for each physical operator node op.
[0026] All five fields mentioned above are derived from existing outputs of the query compilation, without introducing runtime sampling overhead.
[0027] (1) op_type, i.e. operator type, refers to the specific type of physical operator. It can be directly read from the plan node type enumeration, such as PostgreSQL's nodeTag(plan) and MySQL's JOIN_TAB::type.
[0028] (2) est_rows, which is the estimated number of rows to be processed, can be directly read from the optimizer cost estimator output. For example, PostgreSQL Plan->plan_rows, MySQL JOIN_TAB::rows_estimate.
[0029] (3) scan_mode, or access mode, is a classification label derived from op_type in conjunction with its specific scanning conditions. For example, SeqScan → sequential, IndexScan + equality condition → point_lookup, IndexScan + range condition → random, BitmapHeapScan → bulk, HashJoin probe → random, HashJoin build → sequential. In this way, the specific database operation semantics are summarized into general access modes that are instructive for selecting memory levels.
[0030] (4) reuse_factor, which is a value that is derived from the operator semantics and the estimated value and can characterize the degree to which data is accessed repeatedly. For example, HashJoin construction side data page = estimated number of rows on the probe side / number of construction side data pages (i.e., N_probe), SortMergeJoin temporary data page = number of merge rounds, B+ leaf data page = 1, and the rest are assigned default values at compile time according to the operator semantics. For example, the operator for a one-time scan may be 1.
[0031] (5) deadline_hint, i.e. the deadline hint, is obtained by subtracting the previous cumulative total_cost from the total_cost accumulated by this operator in the above plan tree, and multiplying by the current conversion factor of cost unit to milliseconds.
[0032] In one possible implementation, generating the page access sequence diagram corresponding to the query request based on the operator features includes: For each leaf operator in the physical execution plan tree, the data pages of the leaf operator are enumerated based on the estimated number of rows of the leaf operator, the physical data page information of the associated table or index, so as to construct the set of data page numbers corresponding to the leaf operator; For each data page of the leaf operator, based on the start time and end time of the leaf operator and the relative position of the data page in the data page number set, the first access timestamp and the last access timestamp of the data page are assigned by linear interpolation, and the access count and dominant access pattern of the data page are recorded to generate the page access sequence diagram of the leaf operator; Based on the connection semantics of the intermediate operator, the page access timing diagram of the left subtree and the page access timing diagram of the right subtree are combined to form the page access timing diagram of the intermediate operator, wherein the left subtree is the left subtree of the intermediate operator and the right subtree is the right subtree of the intermediate operator; The access records of the same data page accessed by different operators are merged to generate a page access sequence diagram based on data pages. At a preset time before the query plan begins execution, the page access sequence diagram corresponding to the query request is completed and locked.
[0033] In the above embodiments, PATG construction can be broken down into the following five sub-steps: (1) Enumeration of Leaf Operator Data Page Sets: For each leaf operator in the above physical execution plan tree, enumerate the set of data page numbers Pages(op) that the leaf operator will access, based on its est_rows, the number of relational data pages (relpages) of the table / index, and the number of tuples per data page (tuples_per_page). For example, sequential scan retrieves a continuous range of physical data pages, index scan infers heap data pages from the statistical histogram, and bitmap scan retrieves the target heap data page set. Note that all leaf operators are operators, but not all operators are leaf operators. Leaf operators are a special type of operator at the lowest level of the tree structure, responsible for data extraction.
[0034] (2) Assigning time stamps to leaf operators aims to predict the access time of each data page in the aforementioned data page number set during the execution of the operator. Using `deadline_hint` as the end time of this operator `t_end(op)` and the end time of the preceding operator as its start time `t_start(op)`, assuming that the access of the data page number set by the operator is uniform, the first (first_access_ms) and last (last_access_ms) access timestamps are calculated by linear interpolation based on the relative position of the data page in `Pages(op)`. The number of times the data page is accessed in this operation (access_count=1) and the dominant access mode (scan_mode_dominant) are recorded. Thus, each leaf operator obtains a time-stamped list of data page accesses, which is the PATG corresponding to that leaf operator.
[0035] (3) Intermediate operator synthesis, that is, according to the execution semantics of the intermediate operator, the PATG of its left and right subtrees is synthesized into the complete PATG of the operator. This mainly involves synthesizing the PATG of the left and right subtrees of internal operators such as HashJoin, MergeJoin, and NestLoop according to their semantics. For example, taking HashJoin as an example, on the construction side, the access_count of its data pages is multiplied by reuse_factor, because the constructed hash table can be repeatedly searched by the probe side. At the same time, timeline merging is required. The last_access_ms of all data pages on the construction side is extended to the end time (t_end) of the probe side, because the hash table needs to reside throughout the entire probe phase. On the probe side, its scan_mode_dominant is set to random, because hash probe is a random access. In this way, a time-series prediction of data page access can be obtained with the intermediate operator as the root. Among them, the intermediate operator receives the data stream (rows or rows) generated from the lower-level operator (which may be a leaf operator or other intermediate operator), performs calculations such as connection, sorting, and aggregation, and then passes the results to the higher-level operator.
[0036] (4) Merging data pages: Multiple access records of the same physical data page by different operators are merged to form a unique and complete file for that data page. For the case where multiple operators access the same page_id, the data is merged according to (page_id, min(first_access_ms), max(last_access_ms), Σaccess_count, dominant(scan_mode)). Wherein, min(first_access_ms) is the earliest first access time among all records, max(last_access_ms) is the latest last access time among all records, Σaccess_count refers to the cumulative number of accesses, and dominant(scan_mode) is the dominant access mode determined based on the number of accesses and the access mode.
[0037] (5) Output and timestamp locking: The merged set of data page tuples can serve as the PATG corresponding to this query plan. This set is generated and locked at time T0 before the query begins, making it a static and consistent benchmark. This will be immediately used to calculate the hierarchical strategy and can also serve as a subsequent correction benchmark. During query execution, the actual access will be compared with the locked predictions to trigger incremental correction. These steps, through bottom-up synthesis and data page-centric aggregation, transform the abstract physical execution plan tree into a concrete, quantifiable PATG that can guide memory scheduling.
[0038] Thus, the PATG described above is a collection of tuples with complete time-series and access feature predictions, organized by data page. Each data page p is accompanied by a tuple (first_access_ms, last_access_ms, access_count, scan_mode_dominant). Each data page tuple in the PATG must include only the following five required fields: (1) page_id, a unique identifier of a data page, serves as the primary key of a tuple and uniquely identifies the physical data page that is predicted to be accessed; (2) first_access_ms, the first access timestamp, determines the start time of subsequent asynchronous pre-migration. If it is missing, the lead time required for pre-migration cannot be calculated. (3) last_access_ms, the last access timestamp, determines the earliest time when the data page can be downgraded or archived. If it is missing, the subsequent reverse migration path cannot be triggered. (4) access_count, total number of accesses, can determine the reuse benefit of the data page in the greedy scoring in the subsequent strategy generation. If it is missing, it is equivalent to treating all data pages with the same priority, degenerating into the existing popularity stratification. (5) scan_mode_dominant, the dominant access mode, determines the value of the bw_match coefficient of the subsequent hierarchical cost model. If it is missing, it is impossible to distinguish the hierarchical preference of sequential / random / point search.
[0039] The above five items are essential components of the PATG data page tuple. Optional fields can be added in practical applications, such as the deadline_uncertainty estimation uncertainty interval and the transaction identifier to which tx_id belongs. However, optional fields do not affect the minimum working set of PATG.
[0040] Step 103: Based on the page access timing diagram, and combined with the capacity of DRAM, CXL memory, and NVM in the three-tier main memory system, calculate using the tiered cost model to allocate the target storage layer to the data page corresponding to the query plan, thereby generating an active tiered scheduling strategy.
[0041] In practical applications, the above-mentioned proactive hierarchical scheduling strategy can allocate a corresponding target storage layer to each data page in the query plan. At the same time, it can also plan the operation and timing of the above data pages to be pre-migrated, resided or degraded between DRAM, CXL memory and NVM.
[0042] In one possible implementation, after generating the page access sequence diagram corresponding to the query request based on the operator features, the method further includes: For multiple pending query requests in the current concurrent query pool, based on the page access sequence diagram corresponding to the query requests, list all the data pages that will be accessed to form a data page set; Conflict analysis is performed on each of the data pages in the data page set. If the data page appears in the page access sequence diagrams of two or more query requests at the same time, the data page is identified as a shared page. Based on the page access sequence diagram of the two or more query requests, determine the number of times the shared page was accessed and the timestamp of the first access. Based on the shared pages, a shared working set is formed, and in the active hierarchical scheduling strategy, a storage layer with a target priority is allocated to the data pages in the shared working set so that the shared pages reside in DRAM; The data pages in the page access sequence diagram corresponding to multiple query requests, excluding the shared pages, are treated as independent scheduling units, and active hierarchical scheduling decisions are made for each.
[0043] In the above embodiment, for the current concurrent query pool Q={q1, q2, …, qK}, the PATG corresponding to each of the currently pending query requests q1, q2, …, qK can be obtained. All data page access records in these PATGs are merged to form a global data page set containing all data pages accessed by all query requests. Then, this data page set is traversed, and conflict analysis is performed to check how many query request PATGs each data page p appears in. If p appears in two or more query request PATGs, it is determined to be a shared page. This identifies data pages needed by multiple queries that may be potential I / O bottlenecks. For each identified shared page p, its access characteristics in multiple PATGs can be merged into a unified description. The access count is calculated as max(access_count), which is the maximum predicted access count for p across all queries, ensuring that the needs of the most frequently accessed query are met. The first access timestamp is calculated as min(first_access_ms), which is the minimum predicted first access time for p across all queries. This ensures that data is ready before the first query requiring it begins. The shared pages, merged according to the above rules, form a shared working set (W_share). This set represents critical and protected public data under the current concurrent load. All data pages in this set are generally assigned the highest priority and preferentially reside in DRAM. This avoids bandwidth waste and latency caused by multiple queries repeatedly loading the same data page from the slow layer (CXL or NVM); it also eliminates I / O blocking caused by queries waiting for the same hot data page, improving concurrency efficiency. Furthermore, the non-intersecting portions of K PATGs are grouped into independent scheduling units. These units can independently decide their target storage layer based on their own PATG characteristics and hierarchical cost model. This maintains scheduling flexibility while ensuring global hotspots, allowing optimization of private data for each query request.
[0044] In practical applications, the above merging is completed during the query admission phase O(Σpages_in_PATG). That is, it occurs when a query request is accepted into the system for execution but has not yet started actual execution. O(Σpages_in_PATG) indicates that its time is linearly related to the sum of the total number of data pages in the PATG of all queries, which can avoid the overhead of repeated coordination during execution.
[0045] In this way, shared pages can be globally identified and given priority resource guarantees (DRAM residency) to the target, allowing each query's unique data pages to be flexibly scheduled according to their own needs. Through global collaboration, the PATG prediction capability of a single query is extended to a concurrent environment, which can resolve core conflicts when multiple tasks compete for I / O resources, thereby improving the throughput and response speed of the entire database system.
[0046] In practical applications, a unified data page directory (PD) can be maintained, with each data page p recording (tier ∈ {DRAM, CXL, NVM}, last_access, hotness_score, version, dirty_bit, lock_state). The capacities of DRAM, CXL, and NVM are C_D, C_C, and C_N, respectively.
[0047] Each record in the data page directory PD corresponds to a physical data page p, and the meanings of each field are as follows: (1) tier∈{DRAM, CXL, NVM} is the currently resident memory tier; (2) last_access, the most recently accessed logical clock (system global monotonic timestamp or transaction log sequence number LSN), can be used for online learning of hierarchical cost models and fallback cold and hot determination; (3) hotness_score, the hotness score derived from last_access and historical frequency, is only used as a fallback stratification basis when PATG is missing or prediction fails. Normal paths do not participate in the generation of subsequent stratification strategies. (4) version, MVCC version number, is incremented each time it is migrated across layers or written, and is the key to locating the historical copy in the subsequent (page_id, version) double mapping; (5) dirty_bit, dirty data page marker, indicates whether the data page has not been written back to update relative to the persistence layer, and is the input for subsequent write-back determination; (6) lock_state∈{NORMAL, MIGRATING, PINNED, IO_IN_PROGRESS} are state machine flags, corresponding to normal read / write, cross-layer migration, operator-locked migration prohibited, and persistent layer I / O, respectively, used to prevent conflicts between migration and concurrent transaction read / write.
[0048] The capacity limits of the three layers DRAM, CXL, and NVM are denoted as C_D, C_C, and C_N, respectively, and can be used as hard constraints for subsequent solutions.
[0049] In one possible implementation, the formula for calculating the equivalent cost of a single access to a data page in the storage layer in the hierarchical cost model is as follows: cost(p,op,tier)=base_latency(tier)+(1 bw_match(tier,scan_mode))·penalty Where cost(p, op, tier) represents the estimated cost of operator op accessing data page p once on storage tier tier; base_latency(tier) represents the base access latency of storage tier tier; bw_match(tier, scan_mode) represents the bandwidth matching coefficient between storage tier tier and operator op's access mode scan_mode; penalty represents the additional cost penalty introduced when bandwidth mismatch occurs.
[0050] In the above embodiments, a single access cost can be defined, where cost(p, op, tier) represents the equivalent cost of operator op accessing data page p in one access at storage tier, in nanoseconds (ns), with a smaller value being better. The above formula uses the same standardized cost score to measure the expected performance loss of a specific data access (which operator, in what mode, and at which memory layer). base_latency(tier) represents the base access latency of storage tier, reflecting the latency of a single random access under perfect bandwidth matching conditions; it is generally a hardware-dependent baseline value. bw_match(tier, scan_mode)∈[0,1] represents the bandwidth matching coefficient between the storage layer and the operator access mode, where 1 indicates perfect matching and 0 indicates complete mismatch. penalty is the equivalent additional cost for bandwidth mismatch, in ns, converting bandwidth mismatch into an equivalent cost that can be added to the base latency.
[0051] In practical applications, the default values and bases for the above parameters are as follows: (1) base_latency: DRAM=80 ns (based on the typical value of single random read latency of JEDEC DDR5), CXL=250 ns (based on the measured latency range of CXL 2.0 / 3.0 170–300 ns, taking the upper limit as a conservative estimate), NVM=4 μs (based on the typical value of random read latency of 4 KB persistent memory); (2) bw_match default value: (DRAM,*)=1.0; (CXL,sequential)=0.95 (close to DRAM under sequential access of CXL.mem channel). Bandwidth utilization); (CXL,random)=0.3 (small-granularity random access is affected by protocol header overhead and remote hop count); (NVM,bulk)=0.6; (NVM,random)=0.05; (3) penalty=500ns: approximately twice the remote round-trip delay of one CXL, sufficient to drive the solver to avoid high mismatch combinations without suppressing all CXL / NVM candidates. The above default values are only initial values and can be continuously updated by exponential moving average from measured data in subsequent processes.
[0052] In one possible implementation, generating the proactive hierarchical scheduling strategy includes: Under the constraints of the DRAM capacity limit and the CXL memory capacity limit, the target storage layer is allocated to the data page with the goal of minimizing the total access cost, wherein the single access cost is calculated using the formula in the hierarchical cost model.
[0053] For each data page p in PATG, the union operator and the current storage layer tier are used to allocate the target storage layer for the data page. The calculation formula is as follows:
[0054] Here, cost(p, op, tier_p) represents the estimated cost of operator op accessing data page p on the target storage layer tier_p. The total access cost is the sum of the latency costs incurred when all data pages are accessed. This cost depends on the frequency and pattern of page access; for example, a page frequently accessed randomly will have a higher cost if placed on a slower layer, and also on the performance of the storage layer where the page resides (quantified by the tiered cost model cost(p, op, tier)). Generally, the access cost is low in DRAM and high in NVM. A suitable page placement strategy is needed to minimize this total cost. tier_p represents the target storage layer to which data page p is allocated. The decision variable in the above formula lies in allocating a target storage layer tier_p ∈ {DRAM, CXL, NVM} for each data page p. size(p) represents the size of data page p, and I(condition) is an indicator function; it takes the value 1 when the condition is true and 0 otherwise. C_D represents the upper limit of the DRAM layer capacity, and C_C represents the upper limit of the CXL layer capacity. It should be noted that the NVM layer is usually used as a backup layer with larger capacity, and hard capacity constraints may not be set, or C_N constraints may be set according to the actual device capacity.
[0055] In practical applications, the aforementioned hierarchical cost model can distinguish the sensitivity of different operator access modes (such as sequential / random) to memory bandwidth and latency when generating proactive hierarchical strategies. It can intelligently guide high-bandwidth data to CXL, lock low-latency sensitive data in DRAM, and settle cold data in NVM, thus solving the drawbacks of resource mismatch in traditional solutions and achieving precise matching between demand and hardware.
[0056] In the above embodiments, a greedy algorithm can be used for allocation when generating the proactive hierarchical scheduling strategy, including: A revenue score is calculated for the data page based on the reuse factor of the data page and the difference between the basic access latency of the current storage layer and the target storage layer. The data pages are sorted in descending order of earnings score; Each data page is processed sequentially. For the current data page, the data page is attempted to be placed into the DRAM, the CXL memory, or the NVM in descending order of storage layer performance until the first storage layer with sufficient remaining capacity is found, and the storage layer is used as the target storage layer for the data page. When two data pages have the same benefit score, the shared page in the shared working set is preferentially allocated to the DRAM.
[0057] In practical applications, the aforementioned data pages can include each data page in the shared working set (W_share) and the independent scheduling unit, without specific restrictions.
[0058] In practical applications, the greedy algorithm described above is a concrete, efficient, and approximate solution strategy for achieving the global objective of minimizing the total access cost under capacity constraints. The idea behind the greedy algorithm is to use an incremental approach, pursuing a local optimum at each step, to approximate the global optimum.
[0059] In practical applications, the formula for calculating the payout score is as follows: reuse_factor × (base_latency(current tier) base_latency(targettier)) The physical meaning of the above formula is: the total latency savings that can be accumulated in subsequent predicted accesses of data page p after it is migrated from the current storage tier to the target storage tier. This score quantifies the total latency savings that can be achieved by migrating data page p from the current storage tier to the target storage tier. (base_latency(tier_current)) `base_latency(tier_target)` represents the latency saved per access. For example, migrating from NVM to DRAM can save approximately 4μs-80ns of latency per access.
[0060] Then, p can be enumerated in descending order of score(p) and placed sequentially into the target layer with the highest remaining capacity; if DRAM is full, it is downgraded to CXL, and if CXL is full, it is downgraded to NVM. Among data pages with the same score, shared pages of the shared working set W_share are given priority to reside in DRAM.
[0061] The greedy algorithm described above uses a reward score as a guide, allocating valuable fast storage space (DRAM, CXL) to data pages with high access frequency and greater benefit from speed improvements, while satisfying the hard constraints of hardware resources. Furthermore, this greedy algorithm is a standard approximation algorithm for the capacity-constrained multidimensional 0 / 1 knapsack problem, and can be proven to be 2-approximately optimal under the hard capacity constraint. The algorithm's time consumption is mainly in the sorting operation, with a time complexity of O(N log N), where N is the total number of data pages in PATG. Compared to the exact solution of the dynamic programming algorithm (complexity O(N·C), where C is the capacity), it is orders of magnitude faster, completing the query admission phase in milliseconds, leaving ample time for subsequent asynchronous pre-migration scheduling, and achieving a balance between decision speed and optimization quality.
[0062] Step 104: Within a preset time window before the query plan begins execution, the data page is asynchronously scheduled among the three-tier main memory system according to the active hierarchical scheduling strategy.
[0063] In practical applications, constructing a PATG (Parallel Access Graph) through a query plan allows us to clearly define the data pages to be accessed, as well as the expected access pattern and time for each data page. Based on the PATG and the characteristics of each memory layer, we can calculate which layer (DRAM, CXL, or NVM) offers the best performance for each data page, thus generating a proactive hierarchical scheduling strategy. Within a pre-defined time window before the query plan actually begins execution—for example, after query compilation but before the executor starts—these data pages can be asynchronously (without blocking query execution) prefetched or migrated from the current layer (which might be the slow NVM) to the target layer (such as fast DRAM or CXL memory) according to the aforementioned strategy.
[0064] In this embodiment, by utilizing the idle time before the query begins, data is proactively and predictively moved to the target storage layer in advance, thereby reducing the waiting delay caused by the data not being in the target location when the query is actually executed, and thus improving performance.
[0065] In practical applications, the following six-way asynchronous scheduling can be initiated before query execution begins, ΔT (default 2 milliseconds), based on the proactive hierarchical scheduling strategy: DRAM→CXL Evict, CXL→DRAM Pre-Upgrade, NVM→CXL Asynchronous Load, NVM→DRAM Direct Pull, CXL→NVM Archive, DRAM→NVM Deep Archive.
[0066] Among them, (1) DRAM→CXL (eviction) occurs when DRAM data page p in PATG is close to the end of the current batch and is squeezed out of DRAM by a new batch of high-score data pages.
[0067] (2) CXL→DRAM (pre-upgrade) appears in CXL data page p when first_access_ms is about to arrive in the new query PATG, reuse_factor is high and scan_mode_dominant is random or point_lookup.
[0068] (3) NVM→CXL (asynchronous loading) occurs when the NVM data page p is accessed by the operator, scan_mode_dominant is sequential or bulk, and the DRAM capacity is already full of high-yield data pages.
[0069] (4) NVM→DRAM (direct pull) occurs when NVM data pages have both a high reuse_factor (such as HashJoin construction side, B+ leaf data pages) and random / point_lookup mode, skipping CXL transfer and directly loading into DRAM.
[0070] (5) CXL→NVM (archive) appears when the last_access_ms of the CXL data page has expired and is no longer available in the merge working set prediction.
[0071] (6) When DRAM→NVM (deep archiving) appears in DRAM data pages that are confirmed to be used only once (such as sorting temporary data pages, HashJoin final result data pages) and reuse_factor=0, skip CXL relay to save CXL link bandwidth.
[0072] In practical applications, the above paths (1) and (2) use the CXL.mem memcpy channel, which is initiated by the CPU and does not block the DRAM read path; paths (3), (4), (5) and (6) involve the NVM side and are asynchronously transported through Direct Memory Access (DMA).
[0073] The default value of 2ms for ΔT is based on the fact that a typical migration of a 4KB data page in CXL.mem takes about 0.5–1μs, and 2ms is sufficient to complete the pre-migration of PATG with a data page size of 1k–10k concurrently. In this way, ΔT covers most of the migration latency, and during the execution phase, the vast majority of data page accesses hit the target storage layer.
[0074] Step 105: During query execution, access the data page according to the target storage layer and monitor the deviation between the actual access and the predicted access.
[0075] The predicted access is determined based on the page access sequence diagram.
[0076] In practical applications, once the query actually begins execution, each operator can access data pages according to the target storage layer indicated in the proactive hierarchical scheduling strategy. For example, when an operator needs data page p, it can obtain the target storage layer of p, such as DRAM, and then directly send a load request to DRAM. This achieves lower access latency.
[0077] In one possible implementation, when an operator accesses a data page, it can read the storage layer identifier of the data page through the data page directory (PD) and initiate a data loading request load to the storage layer with O(1) time complexity. If the migration operation of the data page has not been completed and tier ≠ expected, resulting in the actual storage layer being inconsistent with the target storage layer determined in the above scheduling strategy, it will fall back to the next level performance storage layer of the actual storage layer of the data page for access, DRAM expected → CXL, CXL expected → NVM, in order to avoid blocking query execution.
[0078] Step 106: If the deviation exceeds the preset deviation threshold, trigger incremental deviation correction scheduling.
[0079] In one possible implementation, during query execution, the actual number of rows processed by each operator can be compared in real time with the estimated number of rows processed recorded in the PATG. If the ratio of the actual number of rows processed to the estimated number of rows processed exceeds a preset upper limit threshold or falls below a preset lower limit threshold, incremental rescheduling can be triggered. The incremental rescheduling includes: re-executing the hierarchical scheduling strategy calculation for data pages in the PATG that have not yet been accessed, and triggering incremental migration operations only for data pages that have not yet been accessed.
[0080] In practical applications, during execution, the actual number of rows compared to the estimated value for each operator is monitored. If the ratio is >2× or <0.5×, the scheduler can incrementally regenerate the scheduling policy for the remaining PATG and trigger incremental migration; only unaccessed data pages are adjusted, while accessed data pages remain unchanged. For example, a query plan performs an IndexScan (random access) followed by a SeqScan (sequential scan). During scheduling decisions, the optimizer estimates that IndexScan will return 100 rows and places them in the CXL. During actual execution, monitoring reveals that IndexScan actually returned 500 rows (ratio=5>2), indicating that the random access load is much higher than expected. The system immediately re-evaluates the subsequent SeqScan and remaining operations. Because IndexScan consumes more resources, it may be crowding out the space in the CXL for pages required by subsequent operations. The system recalculates and may decide to downgrade some data pages that SeqScan needs to access but have not yet been loaded from the originally planned CXL to NVM, while prefetching more index pages that IndexScan still needs to access into the CXL. The remainder of the query is executed using an updated, more accurate page layout, thus avoiding the severe performance degradation that could have been caused by an initial prediction error.
[0081] In one possible implementation, the method further includes: When a dirty data page in the DRAM triggers a write-back, the access characteristic information of the dirty data page is read. The access feature information is input into the probability model to estimate the probability that the dirty data page will be read again within a future preset time window; If the probability is higher than a preset probability threshold, the dirty data page is written back to the CXL memory, wherein the dirty data page written to the CXL memory is asynchronously persisted to the NVM by a background checkpoint thread; If the probability is not higher than the preset probability threshold, the dirty data page is written back to the NVM.
[0082] When a dirty DRAM data page triggers a write-back, such as due to buffer pool replacement, checkpointing, or background flushing, the access characteristics of the dirty data page are read, namely its (last_access, access_count, scan_mode) and the flush position in the WAL log. The flush position in the WAL log indicates that the modification record of the data page in the log has been safely written to disk (NVM), and can be recovered through the log even if the data page data itself has not yet been written to the NVM. Subsequently, a probabilistic model is used to estimate the probability P_reuse of the dirty data page being read again within the future T_reuse time window.
[0083] The specific fitting form of P_reuse in the probabilistic model is as follows: P_reuse = f(access_count, Δt_since_last, scan_mode_dominant) Here, f(·) is a monotonic function for the subsequent online learning closed-loop maintenance. The inputs are the historical access count of the data page, the time difference since the most recent access, and the dominant access pattern. The output is P_reuse ∈ [0, 1]. f(·) can continuously adjust its parameters by collecting real access results (whether the data page is accessed again after being written back) as feedback, thereby adapting to load changes and improving prediction accuracy.
[0084] The principle behind the write-back decision is to minimize the total expected cost. For a dirty data page, if it is written back to CXL, the total expected cost is: E_cxl = write latency (CXL) + P_reuse × read cost (CXL); if it is written back to NVM, the total expected cost is: E_nvm = write latency (NVM) + P_reuse × read cost (NVM).
[0085] Wherein, write latency (tier) ≈ base_latency(tier) (which can be approximated as the write operation latency equal to the base latency), and read cost (tier) = base_latency(tier) + (1 - bw_match(tier, scan_mode))·penalty, is calculated using the formula of the hierarchical cost model described above. This hierarchical cost model is used here for the write path to estimate the difference between P_reuse and the expected cost of the target layer. Both share the three sets of parameters—base_latency, bw_match, and penalty—and the same fitted curve.
[0086] The theoretical formula for calculating the preset probability threshold τ_wb is as follows. τ_wb = [Write latency (NVM) - Write latency (CXL)] / [Read cost (CXL) - Read cost (NVM)] In practical applications, the selection of the preset probability threshold τ_wb=0.4 is based on the following: when P_reuse=0.4, the expected cost of writing back to CXL is approximately equal to the expected cost of writing back to NVM at a trade-off point (calculated based on the default base_latency).
[0087] If P_reuse > the preset probability threshold τ_wb (default 0.4), then write back to CXL. This way, on the next read, the data can be quickly retrieved directly from CXL memory, avoiding the overhead of reading from NVM and greatly improving read performance. Otherwise, write back to NVM. This saves CXL memory space and write bandwidth, and direct persistence also completes the final data writing to disk.
[0088] It's important to note that dirty data pages written to CXL are ultimately persisted to NVM by the background checkpoint thread, without affecting ACID durability. Database durability does not require immediate synchronous writing to the persistent device for every modification to a dirty data page; instead, it is guaranteed through write-ahead logging and periodic flushing of dirty data. As long as the modification record for that data page has been written to the persistent WAL log, even if the data page itself temporarily resides in CXL memory, the correct state of the data page can be recovered by replaying the WAL log in the event of a system crash.
[0089] In one possible implementation, the method further includes: When performing cross-layer migration on any of the aforementioned data pages, the following steps are specifically executed: Before the migration begins, the status of the data page is marked as "migration in progress," and the version number of the data page is incremented. A free frame is pre-allocated in the target storage layer, and an atomic memory copy operation is performed to copy the contents of the data page to the free frame; If a transaction initiates a write operation on the data page during the migration, the write operation will be applied to a new copy of the target storage layer; Maintain a multi-version concurrency control version chain indexed by data page identifier and version number. When a read transaction accesses the data page, it checks the visibility of its version number to the current transaction snapshot. After the migration is complete, the data page directory is atomically updated to point the storage location of the data page to a new copy of the target storage layer, and the original version of the data page in its current storage layer is released.
[0090] In practical applications, before any cross-layer migration is executed: Mark data page p as entering the MIGRATING state during migration, and increment PD.version by 1; In the target storage layer, a free frame p' is pre-allocated, and an atomic memory copy (memcpy) is performed. If a transaction performs a write operation on p during the migration, the write is applied to p' (instead of p) according to the copy-on-migrate (CoM) rule. The MVCC version chain is indexed by the (page_id, version) tuple. When a read transaction reads a tier through the data page directory (PD), it simultaneously checks whether the version is visible in the current snapshot. If not, it rolls back to read the old version copy (the old copy must be released to meet the GC watermark). After migration, the PD is atomically switched to point to p', and p is released. This protocol guarantees that migration is transparent to transactions and is compatible with RC / RR / SI isolation levels.
[0091] In practical applications, the above method also includes online fitting of the hierarchical cost model parameters. For example, the measured (latency, bw) samples of each operator / level combination can be accumulated, and the base_latency(tier) and penalty in the hierarchical cost model can be updated with an exponential moving average, so that the hierarchical cost model can be adapted to different hardware generations (CXL2.0 / 3.0 / 4.0) and vendor characteristics.
[0092] In practical applications, the update formula for the Exponential Moving Average (EMA) is as follows: base_latency(tier)_new=(1 α)×base_latency(tier)_old+α×latency_measured; penalty_new=(1 α)×penalty_old+α×max(0,latency_measured base_latency(tier)); bw_match(tier,scan_mode)_new=(1 α)×old+α×(bw_measured / bw_peak(tier)).
[0093] Wherein, α is the smoothing coefficient, with a value range of (0, 1]. It is recommended to default to α = 0.2 (a commonly used empirical value that balances stability and response speed, which can be adjusted in the implementation examples).
[0094] Numerical Example 1: The original base_latency(CXL)_old = 250 ns. In this case, the measured average latency of a certain sequential scan operator in the CXL layer is latency_measured = 300 ns. Therefore, base_latency(CXL)_new = 0.8×250+0.2×300 = 260 ns. The next round of calculation of the layered cost model solution will use 260 ns.
[0095] Numerical Example 2: Given an initial penalty_old = 500 ns, a CXL random access test latency_measured = 800 ns, and base_latency(CXL) = 250 ns, then penalty_new = 0.8 × 500 + 0.2 × max(0, 800) 250) = 510 ns.
[0096] In practical applications, a cold start rule can be followed: if the cumulative number of samples for a certain (tier, scan_mode) combination is less than 32, the initial default value is still used; if it is greater than or equal to 32, EMA updates are started to avoid severe jitter during cold starts.
[0097] In this embodiment, the query plan is parsed, operator features are extracted, and a page access sequence diagram is generated. This solves the problem of OS post-sampling lag. Subsequently, based on and combined with the hardware characteristics of the three-layer memory system, a proactive hierarchical scheduling strategy is generated through calculation using a hierarchical cost model. This strategy allocates a unique target storage layer to each data page, ensuring that the database kernel and the underlying memory management follow the same optimization objectives, eliminating strategy conflicts and data thrashing. Furthermore, the hierarchical cost model can distinguish the memory sensitivity of operators, achieving precise matching between requirements and hardware. Then, in specific execution, data pre-migration is asynchronously completed according to the proactive hierarchical scheduling strategy before the actual query execution, eliminating query "cold start" latency. During execution, prediction deviations are monitored and incremental corrections are triggered, ensuring the system's robustness under dynamic loads. Thus, in a working set scenario with over DRAM capacity, compared to the traditional combination of "OS TPP + database LRU", query throughput can be increased by more than 2 times, tail query latency reduced by more than 50%, and overall memory resource utilization efficiency significantly improved.
[0098] To provide a more detailed explanation of the proactive hierarchical management method for the database buffer pool in this application embodiment, the following supplementary descriptions are also provided in this application embodiment: Figure 2 This diagram illustrates a flowchart of another proactive hierarchical management method for database buffer pools. Figure 2 The demonstration showcases the collaborative process of four modules: Query Plan Awareness Module, Three-Layer Page Directory and Cost Model Module, Look-Ahead Prefetching and Migration Scheduler Module, and Transaction Consistency Migration Protocol Module. It presents a complete closed loop from query request, query compilation, PATG generation, cross-query merging, cost calculation, asynchronous migration, execution-time routing, row count deviation correction, dirty data page write-back, transaction consistency protocol, and online learning of the layered cost model.
[0099] Specifically, after the query is compiled, the query plan awareness module extracts access features from the physical execution plan tree and generates a page access sequence diagram (PATG); the PATG queues of multiple concurrent queries are merged to obtain a shared working set prediction; the hierarchical cost model solves for the optimal level for each page based on the operator access mode and the hierarchical bandwidth matching degree; the scheduler starts a six-way asynchronous pre-migration ΔT before execution begins; during execution, access is routed according to the O(1) hierarchical level, and incremental correction is triggered when the row number deviation exceeds the deviation threshold; dirty page write-back also follows the hierarchical cost model evaluation to select the target storage layer; all cross-layer migrations are guaranteed to be MVCC correct through the transaction consistency protocol (copy-on-migrate and version chain double mapping during migration).
[0100] Figure 3 A schematic diagram of a DRAM-CXL-NVM three-layer main memory system and a six-way migration path is shown. Figure 3This demonstrates the capacity / latency / bandwidth relationship of the three layers and the six cross-layer paths in this application.
[0101] Specifically, Figure 3 The system defines three distinct memory tiers: DRAM (80ns): serving as the performance gold standard, handling the hottest data and most urgent requests. CXL (250ns): acting as a capacity expansion tier, utilizing CXL technology to provide larger capacity and slightly higher latency than DRAM, suitable for handling warm data or sequential read-intensive workloads. NVM (4us): acting as a persistent capacity tier, providing memory-like access speeds and large capacity while maintaining data persistence, suitable for cold data or data requiring rapid recovery. Unlike traditional memory management, data flow here is multi-directional, implementing sinking and escalation strategies.
[0102] The sinking strategy refers to the system's approach where, when data cools down in DRAM, it is not directly discarded but instead prioritized for rapid sinking to CXL (Current Limiting), or a write-back operation is performed to CXL. If the data remains unaccessed in CXL for an extended period, it will be archived or flushed back to NVM (NVM) to free up expensive CXL resources.
[0103] The upgrade strategy refers to the system's pre-upgrade capability. When it detects that data in CXL is about to be frequently accessed, it can be moved back to DRAM in advance to avoid query lag. For data in NVM, the system supports asynchronous loading, allowing background threads to pre-warm the data to the upper layer without blocking user requests.
[0104] Figure 4 A schematic diagram of a PATG example with cross-query merging is shown. Figure 4 Display the PATG of two query requests q1 and q2, as well as the shared working set W_share and the independent working set obtained after merging.
[0105] Specifically, q1 PATG: p1(seq,1), p2(seq,1), p3(hash,8); q2 PATG: p3(hash,4), p4(idx,3), p5(idx,3); by merging the PATGs of q1 and q2, we can obtain W_share={p3(hash,max=8)} and independent={p1,p2,p4,p5}. The W_share portion follows the "DRAM priority" principle, while the independent portion is processed using a "hierarchical cost model". Finally, the two results are merged and enter the allocation stage.
[0106] In the above, the format of each data set q1 and q2 is: Page Name + (Access Pattern, Access Count). The page name (p1, p2, p3...) represents the specific data page identifier that needs to be accessed during the query process. In the access pattern (seq, hash, idx), seq refers to sequential access, hash refers to hash access, and idx refers to index access. The PATG of query request q1 can be interpreted as query request q1 performing one sequential access each to p1 and p2, and eight hash accesses to p3 during execution. The PATG of query request q2 can be interpreted as query request q2 performing four hash accesses to p3, and three index accesses each to p4 and p5. During the merging process, q1 needs to access p3 eight times, and q2 needs to access p3 four times. This identifies the high-frequency access characteristic of p3 (maximum single access count is 8), and it is categorized as W_share (shared working set).
[0107] Figure 5 A schematic diagram of a state machine for a transactional consistency migration protocol is shown. (The above...) Figure 5 The atomic switching process of data page p from NORMAL through MIGRATING to NORMAL' (new frame) is shown, along with the routing rules for read / write transactions during this process.
[0108] Specifically, p: NORMAL tier=T1, where NORMAL tier can be translated as normal tier (old frame), indicating that data page p is in a normal service state and is currently located at tier T1 (e.g., DRAM tier). The migration begins, triggering a migration instruction to prepare to migrate page p to the target tier T2. p's status changes to MIGRATING.
[0109] Subsequently, memory allocation and copying can be performed. A new memory space, named p', is allocated at the target level T2. An atomic memory copy (memcpy) is executed to copy the data from p to p'. The system is in an intermediate state during the p→p' copy process.
[0110] Next, logging and dual write (Consistency Protocol) are performed. To ensure no data loss during the migration process (in case of potential crashes), the system executes a write transaction COM route to p'. This means that all write operations will not only be written to the original address p, but also synchronously written to the new address p', maintaining the dual write state, thus achieving the dual write ready state.
[0111] Finally, an atomic switch is performed, instantly redirecting the logical address pointing to p to p' by updating the memory address mapping table. p' then officially takes over the service, and its state is updated to p': NORMAL tier=T2.
[0112] When the process ends, there are two branches. Branch 1 is resource reclamation, which can release p. In this way, the source resource is reclaimed and the process ends.
[0113] Branch 2 is an older snapshot. The system retains a pointer to the old page p, marked as an old snapshot readback p. This provides a rollback window; if problems are found with the migrated data, it can be recovered by reading the old snapshot p.
[0114] The above are embodiments of the method proposed in this application. Based on the same inventive concept, embodiments of this application also provide an active hierarchical management device for a database buffer pool, the structure of which is as follows: Figure 6 As shown.
[0115] Figure 6 This is a schematic diagram of the internal structure of an active hierarchical management device for a database buffer pool, provided as an embodiment of this application. Figure 2 As shown, the device includes: At least one processor 601; And a memory 602 that is communicatively connected to at least one processor; The memory 602 stores instructions that can be executed by at least one processor. The instructions are executed by at least one processor 601 to enable at least one processor 601 to: execute the above-described active hierarchical management method for the database buffer pool.
[0116] In one possible implementation, the processor is capable of receiving a query request for a database and generating a query plan based on the query request; extracting features from the query plan to obtain operator features, and generating a page access timing diagram corresponding to the query request based on the operator features; based on the page access timing diagram, and in conjunction with the capacity of DRAM, CXL memory, and NVM in the three-tier main memory system, calculating using a tiered cost model to allocate a target storage layer for the data page corresponding to the query plan, thereby generating an active tiered scheduling strategy; within a preset time window before the query plan begins execution, asynchronously executing the scheduling operation of the data page between the three-tier main memory system according to the active tiered scheduling strategy; during query execution, accessing the data page according to the target storage layer, and monitoring the deviation between the actual access and the predicted access, wherein the predicted access is determined according to the page access timing diagram; and triggering incremental error correction scheduling if the deviation exceeds a preset deviation threshold.
[0117] Some embodiments of this application provide corresponding to Figure 1 A non-volatile computer storage medium stores computer-executable instructions, which are configured to execute the above-mentioned active hierarchical management method for the database buffer pool.
[0118] In one possible implementation, the computer-executable instructions are configured to receive a query request for a database and generate a query plan based on the query request; extract features from the query plan to obtain operator features, and generate a page access timing diagram corresponding to the query request based on the operator features; based on the page access timing diagram, and in conjunction with the capacity of DRAM, CXL memory, and NVM in the three-tier main memory system, calculate using a tiered cost model to allocate a target storage layer for the data page corresponding to the query plan, thereby generating an active tiered scheduling strategy; within a preset time window before the query plan begins execution, asynchronously execute the scheduling operation of the data page between the three-tier main memory system according to the active tiered scheduling strategy; during query execution, access the data page according to the target storage layer and monitor the deviation between the actual access and the predicted access, wherein the predicted access is determined according to the page access timing diagram; if the deviation exceeds a preset deviation threshold, trigger incremental correction scheduling.
[0119] The various embodiments in this application are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the embodiments for IoT devices and media are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.
[0120] The systems, media, and methods provided in this application are one-to-one correspondences. Therefore, the systems and media also have similar beneficial technical effects as their corresponding methods. Since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the systems and media will not be repeated here.
[0121] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0122] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0123] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0124] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0125] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.
[0126] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0127] Computer-readable media include both permanent and non-permanent, removable and non-removable media that can store information by any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0128] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0129] The above description is merely an embodiment of this application and is not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.
Claims
1. A proactive hierarchical management method for a database buffer pool, characterized in that, The method is applied to a three-tier main memory system consisting of Dynamic Random Access Memory (DRAM), Computational Fast Interconnect (CXL) memory, and Non-Volatile Memory (NVM), and the method includes: Receive a query request for the database and generate a query plan based on the query request; Feature extraction is performed on the query plan to obtain operator features, and a page access sequence diagram corresponding to the query request is generated based on the operator features; Based on the page access timing diagram, and combined with the capacity of DRAM, CXL memory and NVM in the three-tier main memory system, a tiered cost model is used to calculate and allocate the target storage layer to the data page corresponding to the query plan, so as to generate an active tiered scheduling strategy. Within a preset time window before the query plan begins execution, the scheduling operation of the data pages among the three-tier main memory system is executed asynchronously according to the active hierarchical scheduling strategy. During query execution, the data pages are accessed according to the target storage layer, and the deviation between the actual access and the predicted access is monitored, wherein the predicted access is determined based on the page access sequence diagram; If the deviation exceeds a preset deviation threshold, incremental correction scheduling is triggered.
2. The method according to claim 1, characterized in that, The feature extraction of the query plan to obtain operator features includes: The physical execution plan tree corresponding to the query plan is traversed from bottom to top; For each physical operator node encountered during the traversal, a feature tuple corresponding to the physical operator node is constructed, wherein the feature tuple includes the operator type, estimated number of rows to be processed, access mode, reuse factor and deadline hint corresponding to the physical operator node.
3. The method according to claim 2, characterized in that, The step of generating the page access sequence diagram corresponding to the query request based on the operator features includes: For each leaf operator in the physical execution plan tree, the data pages of the leaf operator are enumerated based on the estimated number of rows of the leaf operator, the physical data page information of the associated table or index, so as to construct the set of data page numbers corresponding to the leaf operator; For each data page of the leaf operator, based on the start time and end time of the leaf operator and the relative position of the data page in the data page number set, the first access timestamp and the last access timestamp of the data page are assigned by linear interpolation, and the access count and dominant access pattern of the data page are recorded to generate the page access sequence diagram of the leaf operator; Based on the connection semantics of the intermediate operator, the page access timing diagram of the left subtree and the page access timing diagram of the right subtree are combined to form the page access timing diagram of the intermediate operator, wherein the left subtree is the left subtree of the intermediate operator and the right subtree is the right subtree of the intermediate operator; The access records of the same data page accessed by different operators are merged to generate a page access sequence diagram based on data pages. At a preset time before the query plan begins execution, the page access sequence diagram corresponding to the query request is completed and locked.
4. The method according to claim 1, characterized in that, After generating the page access sequence diagram corresponding to the query request based on the operator features, the method further includes: For multiple pending query requests in the current concurrent query pool, based on the page access sequence diagram corresponding to the query requests, list all the data pages that will be accessed to form a data page set; Conflict analysis is performed on each of the data pages in the data page set. If the data page appears in the page access sequence diagrams of two or more query requests at the same time, the data page is identified as a shared page. Based on the page access sequence diagram of the two or more query requests, determine the number of times the shared page was accessed and the timestamp of the first access. Based on the shared pages, a shared working set is formed, and in the active hierarchical scheduling strategy, a storage layer with a target priority is allocated to the data pages in the shared working set so that the shared pages reside in the DRAM; The data pages in the page access sequence diagram corresponding to multiple query requests, excluding the shared pages, are treated as independent scheduling units, and active hierarchical scheduling decisions are made for each.
5. The method according to claim 4, characterized in that, In the hierarchical cost model, the formula for calculating the equivalent cost of a single access to a data page in the storage layer is as follows: cost(p,op,tier)=base_latency(tier)+(1 bw_match(tier,scan_mode))·penalty Where cost(p, op, tier) represents the estimated cost of operator op accessing data page p once on storage tier tier; base_latency(tier) represents the base access latency of storage tier tier; bw_match(tier, scan_mode) represents the bandwidth matching coefficient between storage tier tier and access mode scan_mode of operator op; penalty represents the additional cost penalty introduced when bandwidth mismatch occurs.
6. The method according to claim 5, characterized in that, The generation of the proactive hierarchical scheduling strategy includes: Under the constraints of the DRAM capacity limit and the CXL memory capacity limit, the target storage layer is allocated to the data page with the goal of minimizing the total access cost, wherein the single access cost is calculated using the formula in the hierarchical cost model; The allocation is performed using a greedy algorithm, including: A revenue score is calculated for the data page based on the reuse factor of the data page and the difference between the basic access latency of the current storage layer and the target storage layer. The data pages are sorted in descending order of earnings score; Each data page is processed sequentially. For the current data page, the data page is attempted to be placed into the DRAM, the CXL memory, or the NVM in descending order of storage layer performance until the first storage layer with sufficient remaining capacity is found, and the storage layer is used as the target storage layer for the data page. When two data pages have the same benefit score, the shared page in the shared working set is preferentially allocated to the DRAM.
7. The method according to claim 1, characterized in that, The method further includes: When a dirty data page in the DRAM triggers a write-back, the access characteristic information of the dirty data page is read. The access feature information is input into the probability model to estimate the probability that the dirty data page will be read again within a future preset time window; If the probability is higher than a preset probability threshold, the dirty data page is written back to the CXL memory, wherein the dirty data page written to the CXL memory is asynchronously persisted to the NVM by a background checkpoint thread; If the probability is not higher than the preset probability threshold, the dirty data page is written back to the NVM.
8. The method according to claim 1, characterized in that, The method further includes: When performing cross-layer migration on any of the aforementioned data pages, the following steps are specifically executed: Before the migration begins, the status of the data page is marked as "migration in progress," and the version number of the data page is incremented. A free frame is pre-allocated in the target storage layer, and an atomic memory copy operation is performed to copy the contents of the data page to the free frame; If a transaction initiates a write operation on the data page during the migration, the write operation will be applied to a new copy of the target storage layer; Maintain a multi-version concurrency control version chain indexed by data page identifier and version number. When a read transaction accesses the data page, it checks the visibility of its version number to the current transaction snapshot. After the migration is complete, the data page directory is atomically updated to point the storage location of the data page to a new copy of the target storage layer, and the original version of the data page in its current storage layer is released.
9. A proactive hierarchical management device for a database buffer pool, characterized in that, The device includes: At least one processor; And, a memory communicatively connected to the at least one processor; The memory stores instructions that can be executed by the at least one processor, which are executed by the at least one processor to enable the at least one processor to perform an active hierarchical management method for a database buffer pool as described in any one of claims 1-8.
10. A computer storage medium storing computer-executable instructions, characterized in that, When the computer-executable instructions are executed, an active hierarchical management method for a database buffer pool as described in any one of claims 1-8 is implemented.