A method and apparatus for data scheduling in a database system

By adopting the LSM-Tree architecture in the distributed database and scheduling cold and hot data at the data block level, the resource waste and scheduling lag caused by partition-level management are solved, achieving efficient cold and hot data management and improving the performance of the database system and the utilization of storage resources.

CN122309462APending Publication Date: 2026-06-30BEIJING OCEANBASE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING OCEANBASE TECHNOLOGY CO LTD
Filing Date
2026-03-30
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies in distributed databases manage hot and cold data at the partition level, resulting in low storage resource utilization and scheduling lag. They are unable to respond to changes in data popularity in real time, leading to resource waste and performance bottlenecks.

Method used

In the database system, a log structure merge tree (LSM-Tree) architecture is adopted. Through hot and cold data scheduling at the data block level, the hotness of data blocks is determined based on the most recent update time and the boundary time threshold. Hot data blocks are stored in high-performance storage media, while cold data blocks are stored in low-cost storage media. This is independent of the merge operation and enables flexible and real-time data scheduling.

Benefits of technology

It improves the utilization rate of storage resources, reduces storage resource waste, enhances the performance of hot data access, balances the performance and cost of the database system, and achieves flexibility and real-time response in hot and cold data scheduling.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309462A_ABST
    Figure CN122309462A_ABST
Patent Text Reader

Abstract

This specification provides a method for data scheduling in a database system. The database system's storage architecture is a log structure merged tree (LSM-Tree). The method includes a target scheduling operation performed on a data table. The target scheduling includes: obtaining a subset of data blocks belonging to a first partition of the data table, where each data block is used to store a portion of the data rows in the first partition; determining a boundary time threshold for the subset of data blocks based on the metadata of the data table; and for any first data block in the subset of data blocks, if the most recent update time of the first data block is later than the boundary time threshold, then the first data block is stored in a first storage medium; if the most recent update time of the first data block is not later than the boundary time threshold, then the first data block is stored in a second storage medium; the read / write performance of the first storage medium is higher than that of the second storage medium.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of computer technology, and more particularly to a method and apparatus for data scheduling of a database system. Background Technology

[0002] A distributed database is a database that runs across multiple computers. It distributes data across multiple computer nodes, which can be in a cluster environment in the same geographical location or distributed across multiple data centers in different regions. Compared to traditional single-node databases, distributed databases effectively overcome the storage capacity limitations and computing performance bottlenecks of single nodes, significantly improving the scalability, availability, and access performance of the database system. They also support large-scale data storage and processing, and are therefore widely used in fields such as e-commerce, finance, and the Internet of Things that require massive data storage and high-performance data access.

[0003] With the rapid development of cloud computing and big data technologies, compute-decentralization architecture has become an important design paradigm for achieving high scalability and low-cost storage in distributed databases. In this architecture, compute nodes and data nodes are decoupled, and data is persistently stored in a remote shared storage service (e.g., Object Storage Service). Shared storage services offer advantages such as low cost, high scalability, and large storage capacity, but their access latency is significantly higher than local storage media. While compute-decentralization architecture can significantly reduce storage costs, network access latency also leads to poor read performance for data, especially hot data. To address this, the industry has proposed a tiered storage solution for hot and cold data in database systems. The core idea is to cache or migrate frequently accessed hot data to local storage media (e.g., local disks), while retaining infrequently accessed cold data in low-cost shared storage services, thereby optimizing the overall performance and resource costs of the database system.

[0004] In database systems, partitioning is a common data organization and management concept. It involves dividing rows of data in a table into smaller, more manageable logical units according to specific rules (e.g., time range, key-value hashing) to improve query performance, simplify data maintenance, and enhance availability. Currently, industry-standard hot and cold data management solutions are typically based on partitioning. Database systems use table partitions as the smallest granularity to determine and schedule hot and cold data. Specifically, data systems usually mark an entire partition as hot or cold based on its time range or other predefined rules. Data in a partition marked as hot is loaded or stored on the local storage medium of the compute node, while data in a partition marked as cold resides in a shared storage service.

[0005] However, setting the granularity of hot and cold data management at the partition level has obvious limitations in practical applications, mainly in the following two aspects:

[0006] First, low storage resource utilization leads to resource waste. As a logical management unit of numerous data rows, a partition's internal data access frequency is often uneven. Commonly, only recently written data within a partition may be active, hot data rows, while the majority of the remaining data rows are essentially "cooled down." Treating a partition as a whole for scheduling inevitably forces the entire partition, containing a large amount of cold data, to be placed on limited local storage media to ensure access performance for the small number of hot data rows. This results in a serious waste of storage space and increases overall resource costs.

[0007] Secondly, the effective conditions are stringent, and data scheduling is lagging. To achieve partition-level hot / cold partition determination, related technologies typically rely on explicitly defined partitioning strategies, such as time-based range partitioning. For example, a database system might define partitions with newly generated data rows within the last 7 days as hot partitions by default. This requires that data tables use a specific time partitioning key when created, making this approach unsuitable for historical data tables that do not employ such partitioning strategies. Furthermore, these technologies couple the hot / cold partition determination process with the database system's background compression and merging process, meaning that user-configured hot / cold policies must wait for the merge to complete before taking effect, failing to respond in real-time to changes in data popularity.

[0008] Therefore, it is hoped that a solution can be provided that can achieve more refined and flexible data scheduling in the database system through technical means, break through the partitioning limitation, and realize the identification and storage of hot and cold data on smaller data units, so as to maximize the utilization of storage resources while improving the access performance of hot data, and balance the performance and cost of the database system. Summary of the Invention

[0009] This specification describes one or more embodiments of a method and apparatus for data scheduling in a database system, which can solve the above-mentioned technical problems.

[0010] According to a first aspect, a method for data scheduling in a database system is provided, wherein the storage architecture of the database system is a log structure merge tree (LSM-Tree), and the method includes a target scheduling operation performed on a data table, wherein the target scheduling includes:

[0011] Obtain a subset of data blocks belonging to the first partition of the data table, where each data block is used to store a portion of the data rows of the first partition.

[0012] Based on the metadata of the data table, determine the boundary time threshold for the subset of data blocks.

[0013] For any first data block in the subset of data blocks, if the most recent update time of the first data block is later than the boundary time threshold, then the first data block is stored in the first storage medium; if the most recent update time of the first data block is not later than the boundary time threshold, then the first data block is stored in the second storage medium; the read / write performance of the first storage medium is higher than that of the second storage medium.

[0014] According to a second aspect, this specification provides an apparatus for data scheduling in a database system, wherein the database system has a storage architecture of a log structure merge tree (LSM-Tree), and the apparatus is used to perform target scheduling operations on data tables. The apparatus includes:

[0015] The acquisition module is configured to acquire a subset of data blocks belonging to the first partition of the data table, wherein each data block is used to store a portion of the data rows of the first partition.

[0016] The determination module is configured to determine the boundary time threshold for the subset of data blocks based on the metadata of the data table.

[0017] The scheduling module is configured to, for any first data block in the subset of data blocks, if the most recent update time of the first data block is later than the boundary time threshold, then the first data block is stored in a first storage medium; if the most recent update time of the first data block is not later than the boundary time threshold, then the first data block is stored in a second storage medium; the read / write performance of the first storage medium is higher than that of the second storage medium.

[0018] According to a third aspect, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the method described in the first aspect.

[0019] According to a fourth aspect, a computing device is provided, including a memory and a processor, characterized in that the memory stores executable code, and when the processor executes the executable code, it implements the method described in the first aspect.

[0020] In summary, the embodiments provided in this specification present a method for data scheduling in a database system. This method refines the hot and cold data scheduling operation in the database system down to the data blocks constituting the partition. This allows the operation to proceed independently of the LSM-Tree merging process, proactively executing on demand without waiting for the merging operation to be triggered. This ensures that the hot and cold data scheduling strategy takes effect quickly after configuration, responding in real-time to dynamic changes in data popularity. Furthermore, since the data scheduling operation operates at the data block level, rather than the entire partition, it can accurately schedule hot data to high-performance storage media without loading the entire partition, significantly reducing storage resource waste. This approach ensures the performance of hot data access while improving the space utilization of the storage media, balancing the performance requirements of the database system with storage costs. Attached Figure Description

[0021] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.

[0022] Figure 1 This is a schematic diagram of a typical LSM-Tree storage architecture disclosed in this specification;

[0023] Figure 2 This is a schematic diagram of an exemplary distributed database implementation architecture disclosed in this specification;

[0024] Figure 3 This specification provides an exemplary implementation architecture for a distributed database employing a shared storage model.

[0025] Figure 4A This is a schematic diagram of a data block storage architecture disclosed in this specification;

[0026] Figure 4B This specification provides an implementation framework for a data scheduling method for a database system.

[0027] Figure 5 This is a flowchart illustrating a method for data scheduling in a database system according to embodiments of this specification.

[0028] Figure 6 This is a schematic diagram of an apparatus for data scheduling of a database system according to an embodiment of this specification. Detailed Implementation

[0029] The solutions provided in the embodiments of this specification will now be described with reference to the accompanying drawings.

[0030] Currently, in many enterprises, data analysis and operational activities are all based on databases as their data foundation. Databases can efficiently organize, store, and manage massive amounts of data. Through flexible query interfaces and transaction mechanisms, they provide efficient data read and write services for upper-layer applications while ensuring data consistency and reliability.

[0031] From a database architecture perspective, databases can employ an append-only write strategy to handle new data. This means that insert, update, and delete operations performed on a database instance are all stored as newly added data, and once stored, the data is not modified. For example, data insertion can be achieved by directly appending new data records, updating historical data can be achieved by appending new version data records, and deleting historical data can be achieved by appending data to delete marker records.

[0032] At the data storage level, besides organizing data in data pages and persisting it to storage media, databases can also use log-structured storage structures for data persistence. Typically, an LSM-Tree (Log Structured Merge Tree) storage architecture can be used. Under this architecture, new data written to the database is first sequentially appended to a MEMTable in memory, forming incremental data. This incremental data is then persisted to storage media in an immutable form (Immutable MEMTable), creating multiple SSTable (Sorted String Table) files generated based on time order. In other words, the MEMTable resides in memory to store incremental data, providing read and write operations. When the space occupied by the incremental data stored in the MEMTable reaches a certain threshold, the incremental data in the MEMTable is frozen, forming an Immutable MEMTable, and further persisted to SSTable files. SSTables can be stored on non-volatile storage media for storing static data. Unlike RAM, non-volatile storage media features data retention after power failure, large storage capacity at relatively low cost, and high sequential write bandwidth but high random read / write latency. In this storage architecture, all data insertion, update, and deletion operations can be transformed into new data records and ultimately appended to the SSTable file, aligning with the append-only write strategy described above. In some practices, the SSTable file can be further divided into several fixed-size blocks at the logical level, with each block containing several data records.

[0033] The basic data unit for database operations can be the SSTable file described above, or the blocks that make up the SSTable file. In some embodiments of this specification, data blocks will be used to represent the basic data unit for database operations, and their specific meaning may vary depending on the database architecture design in the actual application.

[0034] For a concrete implementation example of the LSM-Tree database described above, please refer to [link / reference]. Figure 1 The diagram shows a typical LSM-Tree storage architecture. Database systems using this architecture can transform random write operations into batch sequential write operations, greatly improving data writing speed.

[0035] See Figure 1 Data updates over a period of time (corresponding to write operations shown in the attached diagram) are persisted to a log file (corresponding to the Write-Ahead Log (WAL) shown in the attached diagram) in a sequential manner and written to a data structure in memory (corresponding to the MEMTable shown in the attached diagram). When the amount of data in the MEMTable exceeds a certain threshold, the MEMTable is frozen and transformed into an immutable MEMTable (corresponding to the ImmutableMEMTable shown in the attached diagram). Simultaneously, to avoid blocking database write operations, a new MEMTable is generated to respond to subsequent data writes. Next, background tasks in the database system persist the data in the Immutable MEMTable in memory (corresponding to the flush operation shown in the attached diagram) to the SSTable file on disk without blocking the processing of database foreground tasks. In this way, random write operations performed by the front-end system on different data in the database system can be batch-flushed to the SSTable on disk. This flushing process is a sequential append write to the SSTable, thus transforming the originally scattered random write operations into batch, continuous sequential write operations, significantly improving data write efficiency.

[0036] Please refer to the appendix for further details. Figure 1The SSTables on disk are organized in a multi-level structure (the example in the attached diagram includes L0, L1, and L2, three levels). The number of levels can be set according to specific needs. Generally, the total capacity of the SSTables at each higher level is smaller than that at the next lower level. The L0 level SSTable is usually generated directly by flushing the Immutable MEMTable. Other levels (e.g., L1, L2) of SSTables are generated through compaction operations. Specifically, when the data capacity of the L0 SSTable reaches or approaches its limit, the data of the L0 SSTable can be written to the L1 SSTable through compaction. Similarly, when the data capacity of the L1 SSTable reaches its limit, the data of the L1 SSTable can be written to the L2 SSTable through compaction, and so on. In this way, the data stored in the SSTable at each higher level will be newer than that at the next lower level. Overall, the newest data is stored in memory, the second newest data is stored in the L0 SSTable, and as compaction operations gradually migrate to the next lower level, the oldest data is stored in the lowest level SSTable.

[0037] Please refer to the appendix for further details. Figure 1 In LSM-Tree databases, the execution of read operations is also closely related to the aforementioned multi-level storage architecture. When processing a read operation, the database system first queries the MEMTable in memory. If no match is found, it continues to query the Immutable MEMTable in memory. If no matching data is found in memory, the system searches the multi-level SSTables on disk, typically starting from a certain level (e.g., L0 level) and traversing level by level down (e.g., from L0 to L1 level, then to L2 level) until matching data is found.

[0038] The above is a detailed introduction to the LSM-Tree storage architecture in databases. Database systems based on this storage architecture can convert random data writes into sequential append writes and reorganize data using background compression and merging operations, thereby achieving extremely high data write throughput. It is particularly suitable for carrying continuous writes of massive amounts of data and is widely used in data write-intensive scenarios such as time-series data recording and operation log archiving.

[0039] Next, we will first introduce the implementation architecture of distributed databases, then the typical implementation architecture of distributed databases under shared storage mode, and finally, based on the above content, explain the technical problems that distributed databases face in cold and hot data scheduling scenarios under shared storage mode. It should be noted that this specification... Figure 2 or Figure 3 The distributed database architecture shown is only an example, and the number of nodes (including compute nodes and data nodes), partitioning, and region affiliation are not limiting factors. In practice, these can be flexibly configured according to needs. For example, using a single node to implement the functions of both compute and data nodes can transform a distributed database with shared storage into a single-node database.

[0040] Figure 2 An exemplary distributed database implementation architecture is shown. See also... Figure 2 A distributed database system consists of several nodes, each typically a single physical machine. These nodes often belong to different regions, with each node belonging to one region. For example... Figure 2 The diagram shows region A, which contains nodes A1 and A2, and region B, which contains nodes B1 and B2. A region is a logical concept, typically representing a set of nodes with similar network conditions or geographical locations. That is, a region can have different meanings depending on the deployment model. For example, when a database system is deployed in a data center, nodes in a region can be machines belonging to the same rack or machines bridged on the same switch. When a database system is deployed in multiple data centers, each region can correspond to one data center, and the nodes in a region are the machines deployed within that data center.

[0041] In some distributed databases, data in a table can be split (horizontally partitioned) into multiple data shards according to certain partitioning rules. Each data shard is a table partition, or simply partition (or Tablet). Any row of data in a table belongs to one and only one partition. Multiple partitions of a data table can be distributed and stored across multiple different nodes. Figure 2 For example, data table 1 is divided into 4 partitions, data table 2 is divided into 2 partitions, and they are distributed and stored on various nodes.

[0042] In practical applications, to achieve high availability of a database system, a partition can have multiple replicas. Typically, these replicas are distributed across multiple different regions, and only one of these replicas can accept data modification operations; this primary replica is the main replica, and the others are secondary replicas. Figure 2 In the example shown, the rounded rectangle represents the primary copy of the partition, and the parallelogram represents the secondary copy of the partition.

[0043] The node hosting the primary replica can be considered the master node, and the nodes hosting the secondary replicas can be considered secondary nodes. Once the node hosting the primary replica begins providing services, user data update operations will generate corresponding logs / write-ahead logs on that node. These logs are then synchronized to all secondary replica nodes based on a distributed consensus protocol (e.g., Raft, Paxos), ensuring data consistency across replicas. When the node hosting the primary replica fails, the database system can automatically initiate an election through the consensus protocol to select a new master node from the surviving secondary nodes to continue providing services, achieving automatic fault recovery and continuous service availability.

[0044] Based on the above description of the distributed database implementation architecture, it is clear that distributed database systems achieve high availability and distributed processing through data sharding and multi-replica deployment. However, traditional distributed database systems often adopt a shared-nothing (SN) architecture, where each node needs to store a complete copy of the data. This not only leads to data redundancy and increased storage costs, but also necessitates complex data redistribution operations during node expansion or fault recovery, which is time-consuming and impacts the business continuity of the database system. Furthermore, during node fault recovery, complete data shards need to be synchronized from other nodes, resulting in low efficiency. To address this, the industry has proposed a shared storage architecture for distributed database systems. This architecture decouples data storage functions from nodes, forming a unified, shareable data storage layer (hereinafter referred to as the shared storage layer). This allows nodes to focus on computational tasks without the responsibility of data persistence and synchronization, thus achieving stateless characteristics. While effectively reducing data redundancy, it also improves elastic scalability.

[0045] Figure 3 An exemplary implementation architecture of a distributed database system employing a shared storage model is shown. See also... Figure 3 From an architectural perspective, the database system can be divided into a compute node layer and a shared storage layer. Additionally, it can include a metadata service layer and a log service layer. The log service layer and metadata service layer can serve as shared layers between multiple compute nodes and data nodes, achieving data sharing and synchronization based on a distributed consensus protocol. From an engineering implementation perspective, the log service layer and metadata service layer can be implemented as sub-modules within the shared storage layer, or they can be implemented as separate modules; this specification does not limit this approach. The aforementioned layers can be deployed independently and work collaboratively through standardized interfaces to achieve separation of computation and storage in the database system.

[0046] The compute node layer can consist of several stateless compute nodes, responsible for handling computationally intensive tasks such as querying and transaction execution. The shared storage layer provides persistent data storage and sharing services, ensuring data consistency. Specifically, the shared storage layer can be implemented using object storage services, which can be a distributed, scalable storage cluster storing database data based on an LSM-Tree write strategy. For any tenant, only complete copies of data and logs need to be stored in the shared storage layer to reduce data redundancy and lower storage costs. Simultaneously, to improve the performance of the database system in Transaction Processing (TP) scenarios, each compute node in the compute node layer can store hot data of partitions in its local cache, reducing dependence on access to the shared storage layer. Under this architecture, due to the stateless nature of the compute nodes, when scaling up or recovering from a fault, there is no need for data redistribution; only newly added data that has not yet been persisted in the shared storage layer needs to be retrieved from other nodes. This significantly improves the efficiency of node addition or fault recovery, reducing the impact on the business continuity of the database system.

[0047] Continue reading Figure 3 In a distributed database system employing a shared storage model, compute nodes can be divided into read-write compute nodes (RW nodes) and read-only compute nodes (RO nodes) based on their read and write permissions to replicas. These two types of nodes work together to provide high-concurrency transaction processing and query services for partitioned data within the replicas. Figure 3 In the diagram, rounded rectangles represent compute nodes managing primary replicas, and parallelograms represent compute nodes managing secondary replicas; a replica can include several partitions of data. Figure 3 In this context, the letter P represents partition data, such as P1, P2, etc. Compute nodes manage replicas using Log Streams (LS). It's understandable that since a compute node can manage multiple partitions, it may be classified as a Replica-W or Replica-O node, depending on whether it manages the replica type of a particular partition. In other words, the RW / RO attribute is determined by the partition replica type. For example, if a compute node manages the primary replica of partition P1, it is an RW node for partition P1. If it also manages the secondary replica of partition P4, it is also an RO node for partition P4.

[0048] RW nodes manage the primary replica of the partition, primarily responsible for handling data update transactions. They can perform write operations and generate commit logs (CLog, a log instantiated based on the WAL mechanism). RO nodes manage the secondary replicas of the partition, primarily providing read-only query services. They obtain data changes from the shared storage layer through synchronization mechanisms (e.g., distributed consensus protocols) to ensure data consistency, thereby achieving read-write separation, distributing the read load of RW compute nodes, and providing multi-read, multi-write capabilities.

[0049] Specifically, the commit log is a core component ensuring transaction durability and data consistency. It records all data operations performed during the transaction commit process, including but not limited to the transaction operation type, data changes, and timestamp information. The commit log is a crucial basis for fault recovery and data synchronization in distributed database systems. In shared storage mode, the working mechanism of the commit log differs from the traditional SN architecture. When executing a transaction, the RW node follows a log-first principle, pre-writing the data changes involved in the transaction to the commit log, and then updating the MEMTable based on the transaction execution. In this way, the commit log can record the generation process of incremental data in the MEMTable, and any node can recover data by replaying the commit log. The commit log generated by the RW node can then be uploaded to the shared storage layer (e.g., through a log service implemented based on the Multi-Paxos protocol) for archiving, allowing other computing nodes to pull the archived commit log from the shared storage layer for data replay, thereby maintaining data consistency between nodes.

[0050] In terms of data persistence, based on the storage characteristics of LSM-Tree, the SSTables in the shared storage layer adopt a multi-level organization method (as mentioned above, three levels: L0, L1, and L2). The generation and merging of SSTables at each level are achieved through different types of compaction operations.

[0051] See Figure 3When an RW node processes write transactions for primary replica partition data, it first writes the data to a MEMTable in memory. When the memory usage of the MEMTable reaches a preset threshold, to avoid blocking subsequent write operations, the RW node can freeze the MEMTable as an immutable Immutable MEMTable. Then, through a merge operation (referred to as a mini compaction to distinguish it from other types of merge operations), the data in the Immutable MEMTable is persisted to an SSTable file (referred to as a mini SSTable to distinguish it from SSTables generated by other types of merge operations). The execution of a mini compaction signifies that the incremental data in memory is officially written to disk; a mini compaction can also be called an incremental merge. After the mini compaction is completed, the mini SSTable is stored locally on the RW node. A background process can upload / write this mini SSTable to the L0 level of the shared storage layer, completing the persistence of the partition replica data in the MEMTable. As mentioned earlier, the LSM-Tree database system uses an append-only approach for data updates; therefore, the data in the mini SSTable can also be considered incremental data for the partition replicas.

[0052] When the number of mini SSTable files at the L0 level of the shared storage layer accumulates to a preset threshold, or when file fragmentation affects data query efficiency to a certain extent, a merge operation (referred to as minor compaction to distinguish it from other types of merge operations) can be performed to merge multiple mini SSTables. More specifically, minor compaction can be divided into two scenarios: one is merging multiple fragmented miniSSTables at the L0 level into a larger mini SSTable with stronger data continuity. Figure 3 (Not shown in the image) to reduce the number of files at the L0 level; another method is to merge multiple mini SSTables in the L0 level with the SSTable file in the L1 level (to distinguish it from other types of SSTables, this SSTable is called a minor SSTable) to generate a new minor SSTable at the L1 level, in order to reduce the overhead of cross-level queries.

[0053] In addition, the database system can periodically initiate a full merge operation (called a major compaction) to merge the SSTables at each level in the shared storage layer (in some practices, dynamic MEMTable data may also be included) into a single SSTable with the same version (called a major SSTable) as a new data baseline, thereby cleaning up expired data and optimizing the storage layout.

[0054] In the shared storage architecture described above, the RW node itself holds a complete data view of the partition, namely the sum of the incremental data in the current MEMTable and the data in the persistent SSTable in the shared storage. This allows the RW node to handle real-time read and write requests for the partition. The RO node, on the other hand, relies on the SSTable files in the shared storage and asynchronously synchronized commit logs to replay the incremental data of the partition, thus building a near real-time read-only data view. In other words, the RO node can pull the commit logs corresponding to the partition from the shared storage layer (or log service layer) and replay them locally, thereby updating the data state of the partition. This mechanism ensures that the RO node holds the real-time data of the partition without bearing the partition's write load, providing low-latency query services for the partition.

[0055] Therefore, in database systems using shared storage, when a transaction triggers a read operation on a data row, the compute node responsible for providing query services needs to retrieve the relevant data from the shared storage layer. Since the shared storage layer is designed to provide high-capacity, low-cost storage capabilities, while offering high scalability, its access latency is typically higher than local storage media. Specifically, in LSM-Tree query scenarios, if the queried data does not exist in the compute node's in-memory table (MEMTable, Immutable MEMTable), the compute node needs to initiate I / O requests to the shared storage layer, traversing multiple levels of SSTable files. This process inevitably introduces network latency and the processing latency of the shared storage layer. For frequently accessed hot data rows, repeated remote reads become a bottleneck for database system performance, directly impacting transaction processing response time and overall throughput.

[0056] To address this, a common industry optimization approach is to configure high-performance storage media, such as local caches, on the compute nodes to cache identified frequently accessed data rows. When subsequent queries request these data rows again, the compute node can directly read them from the local cache, thereby avoiding remote access to the shared storage layer and improving the responsiveness of the database system. Therefore, accurately, flexibly, and efficiently scheduling frequently accessed (i.e., high-frequency accessed) data rows to the compute node's local cache while storing low-frequency accessed data rows in the shared storage layer becomes crucial for balancing database system performance and storage costs.

[0057] In some related technologies, hot and cold data management typically uses data table partitions as the basic unit. The database system classifies a partition as either a high-frequency or low-frequency access partition based on its time information or other predefined fields. Data from partitions identified as high-frequency access is loaded into the local cache of the compute nodes, while data from partitions identified as low-frequency access is stored in the shared storage layer. As mentioned earlier, while this partition-based hot and cold data management is simple to implement, it has several limitations in practical use, including but not limited to the following:

[0058] On the one hand, as a logical management unit of numerous data rows, a partition's internal data access frequency is often uneven. Commonly, only recently written data within a partition may be active, hot data rows, while the majority of the remaining data rows are essentially "cooled down." Treating a partition as a whole for scheduling inevitably forces the entire partition, containing a large amount of cold data, to be placed on local storage media with limited space in order to ensure access performance for the small amount of hot data. This means that a large amount of rarely accessed cold data occupies space in the compute node's local cache. Local caches are typically limited in capacity and expensive. This crude strategy of managing hot and cold data by partition leads to inefficient use of the local cache, preventing it from being used to cache more hot data in other partitions, resulting in low overall resource utilization.

[0059] On the other hand, to achieve partition-level hot / cold data determination, related technologies typically rely on explicitly defined partitioning strategies, such as time-based range partitioning. For example, a database system might define partitions that generate new data rows within the last 7 days as hot partitions by default. This requires that data tables use a specific time partitioning key when created, making this approach unsuitable for historical data tables that do not employ such partitioning strategies. Furthermore, due to continuous merge operations, data from a partition may be scattered across multiple SSTable files at different levels, and these SSTable files are constantly reorganized as merge operations are performed. To ensure data consistency, partition-level hot / cold data scheduling schemes often couple their execution with the merge operation. For example, after a full merge is completed, hot / cold scheduling of partitions is performed based on the new baseline SSTable. This can cause user-configured hot / cold strategies to fail to respond in real-time to changes in data popularity, reducing the flexibility of the database system.

[0060] In view of this, the inventors have proposed a method for data scheduling in a database system in the embodiments of this specification. This method can flexibly determine the hot and cold data subsets in a partition and perform data scheduling, thereby achieving fine and flexible data scheduling. It breaks through the partition limitation, realizes the identification and storage of hot and cold data on smaller data units, improves the access performance of hot data, maximizes the utilization of storage resources, and balances the performance and cost of the database system. Figure 4A and Figure 4B Together, they illustrate the implementation framework of this method.

[0061] First refer to Figure 4A This paper presents a schematic data block storage structure. Database partitions are persistently stored using SSTables. The dashed boxes in the diagram represent the storage levels of the data blocks, where an SSTable can also be considered a type of data block. Assuming the storage level corresponding to an SSTable is N (in a typical LSM-Tree storage architecture, N is usually 0), the data rows in the SSTable, ordered by row key, can be further divided into smaller data blocks, corresponding to storage level N+1. If higher performance is required in practical applications, the data blocks at storage level N+1 can be further divided into even smaller blocks, corresponding to storage level N+2, and so on, forming a multi-level nested data block logical structure. It should be noted that... Figure 4A In the given example, each data block is divided based on its row key. Since data in an SSTable is stored ordered by row key according to a preset sorting rule, the row key range of each data block is continuous and unique. Furthermore, the row key ranges of data blocks within the same storage level do not overlap, completely covering the row key range of the data blocks in the previous level. It should also be noted that... (See appendix) Figure 4A The number of storage levels and the amount of data contained in each data block shown are merely exemplary configurations to illustrate the storage structure of the database system. In practice, the granularity of data block division and the number of levels can be flexibly set according to factors such as the storage performance of the database system, the scale of data, and compression algorithms.

[0062] Understandable, Figure 4A In the provided exemplary database system, from a logical perspective, data is divided into multiple data blocks based on row keys for storage, and these multiple data blocks correspond to several storage levels. Any data block in any storage level is composed of several data blocks in the next storage level, and the row key ranges of the data rows contained in each data block of the same storage level are continuous and do not overlap.

[0063] In a specific implementation, database system data can be logically divided into three storage levels. The first storage level consists of data blocks called SSTables, corresponding to partitions. The second storage level consists of data blocks called macroblocks, which are the storage units that make up an SSTable. The third storage level consists of data blocks called microblocks, which are the storage units that make up a macroblock. In practice, data in an SSTable is organized in units of macroblocks, each with a preset size, such as 2MB. A macroblock can be further divided into several microblocks, each with a preset size, such as 16KB. A microblock contains multiple data rows and is the basic unit of I / O within the database system.

[0064] See next. Figure 4B This paper presents a schematic framework for data scheduling in a database system. Data tables in the database system are divided into multiple partitions and persistently stored as SSTable files. The metadata of each data table stores its definition information, such as the table's field structure, primary key configuration, and hot / cold data scheduling strategy. As mentioned earlier, the data in an SSTable can be logically divided into multiple storage levels of data blocks (illustrated as rounded rectangles in the diagram). Depending on the specific implementation, these data blocks can be macroblocks, microblocks, or data rows, as described above.

[0065] Continue reading Figure 4B In this embodiment, hot and cold data scheduling can be performed on a subset of data blocks (hereinafter referred to as the data block subset) of a partition in a data table. The selection of the data block subset can be flexibly determined according to actual needs. For example, it can be all newly generated data blocks in the partition since the previous hot and cold data scheduling operation was completed; it can also be all data blocks currently existing in the partition; or it can be a set of data blocks containing potentially hot data rows predicted based on a certain query. When performing hot and cold data scheduling operations, the hot and cold data boundary conditions for the data block subset can first be determined based on the metadata of the data table (hereinafter, the boundary time threshold is used as an example). For the LSM-Tree architecture, since data blocks are not updated once they are generated, the last modified time (LMT) of a data block can usually be equivalent to the latest write time of all data rows in that data block.

[0066] The inventors discovered that the "hotness" or "coldness" of data is typically related to its creation time, especially in TP (Transaction Processing) scenarios with clear time patterns. Newly generated data blocks are usually read more frequently, while older data blocks are read less frequently. Therefore, by using the LMT (Last Time Modulation) of a data block to determine its most recent update time and comparing it with a threshold time, the "hotness" or "coldness" of the data block can be determined. If the most recent update time of a data block is later than the threshold time, it can be classified as a hot data block and scheduled to the first storage medium with higher read / write performance (e.g., the local cache of a compute node) to ensure low-latency response for high-frequency access. If the most recent update time of a data block is not later than the threshold time, it can be classified as a cold data block and scheduled to the second storage medium with lower storage costs (e.g., object storage in a shared storage layer) to optimize the utilization of storage resources.

[0067] In this way, hot and cold data scheduling operations in the database system can be refined to the data blocks that make up the partition. This allows the operation to be performed independently of the LSM-Tree's merge process, proactively and on demand without waiting for the merge operation to be triggered. This ensures that the hot and cold data scheduling strategy takes effect quickly after configuration and responds in real time to dynamic changes in data popularity. Furthermore, because the data scheduling operation operates at the data block level, rather than the entire partition, hot data can be accurately scheduled to high-performance storage media without loading the entire partition, greatly reducing the waste of storage resources and balancing the performance requirements of the database system with storage costs.

[0068] Based on the above technical framework Figure 5 A flowchart illustrating a method for data scheduling in a database system according to an embodiment of this specification is shown. It is understood that the method disclosed in this specification can be executed by any device, equipment, platform, or cluster of devices with computing and processing capabilities. The storage architecture of the database system is a log structure merged tree (LSM-Tree).

[0069] See Figure 5In one embodiment, the method includes a target scheduling operation performed on a data table, the target scheduling including the following steps: S501: Obtaining a subset of data blocks belonging to a first partition of the data table, wherein each data block is used to store a portion of the data rows of the first partition. S503: Determining a boundary time threshold for the subset of data blocks based on the metadata of the data table. S505: For any first data block in the subset of data blocks, if the most recent update time of the first data block is later than the boundary time threshold, then the first data block is stored in a first storage medium; if the most recent update time of the first data block is not later than the boundary time threshold, then the first data block is stored in a second storage medium; the read / write performance of the first storage medium is higher than that of the second storage medium.

[0070] As previously mentioned, the storage architecture of the database system is a log structure merge tree (LSM-Tree). The data writing process, persistent storage method, and merging mechanism implemented in this storage architecture have been described above and will not be repeated here. As mentioned earlier, the database system can consist of several nodes using a shared storage architecture, including data nodes and compute nodes. The following will describe the cold and hot data scheduling method for any data table in the database system, using specific embodiments. It can be understood that in a distributed database system using a shared storage architecture, the method can be executed by the first compute node among the several compute nodes constituting the database system. The first compute node is the replica management node for the first partition of the data table. The method can also be executed by the data nodes constituting the database system, communicating with the compute nodes that manage the replicas of the data table partitions to achieve cold and hot data scheduling between the compute node's local cache and the data nodes.

[0071] In step S501, a subset of data blocks belonging to the first partition of the data table is obtained, and each data block is used to store a portion of the data rows of the first partition.

[0072] A data block represents a basic data unit. Depending on the actual usage needs, it can represent a macroblock that makes up a partition, a microblock that makes up a macroblock, or a data row. This embodiment does not make any specific limitations in this regard. The data block stores a portion of the data rows of the first partition.

[0073] The data block subset refers to a subset of data blocks selected from all data blocks corresponding to the first partition according to a specific strategy, which serves as the processing object for this hot / cold data scheduling (hereinafter referred to as the target scheduling) operation. In a specific implementation, the data blocks that meet the target scheduling execution conditions can be indexed by scanning the metadata (MetaData) of the SSTable file corresponding to the first partition, thus forming the data block subset. For example, there are typically the following two scenarios.

[0074] In one scenario, the subset of data blocks may include newly generated data blocks in the first partition since the completion of the previous target scheduling operation. During the continuous operation of the database system, new data is constantly generated with the execution of transactions and formed into new SSTable files through MEMTable flush or merge operations, thus generating new data blocks. In this scenario, the newly generated data blocks in the first partition since the last target scheduling operation can be identified by tracing the checkpoint or log sequence number after the completion of the last target scheduling operation, and the subset of data blocks can be determined based on this subset.

[0075] In another scenario, the subset of data blocks may include all data blocks of the first partition. This scenario is typically applicable during the initial hot / cold data scheduling operation on the first partition, or when the hot / cold data scheduling strategy for the data table changes. By defining all data blocks within the first partition as a subset, a complete hot / cold data scheduling process can be performed on all data in the first partition at the data block granularity. All data blocks are scheduled to the appropriate storage medium according to the current unified hot / cold data scheduling strategy.

[0076] Next, in step S503, the boundary time threshold for the subset of data blocks can be determined based on the metadata of the data table.

[0077] As mentioned earlier, the metadata of a data table stores its table definition information, which can be configured during the table's creation or maintenance phases. Taking the creation of a data table as an example, the SQL statement could be:

[0078] CREATE TABLE table_name (table_schema)

[0079] HOT_RETENTION = interval_num,

[0080] GRUNULARITY = unit;

[0081] In this example statement, uppercase words (CREATE, TABLE, etc.) represent SQL keywords, and lowercase words (table_name, table_schema, etc.) represent user-specified parameters. Here, table_name is the name of the data table, and table_schema defines the data columns of the table, i.e., the table structure. interval_num is the threshold for the lifespan of hot data; data created within this timeframe can be considered hot data. This parameter is usually an integer, and the unit can be flexibly set by the database system, such as SECOND (seconds), MINUTE (minutes), HOUR (hours), DAY (days), etc. unit indicates the type of data block to be scheduled, such as MACRO BLOCK, MICROBLOCK, ROW (rows), etc.

[0082] It should be noted that, in order to maintain the brevity of the description, unless otherwise specified, each embodiment will be described using macroblocks as data blocks in the following text.

[0083] By default, the database system uses the Last Update Time (LMT) of a data block as its most recent update time and compares it with the hot data lifetime defined in the table metadata to determine whether a data block is considered "hot" or "cold." In some implementations, to provide a more flexible hot / cold data scheduling strategy, users can specify data columns to determine the most recent update time of a data block. An example SQL statement could be:

[0084] CREATE TABLE table_name (table_schema)

[0085] BOUNDARY_COLUMN = col_name,

[0086] HOT_RETENTION = interval_num,

[0087] GRUNULARITY = unit;

[0088] In this exemplary statement, `col_name` is the column name of a defined data column in the data table. In practice, the data type of this column can be data types such as `TIMESTAMP`, `DATETIME`, or `INT`, which are data types capable of recording or converting time information. Users can specify `BOUNDARY_COLUMN` to use a custom data column as the basis for determining the most recent update time of a data block in the data table.

[0089] Returning to step S503, for the subset of data blocks, the hot and cold data boundary condition, i.e., the boundary time threshold, can be determined based on the metadata of the data table. Specifically, the duration threshold configured on the data table is obtained, such as the interval_num specified by HOT_RETENTION in the previous example. Based on the obtained duration threshold and the current system time, the boundary time threshold for the subset of data blocks is determined. Specifically, the boundary time threshold can be obtained by subtracting the duration threshold from the current system time.

[0090] After determining the subset of data blocks and their corresponding boundary time thresholds through the above steps, in step S505, for any first data block in the subset of data blocks, if the most recent update time of the first data block is later than the boundary time threshold, then the first data block is stored in the first storage medium; if the most recent update time of the first data block is not later than the boundary time threshold, then the first data block is stored in the second storage medium; the read / write performance of the first storage medium is higher than that of the second storage medium.

[0091] In a specific practice, the first storage medium can be a local disk of the compute node, deployed locally on the compute node, and the second storage medium can be remote (relative to the compute node) shared storage, typically such as object storage, deployed on the data nodes that make up the shared storage layer.

[0092] In this step, determining the most recent update time of the first data block is crucial for assessing its "hot" or "cold" status. The most recent update time can represent the timeliness of data within a data block, based on its creation time. In practice, the most recent update time of the first data block can be determined using at least two of the following methods, depending on the user's configuration of the data table.

[0093] In one scenario, a user specifies a target column (hereinafter referred to as the target column) for a data table to determine the most recent update time of a data block. The target column indicates time information related to the data rows, such as order creation time or log timestamps. The first data block may contain a massive number of data rows, each with an independent time value on the target column. To quickly determine the most recent update time of the first data block, a sparse skip index can be created and maintained for the target column. This index is a lightweight, block-level aggregated index, typically generated during data block persistence. After generating the sparse skip index for the target column, key statistics can be quickly obtained without traversing all data rows within the data block. For example, after specifying the target column, a sparse skip index can be created on that column for each data block (including the first data block) to aggregate and record the maximum time value (or, in some practices, the minimum time value) of all data rows within the data block on the target column. When determining the most recent update time of the first data block, it is not necessary to read the values ​​of each data row in the target data column one by one. Instead, it is only necessary to query the sparse skip index of the first data block in the target data column to directly obtain the maximum time value of all data rows in the first data block in the target data column, and determine the maximum time value as the most recent update time of the first data block.

[0094] In another scenario, if the user has not specified a data column for determining the most recent update time in the data table, the LMT of the first data block can be directly used as the most recent update time. As mentioned earlier, in the LSM-Tree storage architecture, due to the append-only write method, once a data block is flushed or persisted to an SSTable file through a merge operation, its content will not be changed again. Therefore, the LMT of a data block is usually the creation time of the data block. In some optimization practices, if a reuse mechanism is designed for data blocks, for example, to reduce write amplification, some data rows in an obsolete data block are reused, then the LMT of the data block can also be the last update time of the data block.

[0095] After determining the most recent update time of the first data block, it can be compared with a boundary time threshold to determine the hot / cold status of the first data block, and then the corresponding scheduling operation can be performed between different storage media. In specific applications, the first data block may already exist in a certain storage medium. To avoid data redundancy and occupy storage resources, and to ensure the idempotency of data access to the first data block, the target scheduler can use only a single storage medium to store the first data block.

[0096] Specifically, if the first data block is determined to be hot data (i.e., its most recent update time is later than the threshold time), it can be scheduled to the first storage medium. During target scheduling, if the first data block is already stored in the second storage medium, it can be deleted from the second storage medium after being written to it. Similarly, if the first data block is determined to be cold data (i.e., its most recent update time is not later than the threshold time), it can be scheduled to the second storage medium. During target scheduling, if the first data block is already stored in the first storage medium, it can be deleted from the first storage medium after being written to it.

[0097] In terms of specific engineering implementation, depending on the architecture of the database system and the division of responsibilities among its components, the aforementioned target scheduling operations can be performed by different nodes.

[0098] In one implementation, the target scheduling operation can be performed by the first compute node in the first partition replica management node. When the first data block is determined to be hot data, and the first data block is currently stored in the second storage medium of the data node, the first compute node can initiate a data retrieval request to retrieve the first data block from the second storage medium of the data node via the network / interface, and store it in the first storage medium of the first compute node. Optionally, after completing the storage of the first data block, the first compute node can also send a data deletion command to the data node, so that the data node can remove the first data block from the second storage medium. Similarly, if the first data block is determined to be cold data, and the first data block is currently stored in the first storage medium of the compute node, the first compute node can send the first data block to the data node via a secure transmission protocol, so that the data node can store the received first data block in the second storage medium. Optionally, after the data node returns a successful storage response, the first compute node can delete the first data block from the first storage medium.

[0099] In another implementation, the target scheduling operation can be performed by the data node. The data node can communicate with the first compute node in the first partition replica management node to obtain the hot / cold determination result of the first data block. If the first data block is determined to be hot data, and the first data block is currently stored in the data node's second storage medium, the data node can send the first data block to the first compute node, so that the first compute node can store the first data block in the first storage medium. Optionally, after receiving confirmation of successful storage of the first data block from all replica management nodes of the first partition, the data node can delete the first data block from the second storage medium. Similarly, if the first data block is determined to be cold data, and the first data block is currently stored in the first storage medium, the data node can send a data retrieval request to the first compute node to retrieve the first data block stored in the first storage medium from the first compute node and store the first data block in the second storage medium. Optionally, after completing the write of the first data block to the second storage medium, the data node can send a deletion command to the first compute node, so that the first compute node can delete the first data block from the first storage medium.

[0100] The above describes a method for data scheduling in a database system based on one or more embodiments. It should be noted that as the database system continues to run, the target scheduling operations described in steps S501-S505 can be invoked and executed multiple times, continuously performing hot and cold scheduling of data blocks to dynamically adapt to changes in the access frequency of data blocks, ensuring that the performance and storage costs of the database system are always optimized.

[0101] The target scheduling operation described in the above embodiments can be refined to the data blocks that constitute the partition, so that the operation can be executed independently of the LSM-Tree merging process and can be actively executed on demand without waiting for the merging operation to be triggered, thereby improving the flexibility and real-time effectiveness of cold and hot data scheduling in the database system.

[0102] In a practical application, target scheduling can be executed periodically at preset time intervals. Users can configure an execution cycle for target scheduling in the database system (e.g., every minute, every 5 minutes, or every hour, etc.) to trigger hot and cold data scheduling for each data table at regular intervals. This ensures that the database system can evaluate the hot and cold status of data blocks at a stable frequency and schedule hot and cold data blocks according to the currently determined boundary time thresholds. Because the execution granularity of hot and cold data scheduling is at the data block level rather than the entire partition, and the scheduling process is independent of the resource-intensive merging process, even with frequent execution, its resource overhead is relatively controllable, enabling continuous optimization.

[0103] In another specific practice, target scheduling can also be performed under preset conditions. For example, the preset conditions may include:

[0104] Responding to the first command sent by the user: Users can proactively initiate a cold / hot data scheduling operation for a specific data table, partition, or the entire database through specific management commands or API interfaces. In practical applications, when a user anticipates performing large-scale analysis and queries on a batch of data, they can manually trigger the database system's cold / hot data scheduling to preheat the relevant data blocks to the corresponding storage media; or, after the user adjusts the cold / hot scheduling strategy for a data table, the scheduling can be executed immediately to make the new strategy take effect.

[0105] In response to the generation of at least one of the data blocks: when a new data block is generated in the partition (e.g., by flushing to generate a new data block), a cold and hot data scheduling operation is automatically triggered to determine and schedule the newly generated data block as cold or hot, thereby effectively reducing the delay between the generation of the data block and the scheduling of the data block, ensuring that the newly generated data block can be scheduled to the corresponding storage medium in a timely manner. While ensuring that hot data can be quickly cached in the first storage medium, it can also prevent cold data from occupying the space of the first storage for too long.

[0106] Furthermore, it should be noted that although the execution granularity of the target scheduling method described above can be refined to data blocks such as macroblocks, microblocks, or even data rows, in practical applications, considering the balance between the execution complexity of target scheduling and the performance improvement of the database system that target scheduling can bring, it is preferable to set the data block as a macroblock.

[0107] The foregoing description, based on one or more embodiments, details a method for data scheduling in a database system. Using the method provided in the embodiments of this specification, the hot and cold data scheduling operation in the database system can be refined to the data blocks constituting the partition. This allows the operation to be performed independently of the LSM-Tree merging process, proactively and on demand without waiting for the merging operation to be triggered. This ensures that the hot and cold data scheduling strategy takes effect quickly after configuration and responds in real-time to dynamic changes in data popularity. Furthermore, since the data scheduling operation operates at the data block level, rather than the entire partition, hot data can be accurately scheduled to high-performance storage media without loading the entire partition, greatly reducing the waste of storage resources. While ensuring the performance of hot data access, it also improves the utilization rate of the local cache on the computing node, balancing the performance requirements and storage costs of the database system.

[0108] In this specification, the terms "first" in the first partition, first data block, etc., and the corresponding terms "second" and "third" (if they exist) are used merely for the convenience of distinction and description, and do not have any limiting meaning.

[0109] The foregoing description describes specific embodiments of this specification; other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than those shown in the embodiments, and the desired result may still be achieved. Furthermore, the processes depicted in the drawings do not necessarily need to follow the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0110] Figure 6 This is a schematic diagram of an apparatus for data scheduling in a database system according to an embodiment of this specification. The apparatus 600 is deployed in a computing device, which can be implemented using any device, equipment, platform, or device cluster with computing and processing capabilities. The database system's storage architecture is a log structure merged tree (LSM-Tree). The apparatus is used to perform target scheduling operations on data tables. This apparatus embodiment is similar to... Figure 5 Corresponding to the method embodiment shown, the apparatus 600 includes:

[0111] The acquisition module 601 is configured to acquire a subset of data blocks belonging to the first partition of the data table, wherein each data block is used to store a portion of the data rows of the first partition.

[0112] The determination module 602 is configured to determine the boundary time threshold for the subset of data blocks based on the metadata of the data table.

[0113] The scheduling module 603 is configured to, for any first data block in the subset of data blocks, if the most recent update time of the first data block is later than the boundary time threshold, then the first data block is stored in the first storage medium; if the most recent update time of the first data block is not later than the boundary time threshold, then the first data block is stored in the second storage medium; the read / write performance of the first storage medium is higher than that of the second storage medium.

[0114] According to another embodiment, this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the foregoing combinations. Figure 5 The steps of the method are described.

[0115] According to yet another embodiment, this specification also provides a computing device including a memory and a processor, characterized in that the memory stores executable code, and when the processor executes the executable code, it implements the foregoing combination. Figure 5 The steps of the method are described.

[0116] Those skilled in the art will recognize that the functions described in the embodiments of the present invention in one or more of the above examples can be implemented using hardware, software, firmware, or any combination thereof. When implemented in software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium.

[0117] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, or improvements made based on the technical solutions of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for data scheduling in a database system, wherein the storage architecture of the database system is a log structure merge tree (LSM-Tree), the method comprising target scheduling operations performed on data tables, the target scheduling including: Obtain a subset of data blocks belonging to the first partition of the data table, wherein each data block is used to store a portion of the data rows of the first partition; Based on the metadata of the data table, determine the boundary time threshold for the subset of data blocks; For any first data block in the subset of data blocks, if the most recent update time of the first data block is later than the boundary time threshold, then the first data block is stored in the first storage medium; if the most recent update time of the first data block is not later than the boundary time threshold, then the first data block is stored in the second storage medium. The read / write performance of the first storage medium is higher than that of the second storage medium.

2. The method according to claim 1, wherein, The target scheduling operation is executed periodically at preset time intervals; or, The target scheduling operation is executed under preset conditions; The preset conditions include: responding to a first instruction sent by the user, or responding to the generation of at least one of the data blocks.

3. The method according to claim 1, wherein, The database system includes a plurality of computing nodes and data nodes. The first computing node among the plurality of computing nodes is the replica management node of the first partition. The first storage medium is deployed on the first computing node, and the second storage medium is deployed on the data node.

4. The method according to claim 3, wherein, The method is executed by the first computing node; The method of storing the first data block in the first storage medium includes: If the first data block is currently stored in the second storage medium and is not stored in the first storage medium, then the first data block is obtained from the second storage medium and stored in the first storage medium; And, storing the first data block in the second storage medium includes: If the first data block is currently stored in the first storage medium and not in the second storage medium, then the first data block is sent to the data node, so that the data node stores the first data block in the second storage medium; and the first data block is deleted from the first storage medium.

5. The method according to claim 3, wherein, The method is executed by the data node; The method of storing the first data block in the first storage medium includes: If the first data block is currently stored in the second storage medium and not in the first storage medium, then the first data block is sent to the first computing node, so that the first computing node stores the first data block in the first storage medium; And, storing the first data block in the second storage medium includes: If the first data block is currently stored in the first storage medium and not in the second storage medium, then the first data block is retrieved from the first storage medium and stored in the second storage medium.

6. The method according to claim 1, wherein, The method of storing the first data block in the first storage medium includes: After storing the first data block in the first storage medium, if the first data block is currently still stored in the second storage medium, then delete the first data block from the second storage medium; And, storing the first data block in the second storage medium includes: After storing the first data block in the second storage medium, if the first data block is still currently stored in the first storage medium, then delete the first data block from the first storage medium.

7. The method according to claim 1, wherein, Based on the metadata of the data table, determine the boundary time threshold for the subset of data blocks, including: Obtain the duration threshold configured on the data table; Based on the duration threshold and the current system time, determine the boundary time threshold for the subset of data blocks.

8. The method according to claim 1, wherein, The data table is configured with a first attribute to indicate the type of the data block; the first attribute is any one of the following: macroblock, microblock, data row.

9. The method according to claim 1, wherein, The data table contains target data columns, which indicate time information related to data rows; the most recent update time is determined in the following way: Based on all the data rows contained in the first data block, the maximum time value of all data rows on the target data column is determined by sparse skip index; The maximum time value is determined as the most recent update time of the first data block.

10. The method according to claim 1, wherein, The most recent update time is the creation time of the first data block.

11. The method according to claim 1, wherein, The subset of data blocks includes the data blocks newly generated in the first partition after the previous target scheduling operation was completed.

12. The method according to claim 1, wherein, The subset of data blocks contains all the data blocks of the first partition.

13. The method according to claim 1, wherein, The first storage medium is a local disk, and the second storage medium is object storage.

14. An apparatus for data scheduling in a database system, wherein the storage architecture of the database system is a log structure merge tree (LSM-Tree), the apparatus being used to perform target scheduling operations on data tables, the apparatus comprising: The acquisition module is configured to acquire a subset of data blocks belonging to the first partition of the data table, wherein each data block is used to store a portion of the data rows of the first partition; The module is configured to determine the boundary time threshold for the subset of data blocks based on the metadata of the data table. The scheduling module is configured to, for any first data block in the subset of data blocks, if the most recent update time of the first data block is later than the boundary time threshold, then store the first data block in the first storage medium; if the most recent update time of the first data block is not later than the boundary time threshold, then store the first data block in the second storage medium. The read / write performance of the first storage medium is higher than that of the second storage medium.

15. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1-13.

16. A computing device, comprising a memory and a processor, characterized in that, The memory stores executable code, and when the processor executes the executable code, it implements the method of any one of claims 1-13.