A method for decentralized storage of database metadata
By mapping metadata and data tables to the same virtual node through consistent hashing algorithm and Raft protocol, and combining multi-level caching and indexing, the performance bottleneck and scalability problems of traditional centralized metadata management architecture are solved, realizing efficient and reliable decentralized storage and horizontal scaling of metadata.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TAOS DATA
- Filing Date
- 2026-03-13
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional centralized metadata management architectures suffer from performance bottlenecks, limited scalability, and reliability risks in large-scale data scenarios. Existing distributed databases still have inefficiencies in metadata management, making it difficult to meet the needs of high concurrency, low latency, and elastic scaling.
A consistent hashing algorithm is used to force metadata and data tables to be mapped to the same virtual node. Combined with the Raft protocol, strong consistency of metadata replicas is achieved. A multi-level caching system and multi-dimensional indexes are designed. Through virtual node splitting and dynamic expansion, decentralized storage and horizontal scaling of metadata are realized.
It enables seamless horizontal scaling of metadata, reduces cross-node access latency, improves system reliability and response speed, and supports efficient management of massive tables.
Smart Images

Figure CN121833713B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of databases, and specifically to a method for horizontal scaling of database clusters and distributed storage of metadata based on consistent hashing sharding, particularly a method for decentralized storage of database metadata. Background Technology
[0002] In the field of database technology, especially in time-series database architecture, the way metadata (including table structure, tag information, etc.) is managed has a decisive impact on system performance, scalability, and reliability.
[0003] Traditional time-series databases generally use, for example Figure 1 The diagram illustrates a centralized metadata management architecture. While this architecture performs stably with limited data volume, its inherent flaws become increasingly apparent as business scales and the number of tables reaches millions or even tens of millions. First, the system faces severe performance bottlenecks. All metadata queries rely on a single server, leading to a linear increase in response latency with the number of tables. Simultaneously, the massive amount of metadata puts enormous pressure on single-machine memory capacity, making CPU and I / O more likely to become system bottlenecks in high-concurrency scenarios. Second, the system's scalability is severely limited. Metadata storage and processing capabilities cannot be horizontally scaled by adding nodes, and it is difficult to distribute access pressure from hot tables. Finally, the system presents significant reliability risks. A single point of failure in the central node can affect the entire system, and the backup and recovery of massive amounts of metadata is excessively time-consuming, making it difficult to meet strict recovery time objectives (RTO).
[0004] To address the challenges of large-scale data, while some existing distributed databases have implemented data sharding, their metadata management solutions remain incomplete, exhibiting the following main problems:
[0005] Metadata is separated from data: The metadata itself is still stored in a centralized manner and is not distributed in coordination with data sharding, which makes metadata service a new performance bottleneck in the global system.
[0006] Lack of fine-grained sharding mechanism: Most systems only support coarse-grained sharding at the database or tablespace level, and cannot achieve fine-grained metadata distribution and management at the table level, which limits the system's flexibility and resource utilization.
[0007] Weak dynamic scalability: When cluster expansion or sharding strategy adjustment is required, large-scale data migration is often involved, which is costly, time-consuming, and affects business continuity.
[0008] CN113886037A discloses a method and system for implementing data distribution in a distributed database cluster. This patented technology manages the overall data by dividing it into multiple data shards. Scheduling instances, computing instances, and a distributed storage system are interconnected via a network. Computing instances and scheduling instances share cluster metadata, and the database cluster's failover, node replacement, expansion, and contraction are achieved by controlling the correspondence between computing instances and data shards. However, this patent lacks flexibility in setting up data shards and cannot dynamically adjust the sharding strategy according to actual needs.
[0009] CN114443643A discloses a method and system for implementing hash distribution tables in a distributed database. This patented technology creates a distribution table in the form of a parent-child table. The parent table stores metadata information of the distribution table and related information of the child table, while the child table stores the table data of the distribution table according to hash values. Furthermore, it achieves the storage of hash distribution tables in a distributed database by creating indexes for the distribution table and storing the indexes in the corresponding child tables. However, this patent still suffers from inefficiency in the creation and update operations of the hash distribution table.
[0010] In summary, existing centralized or semi-distributed metadata management solutions are no longer adequate for the core requirements of high concurrency, low latency, elastic scaling, and high availability in massive data scenarios. Therefore, there is an urgent need in this field for an innovative technical solution that can achieve decentralized metadata storage, support seamless horizontal scaling, and efficiently manage massive tables. Summary of the Invention
[0011] This invention aims to provide a method for supporting decentralized storage of database metadata that supports horizontal scaling, in order to solve the fundamental problems faced by traditional centralized metadata architectures, such as performance bottlenecks, limited scalability, and reliability risks.
[0012] A method for decentralized storage of database metadata according to the present invention includes:
[0013] The global hash space is divided into multiple consecutive hash value ranges using a consistent hashing algorithm, and these ranges are then assigned to multiple virtual nodes (vnodes) on the server, so that each virtual node corresponds to a hash value range.
[0014] The metadata of each data table and the generated time-series data are forcibly mapped and stored in the same virtual node, so that any operation on the time-series data on the virtual node can obtain the metadata on the virtual node, thereby avoiding the latency and overhead caused by remote access to metadata across nodes;
[0015] At least two virtual nodes (Vnodes) are combined into a virtual node group (Vgroup), and strong consistency and automatic failover of metadata replicas are achieved within the virtual node group (Vgroup) through the Raft protocol.
[0016] A multi-level caching system spanning both the server and client is used to cache hot data and minimize query response time, while also building multi-dimensional indexes.
[0017] When an overloaded virtual node is detected, a new virtual node is created, and a split operation is performed on the overloaded virtual node, moving a segment of hash value out of the hash value range.
[0018] Preferably, the metadata of each data table and the generated time-series data are forcibly mapped and stored in the same virtual node, specifically as follows:
[0019] In the Mnode storage engine SDB on the server-side management node, metadata entries for the super table are created and the structural information of the metadata entries is persistently stored.
[0020] Create metadata entries for a sub-table on each virtual node to store time-series data, and write metadata entries containing the UID of the super table to which the data belongs and the specific tag value into the metadata master table of the storage engine TDB on each virtual node.
[0021] The sub-table index of each virtual node is associated with the super table, thereby storing the metadata of the sub-table and the generated time-series data in the same virtual node.
[0022] Preferably, the hash value of each sub-table falls within the hash value range of the virtual node to which it resides.
[0023] Preferably, a multi-level caching system spanning both the server and client is used to cache frequently accessed data to minimize query response time and to build multi-dimensional indexes, including:
[0024] The client sends a query request to the supertable to obtain the metadata of the table structure of the supertable and caches it so that subsequent identical query requests can first locate the list of sub-table UIDs that meet the conditions from the supertable metadata cached by the client.
[0025] The supertable quickly locates a list of all sub-table UIDs that meet the conditions based on the tag conditions in the query request, using its maintained tag index and sub-table index, and returns it to the client.
[0026] The client sends a data query request to the virtual node (Vnode) where the sub-table that meets the conditions is located, and reads time-series data from the sub-table using the sub-table metadata stored in the virtual node.
[0027] Preferably, forming a virtual node group (Vgroup) from at least two virtual nodes (Vnode) includes:
[0028] A virtual node group Vgroup is formed by combining at least two virtual nodes, including a leader virtual node (Leader) and at least one follower virtual node (Follower), to store the same time-series data and sub-table metadata. The client uses the Leader address to read the time-series data in the sub-table.
[0029] Preferably, achieving strong consistency and automatic failover of metadata replicas within the virtual node group (Vgroup) via the Raft protocol includes:
[0030] When a virtual group (Vgroup) detects a failure of the Leader, it elects a new Leader from the Followers according to the Raft protocol.
[0031] The election results are notified to the Mnode so that the Mnode can update the cluster topology and inform all clients of the new Leader address, so that the clients can use the new Leader address to read the time-series data in the sub-table.
[0032] Preferably, the splitting operation of moving a segment of hash values out of the hash value range for virtual nodes with excessive load includes:
[0033] A segment of hash value is split from the hash value range handled by the overloaded virtual node, forming a new hash value range, and the new hash value range is assigned to the new virtual node.
[0034] Preferably, the splitting operation of moving a segment of hash values out of the hash value range for virtual nodes with excessive load further includes:
[0035] Migrate the metadata and time-series data corresponding to the new hash value range in the overloaded virtual nodes to the new virtual nodes.
[0036] Preferably, the super table serves as the metadata entity of the template. It does not store time-series data itself, but defines a data table structure schema and tag structure tags common to all its sub-tables.
[0037] Preferably, the metadata is used to describe the structure and attributes of the database table itself, including table structure schema, tag information, table name, unique identifier UID, version number, super table to which it belongs, creation time, and time to live (TTL).
[0038] The beneficial technical effects of this invention are as follows: by calculating the hash value of the table name, the metadata of each table and its time-series data are distributed collaboratively on the same vnode, realizing the design principle of "where the data is, the metadata is there", fundamentally avoiding cross-node access; the Raft consistency protocol is used to ensure the consistency of metadata among replicas; at the same time, by caching hot data in memory and building multi-dimensional indexes, the response time of most query operations is minimized. Attached Figure Description
[0039] Figure 1 This is a schematic diagram of the existing centralized metadata management architecture;
[0040] Figure 2 This is a schematic diagram of the decentralized storage architecture for database metadata of the present invention;
[0041] Figure 3 This is a schematic diagram of the method for decentralized storage of database metadata according to the present invention. Detailed Implementation
[0042] Basic Concepts
[0043] Meta data describes the structure and attributes of a database table, including but not limited to table structure (Schema), tags, table name, unique identifier (UID), version number, supertable to which it belongs, creation time, and time to live (TTL).
[0044] Virtual node (vnode): As an independent logical entity that is the basic management unit in a distributed database cluster, each vnode is responsible for the complete metadata storage, time-series data management and related computing tasks of all data tables within a continuous hash range. It is an autonomous unit that integrates storage, computing and caching capabilities.
[0045] Management node (mnode): In this invention, it specifically refers to the system component responsible for maintaining the global metadata of the cluster. Its core functions include maintaining and distributing the hash range mapping table of virtual nodes (vnode), cluster topology information, user authentication and permissions, but it does not persistently store the business metadata of the user data table.
[0046] Metadata and data collaborative distribution: In this invention, it specifically refers to a data distribution design principle, which uses the same mapping rules (such as consistent hashing) to force the metadata of any data table and its generated time-series data to be located and stored in the same virtual node (vnode), ensuring that data operations do not require accessing metadata across nodes.
[0047] Consistent hashing sharding mechanism: It is a distributed sharding method that divides the entire hash space into multiple continuous intervals and assigns them to different virtual nodes (vnodes); by calculating the hash value of the table name, its metadata and data are permanently and uniformly mapped to a specific vnode.
[0048] Table-level sharding: A sharding strategy that uses a single data table as the smallest unit for data distribution and migration. Compared to sharding by database or tablespace, it enables more granular load balancing and resource scheduling.
[0049] A virtual group (vgroup) is a highly available logical group consisting of multiple virtual node (vnode) replicas that uses consistency protocols such as Raft to ensure strong data consistency. Within the group, one node acts as the Leader, responsible for handling write requests, while the others are Followers, providing data redundancy and read services.
[0050] TDB Storage Engine: An embedded storage engine designed specifically for managing metadata. It adopts a three-tier architecture of memory caching, data files (based on B+Tree), and write-ahead log (WAL), and integrates a metadata master table, a history table, and multiple index tables internally.
[0051] Multi-level index system: A set of index tables built inside the TDB storage engine to support fast multi-dimensional queries of metadata, including but not limited to UID index (pUidIdx), table name index (pNameIdx), super table index (pSuidIdx), sub-table index (pCtbIdx), tag index (pTagIdx), etc.
[0052] Write-Ahead Logging (WAL) Mechanism: A technique employed by the TDB storage engine to ensure the atomicity and durability of metadata updates. Before any data page modification is written to the data file, the modified content must first be persisted to disk as a log record. This is used for data recovery after system failures.
[0053] Multi-level caching mechanism: A hierarchical caching system that runs through the client and server sides, including the server-side memory metadata cache and super table statistics cache, as well as the client-side local schema cache and Vnode location information cache, which aims to minimize the underlying I / O and network latency.
[0054] Vnode splitting: An operation that dynamically expands cluster capacity. It refers to splitting the hash range handled by a heavily loaded virtual node (vnode) from the middle and migrating all metadata and data corresponding to the second half of the range to a new vnode, thereby achieving horizontal scaling of capacity and redistribution of load.
[0055] Super Table: A metadata entity that serves as a template. It does not store time-series data itself, but defines the data structure (Schema) and tag structure (Tags) common to all its child tables.
[0056] Child Table: An instantiation of a supertable that inherits the schema of its supertable but has its own specific tag values and actually stores time-series data.
[0057] The core of this solution lies in a decentralized metadata sharding storage method. This method divides the global hash space into multiple contiguous intervals using a consistent hashing algorithm, with each interval managed by a virtual node (vnode). By calculating the hash value of the table name, the metadata of each table and its time-series data are collaboratively distributed on the same vnode, realizing the design principle of "where the data is, the metadata is there," fundamentally avoiding cross-node access.
[0058] To ensure high system availability, multiple vnode replicas form a virtual group (vgroup), and the Raft consistency protocol is used to guarantee metadata consistency among replicas. Meanwhile, to address the need for efficient querying of massive amounts of metadata, this invention designs a multi-level caching mechanism and a multi-level indexing system. By caching hot data in memory and building multi-dimensional indexes, the response time of most query operations is minimized.
[0059] The overall architecture of this invention enables the management capacity and processing capability of metadata to increase linearly with the increase in the number of vnodes, achieving true horizontal scaling and effectively distributing access pressure and eliminating single points of failure.
[0060] To achieve the above overall solution, this invention has designed the following key technical modules, which together constitute the core of this invention's innovation, specifically including:
[0061] 1) A collaboratively distributed, consistent hashing sharding module, which is the cornerstone for achieving decentralized storage and horizontal scaling of metadata, has the following sub-modules:
[0062] The metadata and data co-location submodule differs from existing technologies that store metadata and data separately. This module uses a unified hash algorithm (such as MD5) to force a mapping between the table's metadata and its generated time-series data, storing them within the same vnode. This ensures that any data operation can directly retrieve metadata from the local vnode, completely avoiding the latency and overhead caused by cross-node remote access.
[0063] Fine-grained table-level sharding submodule: Unlike coarse-grained sharding schemes based on databases or tablespaces, this module implements fine-grained sharding at the table level. Each table independently calculates its hash value and maps it to a vnode, achieving an extremely even distribution of load and accurately addressing hotspot table issues.
[0064] The dynamic scaling submodule with minimal migration: When scaling up the cluster, this module supports migrating only a small number of tables affected by changes in hash range through vnode splitting or hash range adjustment strategies, thus achieving "minimal data migration" and ensuring the continuity and smoothness of business during the scaling process.
[0065] 2) A metadata storage engine module that supports multi-dimensional queries. This module is responsible for the efficient and reliable storage and retrieval of metadata, and has the following characteristics:
[0066] The multi-functional, integrated TDB engine is not a standalone storage system, but rather a comprehensive storage system that integrates the metadata master table (pTbDb), the schema history table (pSkmDb), and seven dedicated index tables (such as pUidIdx, pNameIdx, etc.). This design, which isolates and manages different types of metadata and their indexes within the engine, ensures clear data organization and efficient operation.
[0067] Metadata structure design for time-series data: A metadata entry structure was designed to flexibly adapt to the super table, sub-tables, and ordinary tables, specifically for the time-series data model. In particular, a suid field pointing to the super table and a STag encoding structure for efficient storage of tag values were designed for sub-tables, perfectly supporting the unique business model of "one super table, multiple sub-tables" in time-series databases.
[0068] Transactional metadata write process: A strict write process based on operation sequence (SMetaTableOp) is designed. With the guarantee of write log (WAL), the updates to the main table, historical table and multiple index tables are completed as an atomic transaction, which fundamentally eliminates the inconsistency of metadata.
[0069] 3) A multi-level caching module spanning both the client and server sides. This module is crucial for ensuring low-latency queries in the system, and its characteristics are reflected in its multi-layered and intelligent caching system:
[0070] Server-side tiered memory caching:
[0071] Level 1 metadata cache: Employs a dynamically scaling hash bucket to cache core table information (uid, version, etc.) to handle the most frequent accesses.
[0072] Second-level super table statistics cache: Independently caches the aggregation information of the super table (such as the number of child tables) to avoid frequent aggregation queries on the main table.
[0073] Intelligent client-side caching: The client not only caches the complete table schema but also the location information of vnodes. By introducing version number verification and TTL mechanisms, it fully utilizes local caching to improve performance while intelligently detecting changes in metadata or cluster topology, ensuring strong cache consistency.
[0074] 4) A metadata consistency module that ensures high availability and historical traceability. This module guarantees the reliability and maintainability of metadata in a distributed environment; its features include:
[0075] A Raft-based distributed state machine: Each vgroup is a Raft group, and all metadata modifications are persisted as logs on a majority of nodes within the group before being applied to the state machine (i.e., the TDB engine). This ensures that even if some nodes fail, metadata will not be lost, and the cluster can quickly recover services through automatic leader election, achieving high availability.
[0076] Multi-version metadata management uses a composite primary key {uid, version} in the main metadata table to fully record all historical versions of metadata. Combined with a dedicated schema history table (pSkmDb), the system natively supports time travel queries and schema change tracking, advanced features that are difficult to achieve in traditional centralized architectures.
[0077] Example
[0078] Taking a smart grid monitoring platform built by a power company as an example, the platform connects to millions of smart meters throughout the city. Each meter, as an independent device, is modeled as a sub-table in the database. All these sub-tables belong to a super table called "meters". As the business developed, the number of meters grew from the initial 100,000 to tens of millions, and the traditional centralized metadata database could no longer handle the load.
[0079] Implementation process after adopting the solution of this invention
[0080] 1. System initialization and table creation
[0081] 1.1: Creating the database and initializing vnodes
[0082] The system administrator creates the database `power_grid` and sets the initial number of virtual groups `numOfVgroups` to 4. Based on this, the system evenly divides the hash space [0, UINT32_MAX] into four intervals and creates four initial vnodes (e.g., V1, V2, V3, V4), each vnode responsible for one interval. The management node (mnode) records and maintains this mapping relationship: V1: [0, 0x3FFFFFFF], V2: [0x40000000, 0x7FFFFFFF], ...
[0083] 1.2: Creating a Super Table
[0084] Application executes SQL:
[0085] CREATE STABLE meters (ts TIMESTAMP, current FLOAT, voltage FLOAT)TAGS (location BINARY(50), customer_id INT);
[0086] The system creates metadata entries for the super table power_grid.meters in the Mnode's SDB (SDB storage engine) and persists its structure information (schemaRow:ts, current, voltage; schemaTag: location, customer_id) for storage.
[0087] 1.3: Creating Sub-Tables (Batch Access to Smart Meters)
[0088] Create a sub-table for the electricity meter device_001:
[0089] CREATE TABLE meter_001 USING meters TAGS ("Shanghai_Pudong", 10001);
[0090] The system calculates the MD5 hash value of the sub-table with the full name power_grid.meter_001, assuming that its hash value H_001 falls within the hash range of V1.
[0091] Collaborative distribution is active again: the system creates a metadata entry for the sub-table meter_001 on V1. Key operations are as follows:
[0092] In V1's TDB, write entries to the metadata master table (pTbDb), which contains the UID (suid, pointing to the meters table in Mnode) of its super table and the specific tag value ("Shanghai_Pudong", 10001).
[0093] Update the sub-table index (pCtbIdx) on V1 to establish a relationship with the supertable meters.
[0094] Important Note: At this point, the metadata for meter_001 is stored in V1, while the metadata for its supermeters is stored in Mnode. All future time-series data generated by this meter (such as current and voltage readings) will also be stored in V1.
[0095] 2. Data query process (taking querying "voltage of all electricity meters in Pudong area" as an example)
[0096] 2.1: Client-side parsing and cache lookup
[0097] The application initiates an SQL query: SELECT voltage FROM meters WHERE location = "Shanghai_Pudong";
[0098] The client-side parser recognizes that this query involves the supertable `meters` and its tag filtering. It first looks up the schema information and tag index cache for `meters` in the local client cache, assuming no match is found.
[0099] 2.2: Locating Supertable Metadata
[0100] The client sends a request (TDMT_MND_TABLE_META) to the Mnode to obtain the schema definition (schemaRow: ts, current, voltage; schemaTag: location, customer_id) of the supertable power_grid.meters. The Mnode then queries the supertable metadata from its own SDB storage and returns it.
[0101] The client's Catalog module caches the schema of the super table locally, and subsequent identical requests will retrieve it from the local cache first to avoid repeated access to the Mnode.
[0102] Step 2.3: Locate and query data in the sub-table
[0103] Based on the tag condition "Shanghai_Pudong", the Mnode quickly locates a list of all sub-table UIDs that meet the condition through its maintained tag index (pTagIdx) and sub-table index (pCtbIdx), such as [UID of meter_001, UID of meter_005, ...].
[0104] The system discovered that the UID of meter_001 is located in V1 (because when the sub-table was created, its UID, data, and metadata were all in V1). The client then sent data query requests in parallel to the vnodes holding the target sub-table, such as V1 and V5.
[0105] The key advantage is that when a request arrives at V1 to query data for meter_001, V1 does not need to initiate a remote metadata query to any other node (including the Mnode). This is because it stores the complete metadata for meter_001 locally, allowing it to directly parse the position of the voltage column and read data from its local time-series data file. This significantly reduces query latency.
[0106] 2.4: Returning Results
[0107] Each vnode returns the queried data to the client, which then aggregates and presents it to the user.
[0108] 3. Dynamic cluster expansion (to cope with a surge in the number of electricity meters)
[0109] Scenario: As the number of electricity meters exceeded ten million, system monitoring detected that V2 was overloaded.
[0110] 3.1: vnode splitting
[0111] The system automatically or manually triggers the split operation of V2. The original hash range of V2 [0x40000000,0x7FFFFFFF] is split at the midpoint, and the latter half [0x60000000, 0x7FFFFFFF] is assigned to the newly created vnodeV5.
[0112] 3.2: Minimize Data Migration
[0113] The system only needs to migrate tables and metadata in V2 whose hash values fall within the range [0x60000000, 0x7FFFFFFF] to V5. For example, if the hash value H_777 of the sub-table meter_777 falls within the new range, its metadata (stored in the TDB in V2) and all its time-series data will be migrated to V5.
[0114] Tables whose hash values still fall within [0x40000000, 0x5FFFFFFF] will remain in V2 and will not need to be moved.
[0115] The mnode notifies all clients of the new vnode distribution (now V1, V2, V3, V4, V5). In subsequent queries, clients will automatically route the calculated hash value to the new, correct vnode.
[0116] 4. High availability and consistency guarantee (simulating V1 Leader node failure)
[0117] Scenario: Physical node Node_A (which carries the Leader replica of V1) goes down due to a network failure.
[0118] 4.1: Automatic Fault Switching
[0119] The virtual group (vgroup) where V1 resides immediately detects that the Leader has lost contact. The remaining Follower replicas in the group (e.g., on Node_B and Node_C) initiate an election according to the Raft protocol, assuming that the replica on Node_B becomes the new Leader.
[0120] 4.2: Seamless Recovery Service
[0121] After a successful election, mnode updates the cluster topology, informing all clients that the new Leader address for V1 is Node_B.
[0122] All subsequent client requests destined for V1 (such as querying data for meter_001) will be automatically redirected to Node_B. Because the Raft protocol guarantees the synchronization of metadata logs on the Followers, the TDB engine state on Node_B is completely consistent with the Leader before the failure, thus continuing to provide accurate service. The entire failover process is transparent to the application.
[0123] The distributed metadata storage solution of TDengine in this invention solves the bottleneck of traditional centralized architecture through the following key technologies:
[0124] Metadata and data collaborative distribution mechanism: This protects the method of forcibly distributing data table metadata (such as table structure and tag information) and its corresponding time-series data in the same storage unit (virtual node / vnode) through the same mapping rules (such as consistent hashing). This mechanism is the cornerstone of this solution, fundamentally eliminating the overhead of accessing metadata across nodes and realizing "where the data is, the metadata is there".
[0125] Fine-grained sharding and dynamic expansion method based on consistent hashing: This method protects data tables at the smallest granularity and distributes them evenly across multiple virtual nodes (vnodes) using a consistent hashing algorithm. It includes hash space partitioning, table-to-vnode mapping algorithms, and dynamic expansion strategies that migrate only a portion of the data and achieve smooth expansion through vnode splitting.
[0126] An integrated storage and indexing engine designed specifically for metadata: Protects the internal architecture of the TDB storage engine, especially its integrated design of a metadata master table, schema history table, and multi-dimensional indexes such as UID index, name index, and tag index, as well as a write process that ensures the durability and consistency of metadata through write-alive log (WAL) and atomic transactions.
[0127] A smart multi-level caching system that spans both the server and client sides: This multi-level caching structure protects the core metadata and super table statistics of the cached tables on the server side, and caches the complete schema and vnode location information on the client side. In particular, it combines version number verification and TTL mechanisms to ensure client cache consistency.
[0128] A distributed consistency framework that ensures high availability and historical traceability: It protects the mechanism that achieves strong consistency of metadata replicas and automatic failover within virtual groups (vgroups) through the Raft protocol, and saves historical versions of metadata through multi-version control (such as {uid, version} composite primary keys) to support time travel queries and schema change traceability.
[0129] In summary, the present invention provides as follows: Figure 3 The method for decentralized storage of database metadata, as shown, includes:
[0130] The global hash space is divided into multiple consecutive hash value ranges using a consistent hashing algorithm, and these ranges are then assigned to multiple virtual nodes (vnodes) on the server, so that each virtual node corresponds to a hash value range.
[0131] The metadata of each data table and the generated time-series data are forcibly mapped and stored in the same virtual node, so that any operation on the time-series data on the virtual node can obtain the metadata on the virtual node, thereby avoiding the latency and overhead caused by remote access to metadata across nodes;
[0132] At least two virtual nodes (Vnodes) are combined into a virtual node group (Vgroup), and strong consistency and automatic failover of metadata replicas are achieved within the virtual node group (Vgroup) through the Raft protocol.
[0133] A multi-level caching system spanning both the server and client is used to cache hot data and minimize query response time, while also building multi-dimensional indexes.
[0134] When an overloaded virtual node is detected, a new virtual node is created, and a split operation is performed on the overloaded virtual node, moving a segment of hash value out of the hash value range.
[0135] Specifically, the metadata of each data table and the generated time-series data are forcibly mapped and stored in the same virtual node as follows:
[0136] In the Mnode storage engine SDB on the server-side management node, metadata entries for the super table are created and the structural information of the metadata entries is persistently stored.
[0137] Create metadata entries for a sub-table on each virtual node to store time-series data, and write metadata entries containing the UID of the super table to which the data belongs and the specific tag value into the metadata master table of the storage engine TDB on each virtual node.
[0138] The sub-table index of each virtual node is associated with the super table, thereby storing the metadata of the sub-table and the generated time-series data in the same virtual node.
[0139] Each sub-table's hash value falls within the hash value range of its corresponding virtual node.
[0140] This involves utilizing a multi-level caching system spanning both the server and client sides to cache frequently accessed data, minimizing query response time, and establishing multi-dimensional indexes. This includes: the client sending a query request to the supertable to obtain and cache metadata containing the supertable's table structure, ensuring that subsequent identical query requests prioritize locating the list of sub-table UIDs that meet the specified conditions from the cached supertable metadata; the supertable, based on the tag conditions in the query request and using its maintained tag index and sub-table indexes, quickly locating the list of all sub-table UIDs that meet the conditions and returning it to the client; and the client sending a data query request to the virtual node (Vnode) containing the sub-tables that meet the conditions, using the sub-table metadata stored in that virtual node to read time-series data from the sub-tables.
[0141] The process of forming a virtual node group (Vgroup) from at least two virtual nodes (Vnode) includes: forming a virtual node group (Vgroup) from at least two virtual nodes, comprising a leader virtual node (Leader) and at least one follower virtual node (Follower), for storing the same time-series data and sub-table metadata, wherein the client uses the Leader address to read the time-series data in the sub-table.
[0142] The implementation of strong consistency and automatic failover of metadata replicas within the virtual node group (Vgroup) via the Raft protocol includes: when the Vgroup detects a failure of the Leader, it elects a new Leader from the Followers according to the Raft protocol; the election result is notified to the Mnode so that the Mnode updates the cluster topology and informs all clients of the new Leader address so that the clients can use the new Leader address to read time-series data in the sub-table.
[0143] The splitting operation of removing a segment of hash value from the hash value range for a virtual node with excessive load includes: splitting a segment of hash value from the hash value range managed by the virtual node with excessive load, forming a new hash value range from the split segment of hash value, and allocating the new hash value range to the new virtual node.
[0144] The splitting operation of moving a segment of hash value from the hash value range for the overloaded virtual node also includes: migrating the metadata and time-series data of the corresponding new hash value range in the overloaded virtual node to the new virtual node.
[0145] The super table serves as the metadata entity of the template. It does not store time-series data itself, but defines the data table structure Schema and tag structure Tags that are common to all its sub-tables.
[0146] The metadata is used to describe the structure and attributes of the database table itself, including table structure schema, tag information, table name, unique identifier UID, version number, super table to which it belongs, creation time and time to live (TTL).
[0147] Although the present invention has been described in detail above, it is not limited thereto, and those skilled in the art can make various modifications based on the principles of the present invention. Therefore, all modifications made in accordance with the principles of the present invention should be understood to fall within the protection scope of the present invention.
Claims
1. A method for decentralized storage of database metadata, comprising: The global hash space is divided into multiple consecutive hash value intervals using a consistent hashing algorithm, and these intervals are then assigned to multiple virtual nodes (vnodes) on the server, so that each virtual node corresponds to a hash value interval. The metadata of each data table and the generated time-series data are forcibly mapped and stored in the same virtual node, so that any operation on the time-series data on the virtual node can obtain the metadata on the virtual node, thereby avoiding the latency and overhead caused by remote access to metadata across nodes; At least two virtual nodes (Vnodes) are combined into a virtual node group (Vgroup), and strong consistency and automatic failover of metadata replicas are achieved within the virtual node group (Vgroup) through the Raft protocol. A multi-level caching system spanning both the server and client is used to cache hot data and minimize query response time, while also building multi-dimensional indexes. When a virtual node with excessive load is detected, a new virtual node is created, and the virtual node with excessive load is split by moving a segment of hash value out of the hash value range. Specifically, the metadata of each data table and the generated time-series data are forcibly mapped and stored in the same virtual node as follows: In the Mnode storage engine SDB on the server-side management node, metadata entries for the super table are created and the structural information of the metadata entries is persistently stored. On each virtual node, a metadata entry for a sub-table used to store time-series data is created, and a metadata entry containing the unique identifier UID of the supertable to which it belongs and the specific tag value is written into the metadata master table of the storage engine TDB on each virtual node. The sub-table index of each virtual node is associated with the super table, thereby storing the metadata of the sub-table and the generated time-series data in the same virtual node.
2. The method according to claim 1, wherein the hash value of each sub-table falls within the hash value range of the virtual node to which it resides.
3. The method according to claim 1, wherein a multi-level caching system spanning the server and client is used to cache hot data for minimizing query response time and to establish a multi-dimensional index, comprising: The client sends a query request to the supertable to obtain the metadata of the table structure of the supertable and caches it so that subsequent identical query requests can first locate the list of sub-table UIDs that meet the conditions from the supertable metadata cached by the client. The supertable quickly locates a list of all sub-table UIDs that meet the conditions based on the tag conditions in the query request, using its maintained tag index and sub-table index, and returns it to the client. The client sends a data query request to the virtual node (Vnode) where the sub-table that meets the conditions is located, and reads time-series data from the sub-table using the sub-table metadata stored in the virtual node.
4. The method according to claim 2, wherein, A virtual node group (Vgroup) is formed by combining at least two virtual nodes (Vnodes). Specifically, it includes: A virtual node group Vgroup is formed by combining at least two virtual nodes, including a leader virtual node (Leader) and at least one follower virtual node (Follower), to store the same time-series data and sub-table metadata. The client uses the Leader address to read the time-series data in the sub-table.
5. The method according to claim 4, wherein, The Raft protocol enables strong consistency and automatic failover of metadata replicas within the virtual node group (Vgroup), including: When a virtual group (Vgroup) detects a failure of the Leader, it elects a new Leader from the Followers according to the Raft protocol. The election results are notified to the Mnode so that the Mnode can update the cluster topology and inform all clients of the new Leader address, so that the clients can use the new Leader address to read the time-series data in the sub-table.
6. The method according to claim 4, wherein, The splitting operation for overloaded virtual nodes, which involves moving a segment of hash value out of the hash value range, includes: A segment of hash value is split from the hash value range handled by the overloaded virtual node, forming a new hash value range, and the new hash value range is assigned to the new virtual node.
7. The method according to claim 6, wherein, The splitting operation of moving a segment of hash value out of the hash value range for overloaded virtual nodes also includes: Migrate the metadata and time-series data corresponding to the new hash value range in the overloaded virtual nodes to the new virtual nodes.
8. The method according to claim 1, wherein, The metadata entity of the supertable, which serves as a template, is used to define the data table structure schema and tag structure tags common to all its sub-tables. It does not store time-series data itself.
9. The method according to claim 1, wherein, The metadata is used to describe the structure and attributes of the database table itself, including table structure schema, tag information, table name, unique identifier UID, version number, super table to which it belongs, creation time, and time to live (TTL).