Data balancing method and apparatus for redirect-on-write distributed storage engine
By employing a logical topology redirection data balancing method in a distributed storage system, the problem of data imbalance was solved, thereby improving system performance and scalability.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- CHINA TELECOM CLOUD TECH CO LTD
- Filing Date
- 2025-11-24
- Publication Date
- 2026-06-11
Smart Images

Figure CN2025136969_11062026_PF_FP_ABST
Abstract
Description
A method and apparatus for data balancing in a write-time redirection distributed storage engine.
[0001] Related applications
[0002] This application claims priority to Chinese patent application filed on December 3, 2024, application number 2024117656806, entitled "A method and apparatus for data balancing in a distributed storage engine with write-time redirection", the entire contents of which are incorporated herein by reference. Technical Field
[0003] This application relates to the field of distributed storage technology, specifically to a method and apparatus for data balancing in a write-time redirection distributed storage engine. Background Technology
[0004] Distributed storage, also known in the storage industry as Software Defined Storage (SDS), is defined by the Storage Networking Industry Association (SNIA) as: a type of virtualized storage with a service management interface. SDS includes storage pooling functionality and allows for the definition of data service characteristics within the storage pool through the service management interface. The most crucial aspect of virtualized storage is the virtualization of the storage hardware. Compared to traditional storage, distributed storage no longer relies on proprietary hardware (such as storage controllers) and proprietary networks (such as FC networks), and can provide professional storage services through general-purpose hardware and a unified network.
[0005] Distributed storage technology is now widely used in traditional data centers, public cloud data centers, private cloud data centers, and hyperconverged workstations. Among them, the distributed storage engine provides high-performance, highly reliable, and efficient data access capabilities for distributed storage, and is the core component of distributed storage.
[0006] Distributed storage engines offer various implementation options, such as the widely used Ceph distributed storage system and the Google File System (GFS). Both employ a distributed storage architecture, enabling data to be distributed across multiple servers, making them suitable for scenarios like big data analytics, cloud computing, and large-scale data storage. However, they both face the problem of data imbalance, meaning that data is unevenly distributed among the nodes in the distributed system. This can cause some nodes to bear an excessive storage or processing burden, significantly impacting the system's performance, scalability, and fault tolerance. Summary of the Invention
[0007] According to various embodiments of the present application, a method and apparatus for data balancing of a write-time redirection distributed storage engine are provided.
[0008] In a first aspect, the present application provides a method for data balancing of a write-time redirection distributed storage engine, executed by a computer device, comprising:
[0009] executing a preset disk selection balancing strategy based on a logical topology structure, wherein the logical topology structure is one of a plurality of different types of logical topology structures determined after reorganizing a heterogeneous cluster physical topology structure, and each logical topology structure includes a plurality of logically isolated sub-topologies;
[0010] after executing the preset disk selection balancing strategy, obtaining a PG used capacity and a hard disk used capacity;
[0011] based on the PG used capacity and the hard disk used capacity, executing a preset space allocation balancing strategy;
[0012] after executing the preset space allocation balancing strategy, based on a star-type read-write mode, executing a preset usage balancing strategy; and,
[0013] after executing the preset usage balancing strategy, if the cluster still does not reach a balanced state, executing a preset background balancing strategy.
[0014] In an optional implementation, executing the preset disk selection balancing strategy based on the logical topology structure comprises:
[0015] step a, determining a weight and a failure domain of each topology node in the logical topology structure;
[0016] step b, selecting an Nth layer in the logical topology structure, wherein N = 1;
[0017] step c, determining a first topology node in a maximum weight set in the Nth layer and a failure domain corresponding to the first topology node; the first topology node is any node in the maximum weight set;
[0018] step d, if the failure domain corresponding to the first topology node is a set failure domain, determining whether the first topology node meets a failure requirement;
[0019] step e1, if the first topology node meets the failure requirement, selecting the first topology node, setting N = N + 1, returning to step c, until N is the last layer number in the logical topology structure, and completing disk selection of the logical topology; and,
[0020] step e2, if the first topology node does not meet the failure requirement, returning to step b, and taking a second largest weight set in the Nth layer as the maximum weight set.
[0021] In one optional implementation, after completing the disk selection for the logical topology, the method further includes:
[0022] The check shards and data shards of all PGs in the cluster are readjusted so that each hard drive in the logical topology is used as an EC check shard and EC data shard an equal number of times.
[0023] In one optional implementation, a preset space allocation balancing strategy is executed based on the used capacity of the PG and the used capacity of the hard disk, including:
[0024] Determine whether the difference in used capacity of PG exceeds the first preset capacity;
[0025] If the difference in the used capacity of PG exceeds the first preset capacity, it is determined whether the time exceeding the first preset capacity exceeds the first preset time.
[0026] If the time exceeds the first preset capacity, a first cyclic PG group is established. The first cyclic PG group is used to balance the used capacity of PGs.
[0027] Determine whether the difference in used hard drive capacity exceeds the second preset capacity;
[0028] If the difference in used capacity of PG exceeds the second preset capacity, determine whether the time exceeding the second preset capacity exceeds the second preset time; and,
[0029] If the time exceeds the second preset capacity, a second cyclic PG group is established on the hard drive that exceeds the second preset capacity and the second preset time. The second cyclic PG group is used to balance the used capacity of the hard drive.
[0030] In one optional implementation, based on the star topology read / write method, a preset usage balancing strategy is executed, including:
[0031] Obtain the I / O data for each hard drive in a star topology read / write mode;
[0032] Based on IO data, the data types with different levels of popularity were determined; and,
[0033] Data types with different popularity are assigned to the corresponding logical topology based on the type of the logical topology.
[0034] In one optional implementation, a preset backend load balancing strategy is executed, including:
[0035] Obtain statistical information for all PGs and all hard drives within the cluster;
[0036] Based on the statistical information of all PGs, hot PGs are identified;
[0037] Based on statistical information from all hard drives, hotspot hard drives are identified; and,
[0038] If a hot PG and a hot hard drive continuously exhibit a third preset time, the hot PG will be migrated from the hot hard drive to a non-hot hard drive through PG migration.
[0039] Secondly, this application provides a data balancing device for a write-time redirection distributed storage engine, the device comprising:
[0040] The balancing strategy execution module is used to execute a preset disk selection balancing strategy based on the logical topology. The logical topology is one of several different types of logical topologies determined after reorganizing the physical topology of the heterogeneous cluster, and each logical topology includes multiple logically isolated sub-topologies.
[0041] The space allocation balancing strategy execution module is used to obtain the used capacity of the PG and the used capacity of the hard disk after executing the preset disk selection balancing strategy; and to execute the preset space allocation balancing strategy based on the used capacity of the PG and the used capacity of the hard disk.
[0042] The load balancing strategy execution module is used to execute a preset load balancing strategy based on a star topology read / write method after executing a preset space allocation load balancing strategy; and...
[0043] The background load balancing strategy execution module is used to execute a preset background load balancing strategy if the cluster still fails to achieve load balancing after executing the preset load balancing strategy.
[0044] Thirdly, this application provides a computer device, including: a memory and a processor, which are communicatively connected to each other. The memory stores computer instructions, and the processor performs the following steps by executing the computer instructions:
[0045] A preset disk selection and balancing strategy is executed based on a logical topology. The logical topology is one of several different types of logical topologies determined after reorganizing the physical topology of the heterogeneous cluster, and each logical topology includes multiple logically isolated sub-topologies.
[0046] After executing the preset disk balancing strategy, obtain the used capacity of the PG and the used capacity of the hard disk;
[0047] Based on the used capacity of the PG and the used capacity of the hard disk, a preset space allocation balancing strategy is executed;
[0048] After executing the preset space allocation balancing strategy, based on the star topology read / write method, a preset usage balancing strategy is executed; and,
[0049] If the cluster still fails to achieve balance after executing the preset load balancing strategy, the preset background load balancing strategy will be executed.
[0050] Fourthly, this application provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the following steps:
[0051] A preset disk selection and balancing strategy is executed based on a logical topology. The logical topology is one of several different types of logical topologies determined after reorganizing the physical topology of the heterogeneous cluster, and each logical topology includes multiple logically isolated sub-topologies.
[0052] After executing the preset disk balancing strategy, obtain the used capacity of the PG and the used capacity of the hard disk;
[0053] Based on the used capacity of the PG and the used capacity of the hard disk, a preset space allocation balancing strategy is executed;
[0054] After executing the preset space allocation balancing strategy, based on the star topology read / write method, a preset usage balancing strategy is executed; and,
[0055] If the cluster still fails to achieve balance after executing the preset load balancing strategy, the preset background load balancing strategy will be executed.
[0056] Fifthly, this application provides a computer program product, including computer instructions, which, when executed by a processor, perform the following steps:
[0057] A preset disk selection and balancing strategy is executed based on a logical topology. The logical topology is one of several different types of logical topologies determined after reorganizing the physical topology of the heterogeneous cluster, and each logical topology includes multiple logically isolated sub-topologies.
[0058] After executing the preset disk balancing strategy, obtain the used capacity of the PG and the used capacity of the hard disk;
[0059] Based on the used capacity of the PG and the used capacity of the hard disk, a preset space allocation balancing strategy is executed;
[0060] After executing the preset space allocation balancing strategy, based on the star topology read / write method, a preset usage balancing strategy is executed; and,
[0061] If the cluster still fails to achieve balance after executing the preset load balancing strategy, the preset background load balancing strategy will be executed. Attached Figure Description
[0062] To more clearly illustrate the technical solutions in the embodiments of this application or the conventional technology, the drawings used in the description of the embodiments or the conventional technology will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the disclosed drawings without creative effort.
[0063] Figure 1 is a flowchart illustrating a data balancing method for a write-time redirection distributed storage engine according to an embodiment of this application;
[0064] Figure 2 is a schematic diagram of the object topology according to an embodiment of this application;
[0065] Figure 3 is a schematic diagram of one logical topology according to an embodiment of this application;
[0066] Figure 4 is a schematic diagram of the selection panel according to an embodiment of this application;
[0067] Figure 5 is a schematic diagram of verification fragmentation and data fragmentation readjustment according to an embodiment of this application;
[0068] Figure 6 is a schematic diagram of the adjusted verification fragmentation and data fragmentation according to an embodiment of this application;
[0069] Figure 7 is a schematic diagram of the PG cycle according to an embodiment of this application;
[0070] Figure 8 is a schematic diagram of star-shaped read / write according to an embodiment of this application;
[0071] Figure 9 is a schematic diagram of the use of load balancing according to an embodiment of this application;
[0072] Figure 10 is a schematic diagram of PG migration according to an embodiment of this application;
[0073] Figure 11 is a structural block diagram of a data balancing device for a write-time redirection distributed storage engine according to an embodiment of this application;
[0074] Figure 12 is a schematic diagram of the hardware structure of a computer device according to an embodiment of this application. Detailed Implementation
[0075] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0076] Currently, Ceph uses relatively coarse-grained management for cluster balancing, which means that the hardware cannot fully utilize its capabilities during actual production operation. It requires maintenance by people with design and code-level understanding of Ceph to alleviate the data balancing problem to some extent. Furthermore, due to the Ceph architecture, some optimizations, such as data sharding and checksum sharding balancing in EC, cannot be achieved.
[0077] GFS offers more granular management, enabling cluster data balancing through both space allocation and background adjustment. However, it consumes significant master memory resources, and as GFS is primarily a replica architecture, it lacks design for data sharding and parity sharding balancing for EC (Extended Equivalent), thus failing to achieve optimal data balancing.
[0078] In view of this, according to the embodiments of this application, an embodiment of a data balancing method for a write-time redirection distributed storage engine is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0079] This embodiment provides a data balancing method for a write-time redirection distributed storage engine, which can be executed by a cluster management center in devices such as servers, terminals, and mobile terminals. Figure 1 is a flowchart of the write-time redirection distributed storage engine data balancing method according to an embodiment of this application. As shown in Figure 1, the process includes the following steps:
[0080] Step S101: Execute a preset disk selection and balancing strategy based on the logical topology.
[0081] Among them, the logical topology is one of several different types of logical topologies determined after reorganizing the physical topology of the heterogeneous cluster, and each logical topology includes multiple logically isolated sub-topologies.
[0082] The physical topology reflects the mapping relationship between physical hardware servers and hard drives. All server information and hard drive information managed by the servers within the cluster are collected at the chunkmaster (the chunkmaster, meaning the cluster management center, is responsible for data distribution, data balancing, fault recovery, cluster scaling, cluster monitoring, virtual space allocation, and virtual space to physical space conversion in the distributed storage engine of this embodiment; different storage engines use different naming methods, such as master and CM (cluster master)). Server information includes CPU, memory, number of hard drives, network bandwidth information, and hard drive information. Hard drive information includes: hard drive medium (e.g., TLC SSD, QLC+SLC SSD, HDD, etc.), hard drive protocol (SAS, SATA, NVMe), and hard drive capacity. A schematic diagram of the physical topology is shown in Figure 2. The logical topology divides similar hardware into a logical topology pool based on the physical topology and actual reliability requirements. A logical topology is a logically isolated fault domain, with computing resources, network resources, and storage resources logically isolated. That is, servers of the same type are organized into logically isolated logical topologies. One of the ultra-high performance logical topologies is shown in Figure 3. Logical topologies can also include high performance logical topologies, ordinary logical topologies, and large IO capacity logical topologies.
[0083] In different logical topologies, balanced disk selection is performed. Since the physical hardware characteristics are consistent within a topology, only the disk weight needs to be used as the selection criterion to complete the disk selection for the logical topology. The disk weight is calculated as capacity / number of PGs. After logical topology disk selection, because the disks are randomly selected, it leads to imbalances in the distribution of EC (Erasure Code). EC is a coding technique that adds m data to n original data sets and can restore the original data from any n+m data sets. Mathematically, encoding is represented by constructing a multivariate linear equation, and decoding is by solving the multivariate linear equation. Further abstraction leads to matrix operations. Current erasure coding techniques are mainly divided into two categories: one is erasure coding based on Galois field operations (such as RS); the other is erasure coding based on "XOR" operations (LDPC). Under protection mode, from the cluster perspective, all disks may not be evenly distributed as EC check shards and EC data shards. Therefore, further balancing adjustments are needed to ensure that each disk is used as an EC check shard and EC data shard an equal number of times, thus completing the disk selection balancing strategy.
[0084] Step S102: After executing the preset disk balancing strategy, obtain the used capacity of the PG and the used capacity of the hard disk.
[0085] Step S103: Based on the used capacity of PG and the used capacity of hard disk, execute the preset space allocation balancing strategy.
[0086] In this embodiment, virtual addresses (chunks) are uniformly allocated by the cluster management center. A chunk (meaning a large block, in this embodiment, it describes a byte stream space allocated from a PG (Protect Group, in this embodiment, it describes a logical protection group composed of a group of independent virtual hard disks organized together using ECs or replicas to provide reliability guarantees and provide read / write space externally) is typically a maximum of 128MB, allocated from the PG and guaranteed for reliability by the PG), and is distinguished by a unique identifier (chunkID). Large-loop allocation prioritizes chunk allocation based on the principle that the number of chunks allocated on each PG is equal. For a normally functioning cluster, the number and capacity of chunks are the same. When the difference in used capacity of a PG reported by a chunkserver (meaning a single-machine storage server, in this embodiment, it describes a single-machine storage server that manages the storage space on the single-machine server and manages the single-machine byte stream chunk space) exceeds 10GB and remains so for more than one day, a "small-loop PG group" is established to catch up. When the monitoring shows that the difference in used hard drive capacity reported by a single storage server exceeds 10G and remains so for more than 2 days, a "small circular PG group" is established on these hard drives to catch up in capacity, thereby achieving a space allocation balance strategy.
[0087] Step S104: After executing the preset space allocation balancing strategy, execute the preset usage balancing strategy based on the star topology read / write method.
[0088] Reading and writing to a chunk actually involves reading and writing to a group of independent virtual hard disks. Each virtual hard disk only handles reads and writes related to itself. I / O data aggregation and distribution are handled by an external client independent of the virtual hard disks. This read / write method is called star topology in this embodiment. In this embodiment, once data is written to a chunk location by the ROW (Real-Time Writer), that data cannot be changed, and star topology is supported. Combined with evenly distributed parity disks, further balancing can be achieved. That is, when heterogeneous servers exist within a cluster, logical topology is used to logically isolate servers with different structures. When using space, different logical topologies are used based on the characteristics of the data types to achieve balancing of the heterogeneous cluster.
[0089] Step S105: If the cluster still fails to achieve balance after executing the preset load balancing strategy, execute the preset background load balancing strategy.
[0090] Specifically, the cluster management center collects statistical information on all PGs and hard drives within the cluster. Based on this information, it identifies hot PGs and hard drives that become hot due to sudden traffic spikes. If hot PGs and hard drives persist, PG migration is used to move them from hot hard drives to non-hot hard drives, thereby achieving performance and capacity balance within the cluster.
[0091] In related technologies, such as Google's publicly disclosed distributed storage engine GFS (Google File System), which provides an append-only file system, GFS uses a unified management node to manage the file-to-space mapping metadata. Data block information is queried by the management node from the specific data block server at startup. Data balancing is achieved through the management node's balanced allocation of data blocks and direct migration of data blocks. This results in a huge number of data blocks that the management node needs to manage when the GFS cluster is large. Furthermore, because it provides data as files, the memory consumption of data blocks and file information is significant. The lack of a multi-logical topology concept leads to complex handling when heterogeneous servers appear in the cluster. Another widely used technology is Ceph. Ceph is a distributed storage system that uses a distributed algorithm (CRUSH algorithm) to distribute data objects according to the weight of each storage device, making the distribution approximately uniform. Ceph calculates a list of OSDs (Object Storage Devices) for reading and writing. Since these OSD lists are calculated, it is difficult to guarantee that each OSD appears consistently at each location. Therefore, during normal EC (Extra Check) reads in the cluster, the absence of OSDs used for verification during reads causes uneven read distribution. Furthermore, because the number of Placement Groups (PGs) in Ceph is limited compared to the vast number of objects, each OSD (Optical Server Device) may act as the master PG at different times. Since Ceph handles both read and write operations through the master OSD, this can lead to cluster imbalance. While adjusting weights can address this imbalance, it causes data migration across multiple PGs, making fine-grained control as difficult as a centralized management cluster.
[0092] This application proposes a hierarchical data balancing method for write-time redirection distributed storage engines. It breaks down cluster data balancing into four levels: disk selection balancing, space allocation balancing, usage balancing, and background balancing adjustment. A unified cluster management center monitors the cluster hardware, and combined with the ROW storage engine and hierarchical storage architecture, achieves data balancing within the cluster. When applied to a distributed storage engine, this method distributes the capacity pressure on each storage medium and the performance pressure across each network, computing, and storage hardware within the cluster as evenly as possible. In other words, it achieves balanced data distribution within the cluster through fine-grained control of the cluster by a unified cluster management center.
[0093] In some optional implementations, a preset disk selection and equalization strategy is executed based on the logical topology, including:
[0094] Step a: Determine the weight and fault domain of each topology node in the logical topology structure;
[0095] Step b: Select the Nth layer in the logical topology, where N = 1;
[0096] Step c: Determine the first topology node in the maximum weight set of the Nth layer and the fault domain corresponding to the first topology node; the first topology node is any node in the maximum weight set.
[0097] Step d: If the fault domain corresponding to the first topology node is the set fault domain, determine whether the first topology node meets the fault requirements.
[0098] Step e1: If the fault requirements are met, select the first topology node and set N = N + 1. Return to step c until N is the last layer number in the logical topology structure, and complete the selection of the logical topology.
[0099] Step e2: If the fault requirements are not met, return to step b and take the second largest weight set in the Nth layer as the largest weight set.
[0100] Referring to Figure 4, this embodiment uses a greedy algorithm for selecting the disk:
[0101] Step 1: Starting from the root layer, randomly scatter the topological points layer by layer and select them sequentially according to the weight order of the logical topology points;
[0102] Step 2: When selecting down to the next level, if the logical topology point is found to be a fault domain, decide whether to select it according to the fault domain requirements;
[0103] Step 3: After cleaning up the fault domain that does not meet the requirements, return to Step 1 and select the next weighted logical topology point.
[0104] Step 4: When the fault domain requirements are met, the layer is randomly shuffled downwards, and the logical topology point with the highest weight is selected until the hard drive is selected.
[0105] Disk selection balancing organizes hardware of the same type into a logical topology. Large heterogeneous storage server clusters are managed through logical topology isolation and storage federation, elevating the load balancing of the heterogeneous cluster to a usage-balanced level. Disk selection balancing only considers the balancing of hard drives of the same type. The number of available PGs in the cluster is determined by disk selection within the storage cluster, ensuring that each hard drive appears an equal number of times in each position within the PG. Even in EC (Extended Read / Write) scenarios, it can maintain a relatively even distribution of read and write operations across the cluster.
[0106] In some optional implementations, after completing the selection of disks for the logical topology, the following steps are also included:
[0107] The check shards and data shards of all PGs in the cluster are readjusted so that each hard drive in the logical topology is used as an EC check shard and EC data shard an equal number of times.
[0108] In the storage field, it has been found that EC protection groups can provide higher read concurrency and higher disk utilization than replicas (three replicas <33%, EC 4+2 can provide >60% utilization). The limitation of writes requiring full-strip writes by EC is gradually being resolved by the industry, and the use of EC protection groups in distributed storage engines is becoming the mainstream in the distributed storage industry. EC data shards support cluster reads, while parity shards only support repair reads. When parity shards and data shards are unevenly distributed on disks, the disk read pressure distribution within the cluster is uneven when the cluster is normal. The cluster is in a normal state for more than 99.99% of the time, and in database OLTP business, it is generally 70% read, and in OLAP, it is generally more than 90% read. Read balance can significantly improve the cluster throughput. For example, the distributed storage engines used in related technologies, such as Ceph, use random object names and consistent hash calculations to assign them to a group of storage resources, and the IO flow is first concentrated on the master of this group of storage resources. Although from a mathematical probability point of view, in the case of complete randomness and a sufficiently large number of objects, the capacity and performance in the cluster are balanced, the actual business is not a perfect mathematical model. Furthermore, the EC protection mode widely used in the storage industry is divided into data sharding and parity sharding. When the cluster is normal, only data shards support reads; when the cluster is abnormal, parity shards are used to repair reads. The cluster is normal more than 99.999% of the time, which causes read imbalance. The number of PGs in Ceph is not huge, so mathematically it cannot be guaranteed that each medium will perform parity balancing, hence read imbalance cannot be achieved.
[0109] Therefore, as shown in Figures 5 and 6, this embodiment re-adjusts the parity shards and data shards of all PGs within the cluster, ensuring that all hard drives appear the same number of times in each position within the PG. This guarantees that read performance is evenly distributed across all network, computing, and storage resources within the cluster. In the EC scenario, disk selection ensures a balanced distribution of all disks across the EC's parity shards and data shards from the perspective of the logical topology cluster, effectively improving the overall performance balance of the cluster.
[0110] In this embodiment, the hard drives selected for erasure coding are chosen to ensure that the hard drives in the cluster appear in the same position in all erasure coding protection groups, thereby achieving erasure coding read / write balance. This method can effectively prevent uneven hard drive load and improve the performance and reliability of the overall storage system.
[0111] In some optional implementations, a preset space allocation balancing strategy is executed based on the used capacity of the PG and the used capacity of the hard disk, including:
[0112] Determine whether the difference in used capacity of PG exceeds the first preset capacity;
[0113] If the difference in the used capacity of PG exceeds the first preset capacity, it is determined whether the time exceeding the first preset capacity exceeds the first preset time.
[0114] If the time exceeds the first preset capacity, a first cyclic PG group is established. The first cyclic PG group is used to balance the used capacity of PGs.
[0115] Determine whether the difference in used hard drive capacity exceeds the second preset capacity;
[0116] If the difference in the used capacity of PG exceeds the second preset capacity, determine whether the time exceeding the second preset capacity exceeds the second preset time.
[0117] If the time exceeds the second preset capacity, a second cyclic PG group is established on the hard drive that exceeds the second preset capacity and the second preset time. The second cyclic PG group is used to balance the used capacity of the hard drive.
[0118] Specifically, a unified cluster management center allocates virtual address chunks, distinguished by a unique identifier, chunkID. The disk groups contained in PGs (PGs) with adjacent chunkIDs allocated by the cluster management center are largely non-overlapping, ensuring no disk hotspots occur during normal sequential chunk usage. Large-loop priority is used to allocate chunks based on the principle of equal chunk numbers allocated to each PG. For a normally functioning cluster, the number and capacity of chunks are the same. When the cluster management center detects a difference in used capacity of PGs reported by a single storage server exceeding 10GB and lasting for more than one day, it initiates the establishment of a "small-loop PG group" to catch up. When a difference in used capacity of hard drives reported by a single storage server exceeds 10GB and lasts for more than two days, a "small-loop PG group" is established for the PGs on these hard drives to catch up. The specific processing procedure is shown in Figure 7. Valid PGs allocate chunks according to a "round-robin" method. When PG capacities are inconsistent, 90% of the requested chunks are used in a large-loop process, and 10% are used in a small-loop process to catch up with PG capacity. The loop range is uniformly controlled by the cluster management center. Furthermore, the cluster management center monitors the available disk capacity and percentage. When disk capacity is uneven, the relevant PGs are isolated from the large loop to form a small loop, in order to ensure that the chunks allocated by the cluster management center are writable as much as possible.
[0119] In this embodiment, the cluster space allocation is uniformly performed by the cluster management center based on statistical information within the cluster. Chunks are allocated through a "large loop" to ensure that the number of chunks within each PG is consistent. When recovering from a failure after a period of time or when scaling up, chunks are added through a small loop while the large loop is being used for allocation. If the number of chunks is consistent but the capacity is unbalanced for more than one day, the PG with the lower capacity is individually added to the small loop to achieve balance.
[0120] In some optional implementations, a preset usage balancing strategy is executed based on the star topology read / write method, including:
[0121] Obtain the I / O data for each hard drive in a star topology read / write mode;
[0122] Based on IO data, the data types with different popularity levels are determined;
[0123] Data types with different popularity are assigned to the corresponding logical topology based on the type of the logical topology.
[0124] Referring to Figure 8, this embodiment replaces chained writes with star topology, resulting in a more balanced read / write operation within the cluster. When heterogeneous servers exist within a cluster, logical topology isolates servers with different structures. When using space, IO type analysis is performed, and different topologies are used based on the characteristics of the business to achieve balanced heterogeneous cluster operation. For example, after long-term statistics, garbage collection (GC, an automatic memory management mechanism) is used to place hot data into a high-performance topology and cold data into a normal-performance topology, as shown in Figure 9.
[0125] By organizing similar hardware into different topologies, a heterogeneous storage cluster federation approach is used to manage ultra-large heterogeneous storage clusters. Different business types and hardware characteristics allow data to be evenly distributed across different federated clusters, thereby achieving balanced cluster performance. For example, in public clouds, there may be timed massive I / O operations where the business's I / O performance requirements are not high, but the impact on other business performance within the cluster can be mitigated by gradually redirecting traffic to a sandbox cluster without the business being aware of the impact. Through long-term hot and cold I / O analysis, the amount of hot and cold I / O can be distributed, achieving overall cluster performance balance.
[0126] In this embodiment, a star topology is used for read / write operations, without a master node responsible for forwarding read / write operations. In scenarios with high read volumes, the master node will not become a read hotspot. When heterogeneous servers exist within a cluster, logical topology isolates servers with different structures. When using space, different topologies are used based on the characteristics of the business to achieve heterogeneous cluster load balancing, which can further effectively optimize data balance.
[0127] In some optional implementations, a preset backend load balancing strategy is executed, including:
[0128] Obtain statistical information for all PGs and all hard drives within the cluster;
[0129] Based on the statistical information of all PGs, hot PGs are identified;
[0130] Based on the statistical information of all hard drives, identify the hottest hard drives;
[0131] If a hot PG and a hot hard drive continuously exhibit a third preset time, the hot PG will be migrated from the hot hard drive to a non-hot hard drive through PG migration.
[0132] If the above three layers of data balancing still fail to achieve cluster balance, the cluster management center will analyze cluster information. If the cluster experiences prolonged periods of hard drive hotspots and PG hotspots, or if hard drive and PG capacities are inconsistent, the cluster management center can perform PG migration, as shown in Figure 10, migrating hot PGs to non-hotspot hard drives. Because the number of PGs migrated and the migration bandwidth can be configured, the impact on the entire cluster is small and controllable, and it can further effectively improve data balance. Here, authoritative shards represent reliable data.
[0133] The cluster management center monitors information about each component of the cluster. For example, if it finds that certain hard drives and certain PGs on those hard drives will have performance hotspots at fixed intervals of 1 hour, while other hard drives and PGs will not have this problem, then the hot PGs will be migrated to other hard drives without hotspots, and the non-hotspot PGs will be migrated back. This design can also solve the problem of periodically hot PGs on public clouds to some extent.
[0134] This embodiment also provides a data balancing device for a write-time redirection distributed storage engine. This device is used to implement the above embodiments and preferred embodiments, and details already described will not be repeated. As used below, the term "module" can be a combination of software and / or hardware that implements a predetermined function. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.
[0135] This application embodiment also provides a data balancing device for a write-time redirection distributed storage engine, as shown in FIG11. The device includes: a balancing strategy execution module 201, a space allocation balancing strategy execution module 202, a usage balancing strategy execution module 203, and a background balancing strategy execution module 204. Wherein:
[0136] The balancing strategy execution module 201 is used to execute a preset disk selection balancing strategy based on the logical topology. The logical topology is one of several different types of logical topologies determined after reorganizing the physical topology of the heterogeneous cluster, and each logical topology includes multiple logically isolated sub-topologies.
[0137] The space allocation balancing strategy execution module 202 is used to obtain the used capacity of the PG and the used capacity of the hard disk after executing the preset disk selection balancing strategy; and to execute the preset space allocation balancing strategy based on the used capacity of the PG and the used capacity of the hard disk.
[0138] The load balancing strategy execution module 203 is used to execute the preset load balancing strategy based on the star topology read / write method after executing the preset space allocation load balancing strategy.
[0139] The background balancing strategy execution module 204 is used to execute a preset background balancing strategy if the cluster still fails to achieve balancing after executing the preset balancing strategy.
[0140] In this embodiment, the data balancing device for the write-time redirection distributed storage engine is presented in the form of a functional unit. Here, a unit refers to an ASIC circuit, a processor and memory that execute one or more software or fixed programs, and / or other devices that can provide the above functions.
[0141] Further functional descriptions of the above modules and units are the same as those in the corresponding embodiments described above, and will not be repeated here.
[0142] This application embodiment also provides a computer device having the write-time redirection distributed storage engine data balancing device shown in FIG11 above.
[0143] Please refer to Figure 12, which is a schematic diagram of the structure of a computer device provided in an optional embodiment of this application. As shown in Figure 12, the computer device includes: one or more processors 10, a memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components communicate with each other using different buses and can be installed on a common motherboard or otherwise as needed. The processor can process instructions executed within the computer device, including instructions stored in or on memory to display graphical information of a GUI on an external input / output device (such as a display device coupled to the interface). In some optional embodiments, multiple processors and / or multiple buses can be used with multiple memories and multiple memory modules, if desired. Similarly, multiple computer devices can be connected, each providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multiprocessor system). Figure 12 shows an example of a single processor 10.
[0144] Processor 10 may be a central processing unit, a network processor, or a combination thereof. Processor 10 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The programmable logic device may be a complex programmable logic device (CAMP), a field-programmable gate array (FPGA), a general-purpose array logic (GDA), or any combination thereof.
[0145] The memory 20 stores instructions executable by at least one processor 10 to cause the at least one processor 10 to perform the method shown in the above embodiments.
[0146] The memory 20 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the computer device. Furthermore, the memory 20 may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, the memory 20 may optionally include memory remotely located relative to the processor 10, and these remote memories may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0147] The memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk or solid-state drive; the memory 20 may also include a combination of the above types of memory.
[0148] The computer device also includes a communication interface 30 for communicating with other devices or communication networks.
[0149] This application also provides a computer-readable storage medium. The methods described in this application can be implemented in hardware or firmware, or implemented as recordable on a storage medium, or implemented as computer code downloaded over a network and originally stored on a remote storage medium or a non-transitory machine-readable storage medium and subsequently stored on a local storage medium. Thus, the methods described herein can be processed by software stored on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, optical disk, read-only memory, random access memory, flash memory, hard disk, or solid-state drive, etc.; further, the storage medium can also include combinations of the above types of memory. It is understood that computers, processors, microprocessor controllers, or programmable hardware include storage components capable of storing or receiving software or computer code. When the software or computer code is accessed and executed by the computer, processor, or hardware, the methods shown in the above embodiments are implemented.
[0150] A portion of this application can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide the methods and / or technical solutions according to this application through the operation of the computer. Those skilled in the art will understand that the forms in which computer program instructions exist in a computer-readable medium include, but are not limited to, source files, executable files, installation package files, etc. Correspondingly, the ways in which computer program instructions are executed by a computer include, but are not limited to: the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled program, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed program. Here, the computer-readable medium can be any available computer-readable storage medium or communication medium accessible to a computer.
[0151] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0152] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.
Claims
1. A data balancing method for a write-time redirection distributed storage engine, executed by a computer device, wherein, The method includes: A preset disk selection and balancing strategy is executed based on a logical topology. The logical topology is one of several different types of logical topologies determined after reorganizing the physical topology of the heterogeneous cluster, and each logical topology includes multiple logically isolated sub-topologies. After executing the preset disk balancing strategy, obtain the used capacity of the PG and the used capacity of the hard disk; Based on the used capacity of the PG and the used capacity of the hard disk, a preset space allocation balancing strategy is executed; After executing the preset space allocation balancing strategy, based on the star topology read / write method, a preset usage balancing strategy is executed; and, If the cluster still fails to achieve balance after executing the preset load balancing strategy, the preset background load balancing strategy will be executed.
2. The method according to claim 1, wherein, The execution of the preset disk selection and balancing strategy based on the logical topology includes: Step a: Determine the weight and fault domain of each topology node in the logical topology structure; Step b: Select the Nth layer in the logical topology, where N = 1; Step c: Determine the first topology node in the maximum weight set of the Nth layer and the fault domain corresponding to the first topology node; the first topology node is any node in the maximum weight set; Step d: If the fault domain corresponding to the first topology node is a set fault domain, determine whether the first topology node meets the fault requirements. Step e1: If the fault requirements are met, select the first topology node and set N = N + 1, then return to step c, until N is the last layer number in the logical topology structure, completing the selection of the logical topology; and, Step e2: If the fault requirements are not met, return to step b and take the second largest weight set in the Nth layer as the largest weight set.
3. The method according to claim 2, wherein, After completing the selection of the logical topology, the following steps are also included: The check shards and data shards of all PGs in the cluster are readjusted so that each hard disk in the logical topology serves as an EC check shard and an EC data shard an equal number of times.
4. The method according to claim 1, wherein, The step of executing a preset space allocation balancing strategy based on the used capacity of the PG and the used capacity of the hard disk includes: Determine whether the difference in used capacity of PG exceeds the first preset capacity; If the difference in the used capacity of the PG exceeds the first preset capacity, it is determined whether the time exceeding the first preset capacity exceeds the first preset time. If the time exceeds the first preset capacity, a first cyclic PG group is established, which is used to balance the used capacity of the PG. Determine whether the difference in used hard drive capacity exceeds the second preset capacity; If the difference in used capacity of PG exceeds the second preset capacity, it is determined whether the time exceeding the second preset capacity exceeds the second preset time; and, If the time exceeds the second preset capacity, a second cyclic PG group is established on the hard drive that exceeds the second preset capacity and exceeds the second preset time. The second cyclic PG group is used to balance the used capacity of the hard drive.
5. The method according to claim 1, wherein, The star-based read / write method, which executes a preset usage balancing strategy, includes: Obtain the I / O data of each hard disk in the star topology read / write mode; Based on the aforementioned IO data, the data types with different levels of popularity are determined; and, The data types with different popularity are assigned to the corresponding logical topology structures according to the type of the logical topology structure.
6. The method according to claim 1, wherein, The execution of the preset background load balancing strategy includes: Obtain statistical information for all PGs and all hard drives within the cluster; Based on the statistical information of all the PGs, hot PGs are identified; Based on the statistical information of all the hard drives, hotspot hard drives are identified; and, If the hot spot PG and the hot spot hard drive continue to exist for a third preset time, the hot spot PG will be migrated from the hot spot hard drive to a non-hot spot hard drive through PG migration.
7. A data balancing device for a write-time redirection distributed storage engine, wherein, The device includes: The balancing strategy execution module is used to execute a preset disk selection balancing strategy based on a logical topology structure. The logical topology structure is one of several different types of logical topologies determined after reorganizing the physical topology structure of the heterogeneous cluster, and each logical topology structure includes multiple logically isolated sub-topologies. The space allocation balancing strategy execution module is used to obtain the used capacity of the PG and the used capacity of the hard disk after executing the preset disk selection balancing strategy; and to execute the preset space allocation balancing strategy based on the used capacity of the PG and the used capacity of the hard disk. The load balancing strategy execution module is used to execute a preset load balancing strategy based on a star topology read / write pattern after executing the preset space allocation load balancing strategy; and... The background balancing strategy execution module is used to execute a preset background balancing strategy if the cluster still fails to achieve balancing after the preset balancing strategy has been executed.
8. A computer device comprising a memory and a processor, wherein, The memory and the processor are communicatively connected to each other. The memory stores computer-readable instructions, and the processor executes the steps of the method according to any one of claims 1-6 by executing the computer-readable instructions.
9. A computer-readable storage medium storing computer-readable instructions, wherein, The computer-readable instructions are used to cause the computer to perform the steps of the method according to any one of claims 1-6.
10. A computer program product, wherein, Includes computer instructions for causing a computer to perform the steps of the method according to any one of claims 1-6.