Data storage method and apparatus, computer device, storage medium, and program product

By deploying data nodes across multiple availability zones and dynamically maintaining the remaining space ratio in a distributed database system, and employing layered routing and uniform storage, the problems of long data node determination time and hotspots are solved, thereby improving storage efficiency and system availability.

WO2026124149A1PCT designated stage Publication Date: 2026-06-18CHINA TELECOM CLOUD TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CHINA TELECOM CLOUD TECH CO LTD
Filing Date
2025-11-19
Publication Date
2026-06-18

Smart Images

  • Figure CN2025135964_18062026_PF_FP_ABST
    Figure CN2025135964_18062026_PF_FP_ABST
Patent Text Reader

Abstract

The present application relates to a data storage method and apparatus, a computer device, a storage medium, and a program product. The method comprises: in response to a storage request for target data, a management node of a distributed database system determines a target available region among multiple available regions of the distributed database system; on the basis of the amount of remaining space of each data node group in the target available region, the management node determines candidate data node groups for the target data; and the management node determines a target data node group among the candidate data node groups, and stores multiple copies of the target data in the target data node group. Each available region comprises multiple data node groups.
Need to check novelty before this filing date? Find Prior Art

Description

Data storage methods, devices, computer equipment, storage media and software products

[0001] Related applications

[0002] This application claims priority to Chinese patent application filed on December 11, 2024, with application number 202411814952.7, entitled "Data storage method, apparatus, computer equipment, storage medium and program product", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of data storage technology, and in particular to a data storage method, apparatus, computer equipment, storage medium, and program product. Background Technology

[0004] A distributed database system consists of a management node and multiple data nodes. The management node is responsible for the overall management and monitoring of the system, while the data nodes are responsible for the actual data storage and retrieval. The two work together to ensure the stable operation of the system.

[0005] In related technologies, when storing data in a distributed database system, in order to prevent data from being lost due to accidental loss and becoming inaccessible, the management node usually replicates the data into multiple copies, each of which is called a replica, and writes them to different data nodes to achieve data redundancy backup.

[0006] However, the related technologies suffer from the problem of long determination time for data nodes, resulting in low overall storage efficiency of distributed database systems. Summary of the Invention

[0007] According to various embodiments of this application, a data storage method, apparatus, computer device, storage medium, and program product are provided.

[0008] Firstly, this application provides a data storage method applied to the management node of a distributed database system, the method comprising:

[0009] In response to a storage request for target data, the target availability zone is determined from multiple availability zones in the distributed database system; each availability zone comprises multiple data node groups.

[0010] Based on the remaining space rate of each data node group in the target availability zone, determine the candidate data node groups for the target data;

[0011] The target data node group is determined from the candidate data node groups, and multiple copies of the target data are stored in the target data node group.

[0012] In one embodiment, candidate data node groups for the target data are determined based on the remaining space rate of each data node group in the target availability zone, including:

[0013] The data nodes are grouped and sorted according to their remaining space ratio from largest to smallest.

[0014] From the sorted data node grouping sequence, multiple candidate data node groups that match the preset grouping number are determined sequentially.

[0015] In one embodiment, multiple data nodes in the target data node group are distributed in different data centers; storing multiple copies of the target data in the target data node group includes:

[0016] The number of nodes allocated to each data center is determined based on the number of replicas of the target data and the number of data centers corresponding to the target data node groups.

[0017] If the number of available data nodes in each data center is greater than the number of nodes allocated to each data center, the data nodes in each data center that match the number of nodes allocated will be identified as target data nodes.

[0018] Multiple copies of the target data are stored separately in the target data nodes of each data center.

[0019] In one embodiment, the number of nodes allocated to each data center is determined based on the number of replicas of the target data and the number of data centers corresponding to the target data node group, including:

[0020] Based on a uniformly distributed storage method, the quotient is calculated according to the number of replicas of the target data and the number of data centers corresponding to the target data node groupings;

[0021] If the result of the quotient has no remainder, then the result of the quotient is determined as the number of nodes allocated to each data center;

[0022] If the result of the quotient has a remainder, the integer value in the result is determined as the initial number of nodes allocated to each data center. The initial number of nodes allocated to the data centers that match the remainder is then adjusted to obtain the number of nodes allocated to each data center.

[0023] In one embodiment, multiple copies of the target data are stored respectively in the target data nodes of each data center, including:

[0024] For any data center, sort the data nodes according to the remaining available space of each data node in the data center to obtain the data node sequence;

[0025] From the sorted sequence of data nodes, determine the candidate data nodes that match the preset number of nodes;

[0026] Randomly select target data nodes that match the number of nodes allocated from among the candidate data nodes;

[0027] Store replicas matching the number allocated to each node in the target data node of the data center.

[0028] In one embodiment, the method further includes:

[0029] If the number of available data nodes in any data center is less than the number of nodes allocated to the data center, the target data node group will be removed from each candidate data node group, and a new target data node group will be determined from the removed candidate data node groups.

[0030] Secondly, this application also provides a data storage device, the device comprising:

[0031] The request-response module is used to respond to storage requests for target data by determining the target availability zone from multiple availability zones in the distributed database system; each availability zone includes multiple data node groups.

[0032] The grouping evaluation module is used to determine candidate data node groups for the target data based on the remaining space rate of each data node group in the target availability zone.

[0033] The data storage module is used to determine the target data node group from each candidate data node group and store multiple copies of the target data in the target data node group.

[0034] In one embodiment, the grouping evaluation module includes: a node grouping sorting unit and a node grouping determination unit, wherein:

[0035] The node grouping and sorting unit is used to sort the data nodes in descending order of remaining space ratio.

[0036] The node grouping determination unit is used to sequentially determine multiple candidate data node groups that match the preset number of groups from the sorted data node grouping sequence.

[0037] In one embodiment, multiple data nodes in the target data node group are distributed in different data centers; the data storage module includes: a node quantity determination unit, a target node determination unit, and a data replica storage unit, wherein:

[0038] The node quantity determination unit is used to determine the number of nodes allocated to each data center based on the number of replicas of the target data and the number of data centers corresponding to the target data node group.

[0039] The target node determination unit is used to determine the data nodes in each computer room that match the number of nodes allocated in each computer room as target data nodes when the number of available data nodes in each computer room is greater than the number of nodes allocated in each computer room.

[0040] The data replica storage unit is used to store multiple copies of the target data into the target data nodes in each data center.

[0041] In one embodiment, the node number determination unit includes: a node allocation operation subunit, a first allocation subunit, and a second allocation subunit, wherein:

[0042] The node allocation subunit is used to calculate the quotient based on the number of replicas of the target data and the number of data centers corresponding to the target data node group, according to a uniformly distributed storage method.

[0043] The first allocation subunit is used to determine the number of nodes allocated to each computer room if the result of the quotient has no remainder.

[0044] The second allocation subunit is used to determine the integer value in the result of the quotient as the initial number of nodes allocated to each computer room if there is a remainder, and to adjust the initial number of nodes allocated to the computer rooms that match the remainder, so as to obtain the number of nodes allocated to each computer room.

[0045] In one embodiment, the data replica storage unit includes: a node sorting subunit, a candidate node determination subunit, a target node determination subunit, and a target node storage subunit, wherein:

[0046] The data node sorting subunit is used to sort the data nodes according to the remaining available space of each data node in any data room to obtain a data node sequence.

[0047] The candidate node determination sub-unit is used to determine the candidate data nodes that match the preset number of nodes from the sorted data node sequence.

[0048] The target node determination sub-unit is used to randomly determine the target data node that matches the number of nodes allocated from each candidate data node;

[0049] The target node storage subunit is used to store replicas matching the number allocated to each node in the data center's target data nodes.

[0050] In one embodiment, the target node determination unit is further configured to remove the target data node group from each candidate data node group if the number of available data nodes in any data center is less than the number of nodes allocated to the data center, and to re-determine a new target data node group from the removed candidate data node groups.

[0051] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method in any of the embodiments of the first aspect described above.

[0052] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method in any of the embodiments of the first aspect described above.

[0053] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method in any of the embodiments of the first aspect described above.

[0054] Details of one or more embodiments of this application are set forth in the following drawings and description, and other features, objects and advantages of this application will become apparent from the specification, drawings and claims. Attached Figure Description

[0055] To more clearly illustrate the technical solutions in the embodiments of this application or the conventional technology, the drawings used in the description of the embodiments or the conventional technology will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the disclosed drawings without creative effort.

[0056] Figure 1 is an application environment diagram of a data storage method in one embodiment;

[0057] Figure 2 is a schematic diagram of the first process of a data storage method in one embodiment;

[0058] Figure 3 is a schematic diagram of the grouping of data nodes in the availability zone in one embodiment;

[0059] Figure 4 is a flowchart illustrating the data node grouping and sorting steps in one embodiment;

[0060] Figure 5 is a schematic diagram of the deployment of the availability zone data center in one embodiment;

[0061] Figure 6 is a schematic diagram of the second process of a data storage method in one embodiment;

[0062] Figure 7 is a flowchart illustrating the steps for determining the number of data center nodes in one embodiment;

[0063] Figure 8 is a schematic diagram of the third process of a data storage method in one embodiment;

[0064] Figure 9 is a schematic diagram of the fourth process of a data storage method in one embodiment;

[0065] Figure 10 is a structural block diagram of a data storage device in one embodiment;

[0066] Figure 11 is an internal structure diagram of a computer device in one embodiment. Detailed Implementation

[0067] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0068] The technical background of this application will be explained below.

[0069] Distributed database systems store data across multiple different data nodes. Unlike centralized databases, the data nodes in a distributed database are distributed across different servers, which are interconnected via a network, forming a logically unified whole. Furthermore, a distributed database system includes multiple clients, allowing users to access the global data of the distributed database system from any location.

[0070] For the cluster architecture of distributed database systems, physical resources (such as data, servers, and applications) are typically distributed across different regions, each called an availability zone. Existing technical solutions usually deploy all data nodes of the distributed database system in the same region, meaning one distributed database system corresponds to one availability zone (AZ). This deployment method has at least the following problems:

[0071] (1) High-risk concentration. All data nodes are deployed in the same Availability Zone (AZ). If the AZ fails (e.g., power outage, network failure, etc.), the entire distributed database system will be unavailable.

[0072] (2) Uneven load. All traffic is concentrated in one Availability Zone (AZ), which may cause the AZ to be overloaded and unable to be distributed to other Availability Zones.

[0073] (3) Limited scalability. As business grows and nodes need to be scaled horizontally, the limited resources of a single Availability Zone (AZ) may not be able to provide sufficient computing, storage, and network resources, thus limiting the horizontal scaling capability of the database system.

[0074] Furthermore, when using the above cluster architecture for data storage, in order to prevent data from being lost due to accidental loss, the data is copied into multiple copies when writing data. Each copy is called a replica and is written to different data nodes to improve data redundancy and achieve data backup and disaster recovery.

[0075] Data routing algorithms determine how data replicas are distributed across multiple data nodes. Effective data routing algorithms can improve the scalability and fault tolerance of a system. However, for distributed database systems, existing data routing algorithms are mostly applied to small-scale database systems and do not address data balance and hotspot issues. For example, using a remaining space-first strategy, unavailable data nodes are first removed when selecting data nodes. Then, all available nodes are traversed to obtain their remaining space, and the node with the largest remaining space in the cluster is selected as the target node.

[0076] In practical applications, data routing strategies in related technologies require traversing all data nodes in the cluster when selecting a target data node. In scenarios with large data volumes, the number of data nodes can reach hundreds. Traversing all nodes and obtaining their remaining space is too time-consuming and inefficient, making it unsuitable for large-scale distributed database systems. Furthermore, selecting the node with the largest remaining space may lead to hotspot issues. For example, in scenarios with frequent data writes, if the routing algorithm repeatedly selects the same node with the largest remaining space as the target data node, it will cause excessive load on that node, potentially leading to node failure.

[0077] Based on this, embodiments of this application provide a distributed database system architecture that improves the reliability, availability, and scalability of the distributed database system by deploying different data nodes across multiple Availability Zones (AZs). Furthermore, in a large-scale distributed database system deployed across AZs, a data routing strategy for data storage is proposed to improve data storage efficiency and ensure data and load balancing among data nodes.

[0078] Before describing the technical solutions of the embodiments of this application, the application scenarios of the embodiments of this application will be described first.

[0079] The data storage method provided in this application embodiment can be applied to the application environment shown in Figure 1, which is a schematic diagram of the architecture of a distributed database system. The distributed database system includes a management node and multiple data nodes. The management node communicates with each data node, and the multiple data nodes are divided into different availability zones according to their regions. Figure 1 uses a distributed database system with Q availability zones as an example; there is no data transmission between different availability zones. This communication method can effectively avoid the situation where a single availability zone fails, such as due to hardware failure or network problems, causing the entire distributed database system to become unavailable, thus improving the high availability of the distributed database system.

[0080] In practical applications, multiple data nodes are deployed in each Availability Zone (AZ), and all replicas of a data block are distributed across data nodes in the same AZ. The number of data nodes in each AZ can be planned based on the frequency of user requests in different regions. For example, the number of deployed nodes can be increased for Availability Zones with higher user request frequencies. Data nodes can be servers used to store data.

[0081] The technical solution of this application and how it solves the above-mentioned technical problems will be described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will be described below with reference to the accompanying drawings.

[0082] In an exemplary embodiment, as shown in Figure 2, a data storage method is provided, applied to the management node of a distributed database system. The method includes:

[0083] S201, in response to a storage request for target data, determines the target availability zone from multiple availability zones in the distributed database system; each availability zone includes multiple data node groups.

[0084] A distributed database system includes multiple clients, which can be located in different regions for user access. In practical applications, a distributed database system is logically a unified whole, and users can access all the data in the distributed database system from any client.

[0085] Within a distributed database system, there is a management node and multiple data nodes. The management node is used to respond to access requests triggered on any client in the distributed database system, and based on the access request, it searches for or stores the corresponding data from each data node and returns the data processing results to the client.

[0086] It should be noted that, in this embodiment, the multiple data nodes of the distributed database system are divided into different availability zones according to geographical scope, and each availability zone can be regarded as a region. Furthermore, within each availability zone, the multiple data nodes are equally divided into multiple data node groups. Taking an example where each availability zone in the distributed database system includes M×N data nodes, please refer to Figure 3. Figure 3 is an architecture diagram of the distributed database system. The data nodes are divided into M groups according to the equal division method, and each data node group includes N data nodes.

[0087] In this embodiment, the storage request is triggered when target data is uploaded by any client corresponding to the distributed database system. In response to the storage request, the management node determines the physical location of the request based on IP address, geolocation service, and other triggering information. Then, based on the distance between the physical locations of each availability zone and the requested physical location, it determines the availability zone closest to the requested physical location as the target availability zone to reduce network latency and improve access speed.

[0088] S202, determine the candidate data node groups for the target data based on the remaining space rate of each data node group in the target availability zone.

[0089] For each data node group in the target availability zone, obtain the remaining available space and initial capacity of each data node in the data node group, calculate the sum of the remaining available space of each data node as the total remaining space of the data node group, calculate the sum of the initial capacity of each data node as the total capacity of the data node group, and use the ratio between the total remaining space and the total capacity as the remaining space rate of the data node group.

[0090] For each data node group in the target availability zone, the formula for expressing the remaining space ratio is as follows:

[0091] Wherein, AVG(groupID) is the remaining space rate of the data node group, total Remaining Spaces is the sum of the remaining available space of all data nodes in the data node group, and total Capacities is the sum of the total capacity of all data nodes in the data node group.

[0092] Given the remaining space rate of each data node group in the target availability zone, multiple data node groups with remaining space rates within a preset range can be identified as candidate data node groups, or multiple data node groups with remaining space rates greater than a preset space rate threshold can be identified as candidate data node groups, so as to ensure that each candidate data node group can store multiple copies of the target data.

[0093] To prevent the need to traverse all data nodes under all data node groups for each data routing, this embodiment dynamically maintains the remaining space rate of each data node group. When a data node group writes data, deletes data, or all data nodes crash, the remaining space rate of the group to which the data node belongs is recalculated.

[0094] S203, determine the target data node group from each candidate data node group, and store multiple copies of the target data into the target data node group.

[0095] It should be noted that in this embodiment, multiple copies of the target data are stored in the same data node group. Therefore, when multiple candidate data node groups are determined, one candidate data node group can be randomly selected as the target data node group to avoid repeatedly storing different target data in the same data node group, thereby avoiding the hotspot problem of data nodes within the data node group.

[0096] Furthermore, after determining the target data node group, multiple copies of the target data are randomly stored in different data nodes of the target data node group, with one copy stored on each data node.

[0097] Optionally, based on the remaining available space of each data node within the target data node group, a set of multiple candidate data nodes that support the storage of replicas is determined, and multiple target data nodes matching the number of replicas are determined from the set of candidate data nodes, and each replica is stored in each target data node.

[0098] In this embodiment, the management node of the distributed database system, in response to a storage request for target data, determines a target availability zone from multiple availability zones within the distributed database system; determines candidate data node groups for the target data based on the remaining space ratio of each data node group in the target availability zone; determines the target data node group from among the candidate data node groups; and stores multiple copies of the target data in the target data node group. In this method, in a target data storage scenario, in response to a storage request for target data, the system first determines the target availability zone from multiple availability zones, then determines candidate data node groups from the target availability zone, then determines the target data node group from among the candidate data node groups, and finally determines the data nodes for storing copies of the target data from the target data node group. This layered routing approach gradually narrows down the range of data nodes for storing copies of the target data, shortening the data node determination time, quickly storing multiple copies of the target data, and thus improving the data storage efficiency of the distributed database system. Furthermore, when determining the target availability zone, the zone closest to the region triggering the storage request can be selected based on proximity to reduce network latency. When determining the target data node grouping, candidate data node groups are determined based on the remaining space ratio of each data node group within the target availability zone, ensuring the rationality of each candidate grouping. Moreover, the method of selecting the target data node grouping from among the candidate data nodes not only ensures the rationality of the target data node grouping but also improves its flexibility and diversity to a certain extent, avoiding data hotspot issues. In addition, the distributed database system includes multiple availability zones, and the data transmission processes between each availability zone are isolated to prevent the entire distributed database system from becoming unavailable due to failures in a single availability zone, such as hardware failures or network problems, thus improving the high availability of the distributed database system.

[0099] The following describes one possible implementation of candidate data node grouping through an embodiment. As shown in Figure 4, determining candidate data node groups for the target data based on the remaining space rate of each data node group in the target availability zone includes:

[0100] S401 sorts the data nodes in groups according to the remaining space ratio from largest to smallest.

[0101] The data nodes are grouped and sorted according to their remaining space ratio from largest to smallest, resulting in a data node group sequence. In this sequence, a lower remaining space ratio means less remaining space in the data node group, making it easier for copies of the target data stored in that group to concentrate on one or a few data nodes, leading to unbalanced data storage or even the inability to store all copies. Conversely, a higher remaining space ratio means more remaining space in the data node group, making it easier to achieve balanced storage of the target data copies stored in that group.

[0102] S402, from the sorted data node grouping sequence, determine multiple candidate data node groups that match the preset number of groups in sequence.

[0103] The preset number of groups is an integer not less than 1. The preset number of groups can be set based on experience or it can be a parameter carried in the storage request.

[0104] After obtaining the sorted data node grouping sequence, multiple candidate data node groups matching the preset grouping number are determined sequentially from high to low.

[0105] Considering that the distributed database system is a large system, in the embodiments of this application, each candidate data node group includes a sufficient number of data nodes and storage space. In other words, each candidate data node supports storing multiple copies of the target data.

[0106] In this embodiment, the data nodes are sorted in descending order of remaining space ratio. Then, multiple candidate data node groups matching the preset number of groups are determined from the sorted data node group sequence, further narrowing the determination range of subsequent target data node groups. This ensures the rationality of the determined target data node groups while improving the flexibility of the target data node group determination method.

[0107] The target data node group includes multiple data nodes, far exceeding the number of replicas of the target data. In this case, it is necessary to further select a subset of target data nodes from the multiple data nodes in the target data node group to store multiple replicas of the target data.

[0108] In an exemplary embodiment, as shown in Figure 5, multiple data nodes in the target data node group are distributed in different data centers; Figure 5 is a schematic diagram of the data center distribution of a single availability zone.

[0109] As shown in Figure 6, when multiple data nodes in a target data node group are distributed across different data centers, storing multiple copies of the target data into the target data node group includes the following steps:

[0110] S601, determine the number of nodes allocated to each data center based on the number of replicas of the target data and the number of data centers corresponding to the target data node group.

[0111] Based on the number of data centers, the number of copies of the target data is divided into multiple parts, resulting in multiple sub-copy numbers consistent with the number of data centers. Each sub-copy number is then allocated as the number of nodes in one data center.

[0112] S602: When the number of available data nodes in each data center is greater than the number of nodes allocated to each data center, the data nodes in each data center that match the number of nodes allocated are determined as the target data nodes.

[0113] Before determining the target number of data nodes for each data center, it is necessary to first eliminate unavailable nodes in each data center and determine the number of available data nodes in each data center.

[0114] For any data center, nodes that fail, are overloaded, or have insufficient remaining available space are removed from the data center to obtain the number of available data nodes. Insufficient remaining available space means that the actual remaining available space of a node minus the planned capacity for writing data replicas to that node is less than the minimum reserved space.

[0115] Next, the number of available data nodes in each data center is compared with the number of nodes allocated to each data center. If the number of available data nodes in each data center is greater than the number of nodes allocated to each data center, it means that each data center has a sufficient number of data node storage replicas. In this case, the data nodes in each data center that match the number of nodes allocated are determined as the target data nodes.

[0116] In an exemplary embodiment, if the number of available data nodes in any data center is less than the number of nodes allocated to the data center, the target data node group is removed from each candidate data node group, and a new target data node group is determined from the removed candidate data node groups.

[0117] If the number of available data nodes in any data center is less than the number of nodes allocated to the data center, it means that the currently available data nodes in the data center do not support storing the pre-allocated data node storage copies. As a result, the target data group cannot store all the data copies of the target data. In this case, it is necessary to redetermine a new target data node group. In order to prevent the new target data node group from being the same as the initially determined target data group, the initially determined target data group is removed from each candidate data node group. A new target data node group is then redetermined from the removed candidate data node groups to ensure the validity of the new target data nodes.

[0118] In this embodiment of the application, when the number of available data nodes in any data center is less than the number of nodes allocated to the data center, the target data node group is removed from each candidate data node group, and a new target data node group is re-determined from the removed candidate data node groups. This avoids repeatedly determining the same target data node in the candidate data node groups, ensuring the validity of the new target data node. Furthermore, when determining the new target data node, the number of candidate data node groups is reduced, which can improve the efficiency of determining the new target data node to a certain extent.

[0119] S603 stores multiple copies of the target data into the target data nodes in each data center.

[0120] Based on the storage method of one replica corresponding to one data node, multiple replicas of the target data are stored in the target data nodes of each data center.

[0121] In this embodiment, considering the poor disaster recovery capability of a single data center, if an unexpected event such as a power outage causes all servers in the entire fault domain to crash, the fault domain will be unable to read or write data, affecting service quality. Therefore, multiple data centers are configured for an availability zone, with multiple data nodes in the target data node group distributed across different data centers to improve the system's disaster recovery capability. Simultaneously, based on the number of replicas of the target data and the number of data centers corresponding to the target data node group, the number of nodes allocated to each data center is determined. If the number of available data nodes in each data center is greater than the allocated number of nodes, the data nodes in each data center matching the allocated number are designated as target data nodes. Multiple replicas of the target data are evenly stored in the target data nodes of each data center. Thus, if a data center fails, only a portion of the replicas stored in the data nodes of the failed data center become invalid, without affecting the reading of replica data from target data nodes in other non-failed data centers, thereby improving the high availability of the distributed database system.

[0122] In an exemplary embodiment, one possible implementation of the aforementioned embodiment S601, "determining the number of nodes allocated to each data center based on the number of replicas of the target data and the number of data centers corresponding to the target data node group," is described, as shown in Figure 7, including:

[0123] S701, using a uniformly distributed storage method, calculates the quotient based on the number of replicas of the target data and the number of data centers corresponding to the target data node groups.

[0124] To ensure that each replica is evenly distributed across multiple data centers, the ratio between the number of replicas of the target data and the number of data centers corresponding to the target data node group is calculated according to the storage method of evenly storing multiple replicas of the target data in each data center. The quotient result is obtained. In this embodiment, the quotient operation is a floor-rounding calculation.

[0125] S702, if the result of the quotient has no remainder, then the result of the quotient is determined as the number of nodes allocated to each data center.

[0126] S703, if the result of the quotient has a remainder, then the integer value in the result is determined as the initial number of nodes allocated to each computer room, and the initial number of nodes allocated to the computer room that matches the remainder is adjusted to obtain the number of nodes allocated to each computer room.

[0127] Taking the number of replicas as r and the number of data centers as an example, calculate the quotient (r / k). If r / k is divisible, that is, the quotient (r / k) is a positive integer, then the number of nodes allocated to each data center is determined to be (r / k).

[0128] For scenarios where r / k is not divisible, r / k is rounded down, and the remainder is denoted as (r%k). The integer result is used to determine the initial node allocation for each data center. Then, (r%k) data centers are randomly selected from the k data centers as storage locations for the remaining replicas. The initial node allocation for the randomly selected (r%k) data centers is incremented by 1 to obtain the adjusted node allocation for the (r%k) data centers. In this embodiment, the node allocation for the i-th data center is denoted as num(i), where i = 1, 2, ..., k.

[0129] For example, multiple data nodes of the target data node group are distributed in 5 data centers. If the number of replicas is 10, the number of nodes allocated to each data center is determined to be 2 (10 / 5); if the number of replicas is 11, the number of nodes allocated to four data centers is determined to be 2 (10 / 5), and the number of nodes allocated to one data center is determined to be 3 (10 / 5+1).

[0130] In this embodiment, a uniformly distributed storage method is used. The quotient is calculated based on the number of replicas of the target data and the number of data centers corresponding to the target data node group. If there is no remainder in the quotient result, the result is directly determined as the number of nodes allocated to each data center. If there is a remainder in the quotient result, the integer value in the result is determined as the initial number of nodes allocated to each data center. The initial number of nodes allocated to the data centers that match the remainder is adjusted to distribute the data replicas as evenly as possible across different data centers, ensuring a balanced distribution of data nodes and avoiding service unavailability due to a single data center failure.

[0131] In an exemplary embodiment, one possible implementation of the aforementioned embodiment S603, "storing multiple copies of the target data into the target data nodes of each data center," is described, as shown in Figure 8, including:

[0132] S801: For any data center, sort the data nodes according to the remaining available space of each data node in the data center to obtain a data node sequence.

[0133] For any data center, sort the data nodes in the data center according to the remaining available space of each data node in descending order to obtain the data node sequence of the data center.

[0134] S802, determine the candidate data nodes that match the preset number of nodes from the sorted data node sequence.

[0135] The preset number of nodes is an integer not less than the number of nodes allocated to a single data center. After obtaining the sorted data node grouping sequence, multiple candidate data nodes matching the preset number of nodes are determined sequentially from high to low.

[0136] S803, randomly selects the target data node from the candidate data nodes that matches the number of nodes allocated.

[0137] Furthermore, after determining the target data node group, multiple copies of the target data are randomly stored in different data nodes of the target data node group, with one copy stored on each data node.

[0138] S804 stores replicas matching the number allocated to each node in the target data node of the data center.

[0139] The number of replicas matching the number of nodes allocated is stored in each target data node in the data center, with one replica corresponding to one data node.

[0140] In this embodiment of the application, for any data center, each data node is sorted according to the remaining available space of each data node in the data center to obtain a data node sequence, which provides a quantitative basis for determining the target data node. Then, from the sorted data node sequence, candidate data nodes that match the preset number of nodes are determined to narrow down the range of target data nodes. Then, target data nodes that match the node allocation number are randomly determined from each candidate data node to ensure load balancing among target data nodes and avoid hotspot issues.

[0141] In an exemplary embodiment, a data storage method is provided, as shown in FIG9, comprising the following steps:

[0142] S901, in response to a storage request for target data, determines the target availability zone from multiple availability zones in the distributed database storage system based on the principle of proximity.

[0143] S902, based on the remaining space rate of each data node group in the target availability zone, determine multiple candidate data node groups that match the preset number of groups.

[0144] To prevent the need to traverse all data nodes under all data node groups for each data routing, this embodiment dynamically maintains the remaining space ratio of each data node group. When a data node group writes or deletes data, or when all data nodes crash, the remaining space ratio of the data node group is recalculated.

[0145] Based on the remaining space ratio of each data node group, the top M candidate data node groups with the highest remaining space ratios are obtained. Here, `topM` can be configured by the user, with a minimum value of 1, indicating that the candidate data node group with the highest remaining space ratio is selected to ensure data balance among different data node groups. This avoids the possibility of a particular data node group being frequently selected as a write target, leading to hotspot issues, especially in scenarios with large data write volumes and high concurrency. The maximum value of `topM` is the total number of data node groups within the target availability zone, indicating that data balance among data node groups is not considered, and selection is random among all data node groups.

[0146] S903, randomly select a target data node group from each candidate data node group.

[0147] Randomly select one candidate data node group from each candidate data node group as the target data node group, which will be used as the write location for the copy of the target data.

[0148] S904, remove unavailable nodes from the target data node group and determine the available data nodes in multiple data centers corresponding to the target data node group.

[0149] S905 determines the number of nodes allocated to each data center based on the number of replicas of the target data and the number of data centers corresponding to the target data node group.

[0150] S906 determines whether the number of available data nodes in each data center is greater than or equal to the number of nodes allocated to the corresponding data center.

[0151] S907, if not, remove the target data node group from the candidate data node group and return to step S903.

[0152] S908, if so, then based on the remaining available space of the data nodes in each data center, determine a preset number of candidate data nodes from each data center.

[0153] For each data center, the data nodes are sorted in descending order of remaining available space, and the top N data nodes are selected as candidate data nodes. The size of topN is configurable, with a minimum value being the number of nodes allocated to that data center. If topN is the number of nodes allocated to that data center, it means selecting the data nodes with the largest remaining available space from that data center, which can ensure data balance among nodes, but cannot avoid hotspot issues in high-concurrency scenarios. If topN is the maximum number of all available data nodes in the data center, it means that data balance among nodes is not considered, and nodes are randomly selected from all available data nodes.

[0154] S909 randomly selects the target data node from the candidate data nodes in each computer room, matching the number allocated to each node.

[0155] S910 stores multiple copies of the target data in each target data node in each data center.

[0156] In this embodiment, a cross-AZ deployment approach for a distributed database system is adopted, deploying data nodes across multiple AZs. Fault isolation between AZs prevents the entire database system from becoming unavailable due to a single AZ failure. Simultaneously, data nodes are deployed in various locations to allow users to access data services from the nearest AZ, improving access speed. Furthermore, data nodes within a single Availability Zone are further subdivided into multiple data node groups, avoiding the need for the data routing algorithm to traverse all data nodes across the entire AZ, thus improving the efficiency of data node determination. In addition, each AZ is equipped with multiple data centers, with multiple copies of the data distributed across different data centers, preventing service unavailability due to a single data center failure. In summary, the data storage scheme implemented using the above deployment method can, on the one hand, select the fault domain closest to the user based on the user's location to reduce network latency and improve access speed. On the other hand, when selecting target data node groups within an Availability Zone (AZ) and target data nodes within those groups, compared to existing technologies that select the group with the most remaining space or randomly select from the entire cluster, the data storage scheme in this application uses an adjustable parameter (preset number of groups / preset number of nodes) to first determine candidate data node groups / candidate data nodes, and then randomly select target data node groups / target data nodes from among the candidate data node groups / candidate data nodes. This achieves a compromise between data balancing and load balancing to a certain extent.

[0157] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0158] Based on the same inventive concept, this application also provides a data storage device for implementing the data storage method described above. The solution provided by this device is similar to the solution described in the above method; therefore, the specific limitations in one or more data storage device embodiments provided below can be found in the limitations of the data storage method described above, and will not be repeated here.

[0159] In an exemplary embodiment, as shown in FIG10, a data storage device is provided, including: a request response module 1001, a group evaluation module 1002, and a data storage module 1003, wherein:

[0160] The request-response module 1001 is used to determine the target availability zone from multiple availability zones in the distributed database system in response to a storage request for target data; each availability zone includes multiple data node groups.

[0161] The group evaluation module 1002 is used to determine the candidate data node groups of the target data based on the remaining space rate of each data node group in the target availability zone.

[0162] The data storage module 1003 is used to determine the target data node group from each candidate data node group and store multiple copies of the target data into the target data node group.

[0163] In an exemplary embodiment, the grouping evaluation module 1002 includes: a node grouping sorting unit and a node grouping determination unit, wherein:

[0164] The node grouping and sorting unit is used to sort the data nodes in descending order of remaining space ratio.

[0165] The node grouping determination unit is used to sequentially determine multiple candidate data node groups that match the preset number of groups from the sorted data node grouping sequence.

[0166] In an exemplary embodiment, multiple data nodes in the target data node group are distributed in different data centers; the data storage module 1003 includes: a node quantity determination unit, a target node determination unit, and a data replica storage unit, wherein:

[0167] The node quantity determination unit is used to determine the number of nodes allocated to each data center based on the number of replicas of the target data and the number of data centers corresponding to the target data node group.

[0168] The target node determination unit is used to determine the data nodes in each computer room that match the number of nodes allocated in each computer room as target data nodes when the number of available data nodes in each computer room is greater than the number of nodes allocated in each computer room.

[0169] The data replica storage unit is used to store multiple copies of the target data into the target data nodes in each data center.

[0170] In an exemplary embodiment, the node number determination unit includes: a node allocation operation subunit, a first allocation subunit, and a second allocation subunit, wherein:

[0171] The node allocation subunit is used to calculate the quotient based on the number of replicas of the target data and the number of data centers corresponding to the target data node group, according to a uniformly distributed storage method.

[0172] The first allocation subunit is used to determine the number of nodes allocated to each computer room if the result of the quotient has no remainder.

[0173] The second allocation subunit is used to determine the integer value in the result of the quotient as the initial number of nodes allocated to each computer room if there is a remainder, and to adjust the initial number of nodes allocated to the computer rooms that match the remainder, so as to obtain the number of nodes allocated to each computer room.

[0174] In an exemplary embodiment, the data replica storage unit includes: a node sorting subunit, a candidate node determination subunit, a target node determination subunit, and a target node storage subunit, wherein:

[0175] The data node sorting subunit is used to sort the data nodes according to the remaining available space of each data node in any data room to obtain a data node sequence.

[0176] The candidate node determination sub-unit is used to determine the candidate data nodes that match the preset number of nodes from the sorted data node sequence.

[0177] The target node determination sub-unit is used to randomly determine the target data node that matches the number of nodes allocated from each candidate data node;

[0178] The target node storage subunit is used to store replicas matching the number allocated to each node in the data center's target data nodes.

[0179] In one embodiment, the target node determination unit is further configured to remove the target data node group from each candidate data node group if the number of available data nodes in any data center is less than the number of nodes allocated to the data center, and to re-determine a new target data node group from the removed candidate data node groups.

[0180] Each module in the aforementioned data storage device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0181] In an exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram is shown in Figure 11. The computer device includes a processor, memory, input / output interfaces (I / O), and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is connected to the system bus via the I / O interfaces. The processor of the computer device provides computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage media stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device stores data used for data routing and data storage. The I / O interfaces of the computer device are used for exchanging information between the processor and external devices. The communication interface of the computer device is used for communicating with external terminals via a network connection. When the computer program is executed by the processor, it implements a data storage method.

[0182] Those skilled in the art will understand that the structure shown in Figure 11 is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0183] In one exemplary embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above-described method embodiments.

[0184] In one exemplary embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps in the above-described method embodiments.

[0185] In one exemplary embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above-described method embodiments.

[0186] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0187] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0188] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0189] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A data storage method, characterized in that, The method, applied to the management node of a distributed database system, includes: In response to a storage request for target data, a target availability zone is determined from multiple availability zones in the distributed database system; each availability zone includes multiple data node groups. Based on the remaining space rate of each data node group in the target availability zone, candidate data node groups for the target data are determined. A target data node group is determined from each of the candidate data node groups, and multiple copies of the target data are stored in the target data node group.

2. The method according to claim 1, characterized in that, The step of determining candidate data node groups for the target data based on the remaining space rate of each data node group in the target availability zone includes: The data nodes are grouped and sorted according to their remaining space ratio from largest to smallest. From the sorted data node grouping sequence, multiple candidate data node groups that match the preset grouping number are determined sequentially.

3. The method according to claim 1 or 2, characterized in that, The target data node group consists of multiple data nodes distributed in different computer rooms; The step of storing multiple copies of the target data to the target data node group includes: The number of nodes allocated to each data center is determined based on the number of replicas of the target data and the number of data centers corresponding to the target data node group. If the number of available data nodes in each of the aforementioned computer rooms is greater than the number of nodes allocated to each of the aforementioned computer rooms, the data nodes in each of the aforementioned computer rooms that match the number of nodes allocated to each of the aforementioned computer rooms shall be determined as target data nodes; Multiple copies of the target data are stored in the target data nodes of each of the computer rooms.

4. The method according to claim 3, characterized in that, The step of determining the number of nodes allocated to each data center based on the number of replicas of the target data and the number of data centers corresponding to the target data node group includes: Based on a uniformly distributed storage method, the quotient is calculated according to the number of replicas of the target data and the number of data centers corresponding to the target data node group; If the result of the quotient has no remainder, then the result of the quotient is determined as the number of nodes allocated to each of the computer rooms; If the result of the quotient has a remainder, the integer value in the result is determined as the initial number of nodes allocated to each of the data centers, and the initial number of nodes allocated to the data centers that match the remainder is adjusted to obtain the number of nodes allocated to each of the data centers.

5. The method according to claim 3, characterized in that, The step of storing multiple copies of the target data into target data nodes in each of the data centers includes: For any data center, the data nodes are sorted according to the remaining available space of each data node in the data center to obtain a data node sequence; From the sorted sequence of data nodes, determine the candidate data nodes that match the preset number of nodes; Randomly select target data nodes from the candidate data nodes that match the number of nodes allocated to them; The number of replicas matching the number allocated to the nodes is stored in each target data node of the data center.

6. The method according to claim 3, characterized in that, The method further includes: If the number of available data nodes in any data center is less than the number of nodes allocated to that data center, then the target data node group is removed from each of the candidate data node groups, and a new target data node group is determined from the removed candidate data node groups.

7. A data storage device, characterized in that, The device includes: The request-response module is used to determine the target availability zone from multiple availability zones in the distributed database system in response to a storage request for target data; each availability zone includes multiple data node groups; The grouping evaluation module is used to determine the candidate data node groups of the target data based on the remaining space rate of each data node group in the target availability zone; The data storage module is used to determine the target data node group from each of the candidate data node groups, and store multiple copies of the target data into the target data node group.

8. The apparatus according to claim 7, characterized in that, The grouping evaluation module includes: a node grouping sorting unit and a node grouping determination unit, wherein: The node grouping and sorting unit is used to sort the data nodes in descending order of remaining space ratio. The node grouping determination unit is used to sequentially determine multiple candidate data node groups that match the preset number of groups from the sorted data node grouping sequence.

9. The apparatus according to claim 7 or 8, characterized in that, The target data node group consists of multiple data nodes distributed in different computer rooms; The data storage module includes: a node quantity determination unit, a target node determination unit, and a data replica storage unit, wherein: The node quantity determination unit is used to determine the number of nodes allocated to each data center based on the number of replicas of the target data and the number of data centers corresponding to the target data node group. The target node determination unit is used to determine the data nodes in each computer room that match the number of nodes allocated in each computer room as target data nodes when the number of available data nodes in each computer room is greater than the number of nodes allocated in each computer room. The data replica storage unit is used to store multiple copies of the target data into the target data nodes in each data center.

10. The apparatus according to claim 9, characterized in that, The node quantity determination unit includes: a node allocation operation subunit, a first allocation subunit, and a second allocation subunit, wherein: The node allocation subunit is used to calculate the quotient based on the number of replicas of the target data and the number of data centers corresponding to the target data node group, according to a uniformly distributed storage method. The first allocation subunit is used to determine the number of nodes allocated to each computer room if the result of the quotient has no remainder. The second allocation subunit is used to determine the integer value in the result of the quotient as the initial number of nodes allocated to each computer room if there is a remainder, and to adjust the initial number of nodes allocated to the computer rooms that match the remainder, so as to obtain the number of nodes allocated to each computer room.

11. The apparatus according to claim 9, characterized in that, The data replica storage unit includes: a node sorting subunit, a candidate node determination subunit, a target node determination subunit, and a target node storage subunit, wherein: The data node sorting subunit is used to sort the data nodes according to the remaining available space of each data node in any data room to obtain a data node sequence. The candidate node determination sub-unit is used to determine the candidate data nodes that match the preset number of nodes from the sorted data node sequence. The target node determination sub-unit is used to randomly determine the target data node that matches the number of nodes allocated from each candidate data node; The target node storage subunit is used to store replicas matching the number allocated to each node in the data center's target data nodes.

12. The apparatus according to claim 9, characterized in that, The target node determination unit is further configured to, if the number of available data nodes in any data center is less than the number of nodes allocated to the data center, remove the target data node group from each candidate data node group and re-determine a new target data node group from the removed candidate data node groups.

13. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.

14. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.

15. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.