Data writing method, device and equipment after failure of distributed storage system, and medium

By dynamically redirecting write requests to healthy placement groups in a distributed storage system, combined with placement group obstacle avoidance selection and background load balancing mechanisms, the data recovery storm problem caused by failures was resolved, improving system performance and resource utilization, and reducing the impact of failures on business.

CN122308740APending Publication Date: 2026-06-30JINAN INSPUR DATA TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JINAN INSPUR DATA TECH CO LTD
Filing Date
2026-04-02
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In distributed storage systems, data recovery storms caused by failures can lead to a sharp decline in business I/O performance, impacting the high-performance service of online businesses.

Method used

By dynamically redirecting write requests to healthy placement groups on the client side, and combining a preset placement group obstacle avoidance selection algorithm and a background balancing mechanism, parallelization and redundancy of data writing are achieved.

Benefits of technology

While ensuring no data redundancy is lost, the impact of failures on business is significantly reduced, system performance and customer experience are improved, resource contention during recovery is reduced, and system stability and resource utilization are enhanced.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308740A_ABST
    Figure CN122308740A_ABST
Patent Text Reader

Abstract

This invention discloses a method, apparatus, device, and medium for writing data after a distributed storage system failure, relating to the field of computer technology and applied to a client. The method includes: after initiating a write request for the current data write service, performing a mapping query on the monitor component in the distributed storage system to obtain the target placement group of the object to be written; if the system detects a failure, determining the current cluster mapping information based on the incremental placement group mapping information subscribed locally by the client, and performing write redirection in conjunction with a preset placement group obstacle avoidance selection algorithm and a preset background load balancing mechanism to obtain an updated placement group; writing the object to be written in parallel to multiple storage service nodes based on a preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group; and receiving the data write verification result sent by the updated placement group after the data write is completed. This invention improves the performance of normal business I / O after a distributed storage system failure.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to methods, apparatus, devices and media for writing data after a failure in a distributed storage system. Background Technology

[0002] In traditional distributed storage architectures (such as the classic Ceph), data management and transmission are tightly coupled. When a write request arrives at an OSD (Object Storage Device, a storage service node responsible for storing data objects and handling data read and write requests), it is processed by a specific PG (Placement Group) unit.

[0003] However, when an OSD or disk in the system fails, all PGs (Packet Groups) running on it enter a "degraded" state. To ensure data redundancy, the system must immediately initiate a data recovery process. The node containing the primary replica of these PGs reads data from the surviving replicas and rewrites it to the newly replaced OSD to complete data reconstruction. This process fiercely competes with normal business I / O (Input / Output) for the same physical resources (CPU, memory, network, and disk bandwidth), causing a sharp drop in business I / O performance, known as a "recovery storm." Business performance typically doesn't return to normal until data recovery is complete, which is unacceptable for online businesses that require continuous high-performance service.

[0004] It is evident that improving the performance of normal business I / O after a distributed storage system failure is a problem that needs to be solved by those skilled in the art. Summary of the Invention

[0005] The purpose of this invention is to provide a data writing method, apparatus, device, and medium after a distributed storage system failure, which addresses the problem of improving the performance of normal business I / O after a distributed storage system failure—a problem that those skilled in the art need to solve. The specific solution is as follows: In a first aspect, the present invention provides a data writing method after a failure in a distributed storage system, applied to a client, comprising: After initiating the write request corresponding to the current data write business, a mapping query is performed on the monitor component in the distributed storage system to determine the target placement group corresponding to the object to be written in the write request; If the distributed storage system detects a current failure, it determines the current cluster mapping information based on the incremental placement group mapping information subscribed to locally by the client. Based on the current cluster mapping information, the preset placement group obstacle avoidance selection algorithm, and the preset background load balancing mechanism, write redirection is performed to determine the updated placement group corresponding to the object to be written. Based on the preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group, the objects to be written are written to multiple storage service nodes in parallel; among them, the object metadata includes location and permissions; After completing the data writing, receive the data writing verification result sent by the updated placement group.

[0006] Optionally, if the distributed storage system detects a current failure, it determines the current cluster mapping information based on the incremental placement group mapping information subscribed to locally by the client, including: If the distributed storage system detects a failure in the current storage service node or disk, it obtains the cluster mapping update information published by the monitor component based on the incremental placement group mapping information subscribed locally by the client; the cluster mapping update information includes the blocked placement group table, the failed storage service node table, and the placement group weight table. Update the local cached cluster mapping information using the cluster mapping update information to determine the current cluster mapping information.

[0007] Optional, also includes: If the distributed storage system detects a failure in any storage service node or disk, it records the failed storage service node through the monitoring component to determine the failed storage service node table. The monitoring component is used to analyze and flag placement groups affected by faulty storage service nodes to identify the blocked placement group table. The placement group weight table is determined by updating the placement group weight using the monitoring component and the usage capacity information reported by the storage service node. Based on the blocking placement group table, the fault storage service node table, and the placement group weight table, the cluster mapping update information is determined. The cluster mapping update information is pushed to each client by using the monitor component and incremental placement group mapping information.

[0008] Optionally, based on the current cluster mapping information, the preset placement group obstacle avoidance selection algorithm, and the preset background load balancing mechanism, write redirection is performed to determine the updated placement group corresponding to the object to be written, including: Based on the current cluster mapping information, determine whether the target placement group is available, and then determine the placement group determination result; If the placement group judgment result is unavailable, and when the monitor component detects that the difference in the usage of each disk in the cluster corresponding to the distributed storage system is greater than a preset threshold, write redirection is performed based on the current cluster mapping information, the placement group weight table and the preset placement group obstacle avoidance selection algorithm to determine the first updated placement group corresponding to the object to be written. Alternatively, if the placement group's judgment result indicates that it is unavailable, and when the monitoring component detects that the usage difference is greater than a preset threshold, a low-priority background task is started. Based on the background task, current cluster mapping information, placement group weight table, and preset placement group obstacle avoidance selection algorithm, write redirection is performed to determine the second updated placement group corresponding to the object to be written.

[0009] Optionally, based on a preset network protocol, current cluster mapping information, and object metadata fed back by the updated placement group, the object to be written can be written in parallel to multiple storage service nodes, including: Send the first metadata request for the object to be written to the updated placement group; Receive the first object metadata returned by the updated placement group; Based on the preset network protocol, the current cluster mapping information, and the metadata of the first object, the object to be written is written in parallel to multiple storage service nodes that are placed and managed after the update.

[0010] Optional, also includes: When the distributed storage system does not detect a fault, it sends a second metadata request for the object to be written to the target placement group. Receive the second object metadata returned by the target placement group; Based on the preset network protocol, second object metadata, and cluster mapping information cached locally on the client, the objects to be written are written in parallel to multiple storage service nodes managed by the target placement group. By checking the target placement group, we can determine whether all objects to be written have been successfully written to confirm the write check results. When the write check result is yes, a data write success notification is sent to the client through the target placement group.

[0011] Optionally, after receiving the updated data sent by the placement group and writing it to the verification result, the process also includes: If the data write verification result indicates that the write of the object to be written has failed, then upon receiving new incremental placement group mapping information, the write redirection operation of the object to be written will be triggered in combination with the preset placement group obstacle avoidance selection algorithm and the preset background balancing mechanism.

[0012] Secondly, the present invention provides a data writing device after a failure of a distributed storage system, applied to a client, comprising: The placement group determination module is used to perform a mapping query on the monitor component in the distributed storage system after initiating a write request corresponding to the current data write business, so as to determine the target placement group corresponding to the object to be written in the write request. The mapping update module is used to determine the current cluster mapping information based on the incremental placement group mapping information subscribed locally by the client if the distributed storage system detects a current failure. The redirection module is used to redirect writes based on the current cluster mapping information, the preset placement group obstacle avoidance selection algorithm, and the preset background load balancing mechanism, so as to determine the updated placement group corresponding to the object to be written. The data writing module is used to write the objects to be written to multiple storage service nodes in parallel based on the preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group; the object metadata includes location and permissions. The verification result receiving module is used to receive the data write verification result sent by the updated placement group after the data writing is completed.

[0013] Thirdly, the present invention provides an electronic device, comprising: Memory, used to store computer programs; A processor is used to execute computer programs to implement the steps of the aforementioned data writing method after a distributed storage system failure.

[0014] Fourthly, the present invention provides a computer-readable storage medium for storing a computer program, which, when executed by a processor, implements the steps of the aforementioned data writing method after a failure of a distributed storage system.

[0015] As can be seen, in this invention, applied to the client, the process includes: after initiating a write request corresponding to the current data write service, performing a mapping query on the monitor component in the distributed storage system to determine the target placement group corresponding to the object to be written in the write request; if the distributed storage system detects a current fault, determining the current cluster mapping information based on the incremental placement group mapping information subscribed locally by the client; performing write redirection based on the current cluster mapping information, a preset placement group obstacle avoidance selection algorithm, and a preset background load balancing mechanism to determine the updated placement group corresponding to the object to be written; writing the object to be written in parallel to multiple storage service nodes based on a preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group, wherein the object metadata includes location and permissions; and receiving the data write verification result sent by the updated placement group after the data write is completed.

[0016] As can be seen from the above technical solution, through the client, after initiating the write request corresponding to the current data write business, the monitor component in the distributed storage system determines the target placement group corresponding to the object to be written in the write request. Then, if the distributed storage system detects a current failure, it determines the current cluster mapping information based on the subscribed incremental placement group mapping information. Next, based on the current cluster mapping information, the preset placement group obstacle avoidance selection algorithm, and the preset background load balancing mechanism, write redirection is performed to obtain the updated placement group for the object to be written. Finally, based on the preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group, the object to be written is written in parallel to the storage service node, and the data write verification result sent by the updated placement group is received. This improves the performance of normal business I / O after a distributed storage system failure, thereby ensuring that write data redundancy is not lost, greatly reducing the impact of data recovery on client business during failures, improving the customer experience, and creating conditions for failure recovery. Attached Figure Description

[0017] To more clearly illustrate the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A flowchart of a data writing method after a failure in a distributed storage system provided by the present invention; Figure 2 A flowchart of a specific data writing method after a failure in a distributed storage system provided by the present invention; Figure 3 This invention provides a data writing timing diagram under fault-free conditions; Figure 4 A schematic diagram of a data writing device after a failure in a distributed storage system provided by the present invention; Figure 5 This invention provides a structural diagram of an electronic device. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of the present invention.

[0020] The terms "comprising" and "having," and any variations thereof, in the specification and accompanying drawings of this invention are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the steps or units listed, but may include steps or units not listed.

[0021] To enable those skilled in the art to better understand the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0022] In traditional distributed storage architectures (such as the classic Ceph), data management and transmission are tightly coupled. When a write request arrives at an OSD, it is handled by a specific PG (Packet Group) unit. However, when an OSD or disk in the system fails, all PGs running on it enter a "degraded" state. To ensure data redundancy, the system must immediately initiate a data recovery process. The nodes containing the primary replicas of these PGs read data from the surviving replicas and rewrite it to the newly replaced OSD to complete data reconstruction. This process intensely competes with normal business I / O for the same physical resources (CPU, memory, network, and disk bandwidth), causing a sharp decline in business I / O performance, a phenomenon known as a "recovery storm." Business performance typically doesn't return to normal until data recovery is complete, which is unacceptable for online services that require continuous high-performance delivery.

[0023] To address this, the present invention provides a data writing scheme after a distributed storage system failure, which can effectively improve the performance of normal business I / O after a distributed storage system failure. This ensures that the redundancy of written data is not lost, while greatly reducing the impact of data recovery on client business during failure, improving the customer experience, and thus creating conditions for failure recovery.

[0024] See Figure 1 As shown, this embodiment of the invention discloses a data writing method after a distributed storage system failure, applied to a client, including: Step S11: After initiating the write request corresponding to the current data write service, perform a mapping query on the monitor component in the distributed storage system to determine the target placement group corresponding to the object to be written in the write request.

[0025] Specifically, in this embodiment, combined with Figure 2As shown, the client performs normal data I / O operations and initiates a write request corresponding to the current data write operation. Then, the client obtains the cluster mapping from the Monitor (a core management node in a distributed storage system, responsible for maintaining the global state of the entire cluster, including cluster mapping, OSD status, PG distribution, and other metadata). Based on the object ID (Identifier) ​​of the object to be written in the write request, the client obtains the placement group corresponding to the object to be written, i.e., the target placement group.

[0026] Step S12: If the distributed storage system detects a current failure, it determines the current cluster mapping information based on the incremental placement group mapping information subscribed locally by the client.

[0027] In this embodiment, the distributed storage system performs fault detection on itself. When the distributed storage system detects a failure in a storage service node or disk using the corresponding fault detection mechanism, the monitor component updates the data in real time and reports the updated data through the incremental placement group mapping information subscribed to locally by the client. That is, if the distributed storage system detects a failure in the current storage service node or disk, it obtains the cluster mapping update information published by the monitor component based on the incremental placement group mapping information subscribed to locally by the client. The cluster mapping update information includes a blocked placement group table, a failed storage service node table, and a placement group weight table. The cluster mapping update information is used to update the locally cached cluster mapping information to determine the current cluster mapping information. It can be understood that the fault detection mechanism can be a mechanism that determines whether a failure has occurred by monitoring whether the heartbeat times out, or other mechanisms can be selected or customized according to actual needs to implement the logic.

[0028] It's important to understand that on the client's local machine, a new subscription management is added at startup. This subscription continuously subscribes to incremental pgmaps, which are used to incrementally place group mapping information. This allows the client to use these incremental updates to promptly pull the latest cluster mappings and unavailable PG lists from the Monitor, facilitating I / O redirection.

[0029] Furthermore, regarding the data update of the monitor component when a fault is detected, in this embodiment, the monitor component rapidly updates the cluster status, including recording the faulty unit and marking the affected placement groups. That is: if the distributed storage system detects a fault in any storage service node or disk, the monitor component records the faulty storage service node to determine the faulty storage service node table; the monitor component analyzes and marks the placement groups affected by the faulty storage service node to determine the blocked placement group table; the monitor component, using the usage capacity information reported by the storage service node, updates the placement group weights to determine the placement group weight table; based on the blocked placement group table, the faulty storage service node table, and the placement group weight table, cluster mapping update information is determined; and the monitor component, using the incremental placement group mapping information, pushes the cluster mapping update information to each client.

[0030] Understandably, in this embodiment, when a fault is detected, the monitor component quickly updates the cluster status: 1) Recording the faulty unit: The monitor component internally maintains a list of faulty OSDs, i.e., a faulty storage service node table. 2) Marking affected placement groups: The monitor component analyzes all PGs containing faulty OSDs, marks these PGs as unavailable, and compiles a list of unavailable PGs, i.e., a blocked placement group table. This list of unavailable PGs serves as new metadata for the cluster (new pgmap incremental update entries) and is quickly distributed to all clients.

[0031] Furthermore, to avoid uneven data distribution caused by subsequent redirections (where data writes to faulty PGs are redirected, causing their capacity growth to stagnate, while other healthy PGs, especially those frequently selected as redirection targets, experience faster capacity growth. Over time, this can lead to severe cluster capacity skew), a new PG weight calculation has been added to the monitor component: For a PG, its weight value W can be designed to be positively correlated with its remaining capacity or idle rate, i.e., negatively correlated with utilization. For example: W = (1 - Utilization) or a more complex function. PGs with smaller data volumes (larger free space) have higher weight values. This PG weight table is also in the incremental pgmap and is actively pushed to each client after updates.

[0032] In other words, the Monitor component adds an incremental pgmap, maintaining fields such as failed_osd_list (failed storage service node table), blocked_pg_list (blocked placement group table), and PG weight table. When an OSD fails, it calculates the values ​​of each variable and updates the incremental map, pushing it to each client to improve push efficiency (the incremental map is small). Furthermore, clients subscribe to the incremental pgmap upon startup, receiving the incremental pgmap pushed by the Monitor when the system detects a failure and updating their local cache.

[0033] Step S13: Based on the current cluster mapping information, the preset placement group obstacle avoidance selection algorithm, and the preset background balancing mechanism, perform write redirection to determine the updated placement group corresponding to the object to be written.

[0034] In this embodiment, a data recovery and business writing processes are decoupled using a data control separation architecture. When a fault occurs, the system does not immediately initiate traditional data reconstruction. Instead, it dynamically redirects business writes to healthy PGs, thereby isolating the impact of the fault, ensuring the smoothness of business I / O, and gradually resolving the capacity imbalance caused by redirection through a background intelligent data balancing mechanism. Specifically: based on the current cluster mapping information, the system determines whether the target placement group is available to determine the placement group judgment result; if the placement group judgment result is unavailable, and when the monitor component detects that the usage difference of each disk in the cluster corresponding to the distributed storage system is greater than a preset threshold, write redirection is performed based on the current cluster mapping information, the placement group weight table, and the preset placement group obstacle avoidance selection algorithm to determine the first updated placement group corresponding to the object to be written; or, if the placement group judgment result is unavailable, and when the monitor component detects that the usage difference is greater than a preset threshold, a low-priority background task is started; write redirection is performed based on the background task, the current cluster mapping information, the placement group weight table, and the preset placement group obstacle avoidance selection algorithm to determine the second updated placement group corresponding to the object to be written.

[0035] It is important to understand that, in combination Figure 2 As shown, to dynamically redirect to a healthy placement group (i.e., a placement group not listed in the blocked placement group table), the client, based on the updated current cluster mapping information, performs the following: 1) PG selection and obstacle avoidance: When hashing to the target PG based on the object ID, the client first checks if the PG is in the unavailable PG list. If so, it reselects a healthy PG as the target according to the new PG selection algorithm. 2) Requesting metadata writing: This mainly involves requesting new object write space from the PG and then writing the data to the set of healthy OSD replicas managed by the new PG via NoF (VMe over Fabrics, a network protocol). Regarding the new PG selection algorithm, the client adds intelligent routing logic, overlaying redirection logic on top of the original PG mapping algorithm. A `get_target_pg(object_id)` function is implemented, which internally includes obstacle avoidance and weight selection algorithms.

[0036] Furthermore, to address the capacity imbalance issue caused by redirection, this embodiment proposes a weight-based background balancing mechanism, namely a preset background balancing mechanism. This mechanism includes not only the weight calculation of the monitor component placement group, but also intelligent selection and filling of client PGs and initiation of background data migration.

[0037] Regarding client-side PG intelligent selection and filling: In two scenarios, the client will prioritize PGs with higher weights: 1) Writing garbage collection data: In append-only mode (most distributed storage systems currently operate in append-only mode), after a large number of business writes, the same business object may be written to multiple internal objects. Only the latest data is valid, and the previous data is garbage. Garbage collection means that the client starts a thread to scan the valid data fragments of each internal object, extracts the valid data fragments from multiple internal objects, writes them to a new internal object, and releases the original internal object. During this process, when selecting a PG to write to, the client prioritizes writing to PGs with higher weights to fill capacity. 2) Load balancing during normal writes: Even in a fault-free state, the client can probabilistically select PGs (these PGs are all fault-free and normal PGs) based on their weights to fill capacity and achieve load balancing. This can be understood as the `get_target_pg(object_id)` function being reused during garbage collection to select the PG with the highest weight for writing.

[0038] Regarding the background data migration, when the usage of disks within the cluster differs significantly (e.g., the difference reaches 50%, calculated and published by the monitoring component), a low-priority background task is initiated to migrate data from high-load PGs to low-load PGs to further optimize data distribution.

[0039] This ensures that the cluster can automatically and smoothly return to a capacity-balanced state during and after a failure, without the need for urgent and performance-impacting data migration when a failure occurs.

[0040] Step S14: Based on the preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group, write the object to be written to multiple storage service nodes in parallel; wherein, the object metadata includes location and permissions.

[0041] In this embodiment, combined with Figure 2 As shown, after determining a new healthy PG (Post-Updated Placement Group), the system requests object metadata from this placement group. Then, using a preset network protocol and the object metadata, it directly writes data to the new OSD group. Specifically: it sends a first metadata request for the object to be written to the Post-Updated Placement Group; receives the first object metadata returned by the Post-Updated Placement Group; and, based on the preset network protocol, current cluster mapping information, and the first object metadata, writes the object to be written in parallel to multiple storage service nodes managed by the Post-Updated Placement Group. It is understandable that the preset network protocol can be NoF (NoF) or other protocols selected based on actual needs.

[0042] Step S15: After completing the data writing, receive the data writing verification result sent by the updated placement group.

[0043] In this embodiment, combined with Figure 3 As shown, after the data is written, the placement group will verify whether the writing was successful and return the result to the client. If the result shows that the PG writing failed, the client will not reselect the PG. The writing will only be redirected when an incremental pgmap is received. That is, if the data writing verification result shows that the writing of the object to be written failed, the writing redirection operation of the object to be written will be triggered after receiving new incremental placement group mapping information, combined with the preset placement group obstacle avoidance selection algorithm and the preset background balancing mechanism.

[0044] Furthermore, in this implementation, in addition to the data writing process after a failure in the distributed storage system based on data control separation, the data writing process when no failure occurs is also as follows: Figure 3 As shown, in this process, after the client obtains the cluster mapping from the Monitor and the object metadata from the primary PG, it can write the data in parallel and directly to multiple replica OSDs via the NoF protocol (see [link to documentation]). Figure 3 In this architecture, the primary storage group (PG) is only responsible for coordination and final confirmation, and the backup OSD1 and backup OSD2 are used. This architecture eliminates the need for data to flow through the OSD where the primary PG resides, greatly improving write performance and reducing bottlenecks. Specifically: when the distributed storage system does not detect a fault, it sends a second metadata request for the object to be written to the target placement group; it receives the second object metadata returned by the target placement group; based on the preset network protocol, the second object metadata, and the cluster mapping information cached locally on the client, it writes the object to be written directly to multiple storage service nodes managed by the target placement group in parallel; the target placement group checks whether all objects to be written have been successfully written to determine the write check result; when the write check result is positive, the target placement group sends a data write success notification to the client.

[0045] In summary, this embodiment proposes a rapid fault recovery scheme for a distributed storage system based on data control separation. By using fault reporting and client write redirection logic, it ensures that data written by clients during a fault is not degraded, greatly eliminating the impact of data recovery on client services in fault scenarios, improving the stability of system performance, and enhancing the competitiveness of the distributed storage system. (1) Client dynamic write redirection strategy: The client caches and subscribes to the list of unavailable PGs published by the Monitor. When writing business data, it actively performs obstacle avoidance checks and has a built-in redirection algorithm. When a PG is unavailable, it can immediately select a new healthy PG to avoid writing business data to a degraded PG. While ensuring that the redundancy of written data is not lost (traditional writing data is degraded writing, and the data redundancy is reduced), it greatly reduces the impact of data recovery on client business during failure and improves customer experience.

[0046] (2) Gradual data balancing mechanism: The capacity balancing of the storage system is extracted into the business write process (garbage collection, PG weight selection), avoiding the capacity imbalance problem caused by write redirection. At the same time, the traditional process of brute-force backend data migration to immediately level the data distribution is placed in the write process, which greatly reduces disk pressure and system resource occupation and improves resource utilization.

[0047] Compared to existing technologies, this solution has the following advantages: (1) Performance Enhancement: Ultimate Stability of Business Performance. In traditional solutions, data recovery traffic overlaps with business traffic, competing for resources and causing a sharp drop in business performance. This solution, through write redirection, completely avoids large-scale data reconstruction during fault recovery, ensuring that the performance curve of business I / O remains stable during fault occurrence and recovery, fundamentally guaranteeing service quality, especially suitable for latency-sensitive core production businesses. At the same time, it ensures that the redundancy of customer write data is not reduced, improving data reliability during continuous failures.

[0048] (2) Improved resource utilization efficiency: Traditional recovery storms can exhaust the cluster's network and disk IOPS (Input / Output Operations Per Second), affecting the entire cluster's business. This solution isolates the impact of recovery on resources in the background and during low-load periods, allowing valuable hardware resources to be continuously used to serve business requests, improving the effective utilization of resources, and further reducing the overall configuration of the storage system and saving costs.

[0049] (3) Simplified operation and maintenance: Operation and maintenance personnel no longer need to make a difficult trade-off between 'rapid recovery' and 'business impact', but can handle hardware failures with ease, and even manually trigger more thorough data balancing operations during off-peak business periods, thereby reducing the complexity of operation and maintenance and improving the overall availability of the system.

[0050] Therefore, in this embodiment of the invention, after initiating a write request corresponding to the current data write service through the client, the target placement group corresponding to the object to be written in the write request is determined using the monitor component in the distributed storage system. Then, if the distributed storage system detects a current fault, the current cluster mapping information is determined based on the incremental placement group mapping information subscribed to locally by the client. Next, based on the current cluster mapping information, a preset placement group obstacle avoidance selection algorithm, and a preset background load balancing mechanism, write redirection is performed to obtain the updated placement group for the object to be written. Finally, based on a preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group, the object to be written is written in parallel to the storage service node, and the data write verification result sent by the updated placement group is received. This improves the performance of normal business I / O after a distributed storage system fault, thereby ensuring that write data redundancy is not lost while significantly reducing the impact of data recovery on client services during a fault, improving the customer experience, and creating conditions for fault recovery.

[0051] As a preferred embodiment, to avoid clients writing to incorrect nodes based on expired mappings due to delays in the propagation of mapping information during a failure, the client can maintain two types of mappings locally: a stable view (based on normal CRUSH or rule-based mappings) and an incremental view (obtained from incremental placement group mapping information pushed by the monitor component, with a version number and effective range). Furthermore, during writing, the client can simultaneously carry the version numbers of both views to the primary OSD of the target PG. Before receiving the write, the OSD verifies whether the local view version is consistent with the client's; if inconsistent, it rejects the write and returns the latest view. The client then retryes and updates the view based on the returned view. This not only prevents clients from writing to incorrect nodes based on expired mappings but also ensures consistency of all writes from the cluster's perspective, providing a reliable underlying view synchronization mechanism for redirection.

[0052] See Figure 4 As shown, this embodiment of the invention also discloses a data writing device after a distributed storage system failure, applied to a client, comprising: The placement group determination module is used to perform a mapping query on the monitor component in the distributed storage system after initiating a write request corresponding to the current data write business, so as to determine the target placement group corresponding to the object to be written in the write request. The mapping update module is used to determine the current cluster mapping information based on the incremental placement group mapping information subscribed locally by the client if the distributed storage system detects a current failure. The redirection module is used to redirect writes based on the current cluster mapping information, the preset placement group obstacle avoidance selection algorithm, and the preset background load balancing mechanism, so as to determine the updated placement group corresponding to the object to be written. The data writing module is used to write the objects to be written to multiple storage service nodes in parallel based on the preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group; the object metadata includes location and permissions. The verification result receiving module is used to receive the data write verification result sent by the updated placement group after the data writing is completed.

[0053] For more detailed information on the working process of each of the above modules, please refer to the relevant content disclosed in the foregoing embodiments, which will not be repeated here.

[0054] Therefore, this embodiment of the invention, through the client, firstly, after initiating a write request corresponding to the current data write service, uses the monitor component in the distributed storage system to determine the target placement group corresponding to the object to be written in the write request. Then, if the distributed storage system detects a current fault, it determines the current cluster mapping information based on the incremental placement group mapping information subscribed to locally by the client. Next, based on the current cluster mapping information, a preset placement group obstacle avoidance selection algorithm, and a preset background load balancing mechanism, write redirection is performed to obtain the updated placement group for the object to be written. Finally, based on a preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group, the object to be written is written in parallel to the storage service node, and the data write verification result sent by the updated placement group is received. This improves the performance of normal business I / O after a distributed storage system fault, thereby ensuring that write data redundancy is not lost while significantly reducing the impact of data recovery on client services during a fault, improving the customer experience, and creating conditions for fault recovery.

[0055] Furthermore, embodiments of the present invention also disclose an electronic device, Figure 5 This is a structural diagram of an electronic device according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of the invention. Specifically, the electronic device may include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input / output interface 25, and a communication bus 26. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the data writing method after a failure of the distributed storage system disclosed in any of the foregoing embodiments. Furthermore, the electronic device in this embodiment may specifically be an electronic computer.

[0056] In this embodiment, the power supply 23 is used to provide operating voltage for various hardware devices on the electronic device; the communication interface 24 can create a data transmission channel between the electronic device and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this invention, and is not specifically limited here; the input / output interface 25 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.

[0057] In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk or optical disk, etc. The resources stored thereon can include operating system 221, computer program 222, etc., and the storage method can be temporary storage or permanent storage.

[0058] The operating system 221 is used to manage and control the various hardware devices on the electronic device and the computer program 222, which may be Windows Server, Netware, Unix, Linux, etc. In addition to including a computer program capable of performing the data writing method after a distributed storage system failure disclosed in any of the foregoing embodiments, the computer program 222 may further include computer programs capable of performing other specific tasks.

[0059] Furthermore, the present invention also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned data writing method after a distributed storage system failure. The specific steps of this method can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.

[0060] Furthermore, the present invention also discloses a computer program product, including a computer program / instructions; wherein, when the computer program / instructions are executed by a processor, they implement the aforementioned data writing method after a distributed storage system failure. Specific steps of this method can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.

[0061] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.

[0062] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0063] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0064] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.

[0065] The technical solution provided by the present invention has been described in detail above. Specific examples have been used to illustrate the principle and implementation of the present invention. The description of the above embodiments is only for the purpose of helping to understand the method and core idea of ​​the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation and application scope based on the idea of ​​the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A method for writing data after a failure in a distributed storage system, characterized in that, Applied to the client side, including: After initiating a write request corresponding to the current data write service, a mapping query is performed on the monitor component in the distributed storage system to determine the target placement group corresponding to the object to be written in the write request. If the distributed storage system detects a current failure, it determines the current cluster mapping information based on the incremental placement group mapping information subscribed locally by the client. Based on the current cluster mapping information, the preset placement group obstacle avoidance selection algorithm, and the preset background balancing mechanism, write redirection is performed to determine the updated placement group corresponding to the object to be written. Based on the preset network protocol, the current cluster mapping information, and the object metadata fed back by the updated placement group, the object to be written is written in parallel to multiple storage service nodes; wherein, the object metadata includes location and permissions; After completing the data writing, receive the data writing verification result sent by the updated placement group.

2. The data writing method after a failure in a distributed storage system according to claim 1, characterized in that, If the distributed storage system detects a current failure, it determines the current cluster mapping information based on the incremental placement group mapping information subscribed locally by the client, including: If the distributed storage system detects a failure in the current storage service node or disk, it obtains the cluster mapping update information published by the monitor component based on the incremental placement group mapping information subscribed locally by the client; wherein, the cluster mapping update information includes a blocked placement group table, a failed storage service node table, and a placement group weight table. The cluster mapping update information is used to update the locally cached cluster mapping information to determine the current cluster mapping information.

3. The data writing method after a failure in a distributed storage system according to claim 2, characterized in that, Also includes: If the distributed storage system detects a failure in any of the storage service nodes or the disk, it records the failed storage service node through the monitor component to determine the failed storage service node table. The monitoring component is used to analyze and mark the placement groups affected by the faulty storage service node in order to determine the blocked placement group table; The placement group weight is updated using the monitor component and the usage capacity information reported by the storage service node to determine the placement group weight table. Based on the blocked placement group table, the fault storage service node table, and the placement group weight table, the cluster mapping update information is determined; The cluster mapping update information is pushed to each client via the monitor component and using the incremental placement group mapping information.

4. The data writing method after a failure in a distributed storage system according to claim 2, characterized in that, The write redirection based on the current cluster mapping information, a preset placement group obstacle avoidance selection algorithm, and a preset background load balancing mechanism, to determine the updated placement group corresponding to the object to be written, includes: Based on the current cluster mapping information, determine whether the target placement group is available, and thus determine the placement group determination result; If the placement group determination result is unavailable, and when the monitor component detects that the difference in usage of each disk in the cluster corresponding to the distributed storage system is greater than a preset threshold, write redirection is performed based on the current cluster mapping information, the placement group weight table and the preset placement group obstacle avoidance selection algorithm to determine the first updated placement group corresponding to the object to be written. Alternatively, if the placement group's judgment result indicates that it is unavailable, and when the monitor component detects that the usage difference is greater than the preset threshold, a low-priority background task is initiated. Based on the background task, the current cluster mapping information, the placement group weight table, and the preset placement group obstacle avoidance selection algorithm, write redirection is performed to determine the second updated placement group corresponding to the object to be written.

5. The data writing method after a failure in a distributed storage system according to claim 1, characterized in that, The step of writing the object to be written to multiple storage service nodes in parallel, based on a preset network protocol, current cluster mapping information, and object metadata fed back by the updated placement group, includes: Send the first metadata request corresponding to the object to be written to the updated placement group; Receive the first object metadata returned by the updated placement group; Based on the preset network protocol, the current cluster mapping information, and the metadata of the first object, the object to be written is written directly to the multiple storage service nodes managed after the update in parallel.

6. The data writing method after a failure in a distributed storage system according to claim 1, characterized in that, Also includes: When the distributed storage system does not detect a fault, it sends a second metadata request corresponding to the object to be written to the target placement group. Receive the second object metadata returned by the target placement group; Based on the preset network protocol, the second object metadata, and the cluster mapping information cached locally on the client, the object to be written is written directly to multiple storage service nodes managed by the target placement group in parallel. By checking whether all the objects to be written have been successfully written through the target placement group, the write check result is determined. When the write check result is yes, a data write success notification is sent to the client through the target placement group.

7. The data writing method after a failure in a distributed storage system according to any one of claims 1 to 6, characterized in that, After receiving the data write verification result sent by the updated placement group, the process further includes: If the data write verification result indicates that the write of the object to be written has failed, then upon receiving new incremental placement group mapping information, the write redirection operation of the object to be written is triggered in conjunction with the preset placement group obstacle avoidance selection algorithm and the preset background balancing mechanism.

8. A data writing device after a failure in a distributed storage system, characterized in that, Applied to the client side, including: The placement group determination module is used to perform a mapping query on the monitor component in the distributed storage system after initiating a write request corresponding to the current data write service, so as to determine the target placement group corresponding to the object to be written in the write request. The mapping update module is used to determine the current cluster mapping information based on the incremental placement group mapping information subscribed locally by the client if the distributed storage system detects a current failure. The redirection module is used to perform write redirection based on the current cluster mapping information, the preset placement group obstacle avoidance selection algorithm, and the preset background balancing mechanism, so as to determine the updated placement group corresponding to the object to be written. The data writing module is used to write the object to be written to multiple storage service nodes in parallel based on a preset network protocol, current cluster mapping information, and object metadata fed back by the updated placement group; wherein, the object metadata includes location and permissions; The verification result receiving module is used to receive the data write verification result sent by the updated placement group after the data writing is completed.

9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the steps of the data writing method after a failure of the distributed storage system as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the data writing method after a failure of the distributed storage system as described in any one of claims 1 to 7.