A fault management method, device, equipment and medium of a storage system

By using multidimensional performance data and anomaly detection models to predict failure probabilities in distributed storage systems, nodes are pre-marked as predictable failures and their weights are adjusted, thus solving the latency and resource contention problems in passive reactive failure handling and achieving proactive failure management and stability assurance for storage clusters.

CN122308741APending Publication Date: 2026-06-30JINAN INSPUR DATA TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JINAN INSPUR DATA TECH CO LTD
Filing Date
2026-04-02
Publication Date
2026-06-30

Smart Images

  • Figure CN122308741A_ABST
    Figure CN122308741A_ABST
Patent Text Reader

Abstract

This application discloses a fault management method, apparatus, device, and medium for a storage system, relating to the field of computer technology. The method includes: acquiring multi-dimensional performance data of each storage node in the storage system; inputting the multi-dimensional performance data into a pre-trained anomaly detection model to predict the probability of each storage node failing within a target time window; when the probability of any storage node failing within the target time window exceeds a preset threshold, marking the node state of any storage node as a predictive fault state before the storage node is determined to be a failed node, and adjusting the node weight of any storage node to a preset value to prevent new input / output requests from being routed to any storage node; determining the target migration rate corresponding to any storage node, and migrating the stored data on any storage node to other storage nodes based on the target migration rate. Therefore, this application achieves proactive management before a fault occurs.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a method, apparatus, device, and medium for fault management of a storage system. Background Technology

[0002] Currently, distributed storage systems generally adopt a passive reactive fault handling mechanism. Its core technical process is as follows: 1) The monitor or the adjacent Object Storage Daemon (OSD) detects the liveness status of the OSD node by periodically sending heartbeat packets; 2) If no response is received from the OSD node within a preset time window, the OSD node is determined to be invalid; 3) The MON marks the status of the invalid OSD node as unavailable in the cluster mapping and excludes it from the system's valid service components; 4) After the cluster peer synchronization is completed, the system starts data backfilling and data recovery operations.

[0003] This fault handling mechanism has several shortcomings: First, there is a certain time gap between the failure of an OSD node and its being marked as unavailable by the system. During this period, client input / output (I / O) requests sent to the failed node will be continuously blocked or timed out, leading to severe fluctuations in front-end service performance or even service interruption. Second, the data recovery process will intensively consume system resources such as network bandwidth, disk I / O, and CPU, creating fierce resource competition with normal service I / O requests. This can easily lead to a "recovery storm" and may even cause healthy nodes to overload, triggering a chain of failures and seriously threatening the overall stability of the distributed storage cluster. Third, this mechanism can only provide remedial processing after a failure occurs and cannot provide effective early warning and proactive intervention before a failure actually occurs, lacking foresight in fault handling. Summary of the Invention

[0004] In view of this, the purpose of this invention is to provide a method, apparatus, device, and medium for fault management of a storage system, which realizes proactive management before a fault occurs. The specific solution is as follows:

[0005] In a first aspect, this application discloses a fault management method for a storage system, comprising:

[0006] Obtain multidimensional performance data of each storage node in the storage system;

[0007] Multidimensional performance data is input into a pre-trained anomaly detection model to predict the probability of each storage node failing within a target time window.

[0008] If the probability of any storage node failing within the target time window exceeds a preset threshold, the node state of any storage node will be marked as a predictive failure state before the storage node is determined to be a failed node, and the node weight of any storage node will be adjusted to a preset value to prevent new input / output requests from being routed to any storage node.

[0009] Determine the target migration rate for any storage node, and migrate the stored data on any storage node to other storage nodes based on the target migration rate.

[0010] Optionally, the multidimensional performance data of each storage node includes at least three of the following: disk I / O wait time, CPU I / O wait percentage, network packet loss rate, number of memory allocation failures, internal queue depth of storage processes, and heartbeat response latency fluctuation.

[0011] Optionally, before inputting the multidimensional performance data into the pre-trained anomaly detection model, the following steps may also be taken:

[0012] Collect historical multidimensional performance data of each storage node in the storage system;

[0013] Each set of historical multidimensional performance data is labeled with a corresponding running status label to obtain a training dataset containing multiple sets of data samples and corresponding status labels; wherein, the running status label represents the running status of the corresponding storage node within the target time window;

[0014] Train an anomaly detection model based on the training dataset;

[0015] Deploy the model parameters of the trained anomaly detection model to the management node of the storage system.

[0016] Optionally, the process of migrating storage data from any storage node to other storage nodes based on the target migration rate also includes:

[0017] Detect whether any storage node has an actual fault;

[0018] If any storage node experiences an actual failure, the incremental data generated during the migration will be recovered and migrated to other storage nodes.

[0019] If any storage node does not have an actual fault, then clear the predictive fault state of any storage node and restore the node weight of any storage node.

[0020] Optionally, a target migration rate corresponding to any storage node is determined, and storage data on any storage node is migrated to other storage nodes based on the target migration rate, including:

[0021] Obtain the pre-configured resource usage limit parameters and migration priority level parameters;

[0022] The target migration rate is determined based on the resource usage limit parameters; among which, the resource usage limit parameters include one or a combination of disk I / O bandwidth limit, network transmission bandwidth limit, and CPU utilization limit.

[0023] Set the data migration priority according to the migration priority level parameter;

[0024] Based on the target migration rate and data migration priority, migrate the stored data on any storage node to other storage nodes;

[0025] Correspondingly, fault management methods for storage systems also include:

[0026] Real-time monitoring of resource utilization across various resource dimensions; these resource dimensions include one or a combination of disk input / output, network transmission, and central processing unit (CPU) dimensions.

[0027] When the resource utilization rate of any resource dimension exceeds the upper limit parameter of resource utilization, the target migration rate will be dynamically reduced according to the preset control rules.

[0028] Optionally, the node state of any storage node is marked as a predictive failure state, and the node weight of any storage node is adjusted to a preset value, including:

[0029] Invoke the target management command through the management node of the storage system;

[0030] The target management command marks the node status of any storage node as a predictive failure state and adjusts the node weight of any storage node to a preset value.

[0031] Optionally, after marking the node state of any storage node as a predictive failure state and adjusting the node weight of any storage node to a preset value, the method further includes:

[0032] The updated node status and node weight are written to the cluster mapping of the storage system in real time.

[0033] The updated cluster mapping is broadcast to the storage system so that each storage node in the storage system can update its local routing information according to the updated cluster mapping.

[0034] Secondly, this application discloses a fault management device for a storage system, comprising:

[0035] The data acquisition module is used to acquire multi-dimensional performance data of each storage node in the storage system;

[0036] The probability determination module is used to input multidimensional performance data into a pre-trained anomaly detection model, so as to predict the probability of each storage node failing within the target time window through the anomaly detection model.

[0037] The status marking module is used to mark the node status of any storage node as a predictive failure state before any storage node is determined to be a failed node, when the probability of any storage node failing within the target time window exceeds a preset threshold, and to adjust the node weight of any storage node to a preset value to prevent new input / output requests from being routed to any storage node.

[0038] The data migration module is used to determine the target migration rate corresponding to any storage node, and migrate the stored data on any storage node to other storage nodes based on the target migration rate.

[0039] Thirdly, this application discloses an electronic device, including:

[0040] Memory, used to store computer programs;

[0041] A processor is used to execute computer programs to implement the aforementioned fault management method for the storage system.

[0042] Fourthly, this application discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned fault management method for the storage system.

[0043] Therefore, this application proposes a fault management method for a storage system, comprising: acquiring multi-dimensional performance data of each storage node in the storage system; inputting the multi-dimensional performance data into a pre-trained anomaly detection model to predict the probability of each storage node failing within a target time window; when the probability of any storage node failing within the target time window exceeds a preset threshold, marking the node state of any storage node as a predictive failure state before the storage node is determined to be a failed node, adjusting the node weight of any storage node to a preset value to prevent new input / output requests from being routed to any storage node; determining the target migration rate corresponding to any storage node, and migrating the storage data on any storage node to other storage nodes based on the target migration rate. As can be seen, in traditional mechanisms, there is a certain time gap between a node failure and its being marked as unavailable, leading to blocked or timed-out client I / O requests destined for that node and impacting front-end services. This application addresses this issue by marking storage nodes as predictive failure states in advance through fault probability prediction before they are determined to be failures, and adjusting the node weights to preset values. This directly prevents new I / O requests from being routed to that node, resolving the issue of abnormal business performance during this timeframe. Furthermore, this application determines the target migration rate and initiates data migration in advance before a node actually fails, replacing centralized data recovery operations after a failure. This effectively avoids intense resource competition and ensures the overall stability of the storage cluster. In addition, this application inputs multi-dimensional performance data of storage nodes into a pre-trained anomaly detection model, achieving accurate prediction of the failure probability of each storage node within the target time window. This enables proactive management before failures occur, solving the problem of lack of foresight. Attached Figure Description

[0044] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0045] Figure 1 This is a flowchart of a fault management method for a storage system disclosed in this application;

[0046] Figure 2 This application discloses an architecture diagram of a distributed storage cluster proactive fault suppression system.

[0047] Figure 3 This is a schematic diagram of the structure of a fault control device for a storage system disclosed in this application;

[0048] Figure 4 This is a structural diagram of an electronic device disclosed in this application. Detailed Implementation

[0049] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0050] Currently, passive reactive fault handling mechanisms have several drawbacks: First, there is a certain time gap between the failure of an OSD node and its being marked as unavailable by the system. During this period, client input / output requests sent to the failed node will be continuously blocked or timed out, leading to severe fluctuations in front-end service performance or even service interruption. Second, the data recovery process intensively consumes system resources such as network bandwidth, disk I / O, and CPU, creating fierce resource competition with normal service I / O requests. This can easily lead to a "recovery storm" and may even cause healthy nodes to overload, triggering a chain of failures and seriously threatening the overall stability of the distributed storage cluster. Third, this mechanism can only provide remedial processing after a failure occurs and cannot provide effective early warning and proactive intervention before a failure actually occurs, lacking the foresight required for fault handling.

[0051] Therefore, this application proposes a fault management scheme for a storage system, which can achieve proactive management before a fault occurs.

[0052] This application discloses a fault management method for a storage system. (See also...) Figure 1 As shown, the method includes:

[0053] Step S11: Obtain multidimensional performance data of each storage node in the storage system.

[0054] In this embodiment, the storage system includes, but is not limited to, a distributed storage system, and the multidimensional performance data includes at least three of the following: disk I / O wait time of storage nodes, CPU I / O wait percentage, network packet loss rate, number of memory allocation failures, internal queue depth of storage processes, and heartbeat response latency fluctuation. Disk I / O wait time: The average wait time from submission to completion of a disk I / O request on a storage node, used to characterize disk I / O processing efficiency and blocking level; CPU I / O wait percentage: The percentage of time the CPU is idle waiting for I / O operations to complete, reflecting the resource waiting state of the CPU due to I / O blocking; Network packet loss rate: The proportion of data packets lost during inter-node communication to the total number of data packets sent, used to assess network transmission stability; Memory allocation failure count: The cumulative number of times memory resources requested during the execution of a storage process have failed to be allocated, reflecting the sufficiency of node memory resources and system stability; Internal queue depth of storage process: The number of I / O requests waiting to be processed within the object storage device, reflecting the load pressure on the storage process; Heartbeat response latency fluctuation: The variation in heartbeat packet response time between storage nodes, used to determine the communication status between nodes and the health of node operation.

[0055] Furthermore, historical multidimensional performance data of each storage node in the storage system is collected; each set of historical multidimensional performance data is labeled with a corresponding running status label, resulting in a training dataset containing multiple sets of data samples and corresponding status labels; wherein, the running status label represents the running status of the corresponding storage node within the target time window; an anomaly detection model is trained based on the training dataset; and the model parameters of the trained anomaly detection model are deployed to the management node of the storage system. In some specific embodiments, taking a Ceph-based distributed storage cluster as an example, historical multidimensional performance data of all OSD storage nodes over the past 3 months are collected, including disk I / O wait time, CPU input / output wait ratio, network packet loss rate, memory allocation failure count, OSD internal queue depth, and heartbeat response latency fluctuation; according to whether the OSD node actually fails within the next hour (target time window), each set of data is labeled with a normal label or a fault label, thus constructing a training dataset containing 100,000 samples; a time-series anomaly detection algorithm is used to complete model training, obtaining model parameters that can be used for fault probability prediction; finally, the model parameters are deployed to the Ceph Manager (MGR) management node for online real-time inference prediction.

[0056] Step S12: Input the multidimensional performance data into the pre-trained anomaly detection model to predict the probability of each storage node failing within the target time window.

[0057] In this embodiment, multidimensional performance data is input into a pre-trained anomaly detection model to predict the probability of each storage node failing within a target time window. The target time window represents a time interval from the current moment to a preset future duration, such as 30 minutes or 1 hour, used to predict in advance the probability of storage nodes experiencing failures such as downtime or I / O blocking during this future period.

[0058] Step S13: When the probability of any storage node failing within the target time window exceeds a preset threshold, before any storage node is determined to be a failed node, the node state of any storage node is marked as a predictive failure state, and the node weight of any storage node is adjusted to a preset value to prevent new input / output requests from being routed to any storage node.

[0059] In this embodiment, the management node of the storage system invokes a target management command, which marks the node state of any storage node as a predicted failure state. Then, the node weight of any storage node is adjusted to a preset value (which can be 0). This prevents the storage node from participating in new I / O request load scheduling. The predicted failure state is an intermediate warning state added to the traditional storage node state machine in this embodiment. Compared to the traditional binary state of only "up" (normal) and "down" (failure), this state can identify potential faults before the node actually crashes or is determined to be failed, providing a state triggering basis for proactive fault suppression.

[0060] Furthermore, the updated node status and node weight are written into the cluster map of the storage system in real time; the Monitor (MON) node broadcasts the updated cluster map to the storage system, so that each storage node and each client in the storage system updates its local routing information according to the updated cluster map, ensuring that subsequent I / O requests are no longer routed to the predictive failure node.

[0061] Step S14: Determine the target migration rate corresponding to any storage node, and migrate the storage data on any storage node to other storage nodes based on the target migration rate.

[0062] The system checks whether any storage node has an actual fault. If any storage node has an actual fault, the incremental data generated during the migration is restored and migrated to other storage nodes. If any storage node does not have an actual fault, the predictive fault status of any storage node is cleared, and the node weight of any storage node is restored. In other words, after triggering the predictive fault isolation and data migration process, the system will perform an actual status check on the storage node to confirm whether there is a real potential fault. If the check result indicates that the node is indeed faulty, the incremental recovery and migration of the incomplete existing data and the data added during the migration window are completed to ensure data integrity. If the check result indicates that the node is operating normally, the predictive fault flag is removed, the node's service weight is restored, and the node is reintegrated into the cluster service system.

[0063] The technical advantages of this processing mechanism are as follows: On the one hand, when the prediction is accurate, only lightweight incremental data migration can be performed to avoid resource contention and recovery storms caused by large-scale data recovery, thus ensuring business I / O continuity; on the other hand, when the prediction is inaccurate, node rollback can be achieved without loss, avoiding waste of cluster resources and loss of service capabilities caused by mis-isolation, thereby improving the accuracy of proactive fault management and the overall stability of the cluster.

[0064] In this embodiment, pre-configured resource usage limit parameters and migration priority level parameters are obtained; the target migration rate is determined based on the resource usage limit parameters; wherein, the resource usage limit parameters include one or a combination of disk I / O bandwidth limit, network transmission bandwidth limit, and CPU utilization limit; data migration priority is set according to the migration priority level parameters; and storage data on any storage node is migrated to other storage nodes based on the target migration rate and data migration priority; correspondingly, the fault management method of the storage system also includes: real-time monitoring of resource utilization rates of each resource dimension; wherein, resource dimensions include one or a combination of disk I / O dimension, network transmission dimension, and CPU dimension; when the resource utilization rate of any resource dimension exceeds the resource usage limit parameter, the target migration rate is dynamically reduced according to preset control rules. In other words, the system limits the rate and manages the load of the data migration process based on preset resource usage thresholds and task priority strategies, and continuously monitors the utilization of core resources such as disk, network, and CPU during the migration process. Once the resource utilization exceeds the preset safety threshold, the migration speed is automatically reduced to prevent the migration task from crowding out the resources required for business operation. In this way, resource isolation and balanced scheduling of data migration and business I / O can be achieved, avoiding recovery storms at the source, ensuring the stability of front-end business performance, and completing data migration securely.

[0065] In some specific embodiments, the disk I / O bandwidth limit is pre-configured to 20MB / s, the network bandwidth limit is 50Mbps, and the CPU utilization limit is 30%, and the migration priority is set to the lowest. The system migrates data to the predicted faulty nodes at a rate of 15MB / s. During real-time monitoring, if the network bandwidth utilization is found to reach 58Mbps, exceeding the threshold, the system immediately reduces the migration rate to 8MB / s according to the control rules. The migration is maintained only after the resource utilization drops back to a safe range to ensure that business I / O is not affected.

[0066] Taking a distributed storage system as an example, this application proposes a proactive fault suppression method for distributed storage clusters based on machine learning. This method aims to overcome the shortcomings of existing passive fault recovery technologies. Through predictive analysis, it proactively isolates and initiates smooth data migration of storage nodes (OSDs) before substantial failures occur, completely avoiding I / O blocking and recovery storms, and improving the availability and stability of the distributed storage cluster. See also... Figure 2 As shown, this method is integrated into the Manager (MGR) module of the distributed storage system, with a predictive fault management module at its core. It is deployed as a Python plugin within the MGR daemon, leveraging the MGR plugin framework and API interfaces to achieve data acquisition and cluster management. It also communicates with existing Monitor and OSD services in the cluster via a message mechanism. Its core process includes:

[0067] 1. Data Acquisition and Feature Extraction: Real-time acquisition of multi-dimensional performance indicators for each OSD, covering disk I / O indicators (such as await time, IOPS, throughput, I / O queue depth, and error count), system resource indicators (such as memory utilization and allocation failure count, network packet loss rate and error count), and internal OSD indicators (such as OSD heartbeat response latency fluctuations, internal operation queue length, and Peering state change frequency).

[0068] 2. Machine learning model prediction: The collected time series data is input into the pre-trained supervised learning time series anomaly detection model, which outputs the failure probability score and predictive health status label (such as "healthy", "warning", "high risk") of each OSD. The model parameters are deployed to the MGR module after being trained with offline historical data.

[0069] 3. Proactive Decision-Making and Degradation: When the predicted failure probability of an OSD within the target time window exceeds a preset threshold, an active process is triggered: the MON proactively marks the OSD status as predicted_failure, sets the node weight to 0, and updates this status synchronously to the Cluster Map and broadcasts it to the entire cluster; after receiving the update, subsequent I / O requests will no longer be routed to this OSD, avoiding I / O blocking caused by node failure; at the same time, data migration is started with the lowest priority and strict resource rate limits to ensure that the migration process does not affect the I / O performance of front-end business.

[0070] 4. Post-prediction processing: A comprehensive status verification is performed on OSDs marked as predictive faults. If an actual fault is confirmed, only the unmigrated existing data and incremental data within the migration window need to be restored and migrated, avoiding a large-scale recovery storm. If the node is running normally, the predictive fault status is automatically cleared, the node weight is restored, and it is reintegrated into the cluster service system. The solution also provides a configurable policy interface, supporting customizable parameters such as prediction confidence thresholds, prediction time windows, and migration rate limits, adapting to different business scenarios and hardware environments.

[0071] This solution transforms the operation and maintenance model from "post-event remediation" to "pre-event early warning" by introducing predictive fault states, machine learning-driven proactive decision-making, and designing degradation processes. The effects are as follows: (1) Eliminate I / O blocking caused by faults: Complete traffic switching before the actual failure of a node, ensuring the continuity and smoothness of client I / O, and improving the user experience. (2) Completely eliminate recovery storms: Transform the traditional high-intensity centralized recovery after a fault into a proactive, smooth, and resource-controlled background data migration, greatly reducing the impact on cluster resources and ensuring the overall stability of the cluster. (3) Improve the level of intelligent operation and maintenance: Provide sufficient fault intervention time window for operation and maintenance personnel, and realize the intelligent and proactive transformation of distributed storage cluster operation and maintenance. (4) Enhance system availability and reliability: Reduce the service degradation time caused by node failures and fault handling, and comprehensively improve the overall availability and reliability of the distributed storage system.

[0072] Furthermore, this solution can continuously iterate and optimize the anomaly detection model by combining it with the actual operating characteristics of the distributed storage cluster. By collecting new fault samples and prediction results online and regularly updating model parameters, the accuracy and recall of fault prediction are continuously improved. Simultaneously, during data migration and incremental recovery, this solution can automatically select the optimal target storage node based on the replica distribution strategy and node load, avoiding excessive data concentration and load skew, further improving cluster balance and reliability. In addition, this solution can log and visualize information such as predictive fault events, node status changes, and data migration progress, providing operations and maintenance personnel with full-process traceability evidence and enhancing the observability and maintainability of distributed storage cluster fault management.

[0073] Therefore, this application proposes a fault management method for a storage system, comprising: acquiring multi-dimensional performance data of each storage node in the storage system; inputting the multi-dimensional performance data into a pre-trained anomaly detection model to predict the probability of each storage node failing within a target time window; when the probability of any storage node failing within the target time window exceeds a preset threshold, marking the node state of any storage node as a predictive failure state before the storage node is determined to be a failed node, adjusting the node weight of any storage node to a preset value to prevent new input / output requests from being routed to any storage node; determining the target migration rate corresponding to any storage node, and migrating the storage data on any storage node to other storage nodes based on the target migration rate. As can be seen, in traditional mechanisms, there is a certain time gap between a node failure and its being marked as unavailable, leading to blocked or timed-out client I / O requests destined for that node and impacting front-end services. This application addresses this issue by marking storage nodes as predictive failure states in advance through fault probability prediction before they are determined to be failures, and adjusting the node weights to preset values. This directly prevents new I / O requests from being routed to that node, resolving the issue of abnormal business performance during this timeframe. Furthermore, this application determines the target migration rate and initiates data migration in advance before a node actually fails, replacing centralized data recovery operations after a failure. This effectively avoids intense resource competition and ensures the overall stability of the storage cluster. In addition, this application inputs multi-dimensional performance data of storage nodes into a pre-trained anomaly detection model, achieving accurate prediction of the failure probability of each storage node within the target time window. This enables proactive management before failures occur, solving the problem of lack of foresight.

[0074] Accordingly, this application also discloses a fault management device for a storage system, see [link to relevant documentation]. Figure 3 As shown, the device includes:

[0075] Data acquisition module 11 is used to acquire multi-dimensional performance data of each storage node in the storage system;

[0076] The probability determination module 12 is used to input multidimensional performance data into a pre-trained anomaly detection model in order to predict the probability of each storage node failing within a target time window.

[0077] The status marking module 13 is used to mark the node status of any storage node as a predictive failure state and adjust the node weight of any storage node to a preset value before any storage node is determined to be a failed node when the probability of any storage node failing within the target time window exceeds a preset threshold.

[0078] The data migration module 14 is used to determine the target migration rate corresponding to any storage node, and migrate the stored data on any storage node to other storage nodes based on the target migration rate.

[0079] For more detailed information on the working process of each of the above modules, please refer to the relevant content disclosed in the foregoing embodiments, which will not be repeated here.

[0080] Therefore, this application proposes a fault management method for a storage system, comprising: acquiring multi-dimensional performance data of each storage node in the storage system; inputting the multi-dimensional performance data into a pre-trained anomaly detection model to predict the probability of each storage node failing within a target time window; when the probability of any storage node failing within the target time window exceeds a preset threshold, marking the node state of any storage node as a predictive failure state before the storage node is determined to be a failed node, adjusting the node weight of any storage node to a preset value to prevent new input / output requests from being routed to any storage node; determining the target migration rate corresponding to any storage node, and migrating the storage data on any storage node to other storage nodes based on the target migration rate. As can be seen, in traditional mechanisms, there is a certain time gap between a node failure and its being marked as unavailable, leading to blocked or timed-out client I / O requests destined for that node and impacting front-end services. This application addresses this issue by marking storage nodes as predictive failure states in advance through fault probability prediction before they are determined to be failures, and adjusting the node weights to preset values. This directly prevents new I / O requests from being routed to that node, resolving the issue of abnormal business performance during this timeframe. Furthermore, this application determines the target migration rate and initiates data migration in advance before a node actually fails, replacing centralized data recovery operations after a failure. This effectively avoids intense resource competition and ensures the overall stability of the storage cluster. In addition, this application inputs multi-dimensional performance data of storage nodes into a pre-trained anomaly detection model, achieving accurate prediction of the failure probability of each storage node within the target time window. This enables proactive management before failures occur, solving the problem of lack of foresight.

[0081] Furthermore, embodiments of this application also provide an electronic device. Figure 4 This is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of this application.

[0082] Figure 4 This is a schematic diagram of the structure of an electronic device 20 provided in an embodiment of this application. Specifically, the electronic device 20 may include: at least one processor 21, at least one memory 22, a display screen 23, an input / output interface 24, a communication interface 25, a power supply 26, and a communication bus 27. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the fault management method of the storage system disclosed in any of the foregoing embodiments. Furthermore, the electronic device 20 in this embodiment may specifically be an electronic computer.

[0083] In this embodiment, the power supply 26 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 25 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this application, and is not specifically limited here; the input / output interface 24 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.

[0084] Furthermore, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk, or optical disk, etc. The resources stored thereon may include computer programs 221, and the storage method may be temporary storage or permanent storage. The computer programs 221 may include, in addition to computer programs capable of performing the fault management method of the storage system executed by the electronic device 20 as disclosed in any of the foregoing embodiments, computer programs capable of performing other specific tasks.

[0085] Furthermore, embodiments of this application also disclose a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned fault management method for the storage system.

[0086] For the specific steps of this method, please refer to the relevant content disclosed in the foregoing embodiments, which will not be repeated here.

[0087] The various embodiments in this application are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. For the same or similar parts between the various embodiments, refer to each other. As for the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and relevant parts can be referred to in the method section.

[0088] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0089] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0090] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0091] The above provides a detailed description of a fault management method, apparatus, device, and storage medium for a storage system provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A fault management method for a storage system, characterized in that, include: Obtain multidimensional performance data of each storage node in the storage system; The multidimensional performance data is input into a pre-trained anomaly detection model to predict the probability of each storage node failing within a target time window. If the probability of any storage node failing within the target time window exceeds a preset threshold, the node state of any storage node is marked as a predictive failure state before the storage node is determined to be a failed node, and the node weight of any storage node is adjusted to a preset value to prevent new input / output requests from being routed to the storage node. Determine the target migration rate corresponding to any of the storage nodes, and migrate the stored data on any of the storage nodes to other storage nodes based on the target migration rate.

2. The fault management method for a storage system according to claim 1, characterized in that, The multidimensional performance data of each storage node includes at least three of the following: disk I / O wait time, CPU I / O wait percentage, network packet loss rate, number of memory allocation failures, internal queue depth of storage process, and heartbeat response latency fluctuation.

3. The fault management method for a storage system according to claim 1, characterized in that, Before inputting the multidimensional performance data into the pre-trained anomaly detection model, the method further includes: Collect historical multidimensional performance data of each storage node in the storage system; Each set of historical multidimensional performance data is labeled with a corresponding running status label to obtain a training dataset containing multiple sets of data samples and corresponding status labels; wherein, the running status label represents the running status of the corresponding storage node within the target time window; Train an anomaly detection model based on the aforementioned training dataset; The model parameters of the trained anomaly detection model are deployed to the management node of the storage system.

4. The fault management method for a storage system according to claim 1, characterized in that, The process of migrating storage data on any of the storage nodes to other storage nodes based on the target migration rate also includes: Detect whether any of the storage nodes has an actual fault; If any of the storage nodes experiences an actual failure, the incremental data generated during the migration will be recovered and migrated to the other storage nodes. If any of the storage nodes does not have an actual fault, then the predictive fault state of the storage node is cleared and the node weight of the storage node is restored.

5. The fault management method for a storage system according to claim 1, characterized in that, The step of determining the target migration rate corresponding to any one of the storage nodes, and migrating the stored data on any one of the storage nodes to other storage nodes based on the target migration rate, includes: Obtain the pre-configured resource usage limit parameters and migration priority level parameters; The target migration rate is determined based on the resource usage limit parameters; wherein, the resource usage limit parameters include one or a combination of disk input / output bandwidth limit, network transmission bandwidth limit, and CPU utilization limit. Set the data migration priority according to the migration priority level parameter; Based on the target migration rate and the data migration priority, the stored data on any one of the storage nodes is migrated to other storage nodes; Correspondingly, the fault management method for the storage system further includes: Real-time monitoring of resource utilization across various resource dimensions; wherein, the resource dimensions include one or a combination of several of the following: disk input / output dimension, network transmission dimension, and central processing unit dimension; When the resource occupancy rate of any of the resource dimensions exceeds the upper limit parameter of resource occupancy, the target migration rate is dynamically reduced according to the preset control rules.

6. The fault management method for a storage system according to claim 1, characterized in that, The step of marking the node state of any storage node as a predictive failure state and adjusting the node weight of any storage node to a preset value includes: The target management command is invoked through the management node of the storage system; The target management command marks the node status of any storage node as the predictive failure state and adjusts the node weight of any storage node to the preset value.

7. The fault management method for a storage system according to any one of claims 1 to 6, characterized in that, After marking the node state of any storage node as a predictive failure state and adjusting the node weight of any storage node to a preset value, the method further includes: The updated node status and node weight are written to the cluster mapping of the storage system in real time. The updated cluster mapping is broadcast to the storage system so that each storage node in the storage system updates its local routing information according to the updated cluster mapping.

8. A fault control device for a storage system, characterized in that, include: The data acquisition module is used to acquire multi-dimensional performance data of each storage node in the storage system; The probability determination module is used to input the multidimensional performance data into a pre-trained anomaly detection model, so as to predict the probability of each of the storage nodes failing within a target time window through the anomaly detection model. The status marking module is used to mark the node status of any storage node as a predictive failure state and adjust the node weight of any storage node to a preset value before the storage node is determined to be a failed node when the probability of any storage node failing within the target time window exceeds a preset threshold. This is done to prevent new input / output requests from being routed to the storage node. The data migration module is used to determine the target migration rate corresponding to any of the storage nodes, and migrate the stored data on any of the storage nodes to other storage nodes based on the target migration rate.

9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the fault management method for the storage system as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, Used to store computer programs; wherein, when the computer programs are executed by a processor, they implement the fault management method of the storage system as described in any one of claims 1 to 7.