Fault analysis method, device, equipment, storage medium and program product
By setting up observation points in the distributed storage system to automatically monitor the status of repair tasks, the problem of data repair tasks being blocked for a long time due to faults is solved, achieving efficient fault analysis and rapid fault location, and improving the system's security and fault handling efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SUGON INFORMATION IND
- Filing Date
- 2026-02-05
- Publication Date
- 2026-06-16
AI Technical Summary
In distributed storage systems, data repair tasks are prone to prolonged blocking due to insufficient disk space, network anomalies, task concurrency, and resource shortages. Existing fault analysis methods are inefficient and require manual intervention.
By setting observation points in each node of the distributed storage system, data identifiers and statistics of repair tasks are obtained, the repair status is automatically monitored, fault observation points are located and fault causes are determined, and a solution strategy is generated using a preset fault decision model.
It enables efficient fault analysis without human intervention, quickly locates the fault location and cause, and improves the efficiency of fault analysis for data repair tasks and the security of the system.
Smart Images

Figure CN122220132A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of storage technology, and in particular to a fault analysis method, apparatus, device, storage medium, and program product. Background Technology
[0002] In a distributed storage system, when corrupted object data is generated due to a fault, a data repair task needs to be initiated promptly. This task is responsible for timely and accurate repair of all corrupted object data, thereby preventing high-risk incidents caused by such data. However, during the data repair process, the task may encounter various limitations such as insufficient disk space, network anomalies, insufficient task concurrency and resources, task mutual exclusion, or triggering software defects in the system. These limitations can lead to situations where the data repair task is blocked for an extended period and cannot execute.
[0003] Currently, when data repair tasks encounter anomalies, manual analysis of the location and cause of the problem is required to troubleshoot, which is inefficient. Summary of the Invention
[0004] Therefore, it is necessary to provide a fault analysis method, apparatus, equipment, storage medium, and program product that can improve the fault analysis efficiency of data repair tasks in response to the above-mentioned technical problems.
[0005] In a first aspect, this application provides a fault analysis method, which includes:
[0006] Obtain the data identifiers for each data segment in the repair task;
[0007] Obtain statistical data corresponding to each observation point; each observation point is a node in a distributed storage system that is pre-set.
[0008] Based on the data identifiers and the statistical data corresponding to each observation point, the first repair status of the repair task is determined;
[0009] If the first repair status indicates that the repair task has a fault, the fault observation point is determined according to the statistical data corresponding to each observation point, and the cause of the fault in the repair task is determined according to the statistical data corresponding to the fault observation point.
[0010] In the above embodiment, firstly, the data identifiers of each data segment in the repair task are obtained. Then, the statistical data corresponding to each observation point is obtained, wherein each observation point is pre-set in each node of the distributed storage system. Next, based on each data identifier and the statistical data corresponding to each observation point, the first repair state of the repair task is determined. Finally, if the first repair state indicates that the repair task has a fault, the fault observation point is determined based on the statistical data corresponding to each observation point, and the cause of the fault in the repair task is determined based on the statistical data corresponding to the fault observation point. In this way, by obtaining the statistical data corresponding to the observation points set in each node of the distributed storage system to determine the first repair state of the repair task, and if the first repair state indicates that the repair task has a fault, the fault observation point is further determined to determine the fault location and cause of the fault in the repair task. This achieves monitoring of the running status of the repair task through preset observation points, eliminating the need for manual fault analysis and improving the efficiency of fault analysis for the repair task.
[0011] In one embodiment, the first repair status of the repair task is determined based on each data identifier and the statistical data corresponding to each observation point, including:
[0012] Based on the data identifiers and the statistical data corresponding to each observation point, determine the second repair status corresponding to each data segment;
[0013] If all second repair states are repair complete, then the first repair state is determined to be repair complete.
[0014] If any second repair status is repair failure, then the first repair status is determined to be repair failure.
[0015] In the above embodiments, the repair data in the repair task is distributed in data segments as the smallest granularity. The repair of multiple data segments can be carried out concurrently to improve the efficiency of data repair. The first repair state of the repair task can be determined based on the second repair state of all data segments, thereby realizing the monitoring of the overall operation status of the repair task.
[0016] In one embodiment, the second repair state corresponding to each data segment is determined based on each data identifier and the statistical data corresponding to each observation point, including:
[0017] For each data segment, based on the data identifier corresponding to the data segment, the target observation point corresponding to the data segment is determined from each observation point;
[0018] Based on the statistical data corresponding to the target observation point, determine the second repair state corresponding to the data segment.
[0019] In the above embodiments, the target observation point is determined by the data identifier corresponding to the data segment, and the second repair state corresponding to the data segment is determined according to the statistical data corresponding to the target observation point, thereby realizing the monitoring of the repair process of each data segment.
[0020] In one embodiment, the method further includes:
[0021] Based on the parameter information corresponding to the judgment conditions of each observation point, determine the statistical data corresponding to each observation point.
[0022] In one embodiment, each observation point includes a buffer, and the method further includes:
[0023] The cache area is divided into a preset number of sub-cache areas, and each sub-cache area is used to store statistical data within different time ranges;
[0024] For each sub-cache, when the statistical data stored in the sub-cache reaches the upper limit of the corresponding time range, the sub-cache is converted into the sub-cache corresponding to the next time range, and the statistical data in the sub-cache that exceeds the preset time range is reclaimed.
[0025] In the above embodiments, by setting up different sub-buffers to store statistical data for different time ranges, and by recycling statistical data that exceeds a preset time threshold, the statistical data of each observation point can be managed.
[0026] In one embodiment, fault observation points are determined based on statistical data corresponding to each observation point, and the cause of the fault in the repair task is determined based on the statistical data corresponding to the fault observation points, including:
[0027] Based on the statistical data corresponding to the fault observation points, determine the abnormal information of the fault observation points;
[0028] Based on the anomaly information, determine the cause of the failure in the repair task.
[0029] In the above embodiments, by determining the fault observation point, the cause of the fault can be located efficiently and accurately. Combined with the preset fault decision model, a solution strategy is obtained. Based on the solution strategy, the system operation fault is restored, which brings great safety assurance to safe production.
[0030] Secondly, this application also provides a fault analysis apparatus, which includes:
[0031] The first acquisition module is used to acquire the data identifiers of each data segment in the repair task;
[0032] The second acquisition module is used to acquire statistical data corresponding to each observation point; each observation point is a node in the distributed storage system that is pre-set.
[0033] The first determination module is used to determine the first repair status of the repair task based on each data identifier and the statistical data corresponding to each observation point.
[0034] The second determination module is used to determine the fault observation point based on the statistical data corresponding to each observation point if the first repair status indicates that the repair task has a fault, and to determine the cause of the fault in the repair task based on the statistical data corresponding to the fault observation point.
[0035] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of any of the fault analysis methods described in the first aspect above.
[0036] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of any of the fault analysis methods described in the first aspect above.
[0037] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the fault analysis method described in any of the first aspects above.
[0038] The aforementioned fault analysis method, apparatus, equipment, storage medium, and program product first acquire the data identifiers of each data segment in the repair task. Then, they acquire the statistical data corresponding to each observation point, where each observation point is pre-set in each node of the distributed storage system. Next, based on each data identifier and the statistical data corresponding to each observation point, the first repair state of the repair task is determined. Finally, if the first repair state indicates a fault in the repair task, fault observation points are determined based on the statistical data corresponding to each observation point, and the cause of the fault in the repair task is determined based on the statistical data corresponding to the fault observation points. In this way, by acquiring the statistical data corresponding to the observation points set in each node of the distributed storage system to determine the first repair state of the repair task, and if the first repair state indicates a fault in the repair task, further determining the fault observation points to determine the fault location and cause of the fault in the repair task, the system achieves monitoring of the operating status of the repair task through preset observation points, eliminating the need for manual fault analysis and improving the efficiency of fault analysis for repair tasks. Attached Figure Description
[0039] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0040] Figure 1 This is a diagram illustrating the application environment of the fault analysis method in one embodiment;
[0041] Figure 2 This is a flowchart illustrating a fault analysis method in one embodiment;
[0042] Figure 3 A schematic diagram showing the setting of observation points for a fault analysis method in one embodiment;
[0043] Figure 4 This is a flowchart illustrating the first repair state determination step in one embodiment;
[0044] Figure 5 This is a flowchart illustrating the second repair state determination step in one embodiment;
[0045] Figure 6 This is a flowchart illustrating the observation point cache management steps in one embodiment;
[0046] Figure 7 Here is a flowchart of the state transition of the observation point buffer in one embodiment;
[0047] Figure 8 This is a flowchart illustrating the fault cause analysis steps in one embodiment;
[0048] Figure 9 This is a flowchart illustrating the fault analysis method in another embodiment;
[0049] Figure 10 This is a structural block diagram of a fault analysis device in one embodiment;
[0050] Figure 11 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0051] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0052] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from the second element. The terms "comprising" and "having," and any variations thereof, used in this application, are intended to cover non-exclusive inclusion. The term "multiple" used in this application refers to two or more. The term "and / or" used in this application refers to one of the embodiments, or any combination of multiple embodiments.
[0053] The fault optimization method provided in this application embodiment can be applied to, for example, Figure 1 The application environment shown includes a distributed storage system, which can be a distributed cluster system capable of providing block storage services. Optionally, the distributed storage system includes multiple nodes, wherein the nodes are connected via communication methods such as Wi-Fi, mobile network connections, etc. In this embodiment, to improve communication performance and security, the communication method is a 10 Gigabit network or a private network connection, etc. Each node in the distributed storage system can be, but is not limited to, an independent server or a server cluster composed of multiple servers, or a cloud server providing cloud computing services. This embodiment does not limit the specific form of each node. Optionally, the distributed storage system can include one master node and multiple slave nodes, where the master node can be a management node. Figure 1 This example illustrates a distributed storage system with three slave nodes.
[0054] In one exemplary embodiment, such as Figure 2 As shown, a fault optimization method is provided, which is applied to Figure 1 Taking the master node of a distributed storage system as an example, the explanation includes the following steps 201 to 203. Wherein:
[0055] Step 201: Obtain the data identifiers of each data segment in the repair task.
[0056] Among them, the repair task is a task initiated when repairing bad object data in a distributed storage system. The repair task includes repair data for repairing bad object data. The repair data includes multiple data segments, and multiple data segments can be repaired in parallel.
[0057] When each data segment initiates repair, a data identifier is generated. Each data segment's data identifier is unique, and the route of the data segment in the distributed storage system can be determined based on this data identifier.
[0058] Step 202: Obtain the statistical data corresponding to each observation point.
[0059] In this system, each observation point is pre-set on each node of the distributed storage system. Optionally, based on the function of the repair task, the repair task process can be determined to include at least one process. Each process completes one stage of the repair task process. Observation points are set at key locations in each process on each node, allowing observation of the state of the data segments of the repair task as they pass through that observation point, thereby enabling monitoring of the operational status of the repair task. Figure 3The diagram illustrates the setup of observation points. It uses a repair task comprising a task initiation process, a task execution process, and a data layer process as an example. Different processes on different nodes include multiple observation points. It is understood that when a repair task includes other processes, observation points can be set at different locations according to the actual process requirements; this embodiment does not impose such limitations.
[0060] For example, statistical data for each observation point is determined based on the parameter information corresponding to the judgment conditions of each observation point. The judgment conditions for each observation point may include the target disk status, disk space, the number of available healthy objects, node status, disk freeze, data read / write, etc. The parameter information corresponding to the judgment conditions can be specific judgment parameters; for example, when the judgment condition is disk space, the corresponding parameter information is a disk space threshold. When a repair task passes through an observation point, if the observation point determines that the disk space meets the required disk space threshold, then the observation point detection is valid, i.e., the observation point detection is normal. Similarly, if the observation point determines that the disk space does not meet the required disk space threshold, then the observation point detection is invalid, i.e., the observation point detection is abnormal. Based on the detection results of the observation point, the statistical data corresponding to the observation point can be determined.
[0061] Optionally, the statistical data for observation points may include observation point name, number of hits, number of failed hits, time of anomaly detection, anomaly status code, and the influence weight of observation point location. The number of hits refers to the number of times the repair task passes through the observation point; the number of failed hits refers to the number of times the observation point detected anomalies; the time of anomaly detection refers to the time corresponding to the anomaly detection at the observation point; and the anomaly status code can be determined based on the judgment conditions of the observation point. Different judgment conditions can correspond to different status codes. When an anomaly is detected, the anomaly status code is determined according to the status code corresponding to the judgment conditions. The influence weight of observation point location is used to characterize the degree of influence of the observation point on the overall repair task. It can be determined by the current location of the observation point. If the observation point passes the retry (i.e., the judgment is re-evaluated based on the judgment conditions and the detection is deemed normal), the location influence weight coefficient is low. If the retry still results in anomaly detection, the location influence weight coefficient is high. If the anomaly detected at the observation point is not persistent and can be automatically recovered, the location influence weight coefficient is in the middle range. The specific coefficient value can be set according to actual needs, and this application does not impose any restrictions on it.
[0062] During the repair process of each data segment, each observation point points to the next observation point after the current position. Upon reaching the observation point, a judgment condition is used to determine whether the observation point has been successfully passed. If it passes successfully, the hit count is incremented by one, and the process proceeds to the next observation point for monitoring. If the judgment condition fails, it is determined that the business judgment condition for that observation point position is not met, the anomaly hit count is incremented by one, and the anomaly status is recorded. Each observation point records its own statistical data. Observations within the same process are treated as a statistical unit, meaning that the statistical data of multiple observation points within the same process are packaged and sent together to reduce the amount of data during communication. The master node receives the data sent by each statistical unit and, based on the received data, obtains the statistical data for each observation point.
[0063] Step 203: Determine the first repair status of the repair task based on the data identifiers and the statistical data corresponding to each observation point.
[0064] For example, the data segments in the repair task can be repaired in parallel, and each data segment can correspond to a different route. Please refer to [link / reference]. Figure 3 ,by Figure 3 Taking the observation point settings in the example, the data segment route can include the following situations. It should be understood that the following routes are only illustrative, and other routes are included in the actual repair process. Among them, the observation points along Route 1 are:
[0065] NodeA(AB)->NodeA(CD)->NodeC(E)->NodeA(FG)|NodeB(FG).
[0066] Observation points along Route 2:
[0067] NodeA(AB)->NodeB(CD)->NodeC(E)->NodeA(FG)|NodeC(FG).
[0068] Observation points along Route 3:
[0069] NodeA(AB)->NodeC(CD)->NodeC(E)->NodeA(FG)|NodeC(FG).
[0070] Observation points along Route 4:
[0071] NodeA(AB)->NodeC(CD)->NodeC(E)->NodeB(FG)|NodeC(FG).
[0072] The first repair status of a repair task indicates whether the repair task has been completed. If the repair of all data segments in the repair task is completed, the repair task ends normally, and the first repair status is "Repair Task Completed Normally". If any data segment is blocked or fails to repair during the process and cannot continue, the repair task will be blocked and unable to end normally, and the first repair status is "Repair Task Faulty". The statistical data corresponding to each observation point also includes the data identifiers of the data segments that passed through that observation point. Based on the data identifiers corresponding to each data segment in the repair task, the observation points passed by each data segment and the route of the repair process of each data segment can be determined. Based on the detection results of each data segment at each observation point and whether the repair route of each data segment is complete, the first repair status can be determined.
[0073] Step 204: If the first repair status indicates that the repair task has a fault, determine the fault observation point according to the statistical data corresponding to each observation point, and determine the cause of the fault in the repair task according to the statistical data corresponding to the fault observation point.
[0074] If the first repair status indicates that the repair task has a fault, that is, there are data segments that have not been repaired. Based on the data identifier of the data segments that have not been repaired and the statistical data corresponding to each observation point, the fault observation point is determined. Based on the statistical data corresponding to the fault observation point, such as the time of the anomaly, the status code of the anomaly, and the influence weight of the observation point position, the reason for the failure of the data segment repair is determined, thereby determining the cause of the fault in the repair task. Since the positions of each observation point are preset, the fault location of the repair task can be determined based on the position of the fault observation point.
[0075] In the above embodiment, firstly, the data identifiers of each data segment in the repair task are obtained. Then, the statistical data corresponding to each observation point is obtained, wherein each observation point is pre-set in each node of the distributed storage system. Next, based on each data identifier and the statistical data corresponding to each observation point, the first repair state of the repair task is determined. Finally, if the first repair state indicates that the repair task has a fault, the fault observation point is determined based on the statistical data corresponding to each observation point, and the cause of the fault in the repair task is determined based on the statistical data corresponding to the fault observation point. In this way, by obtaining the statistical data corresponding to the observation points set in each node of the distributed storage system to determine the first repair state of the repair task, and if the first repair state indicates that the repair task has a fault, the fault observation point is further determined to determine the fault location and cause of the fault in the repair task. This achieves monitoring of the running status of the repair task through preset observation points, eliminating the need for manual fault analysis and improving the efficiency of fault analysis for the repair task.
[0076] In the embodiments of this application, the first repair state of the repair task is determined based on the statistical data corresponding to each data identifier and each observation point, such as... Figure 4As shown, it includes:
[0077] Step 401: Determine the second repair status corresponding to each data segment based on the data identifiers and the statistical data corresponding to each observation point.
[0078] For example, the statistical data corresponding to each observation point may include the observation point name, number of hits, number of failed hits, data identifier, hit status, hit anomaly time, hit anomaly status code, and the influence weight of the observation point location. Among them, the hit status of the data segment at the observation point can be determined according to the data identifier, that is, whether the data segment is detected successfully or failed at the observation point. If the detection fails, the hit anomaly time and hit anomaly status code are recorded.
[0079] For any given data segment, the statistical data of each observation point is queried based on the data segment's data identifier to determine all the observation points the data segment has passed through. That is, the statistical data of the observation points includes the data identifier. Based on the statistical data of these observation points, the second repair status of the data segment can be determined.
[0080] Step 402: If all second repair states are repair completed, then the first repair state is determined to be repair completed.
[0081] Each data segment in the repair task can be repaired in parallel. If the second repair status of each data segment in the repair task is "repair completed", that is, all data segments have been successfully repaired, then the first repair status of the repair task is determined to be "repair completed".
[0082] Step 403: If any second repair status is repair failure, then the first repair status is determined to be repair failure.
[0083] If any data segment has a second repair status of repair failure, that is, a data segment has not been successfully repaired, then the repair task is determined to be incomplete and the first repair status is repair failure.
[0084] In the above embodiments, the repair data in the repair task is distributed in data segments as the smallest granularity. The repair of multiple data segments can be carried out concurrently to improve the efficiency of data repair. The first repair state of the repair task can be determined based on the second repair state of all data segments, thereby realizing the monitoring of the overall operation status of the repair task.
[0085] In one embodiment, the steps for determining the second repair state corresponding to each data segment are as follows: Figure 5 As shown, it includes:
[0086] Step 501: For each data segment, based on the data identifier corresponding to the data segment, determine the target observation point corresponding to the data segment from each observation point.
[0087] In this context, the target observation point is the observation point that the data segment passes through during the repair process. For ease of description, this embodiment uses a single data segment as an example, referred to as the target data segment, and the data identifier corresponding to the target data segment is called the target data identifier. Based on the data identifiers recorded at each observation point, if the data identifier recorded at an observation point includes the target data identifier, meaning that the target data segment passed through that observation point during the repair process, then that observation point is the target observation point.
[0088] Step 502: Determine the second repair state corresponding to the data segment based on the statistical data corresponding to the target observation point.
[0089] For example, the statistical data corresponding to the target observation point is parsed to determine the hit status of the target data identifier at that target observation point. If the hit status is "hit failure," meaning the target data segment fails to be detected at that target observation point, the second repair status corresponding to the target data segment is determined to be "repair failure." If the hit status is "hit success," meaning the target data segment is successfully detected at that target observation point, the process continues to check the next target observation point until all target observation points corresponding to the target data segment in the last stage of the repair process are successfully detected. Then, the second repair status corresponding to the target data segment is determined to be "repair complete."
[0090] In the above embodiments, the target observation point is determined by the data identifier corresponding to the data segment, and the second repair state corresponding to the data segment is determined according to the statistical data corresponding to the target observation point, thereby realizing the monitoring of the repair process of each data segment.
[0091] In one embodiment, each observation point includes a cache area for storing historical statistical data for each observation point. To manage the data in the cache area, such as... Figure 6 As shown, the method also includes:
[0092] Step 601: Divide the cache area into a preset number of sub-cache areas.
[0093] Each sub-cache is used to store statistical data for different time ranges. For example, the cache can be divided into three segments, or three sub-caches, according to different time ranges, to store statistical data for different time ranges respectively.
[0094] Step 602: For each sub-cache, when the statistical data stored in the sub-cache reaches the upper limit of the corresponding time range, the sub-cache is converted into the sub-cache corresponding to the next time range, and the statistical data in the sub-cache that exceeds the preset time range is recycled.
[0095] The system periodically monitors the statistical data stored in each sub-cache. If the statistical data stored in a sub-cache reaches the upper limit of its corresponding time range, the sub-cache is converted to the next sub-cache with a longer time range. If a sub-cache has exceeded its preset time range, the statistical data in that sub-cache is reclaimed, and the sub-cache is converted to the sub-cache with the shortest time range. For example, the cache can be divided into a first sub-cache, a second sub-cache, and a third sub-cache. The first sub-cache stores statistical data for one hour, the second sub-cache stores statistical data for two hours, and the third sub-cache stores statistical data for three hours. The preset time range is three hours; this time is just an example. When the statistical data in the first sub-cache reaches the upper limit of its corresponding time range (i.e., exceeds one hour), the first sub-cache is converted to the second sub-cache. When the statistical data in the second sub-cache reaches the upper limit of its corresponding time range (i.e., exceeds two hours), the second sub-cache is converted to the third sub-cache. When the statistical data in the third sub-cache exceeds the preset time range (i.e., exceeds three hours), the third sub-cache is converted back to the first sub-cache, and this conversion is repeated cyclically to manage the statistical data.
[0096] Alternatively, another approach involves periodically monitoring the expiration time of statistical data stored in each sub-cache. If the statistical data expires (i.e., exceeds the corresponding time range), the state of the sub-cache is changed. Each sub-cache cycles through three states: RUNNING, FINISH, and READY, allowing each sub-cache to store statistical data for different time ranges, and periodically cleaning up expired statistical data. The specific process can be as follows: Figure 7 As shown.
[0097] In the above embodiments, by setting up different sub-buffers to store statistical data for different time ranges, and by recycling statistical data that exceeds a preset time threshold, the statistical data of each observation point can be managed.
[0098] In the embodiments of this application, fault observation points are determined based on the statistical data corresponding to each observation point, and the cause of the fault in the repair task is determined based on the statistical data corresponding to the fault observation points, such as... Figure 8 As shown, it includes:
[0099] Step 801: Determine the abnormal information of the fault observation point based on the statistical data corresponding to the fault observation point.
[0100] After identifying the fault observation point, the statistical data of the fault observation point is parsed. The abnormal information of the fault observation point can include the data identifier of the data segment whose hit status is detection failure, as well as the hit anomaly time and hit anomaly status code of the data segment.
[0101] Step 802: Determine the cause of the failure in the repair task based on the abnormal information.
[0102] By querying the preset status code mapping information based on the time of the anomaly hit and the status code of the data segment, the reason for the data segment repair failure can be determined. The status code mapping information can include the correspondence between different judgment conditions, status codes, and fault causes, and can be established after pre-analysis of the judgment conditions.
[0103] Based on the location of fault observation points, the distribution of error locations in the repair process can be determined. The frequency of errors occurring at each observation point can be determined by the number of anomaly hits, thus generating a fault statistics chart for the repair task, including the location of fault observation points, error frequency, and fault causes. After determining the fault causes, a corresponding solution strategy can be generated based on a pre-trained fault decision model. This fault decision model can be obtained by training an initial neural network model based on sample fault causes and sample solution strategies. The aforementioned fault statistics chart, fault locations, fault causes, and solution strategies can all be displayed on the user interface.
[0104] In the above embodiments, by determining the fault observation point, the cause of the fault can be located efficiently and accurately. Combined with the preset fault decision model, a solution strategy is obtained. Based on the solution strategy, the system operation fault is restored, which brings great safety assurance to safe production.
[0105] In embodiments of this application, a fault analysis method is provided, such as... Figure 9 As shown, it includes:
[0106] Step 901: Obtain the data identifiers of each data segment in the repair task.
[0107] Step 902: Obtain the statistical data corresponding to each observation point.
[0108] Step 903: For each data segment, based on the data identifier corresponding to the data segment, determine the target observation point corresponding to the data segment from each observation point.
[0109] Step 904: Determine the second repair state corresponding to the data segment based on the statistical data corresponding to the target observation point.
[0110] Step 905: If all second repair states are repair completed, then the first repair state is determined to be repair completed.
[0111] Step 906: If any second repair status is repair failure, then determine that the first repair status is repair failure.
[0112] Step 907: If the first repair status indicates that the repair task has a fault, determine the abnormal information of the fault observation point based on the statistical data corresponding to the fault observation point.
[0113] Step 908: Determine the cause of the failure in the repair task based on the abnormal information.
[0114] In the above embodiments, by acquiring the data identifiers of each data segment in the repair task and the statistical data corresponding to each observation point, the running status of the repair task can be monitored, and the fault analysis of the repair task can be realized without manual analysis, thus improving the efficiency of fault analysis.
[0115] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps. It is understood that the steps in different embodiments can be freely combined as needed, and all non-contradictory solutions formed by such combinations are within the scope of protection of this application.
[0116] Based on the same inventive concept, this application also provides a fault analysis apparatus for implementing the fault analysis method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations in one or more fault analysis apparatus embodiments provided below can be found in the limitations of the fault analysis method described above, and will not be repeated here.
[0117] In one exemplary embodiment, such as Figure 10 As shown, a fault analysis device 1000 is provided, including: a first acquisition module 1001, a second acquisition module 1002, a first determination module 1003, and a second determination module 1004, wherein:
[0118] The first acquisition module 1001 is used to acquire the data identifiers of each data segment in the repair task;
[0119] The second acquisition module 1002 is used to acquire statistical data corresponding to each observation point; each observation point is a node in the distributed storage system that is pre-set.
[0120] The first determination module 1003 is used to determine the first repair status of the repair task based on each data identifier and the statistical data corresponding to each observation point.
[0121] The second determining module 1004 is used to determine the fault observation point based on the statistical data corresponding to each observation point if the first repair status indicates that the repair task has a fault, and to determine the cause of the fault in the repair task based on the statistical data corresponding to the fault observation point.
[0122] In one embodiment, the first determining module 1003 is specifically used to determine the second repair status corresponding to each data segment based on each data identifier and the statistical data corresponding to each observation point; if each second repair status is repair completed, then the first repair status is determined to be repair completed; if any second repair status is repair failed, then the first repair status is determined to be repair failed.
[0123] In one embodiment, the first determining module 1003 is specifically used to determine the target observation point corresponding to the data segment from each observation point based on the data identifier corresponding to the data segment for each data segment; and to determine the second repair state corresponding to the data segment based on the statistical data corresponding to the target observation point.
[0124] In one embodiment, the device further includes a third determining module, used to determine the statistical data corresponding to each observation point based on the parameter information corresponding to the judgment conditions of each observation point.
[0125] In one embodiment, each observation point includes a cache area, and the device also includes a cache recycling module for dividing the cache area into a preset number of sub-cache areas, each sub-cache area for storing statistical data within different time ranges; for each sub-cache area, when the statistical data stored in the sub-cache area reaches the upper limit of the corresponding time range, the sub-cache area is converted into the sub-cache area corresponding to the next time range, and the statistical data in the sub-cache area that exceeds the preset time range is recycled.
[0126] In one embodiment, the second determining module 1004 is specifically used to determine the abnormal information of the fault observation point based on the statistical data corresponding to the fault observation point; and to determine the cause of the fault in the repair task based on the abnormal information.
[0127] Each module in the aforementioned fault analysis device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in the processor of a computer device in hardware form or independent of it, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.
[0128] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 11 As shown, this computer device includes a processor, memory, input / output interfaces (I / O), and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores statistical data for various observation points. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communication with external terminals via a network connection. When the computer program is executed by the processor, it implements a fault analysis method.
[0129] Those skilled in the art will understand that Figure 11 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0130] In an exemplary embodiment, a computer device is provided, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to perform the following steps: obtaining data identifiers for each data segment in a repair task; obtaining statistical data corresponding to each observation point; each observation point is a node in a distributed storage system that is pre-set; determining a first repair state of the repair task based on each data identifier and the statistical data corresponding to each observation point; if the first repair state indicates that the repair task has a fault, determining a fault observation point based on the statistical data corresponding to each observation point, and determining the cause of the fault in the repair task based on the statistical data corresponding to the fault observation point.
[0131] In one embodiment, when the processor executes the computer program, it further implements the following steps: determining the second repair status corresponding to each data segment based on the statistical data corresponding to each data identifier and each observation point; if each second repair status is repair completed, then determining the first repair status as repair completed; if any second repair status is repair failed, then determining the first repair status as repair failed.
[0132] In one embodiment, when the processor executes the computer program, it further performs the following steps: for each data segment, based on the data identifier corresponding to the data segment, determine the target observation point corresponding to the data segment from each observation point; and determine the second repair state corresponding to the data segment based on the statistical data corresponding to the target observation point.
[0133] In one embodiment, when the processor executes the computer program, it further performs the following steps: determining the statistical data corresponding to each observation point based on the parameter information corresponding to the judgment conditions of each observation point.
[0134] In one embodiment, each observation point includes a cache area, and when the processor executes the computer program, it further implements the following steps: dividing the cache area into a preset number of sub-cache areas, each sub-cache area being used to store statistical data within different time ranges; for each sub-cache area, when the statistical data stored in the sub-cache area reaches the upper limit of the corresponding time range, converting the sub-cache area into a sub-cache area corresponding to the next time range, and reclaiming the statistical data in the sub-cache areas that exceed the preset time range.
[0135] In one embodiment, when the processor executes the computer program, it further performs the following steps: determining the abnormal information of the fault observation point based on the statistical data corresponding to the fault observation point; and determining the cause of the fault in the repair task based on the abnormal information.
[0136] In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, it performs the following steps: obtaining data identifiers for each data segment in a repair task; obtaining statistical data corresponding to each observation point; each observation point is a node in a distributed storage system that is pre-set; determining a first repair state of the repair task based on each data identifier and the statistical data corresponding to each observation point; if the first repair state indicates that the repair task has a fault, determining a fault observation point based on the statistical data corresponding to each observation point, and determining the cause of the fault in the repair task based on the statistical data corresponding to the fault observation point.
[0137] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: determining the second repair status corresponding to each data segment based on each data identifier and the statistical data corresponding to each observation point; if each second repair status is repair completed, then determining the first repair status as repair completed; if any second repair status is repair failed, then determining the first repair status as repair failed.
[0138] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: for each data segment, based on the data identifier corresponding to the data segment, determine the target observation point corresponding to the data segment from each observation point; and determine the second repair state corresponding to the data segment based on the statistical data corresponding to the target observation point.
[0139] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: determining the statistical data corresponding to each observation point based on the parameter information corresponding to the judgment conditions of each observation point.
[0140] In one embodiment, each observation point includes a cache area, and when the computer program is executed by the processor, it further implements the following steps: dividing the cache area into a preset number of sub-cache areas, each sub-cache area being used to store statistical data within different time ranges; for each sub-cache area, when the statistical data stored in the sub-cache area reaches the upper limit of the corresponding time range, converting the sub-cache area into a sub-cache area corresponding to the next time range, and reclaiming the statistical data in the sub-cache areas that exceed the preset time range.
[0141] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: determining the abnormal information of the fault observation point based on the statistical data corresponding to the fault observation point; and determining the cause of the fault in the repair task based on the abnormal information.
[0142] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, performs the following steps: obtaining data identifiers for each data segment in a repair task; obtaining statistical data corresponding to each observation point; each observation point being a node pre-set in a distributed storage system; determining a first repair state of the repair task based on each data identifier and the statistical data corresponding to each observation point; if the first repair state indicates a fault in the repair task, determining a fault observation point based on the statistical data corresponding to each observation point, and determining the cause of the fault in the repair task based on the statistical data corresponding to the fault observation point.
[0143] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: determining the second repair status corresponding to each data segment based on each data identifier and the statistical data corresponding to each observation point; if each second repair status is repair completed, then determining the first repair status as repair completed; if any second repair status is repair failed, then determining the first repair status as repair failed.
[0144] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: for each data segment, based on the data identifier corresponding to the data segment, determine the target observation point corresponding to the data segment from each observation point; and determine the second repair state corresponding to the data segment based on the statistical data corresponding to the target observation point.
[0145] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: determining the statistical data corresponding to each observation point based on the parameter information corresponding to the judgment conditions of each observation point.
[0146] In one embodiment, each observation point includes a cache area, and when the computer program is executed by the processor, it further implements the following steps: dividing the cache area into a preset number of sub-cache areas, each sub-cache area being used to store statistical data within different time ranges; for each sub-cache area, when the statistical data stored in the sub-cache area reaches the upper limit of the corresponding time range, converting the sub-cache area into a sub-cache area corresponding to the next time range, and reclaiming the statistical data in the sub-cache areas that exceed the preset time range.
[0147] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: determining the abnormal information of the fault observation point based on the statistical data corresponding to the fault observation point; and determining the cause of the fault in the repair task based on the abnormal information.
[0148] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.
[0149] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.
[0150] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.
[0151] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A fault analysis method, characterized in that, The method includes: Obtain the data identifiers for each data segment in the repair task; Obtain statistical data corresponding to each observation point; each observation point is a node in a distributed storage system that is pre-set. The first repair status of the repair task is determined based on the data identifiers and the statistical data corresponding to each observation point. If the first repair status indicates that the repair task has a fault, the fault observation point is determined according to the statistical data corresponding to each observation point, and the cause of the fault in the repair task is determined according to the statistical data corresponding to the fault observation point.
2. The method according to claim 1, characterized in that, The step of determining the first repair status of the repair task based on the data identifiers and the statistical data corresponding to each observation point includes: Based on the data identifiers and the statistical data corresponding to each observation point, determine the second repair state corresponding to each data segment; If each of the second repair states is repair complete, then the first repair state is determined to be repair complete; If any second repair status is repair failure, then the first repair status is determined to be repair failure.
3. The method according to claim 2, characterized in that, The step of determining the second repair state corresponding to each data segment based on each data identifier and the statistical data corresponding to each observation point includes: For each data segment, based on the data identifier corresponding to the data segment, the target observation point corresponding to the data segment is determined from each observation point; Based on the statistical data corresponding to the target observation point, determine the second repair state corresponding to the data segment.
4. The method according to claim 1, characterized in that, The acquisition of statistical data corresponding to each observation point includes: Based on the parameter information corresponding to the judgment conditions of each observation point, the statistical data corresponding to each observation point is determined.
5. The method according to claim 1, characterized in that, Each observation point includes a buffer area, and the method further includes: The cache area is divided into a preset number of sub-cache areas, and each sub-cache area is used to store statistical data within different time ranges; For each sub-cache, when the statistical data stored in the sub-cache reaches the upper limit of the corresponding time range, the sub-cache is converted into a sub-cache corresponding to the next time range, and the statistical data in the sub-cache that exceeds the preset time range is recycled.
6. The method according to claim 1, characterized in that, The step of determining fault observation points based on statistical data corresponding to each observation point, and determining the cause of the fault in the repair task based on statistical data corresponding to the fault observation points, includes: Based on the statistical data corresponding to the fault observation points, the abnormal information of the fault observation points is determined; Based on the abnormal information, determine the cause of the failure in the repair task.
7. A fault analysis device, characterized in that, The device includes: The first acquisition module is used to acquire the data identifiers of each data segment in the repair task; The second acquisition module is used to acquire statistical data corresponding to each observation point; each observation point is a node in a distributed storage system that is pre-set. The first determining module is used to determine the first repair status of the repair task based on the data identifiers and the statistical data corresponding to each observation point. The second determining module is used to determine the fault observation point based on the statistical data corresponding to each observation point if the first repair status indicates that the repair task has a fault, and to determine the fault cause of the repair task based on the statistical data corresponding to the fault observation point.
8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.