A slow disk detection method, device, storage medium and computer program product
By collecting multi-dimensional disk performance indicators in real time in a distributed storage system and calculating Mahalanobis distance, and combining multiple sampling results to identify slow disks, the problem of high false positive rate and missed detection in existing technologies is solved, and more accurate and stable slow disk detection is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA MOBILE (SUZHOU) SOFTWARE TECH CO LTD
- Filing Date
- 2026-02-28
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies suffer from high false positive rates or missed detections in slow disk detection, are difficult to adapt to fluctuations in disk performance over time and load, and lack effective identification methods that integrate features from multiple real-time sample datasets, resulting in insufficient detection accuracy and efficiency.
By acquiring multidimensional disk performance metrics at multiple time points to form a sample dataset, the Mahalanobis distance of each disk to the sample dataset is calculated to reflect its deviation from the normal disk group. The results of multiple samplings are combined to make a comprehensive judgment and identify slow disks.
It improves the accuracy and robustness of slow disk identification, avoids the impact of outliers at a single moment, and enhances the stability and performance of distributed storage systems.
Smart Images

Figure CN122308724A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer storage and data processing technology, and in particular to a method, device, storage medium and computer program product for detecting slow disks. Background Technology
[0002] In distributed storage systems, as the number of nodes and disks increases, system stability and performance management become particularly important. Slow disks, as one of the key factors affecting overall system performance, require timely detection to ensure the efficient operation of the storage system. Traditional slow disk detection methods often rely on analyzing input / output (I / O) request service times, disk parameters, or historical data to identify disks with abnormal performance.
[0003] Among related technologies, some solutions dynamically set slow disk thresholds by statistically analyzing the average service time of different types of hard drives within a preset period; others rely on stress tests to build models for prediction; still others combine disk parameters to determine if a disk is slow; and some technologies identify anomalies by comparing the similarity between the current disk and historical disks. However, these methods have certain limitations in handling real-time performance, diversity, and changes in disk behavior.
[0004] The existing technologies mainly rely on single indicators or static thresholds, which are difficult to adapt to the fluctuations in disk performance over time and under load, resulting in high false positive rates or missed detections. Furthermore, when faced with data collected from a large number of disks in parallel, there is a lack of a method that can effectively identify data by integrating features from multiple real-time sample datasets, limiting further improvements in detection accuracy and efficiency. Summary of the Invention
[0005] This application provides a method, device, storage medium, and computer program product for detecting slow disks, which can improve the accuracy of slow disk identification.
[0006] The technical solution of this application embodiment is implemented as follows: In a first aspect, embodiments of this application provide a slow disk detection method, the method comprising: Obtain M sample datasets; wherein each sample dataset is obtained in real time based on the first indicator information of all first disks in the storage cluster from different collection points, and M is a positive integer; For each sample dataset, a first distance is calculated between each second disk in the sample dataset and the sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk; The slow disk is determined from each of the second disks based on each of the first distances.
[0007] Secondly, embodiments of this application provide a slow disk detection device, which includes: a processor and a memory; wherein, The memory is used to store computer programs that can run on the processor; The processor is configured to execute the slow disk detection method as described above when running the computer program.
[0008] Thirdly, embodiments of this application provide a computer-readable storage medium storing computer program code, which, when executed by a computer, implements the slow disk detection method described above.
[0009] Fourthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the slow disk detection method described above.
[0010] This application provides a slow disk detection method, device, storage medium, and computer program product. The method includes: acquiring M sample datasets; wherein each sample dataset is obtained by real-time collection of first indicator information of all first disks in a storage cluster based on different collection points, and M is a positive integer; for each sample dataset, calculating a first distance between each second disk in the sample dataset and the sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk; and determining the slow disk from each second disk based on each first distance. Therefore, this application embodiment can construct a sample dataset by real-time collection of multi-dimensional disk performance indicators at multiple time points, and calculate the Mahalanobis distance (i.e., the first distance) of each disk to the sample dataset to reflect its deviation from the normal disk group. Since in a distributed storage system, the performance of normal disks of the same type should be highly similar under the same business scenario, while slow disks deviate significantly due to performance degradation. Therefore, using Mahalanobis distance can effectively measure the distance between a disk and the normal disk group, thereby improving the accuracy of the judgment. Furthermore, combining multiple sampling results for comprehensive judgment can enhance the robustness of the judgment process and avoid the influence of outliers at a single moment. Attached Figure Description
[0011] Figure 1 This is a schematic diagram of the slow disk detection method proposed in the embodiments of this application. Figure 1 ; Figure 2 This is a schematic diagram of the slow disk detection method proposed in the embodiments of this application. Figure 2 ; Figure 3 This is a schematic diagram of the slow disk detection method proposed in the embodiments of this application. Figure 3 ; Figure 4 This is a schematic diagram of the slow disk processing device proposed in the embodiments of this application; Figure 5 This is a schematic diagram of the composition and structure of the slow disk detection device proposed in the embodiments of this application. Figure 1 ; Figure 6 This is a schematic diagram of the composition and structure of the slow disk detection device proposed in the embodiments of this application. Figure 2 . Detailed Implementation
[0012] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are only for explaining the relevant application and not for limiting the application. Furthermore, it should be noted that, for ease of description, only the parts related to the relevant application are shown in the accompanying drawings.
[0013] In distributed storage systems, as the number of nodes and disks increases, system stability and performance management become particularly important. Slow disks, as one of the key factors affecting overall system performance, require timely detection to ensure the efficient operation of the storage system. Traditional slow disk detection methods often rely on analyzing I / O request service times, disk parameters, or historical data to identify disks with abnormal performance.
[0014] Among related technologies, some solutions dynamically set slow disk thresholds by statistically analyzing the average service time of different types of hard drives within a preset period; others rely on stress tests to build models for prediction; still others combine disk parameters to determine if a disk is slow; and some technologies identify anomalies by comparing the similarity between the current disk and historical disks. However, these methods have certain limitations in handling real-time performance, diversity, and changes in disk behavior.
[0015] The existing technologies mainly rely on single indicators or static thresholds, which are difficult to adapt to the fluctuations in disk performance over time and under load, resulting in high false positive rates or missed detections. Furthermore, when faced with data collected from a large number of disks in parallel, there is a lack of a method that can effectively identify data by integrating features from multiple real-time sample datasets, limiting further improvements in detection accuracy and efficiency.
[0016] To address the issue of high false positive rates or missed detections of slow disks due to current related technologies, this application provides a slow disk detection method, device, storage medium, and computer program product. The method includes: acquiring M sample datasets; where each sample dataset is obtained in real-time from first indicator information of all first disks in a storage cluster based on different collection points, and M is a positive integer; for each sample dataset, calculating a first distance between each second disk in the sample dataset and the sample dataset; where the first distance characterizes the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disks; and identifying slow disks from each of the second disks based on each first distance. Therefore, this application embodiment can construct a sample dataset by real-time collection of multi-dimensional disk performance indicators at multiple time points and calculate the Mahalanobis distance (i.e., the first distance) from each disk to the sample dataset to reflect its deviation from the normal disk group. Since in a distributed storage system, the performance of normal disks of the same type should be highly similar under the same business scenario, while slow disks deviate significantly due to performance degradation, the Mahalanobis distance can effectively measure the distance between a disk and the normal disk group, thereby improving the accuracy of the judgment. Furthermore, combining multiple sampling results for comprehensive judgment can enhance the robustness of the discrimination process and avoid the influence of outliers at a single moment.
[0017] The technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings.
[0018] This application provides a method for detecting slow disks. Figure 1 This is a schematic diagram of the slow disk detection method proposed in the embodiments of this application. Figure 1 ,like Figure 1 As shown, the slow disk detection method may include the following steps: Step 101: Obtain M sample datasets; where each sample dataset is obtained by real-time collection of the first indicator information of all first disks in the storage cluster based on different collection points, and M is a positive integer.
[0019] In the embodiments of this application, the slow disk detection device can acquire M sample datasets.
[0020] It should be noted that, in the embodiments of this application, the slow disk detection device can be any terminal or device with storage and communication functions. For example, the slow disk detection device can be a personal computer (PC). This application does not specifically limit the type of slow disk detection device.
[0021] It should be noted that, in the embodiments of this application, the sample dataset can refer to a set of multi-dimensional performance indicators of all first disks in the storage cluster collected at a certain point in time, used to represent the performance status of the first disks of the entire cluster at that point in time; wherein, each sample dataset comes from a different collection point, i.e., a different point in time, to construct a statistical model between the first disks. For example, in a distributed storage system, the first disk's input / output operations per second (IOPS), read / write bandwidth, latency, and other indicators are collected every 5 minutes. The data collected each time constitutes a sample dataset, and a total of M collections (e.g., M=10) will form 10 sample datasets. These sample datasets can be used to subsequently calculate Mahalanobis distance, thereby identifying possible slow disks.
[0022] It should be noted that, in the embodiments of this application, the first disk refers to all disks whose performance indicators are collected in the current sampling period, which usually includes normal disks and potentially slow disks. This application does not specifically limit the types of disks included in the first disk.
[0023] It should be noted that, in the embodiments of this application, the first indicator information may include M-dimensional indicator information corresponding to each first disk. The M-dimensional indicator information may include multi-dimensional performance indicators collected on the first disk at a certain moment, such as the average response time of I / O requests, read and write bandwidth, throughput, number of input / output operations processed per second (TPS), etc. These multi-dimensional performance indicators together reflect the overall performance of the first disk. This application does not make specific limitations on the number and type of information included in the first indicator information.
[0024] It should be noted that, in the embodiments of this application, by collecting multidimensional performance indicators of the first disk from multiple collection points and constructing multiple sample datasets, basic data can be provided for subsequent similarity calculations, which helps to improve the accuracy and robustness of slow disk identification.
[0025] Step 102: For each sample dataset, calculate the first distance between each second disk in the sample dataset and the sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk.
[0026] In the embodiments of this application, after acquiring M sample datasets, the slow disk detection device can calculate a first distance between each second disk in the sample dataset and the sample dataset for each sample dataset.
[0027] It should be noted that, in the embodiments of this application, the second disk may be a specific disk individual contained in a certain sample dataset, and this application does not specifically limit the type of the second disk.
[0028] It should be noted that in the embodiments of this application, the first distance may refer to Mahalanobis distance, a distance calculation method used to measure the similarity between data points and datasets. It can reflect the distance between the disk and the center of the current sample set. The smaller the Mahalanobis distance, the more similar the disk and the sample set are; conversely, it may be an abnormal disk (such as a slow disk). This application does not specifically limit the type of the first distance.
[0029] It should be noted that, in the embodiments of this application, by calculating the Mahalanobis distance of each disk to the current sample dataset, the consistency of the disk's performance with the overall performance of disks of the same type can be quantitatively evaluated, thereby determining whether it may be a slow disk. This method overcomes the limitations of traditional single-indicator judgment methods and improves the reliability of the judgment results.
[0030] Optionally, in an embodiment of this application, when the slow disk detection device calculates the first distance between each second disk in the sample dataset and the sample dataset, it can obtain the second indicator information corresponding to each second disk in the sample dataset; wherein, the second indicator information includes N-dimensional indicator information corresponding to the second disk, where N is a positive integer; then, a first vector can be determined based on the N-dimensional indicator information, and the first distance can be determined based on the first vector and the first indicator information.
[0031] It should be noted that, in the embodiments of this application, the second indicator information corresponding to the second disk is used to describe the performance of the second disk at a certain moment. The second indicator information may include N-dimensional indicator information corresponding to the second disk, such as disk read / write bandwidth, I / O request response time, number of input / output operations processed per second (TPS), disk queue depth, etc.
[0032] It should be noted that, in the embodiments of this application, when the slow disk detection device determines the first vector based on N-dimensional indicator information, it can combine the performance parameters of multiple dimensions in the N-dimensional indicator information into a vector data structure to facilitate subsequent distance calculation and similarity analysis. By integrating multi-dimensional indicators into a vector form, the overall performance status of the second disk can be reflected more comprehensively, rather than relying solely on a single indicator (such as average service time), thereby improving the accuracy of slow disk detection and avoiding misjudgments caused by anomalies in a single indicator.
[0033] In other words, in the embodiments of this application, multi-dimensional performance indicators are collected for each second disk, and a first vector is generated based on these indicators. This first vector is a quantitative expression of the performance status of the second disk. Then, the first vector can be compared with the mean vector of the sample dataset to calculate the Mahalanobis distance between the second disk and the overall sample set, in order to determine whether the second disk is an abnormal disk or a potentially slow disk. By fusing multi-dimensional indicators and calculating statistical distance, a comprehensive evaluation of the performance status of the second disk is achieved, thereby improving the reliability and stability of the detection results.
[0034] Optionally, in embodiments of this application, when the slow disk detection device determines the first distance based on the first vector and the first indicator information, it can determine the second vector and the first matrix based on the M-dimensional indicator information of each first disk in the sample dataset; wherein, the second vector includes a mean vector, and the first matrix is used to characterize the linear correlation between the M-dimensional indicator information; then the first distance can be determined based on the first vector, the second vector, and the first matrix.
[0035] It should be noted that, in the embodiments of this application, the second vector can be the mean vector calculated from the M-dimensional index information of all first disks. The mean vector can represent the average performance of all first disks in each dimension within the current sampling period. The mean vector is the center position of the sample set and represents the typical behavior pattern of a normal first disk.
[0036] It should be noted that, in the embodiments of this application, the first matrix can be a covariance matrix, used to measure the linear correlation between different performance indicators. For example, if there is a strong negative correlation between the average service time of I / O requests and TPS of a certain first disk, then this relationship will be reflected in the covariance matrix. The covariance matrix reflects the distribution characteristics between data points and helps to identify the first disk that deviates from the normal distribution.
[0037] It should be noted that, in the embodiments of this application, by introducing a mean vector and a covariance matrix, the degree of anomaly of the first disk can be evaluated in a statistical manner. Compared with comparison using only the original indicators, the method of introducing a mean vector and a covariance matrix is more robust, especially in the face of noise or transient performance fluctuations. The method of introducing a mean vector and a covariance matrix can effectively filter out interfering factors, thereby improving detection accuracy.
[0038] For example, in an embodiment of this application, when the slow disk detection device determines the first distance based on the first vector, the second vector and the first matrix, it can calculate the first distance using the following formula (1).
[0039] (1) in, Indicates the first distance. Denotes the first vector. Represents the second vector. This represents the covariance matrix (i.e., the first matrix).
[0040] It should be noted that, in the embodiments of this application, the slow disk detection device can calculate the first distance between each second disk in the sample dataset and the sample dataset for each sample dataset. By calculating the Mahalanobis distance (i.e., the first distance), not only are the absolute values of various indicators of the first disk considered, but also the interrelationships between these indicators are combined, so as to more accurately identify those disks that deviate significantly from the normal range in the multidimensional space, which facilitates more accurate identification of slow disks, thereby improving the stability and business continuity of the distributed storage system.
[0041] Step 103: Determine the slow disk from each of the second disks based on each first distance.
[0042] In the embodiments of this application, after calculating a first distance between each second disk in the sample dataset and the sample dataset for each sample dataset, the slow disk detection device can determine the slow disk from each second disk based on each first distance.
[0043] Optionally, in an embodiment of this application, after calculating the first distance between each second disk in the sample dataset and the sample dataset for each sample dataset, the slow disk detection device can sort the first distance based on a preset sorting rule, and determine the slow disk in the second disk based on the sorted first distance.
[0044] It should be noted that, in the embodiments of this application, the preset sorting rule may refer to sorting according to Mahalanobis distance from largest to smallest, so as to prioritize the identification of disks that differ significantly from the center of the sample set. For example, assuming that each sample dataset contains 100 disks, after calculation, they are sorted according to Mahalanobis distance from largest to smallest, and the top N disks (e.g., N=5) are selected as suspected slow disks, and their suspected slow disk values are calculated and added to the suspected slow disk set (i.e., the first set).
[0045] Optionally, in an embodiment of this application, when the slow disk detection device determines the slow disk in the second disk based on the sorted first distance, it can select the first L first distances from the sorted first distances; where L is a positive integer; then it can calculate the first value of the third disk corresponding to each of the first L first distances; where the third disk is a part of the disks in the second disk, and the first value is used to determine whether the third disk is a slow disk; then the first value and the disk information of the third disk can be added to the first set, and the first set is used to determine whether the third disk is a slow disk.
[0046] For example, in the embodiments of this application, by arranging these distances (i.e., the first distance) in descending order and selecting the L largest distance values, the candidate disks most likely to be slow disks can be quickly located. The parameter L is a positive integer, and its specific value can be dynamically adjusted according to system scale, business type, and historical data distribution. For example, in high-concurrency scenarios, to ensure detection sensitivity, the parameter L can be set to a higher value; while in low-load scenarios, the value of the parameter L can be appropriately reduced to reduce computational overhead. This application does not specifically limit the size of L.
[0047] It should be noted that, in the embodiments of this application, the third disk may be a portion of the disks selected from the second disk set, and the third disk corresponds to the first L largest first distance values. This application does not specifically limit the number of third disks.
[0048] It should be noted that, in the embodiments of this application, the first value is a quantitative indicator used to further evaluate the degree of disk anomaly. This value can be calculated in various ways, and this application does not specifically limit the calculation method of the first value.
[0049] It should be noted that, in the embodiments of this application, the first set can be a data structure for recording information about suspicious disks. The first set contains a first value corresponding to each suspicious disk and basic information about each suspicious disk (such as disk ID, node, last detection time, etc.). In subsequent processing, statistical analysis operations can be performed based on the data in the first set, specifically including calculating the number of times each disk has been added to the first set in the most recent M samples and the sum of the first values corresponding to that specific disk. This application does not specifically limit the type of information contained in the first set.
[0050] In other words, in the embodiments of this application, potentially abnormal disks are identified by calculating the distance between the disk sample and the center point; then, the degree of abnormality is further quantified by calculating a first value; finally, a first set is established, and data from multiple samplings are combined for comprehensive judgment, thereby improving the accuracy and stability of the detection.
[0051] For example, in an embodiment of this application, when the slow disk detection device calculates the first value of the third disk corresponding to the first L first distances, it can perform a first operation on the first L first distances to obtain a second value; then it can perform a second operation on the i-th first distance and the second value among the first L first distances to obtain a first value; wherein i is less than or equal to L.
[0052] For example, in the embodiments of this application, the first arithmetic processing may refer to performing specific mathematical operations on a set of data (i.e., the first L first distances) to extract certain statistical feature values or intermediate results. Such operations may include averaging, summing, taking the maximum value, the minimum value, weighted average, etc. For example, in the embodiments of this application, the summation of the first L first distances can be selected as the second value for subsequent judgment and comparison. This application does not specifically limit the type of the first arithmetic processing.
[0053] For example, in an embodiment of this application, when the slow disk detection device performs a second operation on the i-th first distance and the second value in the first L first distances to obtain the first value, it can be calculated by the following formula (2).
[0054] (2) in, Indicates the first value. This represents the i-th data in the L disk data corresponding to the first distance. Represented as The squared value of the Mahalanobis distance, This represents the sum of the squared values of the Mahalanobis distance (i.e., the first distance) of the first L disk data points selected.
[0055] It should be noted that, in the embodiments of this application, the slow disk detection device can first perform a first operation on the first L first distances to obtain a representative second value; then, it can perform a second operation on a specific first distance and the representative second value to obtain a first value used to determine performance anomalies. The above two steps complement each other. The first step provides global statistical features, and the second step focuses on local deviation analysis, thus together constituting an efficient and accurate disk performance anomaly detection mechanism.
[0056] It should be noted that, in the embodiments of this application, after the slow disk detection device calculates the first value of the third disk corresponding to the first L first distances, it can add the first value and the disk information of the third disk to the first set, and determine whether the third disk is a slow disk based on the first set.
[0057] Optionally, in embodiments of this application, when the slow disk detection device determines whether a third disk is a slow disk based on a first set, it can count the first occurrence of the same target disk in the first set and the third value corresponding to the target disk; wherein, the third disk includes the target disk, and the first value includes the third value; then, it can determine whether the target disk is a slow disk based on the first occurrence and each third value.
[0058] It should be noted that, in the embodiments of this application, the first count refers to the number of times the target disk is marked as a suspicious slow disk in the most recent M consecutive sampling indicators (i.e., the number of times it appears in the first set). By counting the first count, it can be determined whether the target disk continues to show abnormal performance in multiple sampling periods. For example, in a distributed storage system, if the target disk is marked as a suspicious slow disk in the most recent 5 samplings, then the first count is 5.
[0059] It should be noted that in the embodiments of this application, the third value refers to the suspected slow disk value of each sample. The third value is used to comprehensively determine whether a disk is a slow disk. By combining the first value with the suspected slow disk value, the disk with real performance problems can be identified more accurately, avoiding misjudgment due to single sampling error.
[0060] Optionally, in embodiments of this application, when the slow disk detection device determines whether a target disk is a slow disk based on the first count and each third value, it may determine that the target disk is a slow disk if the first count is greater than a first preset threshold and the sum of the third values is greater than a second preset threshold; or, if the first count is less than or equal to the first preset threshold, or the sum of the third values is less than or equal to the second preset threshold, it may determine that the target disk is not a slow disk.
[0061] It should be noted that, in the embodiments of this application, the first preset threshold refers to the minimum number of suspected slow disk occurrences set when determining whether the target disk is a slow disk. Only when the target disk is marked as a suspected slow disk more than or equal to the first preset threshold in the most recent M consecutive samples will the next step of judgment be performed. The first preset threshold is typically less than or equal to half of M (i.e., the number of samples) to ensure the robustness of the judgment. For example, if M=10, the first threshold can be set to 5, meaning that further judgment on whether the target disk is a slow disk will only be performed if the target disk is marked as a suspected slow disk in at least 5 samples. This application does not specifically limit the size of the first preset threshold.
[0062] It should be noted that, in the embodiments of this application, the second preset threshold refers to the minimum cumulative value of the suspected slow disk value (i.e., the third value) set when determining whether the target disk is a slow disk. Only when the target disk meets the first preset threshold and the sum of the suspected slow disk values of the target disk is greater than or equal to the second preset threshold will it be finally marked as a slow disk. This application does not specifically limit the size of the second preset threshold.
[0063] For example, in an embodiment of this application, assume that disk B (i.e., the target disk) has been marked as a suspected slow disk 6 times in the most recent 10 samples (i.e., the first time disk B is marked as a suspected slow disk is 6), and the total number of suspected slow disk values for disk B is 8.2. If the first threshold is set to 5 and the second threshold is set to 8, then the target disk B will be determined as a slow disk. The mechanism formed based on the above judgment logic can effectively filter the influence of random fluctuations on the judgment result and further improve the accuracy of slow disk detection.
[0064] It should be noted that, in the embodiments of this application, by calculating the value of a suspected slow disk based on Mahalanobis distance and combining it with the initial count and a threshold for judgment, random interference factors can be eliminated, the accuracy of slow disk detection can be improved, and thus the operating efficiency of the distributed storage system can be optimized. In other words, this application can detect slow disks in a distributed storage system in real time and accurately without relying on historical data to train the model, thereby improving the performance and availability of the distributed storage system.
[0065] This application provides a slow disk detection method, which includes: acquiring M sample datasets; wherein each sample dataset is obtained by real-time collection of first indicator information of all first disks in a storage cluster based on different collection points, and M is a positive integer; for each sample dataset, calculating a first distance between each second disk in the sample dataset and the sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk; and determining the slow disk from each second disk based on each first distance. Therefore, this application embodiment can construct a sample dataset by real-time collection of multi-dimensional disk performance indicators at multiple time points, and calculate the Mahalanobis distance (i.e., the first distance) of each disk to the sample dataset to reflect its deviation from the normal disk group. Since in a distributed storage system, the performance of normal disks of the same type should be highly similar under the same business scenario, while slow disks deviate significantly due to performance degradation. Therefore, using Mahalanobis distance can effectively measure the distance between a disk and the normal disk group, thereby improving the accuracy of the judgment. Furthermore, combining multiple sampling results for comprehensive judgment can enhance the robustness of the judgment process and avoid the influence of outliers at a single moment.
[0066] Based on the above embodiments, another embodiment of this application provides a slow disk detection method. Figure 2 This is a schematic diagram of the slow disk detection method proposed in the embodiments of this application. Figure 2 ,like Figure 2As shown, the slow disk detection method may include the following steps: Step 201: Under the load balancing of the distributed storage cluster and different disk scenarios, collect multi-dimensional performance indicators of the cluster disks in real time; Step 202: Measure the similarity of distributed storage disks of the same type at the same time based on Mahalanobis distance. The disks (i.e., the second disk) are sorted from farthest to near the center of the disk sample set (i.e., the sample dataset). The farther away from the center of the sample set, the greater the difference from the sample set, and the more likely it is to be a slow disk. Select the information of the first L disks (i.e., the disk information of the third disk) according to the sorting, and calculate the suspected slow disk value. Add the disk information and the suspected slow disk value to the suspected slow disk set (i.e., the first set); Step 203: After the most recent M consecutive disk information collection and calculation, obtain the complete suspected slow disk set; Step 204: If the number of times a disk appears in the suspected slow disk set is greater than the first threshold (i.e., the first preset threshold), and the sum of the suspected slow disk values is greater than the second threshold (i.e., the second preset threshold), then it is judged as a slow disk.
[0067] Optionally, in embodiments of this application, Figure 3 This is a schematic diagram of the slow disk detection method proposed in the embodiments of this application. Figure 3 ,like Figure 3 As shown, the slow disk detection method may include the following steps: Step 301: Collect the disk performance indicators (i.e., the first disk) of each node in the storage cluster; Step 302: Select the most recent consecutive M collection points to form m sample datasets D(1), ..., D(m); Step 303: Initialize variable i to equal 1; Step 304: Determine if i is less than or equal to m, proceed to the next step; otherwise, jump to step 309; Step 305: Calculate the mean μ (i.e., the second vector) and covariance matrix ∑ (i.e., the first matrix) of the sample dataset D(i); Step 306: Calculate the squared Mahalanobis distance (i.e., the first distance) from each disk (i.e., the second disk) in D(i) to the sample dataset D(i); Step 307: Sort by distance from largest to smallest, select the first L data (i.e., the first L first distances), calculate the suspicious slow disk value (i.e., the first value), and add the disk information and the suspicious slow disk value to the suspicious slow disk set T (i.e., the first set). In step 308, i equals i plus 1, jump to step 304; Step 309, calculate the sum of the number of repetitions (i.e., the first count) and the value of the suspected slow disk (i.e., the third value) of the same disk (i.e., the target disk) in T, and record it in the result set R; Step 310, output the disks in R whose number of repetitions is greater than the first threshold and whose sum of the values of the suspected slow disks is greater than the second threshold as slow disks; jump to step 302; where the first threshold and the second threshold are both less than the sampling number M.
[0068] For example, in an embodiment of this application, assume that disk B (i.e., the target disk) has been marked as a suspected slow disk 6 times in the most recent 10 samples (i.e., the first time disk B is marked as a suspected slow disk is 6), and the total number of suspected slow disk values for disk B is 8.2. If the first threshold is set to 5 and the second threshold is set to 8, then the target disk B will be determined as a slow disk. The mechanism formed based on the above judgment logic can effectively filter the influence of random fluctuations on the judgment result and further improve the accuracy of slow disk detection.
[0069] It should be noted that, in the embodiments of this application, by calculating the value of a suspected slow disk based on Mahalanobis distance and combining it with the initial count and a threshold for judgment, random interference factors can be eliminated, the accuracy of slow disk detection can be improved, and thus the operating efficiency of the distributed storage system can be optimized. In other words, this application can detect slow disks in a distributed storage system in real time and accurately without relying on historical data to train the model, thereby improving the performance and availability of the distributed storage system.
[0070] For example, in the embodiments of this application, the Mahalanobis distance calculation formula from data to dataset is the above formula (1). Mahalanobis distance takes into account the correlation between data points and is scale-independent, that is, it is not affected by the measurement scale.
[0071] Optionally, in embodiments of this application, Figure 4 This is a schematic diagram of the slow disk processing device proposed in the embodiments of this application, as shown below. Figure 4 As shown, the slow disk processing device may include: a disk information acquisition module 310, a disk information processing module 320, a suspicious slow disk set acquisition module 330, and a slow disk marking module 340; wherein, the disk information acquisition module 310 is used to acquire disk information in the current distributed cluster in real time, and the disk parameter information includes the disk's own parameter values and disk-related parameter values; the disk information processing module 320 is used to process the acquired disk information, calculate and select the top L suspicious slow disk values and disk information, and add them to the suspicious slow disk set; the suspicious slow disk set acquisition module 330 is used to record the disk information and suspicious slow disk values added to the suspicious slow disk set M times; the slow disk marking module 340 is used to mark the disk as a slow disk when a disk is detected to meet the preset slow disk detection conditions.
[0072] This application provides a slow disk detection method, which includes: acquiring M sample datasets; wherein each sample dataset is obtained by real-time collection of first indicator information of all first disks in a storage cluster based on different collection points, and M is a positive integer; for each sample dataset, calculating a first distance between each second disk in the sample dataset and the sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk; and determining the slow disk from each second disk based on each first distance. Therefore, this application embodiment can construct a sample dataset by real-time collection of multi-dimensional disk performance indicators at multiple time points, and calculate the Mahalanobis distance (i.e., the first distance) of each disk to the sample dataset to reflect its deviation from the normal disk group. Since in a distributed storage system, the performance of normal disks of the same type should be highly similar under the same business scenario, while slow disks deviate significantly due to performance degradation. Therefore, using Mahalanobis distance can effectively measure the distance between a disk and the normal disk group, thereby improving the accuracy of the judgment. Furthermore, combining multiple sampling results for comprehensive judgment can enhance the robustness of the judgment process and avoid the influence of outliers at a single moment.
[0073] Based on the above embodiments, this application provides a slow disk detection device. Figure 5 Schematic diagram of the composition and structure of the slow disk testing equipment Figure 1 ,like Figure 5 As shown, the slow disk detection device 10 includes: an acquisition unit 11, a calculation unit 12, a sorting unit 13, and a determination unit 14; wherein, The acquisition unit 11 is used to acquire M sample datasets; wherein each sample dataset is obtained in real time based on the first indicator information of all first disks in the storage cluster from different acquisition points, and M is a positive integer; The computing unit 12 is configured to calculate a first distance between each second disk in the sample dataset and the sample dataset for each sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk; The sorting unit 13 is used to sort the first distance based on a preset sorting rule; The determining unit 14 is used to determine the slow disk from each of the second disks based on each of the first distances.
[0074] In the embodiments of this application, further, Figure 6 Schematic diagram of the composition and structure of the slow disk testing equipment Figure 2 ,like Figure 6As shown, the slow disk detection device 10 proposed in this application embodiment may further include a processor 15, a memory 16 storing instructions executable by the processor 15, and further, the slow disk detection device 10 may also include a communication interface 17 and a bus 18 for connecting the processor 15, the memory 16 and the communication interface 17.
[0075] In the embodiments of this application, the processor 15 can be at least one of the following: Application-Specific Integrated Circuit (ASIC), Digital Signal Processor (DSP), Digital Signal Processing Device (DSPD), Programmable Logic Device (PLD), Field Programmable Gate Array (FPGA), Central Processing Unit (CPU), Controller, Microcontroller, and Microprocessor. It is understood that for different devices, the electronic device used to implement the above-mentioned processor function can also be other types, and this application embodiment does not specifically limit this. The slow disk detection device 10 may also include a memory 16, which can be connected to the processor 15. The memory 16 is used to store executable program code, which includes computer operation instructions. The memory 16 may include high-speed RAM memory and may also include non-volatile memory, such as at least two disk drives.
[0076] In embodiments of this application, bus 18 is used to connect communication interface 17, processor 15, and memory 16, as well as the mutual communication between these devices.
[0077] In embodiments of this application, memory 16 is used to store instructions and data.
[0078] Further, in an embodiment of this application, the processor 15 is configured to acquire M sample datasets; wherein each sample dataset is obtained by real-time acquisition of first indicator information of all first disks in the storage cluster based on different acquisition points, and M is a positive integer; for each sample dataset, a first distance is calculated between each second disk in the sample dataset and the sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk; and the slow disk is determined from each second disk based on each first distance.
[0079] In practical applications, the aforementioned memory 16 can be volatile memory, such as random-access memory (RAM); or non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid-state drive (SSD); or a combination of the above types of memory, and provide instructions and data to the processor 15.
[0080] This application provides a slow disk detection device that acquires M sample datasets. Each sample dataset is obtained by real-time collection of first indicator information of all first disks in a storage cluster based on different collection points, where M is a positive integer. For each sample dataset, a first distance is calculated between each second disk in the sample dataset and the sample dataset. The first distance characterizes the similarity between each second disk and the center of the sample dataset, and the first disk includes the second disk. Slow disks are identified from each second disk based on each first distance. Therefore, this application can construct a sample dataset by real-time collection of multi-dimensional disk performance indicators at multiple time points and calculate the Mahalanobis distance (i.e., the first distance) of each disk to the sample dataset to reflect its deviation from the normal disk group. In a distributed storage system, the performance of normal disks of the same type should be highly similar under the same business scenario, while slow disks deviate significantly due to performance degradation. Therefore, Mahalanobis distance can effectively measure the distance between a disk and the normal disk group, thereby improving the accuracy of the judgment. Furthermore, combining multiple sampling results for comprehensive judgment can enhance the robustness of the judgment process and avoid the influence of outliers at a single moment.
[0081] This application provides a computer-readable storage medium storing a program thereon, which, when executed by a processor, implements the slow disk detection method described above.
[0082] Specifically, the program instructions corresponding to a slow disk detection method in this embodiment can be stored on storage media such as optical discs, hard disks, and USB flash drives. When the program instructions corresponding to a slow disk detection method in the storage media are read or executed by an electronic device, the following steps are included: Obtain M sample datasets; wherein each sample dataset is obtained in real time based on the first indicator information of all first disks in the storage cluster from different collection points, and M is a positive integer; For each sample dataset, a first distance is calculated between each second disk in the sample dataset and the sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk; The slow disk is determined from each of the second disks based on each of the first distances.
[0083] This application also provides a computer program product, including a computer program that can be executed by the processor 15 of the slow disk detection device 10 to complete the steps described in any of the foregoing methods.
[0084] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of hardware embodiments, software embodiments, or embodiments combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage and optical storage) containing computer-usable program code.
[0085] This application is described with reference to schematic and / or block diagrams of implementations of methods, apparatus (systems), and computer program products according to embodiments of this application. It should be understood that each block of the schematic and / or block diagrams can be implemented by computer program instructions, and combinations of blocks in the schematic and / or block diagrams can be implemented. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the schematic and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0086] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in the implementation flow diagram. Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0087] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0088] The above description is merely a preferred embodiment of this application and is not intended to limit the scope of protection of this application.
Claims
1. A method for detecting slow disks, characterized in that, The method includes: Obtain M sample datasets; wherein each sample dataset is obtained in real time based on the first indicator information of all first disks in the storage cluster from different collection points, and M is a positive integer; For each sample dataset, a first distance is calculated between each second disk in the sample dataset and the sample dataset; wherein the first distance is used to characterize the similarity between each second disk and the center of the sample dataset, and the first disk contains the second disk; The slow disk is determined from each of the second disks based on each of the first distances.
2. The method according to claim 1, characterized in that, The calculation of the first distance between each second disk in the sample dataset and the sample dataset includes: For each of the second disks in the sample dataset, obtain the second indicator information corresponding to the second disk; wherein, the second indicator information includes N-dimensional indicator information corresponding to the second disk, where N is a positive integer; The first vector is determined based on the N-dimensional index information, and the first distance is determined based on the first vector and the first index information.
3. The method according to claim 2, characterized in that, The first indicator information includes M-dimensional indicator information corresponding to each of the first disks, where M is a positive integer; Determining the first distance based on the first vector and the first index information includes: A second vector and a first matrix are determined based on the M-dimensional indicator information of each of the first disks in the sample dataset; wherein, the second vector contains a mean vector, and the first matrix is used to characterize the linear correlation between the M-dimensional indicator information. The first distance is determined based on the first vector, the second vector, and the first matrix.
4. The method according to claim 1, characterized in that, The step of determining the slow disk from each of the second disks based on each of the first distances includes: Select the first L first distances from the sorted first distances; where L is a positive integer; Calculate the first value of the third disk corresponding to each of the first distances in the first L first distances; wherein the third disk is a portion of the disks in the second disk, and the first value is used to determine whether the third disk is a slow disk; The first value and the disk information of the third disk are added to the first set, and the third disk is determined as a slow disk based on the first set.
5. The method according to claim 4, characterized in that, The calculation of the first value of the third disk corresponding to each of the first distances in the first L first distances includes: The first L first distances are processed by a first operation to obtain the second value; The first value is obtained by performing a second operation on the i-th first distance and the second value among the first L first distances; wherein i is less than or equal to L.
6. The method according to claim 4 or 5, characterized in that, The step of determining whether the third disk is a slow disk based on the first set includes: Count the first occurrence of the same target disk in the first set and the third value corresponding to the target disk; wherein the third disk includes the target disk and the first value includes the third value; Based on the first number of times and each of the third values, it is determined whether the target disk is a slow disk.
7. The method according to claim 6, characterized in that, The step of determining whether the target disk is a slow disk based on the first number of times and each third value includes: If the first number of times is greater than the first preset threshold and the sum of the third values is greater than the second preset threshold, the target disk is determined to be a slow disk. or, If the first number of times is less than or equal to the first preset threshold, or if the sum of the third values is less than or equal to the second preset threshold, the target disk is determined to be a non-slow disk.
8. A slow disk detection device, characterized in that, The slow disk detection device includes: a processor and a memory; wherein... The memory is used to store computer programs that can run on the processor; The processor is configured to perform the method as described in any one of claims 1-7 when running the computer program.
9. A computer-readable storage medium, characterized in that, The storage medium stores computer program code, which, when executed by a computer, performs the method described in any one of claims 1-7.
10. A computer program product, comprising a computer program, characterized in that, The computer program, when executed by a processor, implements the method according to any one of claims 1-7.