A slow disk detection method and device, electronic equipment and storage medium
By establishing a Gaussian model in the storage system to detect slow disks and using performance data of sample disks to improve detection accuracy, the problem of insufficient accuracy in slow disk detection in existing technologies is solved, and adaptive and real-time slow disk detection effects are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TELECOM CLOUD TECH CO LTD
- Filing Date
- 2022-07-18
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, the method of setting I/O response time thresholds for slow disk detection has low accuracy and cannot adapt to the impact of factors such as hardware architecture differences, hardware aging, and business pressure.
By acquiring performance data for each disk, a Gaussian model is established. The performance data of the sample disks is used to detect the disks to be tested. The Gaussian model is used to detect slow disks, thereby improving the accuracy of the detection.
It achieves accurate, adaptive, and real-time detection of slow disks, effectively identifying slow disks under hardware aging and changing business pressure, and ensuring the stable operation of the storage system.
Smart Images

Figure CN115269289B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a method, apparatus, electronic device, and storage medium for detecting slow disks. Background Technology
[0002] During disk use, due to factors such as magnetic degradation, bad sectors, or vibration, disks may experience slow I / O response and reduced performance; such disks are called slow disks. When a slow disk exists in a storage system, the read and write operations of the entire storage system will slow down, affecting the host's business processing efficiency and, in severe cases, even causing host service interruption. Therefore, it is necessary to monitor the disks in the storage system in real time to ensure the host's business processing efficiency.
[0003] In related technologies, slow disk detection typically employs a method of setting I / O response time thresholds. This involves acquiring the I / O response time of the disk under test over multiple testing cycles; if the I / O response time consistently exceeds a preset threshold, the disk is identified as slow. However, factors such as hardware architecture variations, hardware aging, business workloads, and even data center environments can all affect disk I / O response time. Therefore, the accuracy of slow disk detection using the method of setting I / O response time thresholds is relatively low. Summary of the Invention
[0004] To address the existing technical problems, embodiments of this application provide a slow disk detection method, apparatus, electronic device, and storage medium, which can improve the accuracy of slow disk detection.
[0005] To achieve the above objectives, the technical solution of this application embodiment is implemented as follows:
[0006] In a first aspect, embodiments of this application provide a slow disk detection method, the method comprising:
[0007] For any type of disk in the server, obtain at least one set of performance data for each disk; each set of performance data corresponds to a set disk performance metric.
[0008] Based on the performance data of each disk and the disk performance index corresponding to each set of performance data, at least one disk to be tested and at least one sample disk are determined from the plurality of disks respectively;
[0009] Based on the performance data of the at least one sample disk, establish a Gaussian model corresponding to each disk performance metric.
[0010] Slow disk detection is performed on each disk to be tested based on the Gaussian model corresponding to each disk performance metric.
[0011] The slow disk detection method provided in this application involves acquiring at least one set of performance data for each of multiple disks of any type in a server. Based on the performance data of each disk and the disk performance index corresponding to each set of performance data, at least one disk to be tested and at least one sample disk are determined from the multiple disks. A Gaussian model corresponding to each disk performance index is established based on the performance data of the at least one sample disk. Then, slow disk detection is performed on each disk to be tested based on the Gaussian model corresponding to each disk performance index. Since the performance data corresponding to the sample disks included in the server can be used to establish the Gaussian model, and the established Gaussian model is used to perform slow disk detection on the disk to be tested, it can accurately and effectively detect whether the disk to be tested is a slow disk, improving the accuracy of slow disk detection. Furthermore, it can adaptively and in real-time perform slow disk detection on multiple disks included in the server.
[0012] In one optional embodiment, determining at least one disk to be tested and at least one sample disk from the plurality of disks based on the performance data of each disk and the disk performance index corresponding to each set of performance data includes:
[0013] Based on the performance data of each disk, invalid disks that are faulty or unused are identified, and the invalid disks are removed from the plurality of disks, with the remaining disks being used as target disks;
[0014] For each target disk, the performance data of the target disk is weighted and summed according to the weight of the first indicator corresponding to each disk performance indicator to obtain the disk performance value of the target disk.
[0015] According to the disk performance value of each target disk in descending order, a set number of target disks are selected from the target disks, and the set number of target disks are used as the disks to be tested;
[0016] Use all or part of the target disks, excluding the disk to be tested, as sample disks.
[0017] In this embodiment, firstly, based on the performance data of each disk, invalid disks that are faulty or unused are identified, and these invalid disks are removed from the pool of disks. The remaining disks are then designated as target disks. Next, for each target disk, the performance data is weighted and summed according to the weight of a first indicator corresponding to each disk's performance metric to obtain the target disk's performance value. Finally, a predetermined number of target disks are selected from the target disks in descending order of their performance values. These predetermined number of target disks are designated as disks to be tested, and all or some of the target disks other than those to be tested are designated as sample disks. This allows for the accurate and reasonable identification of sample disks and disks to be tested from the multiple disks included in the server.
[0018] In one optional embodiment, the step of establishing a Gaussian model corresponding to each disk performance metric based on the performance data of the at least one sample disk includes:
[0019] For each disk performance metric, perform the following operations:
[0020] From the performance data of the at least one sample disk, select the first set of performance data corresponding to the disk performance index, and determine the mean and standard deviation of the multiple performance data contained in the first set of performance data respectively.
[0021] Based on the mean and standard deviation of multiple performance data included in the first set of performance data, a Gaussian model corresponding to the disk performance index is established.
[0022] In this embodiment, for each disk performance metric, a first set of performance data corresponding to that metric can be selected from the performance data of at least one sample disk. The mean and standard deviation of the multiple performance data points included in the first set are then determined. Based on the mean and standard deviation of the multiple performance data points included in the first set, a Gaussian model corresponding to that disk performance metric is established. Thus, for each disk performance metric, a reasonable Gaussian model that reflects the data distribution of the multiple performance data points corresponding to that metric can be established.
[0023] In one optional embodiment, slow disk detection is performed on each disk to be detected based on a Gaussian model corresponding to each disk performance metric, including:
[0024] For each disk to be tested, perform the following operations:
[0025] Each set of performance data of the disk to be tested is input into the Gaussian model corresponding to each disk performance index for testing, and the initial test value corresponding to each disk performance index is obtained.
[0026] Based on the weight of the second indicator corresponding to each disk performance indicator, the initial detection value corresponding to each disk performance indicator is weighted and summed to obtain the target detection value corresponding to the disk to be detected.
[0027] If the target detection value is greater than the set threshold, the disk to be detected is determined to be a slow disk.
[0028] In this embodiment, for each disk to be tested, each set of performance data can be input into a Gaussian model corresponding to each disk performance metric for testing, obtaining an initial detection value for each disk performance metric. Based on the weight of the second metric corresponding to each disk performance metric, the initial detection values for each disk performance metric are weighted and summed to obtain a target detection value for the disk to be tested. If the target detection value is greater than a set threshold, the disk to be tested is determined to be a slow disk. Since the Gaussian model corresponding to each disk performance metric can be established using the performance data of sample disks corresponding to that disk performance metric to test the performance data of the disk to be tested, it is possible to effectively and accurately determine whether the disk to be tested is a slow disk, improving the accuracy of slow disk detection.
[0029] In one optional embodiment, the step of inputting each set of performance data of the disk to be tested into a Gaussian model corresponding to each disk performance metric for testing, and obtaining an initial test value corresponding to each disk performance metric, includes:
[0030] For each disk performance metric, perform the following operations:
[0031] From the performance data of the disk to be tested, select the second set of performance data corresponding to the disk performance index, and take the average of the multiple performance data contained in the second set of performance data as the average of the second set of performance data.
[0032] Based on the Gaussian model corresponding to the disk performance metric, determine the performance threshold range corresponding to the disk performance metric;
[0033] If the mean of the second set of performance data is within the performance threshold range, then the initial detection value corresponding to the disk performance index is determined to be the set first detection value.
[0034] If the mean of the second set of performance data is outside the performance threshold range, then the initial detection value corresponding to the disk performance indicator is determined to be the set second detection value.
[0035] In this embodiment, for each disk performance metric, a second set of performance data corresponding to the disk performance metric can be selected from the performance data of the disk to be tested. The mean of multiple performance data points contained in the second set of performance data is used as the mean of the second set of performance data. Based on the Gaussian model corresponding to the disk performance metric, a performance threshold range corresponding to the disk performance metric is determined. If the mean of the second set of performance data is within the performance threshold range, the initial detection value corresponding to the disk performance metric is determined as a set first detection value. If the mean of the second set of performance data is outside the performance threshold range, the initial detection value corresponding to the disk performance metric is determined as a set second detection value. Because a Gaussian model corresponding to each disk performance metric is used to test the performance data of the disk to be tested, it is possible to accurately and reasonably determine whether the performance data of the disk to be tested conforms to the data distribution corresponding to each disk performance metric.
[0036] Secondly, embodiments of this application also provide a slow disk detection device, the device comprising:
[0037] The data acquisition unit is used to acquire at least one set of performance data for each of the multiple disks of any type in the server; each set of performance data corresponds to a set disk performance metric.
[0038] The data processing unit is used to determine at least one disk to be tested and at least one sample disk from the plurality of disks based on the performance data of each disk and the disk performance index corresponding to each set of performance data.
[0039] The model building unit is used to build a Gaussian model for each disk performance index based on the performance data of the at least one sample disk.
[0040] The slow disk detection unit is used to perform slow disk detection on each disk to be tested based on the Gaussian model corresponding to each disk performance metric.
[0041] In one optional embodiment, the data processing unit is specifically used for:
[0042] Based on the performance data of each disk, invalid disks that are faulty or unused are identified, and the invalid disks are removed from the plurality of disks, with the remaining disks being used as target disks;
[0043] For each target disk, the performance data of the target disk is weighted and summed according to the weight of the first indicator corresponding to each disk performance indicator to obtain the disk performance value of the target disk.
[0044] According to the disk performance value of each target disk in descending order, a set number of target disks are selected from the target disks, and the set number of target disks are used as the disks to be tested;
[0045] Use all or part of the target disks, excluding the disk to be tested, as sample disks.
[0046] In one optional embodiment, the model building unit is specifically used for:
[0047] For each disk performance metric, perform the following operations:
[0048] From the performance data of the at least one sample disk, select the first set of performance data corresponding to the disk performance index, and determine the mean and standard deviation of the multiple performance data contained in the first set of performance data respectively.
[0049] Based on the mean and standard deviation of multiple performance data included in the first set of performance data, a Gaussian model corresponding to the disk performance index is established.
[0050] In one optional embodiment, the slow disk detection unit is specifically used for:
[0051] For each disk to be tested, perform the following operations:
[0052] Each set of performance data of the disk to be tested is input into the Gaussian model corresponding to each disk performance index for testing, and the initial test value corresponding to each disk performance index is obtained.
[0053] Based on the weight of the second indicator corresponding to each disk performance indicator, the initial detection value corresponding to each disk performance indicator is weighted and summed to obtain the target detection value corresponding to the disk to be detected.
[0054] If the target detection value is greater than the set threshold, the disk to be detected is determined to be a slow disk.
[0055] In an optional embodiment, the slow disk detection unit is further configured to:
[0056] For each disk performance metric, perform the following operations:
[0057] From the performance data of the disk to be tested, select the second set of performance data corresponding to the disk performance index, and take the average of the multiple performance data contained in the second set of performance data as the average of the second set of performance data.
[0058] Based on the Gaussian model corresponding to the disk performance metric, determine the performance threshold range corresponding to the disk performance metric;
[0059] If the mean of the second set of performance data is within the performance threshold range, then the initial detection value corresponding to the disk performance index is determined to be the set first detection value.
[0060] If the mean of the second set of performance data is outside the performance threshold range, then the initial detection value corresponding to the disk performance indicator is determined to be the set second detection value.
[0061] Thirdly, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the slow disk detection method of the first aspect.
[0062] Fourthly, embodiments of this application also provide an electronic device, including a memory and a processor, wherein the memory stores a computer program that can run on the processor, and when the computer program is executed by the processor, the processor implements the slow disk detection method of the first aspect.
[0063] The technical effects of any of the implementation methods in the second to fourth aspects can be found in the technical effects of the corresponding implementation methods in the first aspect, and will not be repeated here. Attached Figure Description
[0064] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0065] Figure 1 A flowchart illustrating a slow disk detection method provided in this application embodiment;
[0066] Figure 2 A flowchart for determining the disk to be tested and the sample disk is provided in this application embodiment;
[0067] Figure 3 A flowchart for establishing a Gaussian model is provided as an embodiment of this application;
[0068] Figure 4 This application provides a flowchart for performing slow disk detection on a disk to be tested, as illustrated in an embodiment of the present application.
[0069] Figure 5 A flowchart illustrating another slow disk detection method provided in this application embodiment;
[0070] Figure 6 This is a schematic diagram of the structure of a slow disk detection device provided in an embodiment of this application;
[0071] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0072] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0073] It should be noted that the terms "comprising" and "having" and their variations used in this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are explicitly listed, but may include other steps or units that are not explicitly listed or that are inherent to such process, method, product, or device.
[0074] The technical solutions provided in the embodiments of this application will now be described in detail with reference to the accompanying drawings.
[0075] The term "exemplary" as used below means "serving as an example, embodiment, or illustration." Any embodiment illustrated as "exemplary" is not necessarily to be construed as superior to or better than other embodiments. In the description of embodiments of this application, unless otherwise stated, "a plurality of" means two or more.
[0076] This application provides a method for detecting slow disks, such as... Figure 1 As shown, it includes the following steps:
[0077] Step S101: For any type of multiple disks in the server, obtain at least one set of performance data for each disk.
[0078] There can be many types of disks, such as hard disk drives (HDDs), solid state drives (SSDs), non-volatile memory express (NVME) and Optane.
[0079] Within a sampling period, for any type of multiple disks included in the server, performance data of each disk in the multiple disks corresponding to that type can be collected.
[0080] After collecting performance data for each disk, the performance data for each disk can be divided into at least one set of performance data based on at least one set of disk performance metrics. Each set of performance data corresponds to a set disk performance metric.
[0081] For example, the disk performance metrics used in this application embodiment may include five performance metrics: r / s, w / s, avgqu-sz, await, and util. Specifically, r / s represents the number of read I / O operations completed per second, w / s represents the number of write I / O operations completed per second, avgqu-sz represents the average I / O queue length, await represents the average waiting time for each device I / O operation, and util represents the percentage of time per second spent on I / O operations, i.e., the percentage of CPU time consumed by I / O.
[0082] Step S102: Based on the performance data of each disk and the disk performance index corresponding to each set of performance data, determine at least one disk to be tested and at least one sample disk from multiple disks.
[0083] Specifically, it can be done according to Figure 2 The process shown in the figure identifies the disk to be tested and the sample disk from multiple disks, such as Figure 2 As shown, it includes the following steps:
[0084] Step S201: Based on the performance data of each disk, identify invalid disks that are faulty or unused.
[0085] After collecting performance data for each disk in multiple disks of any type, invalid disks that are faulty or unused can be identified based on the performance data of each disk.
[0086] Specifically, if all the data values of the performance data of a certain disk collected are 0, it can be determined that the disk is faulty or unused, and the disk will be regarded as an invalid disk.
[0087] Step S202: Remove invalid disks from multiple disks and use the remaining disks as the target disks.
[0088] Step S203: For each target disk, the performance data of the target disk is weighted and summed according to the weight of the first indicator corresponding to each disk performance indicator to obtain the disk performance value of the target disk.
[0089] For example, disk performance metrics may include five performance metrics: M1, M2, M3, M4, and M5. The first metric weight for each performance metric is set to 1. Then, based on the first metric weight corresponding to each disk performance metric, the performance data of a target disk is weighted and summed to obtain the disk performance value of the target disk as M1+M2+M3+M4+M5.
[0090] Step S204: Select a set number of target disks from the target disks according to the disk performance value of each target disk from high to low, and use the set number of target disks as the disks to be tested.
[0091] Step S205: Use all or part of the target disks, excluding the disk to be tested, as sample disks.
[0092] For example, the server contains six target disks of type HDD: D1, D2, D3, D4, D5, and D6. Based on the weight of the first indicator corresponding to each disk's performance metric, the performance data of each target disk is weighted and summed to obtain the disk performance value V1 of target disk D1 as 10, the disk performance value V2 of target disk D2 as 9, the disk performance value V3 of target disk D3 as 5, the disk performance value V4 of target disk D4 as 4, the disk performance value V5 of target disk D5 as 3.5, and the disk performance value V6 of target disk D6 as 3.
[0093] Assuming the quantity is set to 2, then target disks D1 and D2 can be used as disks to be tested, and target disks D3, D4, D5 and D6 can all be used as sample disks.
[0094] Step S103: Based on the performance data of at least one sample disk, establish a Gaussian model corresponding to each disk performance metric.
[0095] Specifically, for each disk performance metric, it can be calculated according to... Figure 3 The process shown in the figure establishes a Gaussian model corresponding to the disk performance metric, such as... Figure 3 As shown, it includes the following steps:
[0096] Step S301: Select the first set of performance data corresponding to the disk performance index from the performance data of at least one sample disk.
[0097] Step S302: Determine the mean and standard deviation of the multiple performance data included in the first set of performance data.
[0098] For each disk performance metric, the mean μ and standard deviation σ of the multiple performance data included in the first set of performance data corresponding to that disk performance metric in at least one sample disk can be determined using the following formula:
[0099]
[0100] Step S303: Based on the mean and standard deviation of the multiple performance data contained in the first set of performance data, establish a Gaussian model corresponding to the disk performance index.
[0101] The formula for the Gaussian model is: That is, the Gaussian model satisfies a Gaussian distribution: X ~ (μ, σ) 2 ).
[0102] For example, for disk performance metric M1, based on the first set of performance data corresponding to disk performance metric M1 included in the performance data of at least one sample disk, the mean μ of the first set of performance data is determined to be 2, and the standard deviation σ is determined to be 4. Then, the Gaussian model G1 corresponding to disk performance metric M1 is established as follows: That is, the Gaussian model G1 satisfies a Gaussian distribution: X ~ (2, 4) 2 ).
[0103] Step S104: Perform slow disk detection on each disk to be tested based on the Gaussian model corresponding to each disk performance metric.
[0104] Specifically, for each disk to be tested, it can be done according to... Figure 4 The process shown in the image involves performing a slow disk check on the disk to be tested, such as... Figure 4 As shown, it includes the following steps:
[0105] Step S401: For each disk performance metric, select the second set of performance data corresponding to the disk performance metric from the performance data of the disk to be tested, and take the average of the multiple performance data contained in the second set of performance data as the average of the second set of performance data.
[0106] The mean μ of the multiple performance data included in the second performance data can be determined by the following formula:
[0107]
[0108] Step S402: Determine the performance threshold range corresponding to the disk performance indicator based on the Gaussian model corresponding to the disk performance indicator.
[0109] The Gaussian model corresponding to disk performance metrics follows a Gaussian distribution: X ~ (μ, σ) 2 By using this method, we can determine that the performance threshold range corresponding to the disk performance metrics can be [μ-3σ, μ+3σ]. Here, μ is the mean of the multiple performance data points included in the first set of performance data, and σ is the standard deviation of the multiple performance data points included in the first set of performance data.
[0110] Step S403: If the mean value of the second set of performance data is within the performance threshold range, then the initial detection value corresponding to the disk performance indicator is determined to be the set first detection value; if the mean value of the second set of performance data is outside the performance threshold range, then the initial detection value corresponding to the disk performance indicator is determined to be the set second detection value.
[0111] For example, disk performance metrics may include five metrics: M1, M2, M3, M4, and M5. For performance metric M1, the mean μ1 of the second set of performance data corresponding to performance metric M1 in the disk performance data to be tested is determined to be 3. For performance metric M2, the mean μ2 of the second set of performance data corresponding to performance metric M2 in the disk performance data to be tested is determined to be 4. For performance metric M3, the mean μ3 of the second set of performance data corresponding to performance metric M3 in the disk performance data to be tested is determined to be 1. For performance metric M4, the mean μ4 of the second set of performance data corresponding to performance metric M4 in the disk performance data to be tested is determined to be 1.5. For performance metric M5, the mean μ5 of the second set of performance data corresponding to performance metric M5 in the disk performance data to be tested is determined to be 0.6.
[0112] Assume the Gaussian model corresponding to performance index M1 follows a Gaussian distribution X ~ (2, 0.2). 2 The Gaussian model corresponding to performance index M2 satisfies a Gaussian distribution X ~ (3, 0.6). 2 The Gaussian model corresponding to performance index M3 satisfies a Gaussian distribution X ~ (2, 0.4). 2 The Gaussian model corresponding to performance index M4 satisfies a Gaussian distribution X ~ (3, 0.3). 2 The Gaussian model corresponding to performance index M5 satisfies a Gaussian distribution X ~ (3, 0.5). 2 ).
[0113] Based on the Gaussian model corresponding to performance index M1, the performance threshold range for performance index M1 can be determined as [1.4, 2.6]; based on the Gaussian model corresponding to performance index M2, the performance threshold range for performance index M2 can be determined as [1.2, 4.8]; based on the Gaussian model corresponding to performance index M3, the performance threshold range for performance index M3 can be determined as [0.8, 3.2]; based on the Gaussian model corresponding to performance index M4, the performance threshold range for performance index M4 can be determined as [2.1, 3.9]; based on the Gaussian model corresponding to performance index M5, the performance threshold range for performance index M5 can be determined as [1.5, 4.5].
[0114] Since the mean μ1 of the second set of performance data corresponding to performance index M1 is 3, which is outside the performance threshold range [1.4, 2.6] corresponding to performance index M1, the initial detection value corresponding to performance index M1 can be determined to be 1; since the mean μ2 of the second set of performance data corresponding to performance index M2 is 4, which is within the performance threshold range [1.2, 4.8] corresponding to performance index M2, the initial detection value corresponding to performance index M2 can be determined to be 0; since the mean μ3 of the second set of performance data corresponding to performance index M3 is 1, which is within the performance threshold range corresponding to performance index M3... Within the interval [0.8, 3.2], the initial detection value corresponding to performance index M3 can be determined to be 0; since the mean μ4 of the second set of performance data corresponding to performance index M4 is 1.5, which is outside the performance threshold interval [2.1, 3.9] corresponding to performance index M4, the initial detection value corresponding to performance index M4 can be determined to be 1; since the mean μ5 of the second set of performance data corresponding to performance index M5 is 0.6, which is outside the performance threshold interval [1.5, 4.5] corresponding to performance index M5, the initial detection value corresponding to performance index M5 can be determined to be 1.
[0115] Step S404: Based on the weight of the second indicator corresponding to each disk performance indicator, the initial detection values corresponding to each disk performance indicator are weighted and summed to obtain the target detection value corresponding to the disk to be detected.
[0116] For example, disk performance metrics may include five performance metrics: M1, M2, M3, M4, and M5. The second indicator weight is set to 1 for performance metrics M1, M2, and M3, and the second indicator weight is set to 3.5 for performance metrics M4 and M5. Then, based on the second indicator weight corresponding to each disk performance metric, the initial detection value corresponding to each disk performance metric is weighted and summed to obtain the target detection value for the disk to be tested as 1+0+0+3.5+3.5=8.
[0117] Step S405: If the target detection value corresponding to the disk to be detected is greater than the set threshold, then the disk to be detected is determined to be a slow disk.
[0118] For example, assuming a threshold of 7 is set, the initial detection values corresponding to each disk performance indicator are weighted and summed according to the weight of the second indicator corresponding to each disk performance indicator. The target detection value of the disk to be detected is 8, which is greater than the set threshold of 7. Therefore, the disk to be detected can be determined to be a slow disk.
[0119] In one embodiment, the slow disk detection method provided in this application can also be performed as follows: Figure 5 Implement the process shown, as follows: Figure 5 As shown, it includes the following steps:
[0120] Step S501: Obtain performance data for each disk in at least one disk included in the server.
[0121] Disk performance data is collected via a scheduled task, and performance data for all disks on the server is obtained within a sampling period.
[0122] Optionally, the iostat system tool can be used to collect disk performance data in real time. The iostat system tool is primarily used to output statistics on disk I / O and CPU usage.
[0123] Step S502: Determine the disk type of each disk based on the performance data of each disk.
[0124] The collected disk performance data includes the disk type, so the disk type of each disk can be determined based on its performance data.
[0125] The disk types can include HDD, SSD, NVMe, and OPTANE.
[0126] Step S503: For the performance data of multiple disks corresponding to any disk type, determine the invalid disks that are faulty or unused.
[0127] For each disk type, the performance data of multiple disks can be processed to identify faulty or unused invalid disks.
[0128] For example, a server contains 80 disks of four types: HDD, SSD, NVMe, and OPTANE. Based on the performance data of each disk, it can be determined that 20 disks are HDDs, 10 are SSDs, 25 are NVMe, and 25 are OPTANE. The performance data of the 20 HDD disks, 10 SSD disks, 25 NVMe disks, and 25 OPTANE disks can then be processed separately to identify any faulty or unused invalid disks within each category.
[0129] Step S504: Remove invalid disks from the multiple disks and use the remaining disks as the target disks.
[0130] Since the performance data of invalid disks is invalid and will interfere with the subsequent process of building Gaussian models and detecting slow disks, it is necessary to remove invalid disks from multiple disks and use the remaining disks as target disks.
[0131] Step S505: Based on the weight of the first indicator corresponding to each disk performance indicator, perform a weighted summation on the performance data of each target disk to obtain the disk performance value of each target disk.
[0132] Step S506: Select a set number of target disks from the target disks according to the disk performance value of each target disk from high to low, and use the set number of target disks as the disks to be tested.
[0133] Optionally, after obtaining the disk performance value of each target disk, the target disk with the highest disk performance value can be selected as the disk to be tested.
[0134] Step S507: Use all or part of the target disks, excluding the disk to be tested, as sample disks.
[0135] Step S508: Based on the performance data of the sample disks, establish a Gaussian model corresponding to each disk performance metric.
[0136] Assuming there are five disk performance metrics: M1, M2, M3, M4, and M5, the performance data of the sample disks can be divided into five groups: sample data group M1, sample data group M2, sample data group M3, sample data group M4, and sample data group M5. Each sample data group contains N performance data points. Therefore, the performance data of the sample disks can be represented as:
[0137]
[0138] Then, based on the five sets of sample data obtained from the division, five Gaussian models can be established.
[0139] There are two common types of Gaussian modeling: the first is to perform one-dimensional Gaussian modeling on a single set of data, and the second is to perform multi-dimensional Gaussian modeling on multiple sets of data.
[0140] The first type of Gaussian modeling requires multiple iterations to collect enough sample data to ensure the accuracy of the Gaussian model. However, in the context of distributed storage systems, the timeliness of the multi-cycle iterative modeling method cannot meet the requirements, and it cannot detect slow disks in a timely manner or avoid the impact of slow disks on the disk cluster.
[0141] The second type of Gaussian modeling is complex and difficult to implement when performing Gaussian modeling in 3D or higher dimensions.
[0142] Therefore, this application embodiment uses multiple sets of sample data to establish multiple Gaussian models, that is, a Gaussian model is established based on each set of sample data. This technique is simple to implement, requiring only a single-cycle performance data collection. For a single Gaussian model, the sample data collected within a single cycle will sacrifice some of the model's accuracy; however, multiple Gaussian models reflect the current disk's workload and response speed in multiple dimensions. Using multiple Gaussian models to evaluate disk performance can ensure timeliness without sacrificing accuracy. Furthermore, the Gaussian models automatically adjust according to the hardware environment and business scenarios, demonstrating adaptability.
[0143] Step S509: For each disk to be tested, determine the initial test value corresponding to each disk performance indicator based on the Gaussian model corresponding to each disk performance indicator.
[0144] The Gaussian model corresponding to each disk performance metric follows a Gaussian distribution X ~ (μ, σ). 2 ), where μ and σ are determined based on sample data sets of sample disks corresponding to each disk performance metric.
[0145] Based on the Gaussian model corresponding to each disk performance metric, the corresponding performance threshold range for each disk performance metric can be determined as [μ-3σ, μ+3σ].
[0146] For each disk performance metric, if the mean of the performance data of the disk under test corresponding to that disk performance metric is within the corresponding performance threshold range of that disk performance metric, then the initial detection value corresponding to that disk performance metric can be determined to be 0. If the mean of the performance data of the disk under test is outside the corresponding performance threshold range of that disk performance metric, then the initial detection value corresponding to that disk performance metric can be determined to be 1.
[0147] Step S510: Based on the weight of the second indicator corresponding to each disk performance indicator, the initial detection values corresponding to each disk performance indicator are weighted and summed to obtain the target detection value corresponding to the disk to be detected.
[0148] Step S511: If the target detection value corresponding to the disk to be detected is greater than the set threshold, then the disk to be detected is determined to be a slow disk.
[0149] Optionally, slow disk checks can be performed periodically on the disks in the server. For example, slow disk checks can be performed on the disks in the server every 2 minutes. That is, the performance data of the disks in the server can be obtained every 2 minutes, and multiple Gaussian models can be built based on the performance data of the sample disks in the server. Slow disk checks can then be performed on the disks to be tested in the server based on the established multiple Gaussian models to determine whether the disk to be tested is a slow disk.
[0150] This application provides a slow disk detection method that can solve the problems of the inability to dynamically adjust the latency threshold and insufficient timeliness in current slow disk detection methods. By using a Gaussian model to analyze disk performance data, it can detect slow disks in real time and adaptively without affecting business operations, thus ensuring the stable operation of the cluster.
[0151] Compared with related technologies, the slow disk detection method proposed in this application has the following advantages:
[0152] First, the adaptability is reflected in the fact that, as the pressure of cluster business fluctuates, or as server hardware ages or server hardware architecture varies, the slow disk detection method proposed in this application can adaptively adjust the data model to effectively detect slow disks.
[0153] Second, the real-time performance is reflected in the following two aspects: 1. The slow disk detection method proposed in this application performs parallel detection during cluster operation, without the need for post-event stress testing; 2. The slow disk detection method proposed in this application can identify slow disks within one detection cycle, without the need for multiple cycles of judgment.
[0154] and Figure 1 The slow disk detection method shown is based on the same inventive concept. This application also provides a slow disk detection device. Since this device corresponds to the slow disk detection method of this application, and the principle by which this device solves the problem is similar to that of the method, the implementation of this device can refer to the implementation of the above method; repeated details will not be elaborated further.
[0155] Figure 6 This application provides a schematic diagram of the structure of a slow disk detection device according to an embodiment of the present application. Figure 6 As shown, the slow disk detection device includes a data acquisition unit 601, a data processing unit 602, a model building unit 603, and a slow disk detection unit 604.
[0156] The data acquisition unit 601 is used to acquire at least one set of performance data for each of the multiple disks of any type in the server; each set of performance data corresponds to a set disk performance index.
[0157] The data processing unit 602 is used to determine at least one disk to be tested and at least one sample disk from multiple disks based on the performance data of each disk and the disk performance index corresponding to each set of performance data.
[0158] The model building unit 603 is used to build a Gaussian model for each disk performance index based on the performance data of at least one sample disk.
[0159] The slow disk detection unit 604 is used to perform slow disk detection on each disk to be tested based on the Gaussian model corresponding to each disk performance metric.
[0160] In an optional embodiment, the data processing unit 602 is specifically used for:
[0161] Based on the performance data of each disk, identify invalid disks that are faulty or unused, remove invalid disks from multiple disks, and use the remaining disks as target disks;
[0162] For each target disk, the performance data of the target disk is weighted and summed according to the weight of the first indicator corresponding to each disk performance indicator to obtain the disk performance value of the target disk.
[0163] Select a set number of target disks from the target disks according to their disk performance values from high to low, and use the set number of target disks as the disks to be tested;
[0164] Use all or part of the target disks, excluding the disk to be tested, as sample disks.
[0165] In an optional embodiment, the model building unit 603 is specifically used for:
[0166] For each disk performance metric, perform the following operations:
[0167] From the performance data of at least one sample disk, select the first set of performance data corresponding to the disk performance index, and determine the mean and standard deviation of the multiple performance data contained in the first set of performance data respectively.
[0168] Based on the mean and standard deviation of multiple performance data included in the first set of performance data, a Gaussian model corresponding to the disk performance index is established.
[0169] In an optional embodiment, the slow disk detection unit 604 is specifically used for:
[0170] For each disk to be tested, perform the following operations:
[0171] Each set of performance data for the disk to be tested is input into the Gaussian model corresponding to each disk performance metric for testing, and the initial test value corresponding to each disk performance metric is obtained.
[0172] Based on the weight of the second indicator corresponding to each disk performance indicator, the initial detection values corresponding to each disk performance indicator are weighted and summed to obtain the target detection value corresponding to the disk to be detected.
[0173] If the target detection value is greater than the set threshold, the disk to be detected is determined to be a slow disk.
[0174] In an optional embodiment, the slow disk detection unit 604 is further configured to:
[0175] For each disk performance metric, perform the following operations:
[0176] From the performance data of the disk to be tested, select the second set of performance data corresponding to the disk performance indicators, and take the average of the multiple performance data contained in the second set of performance data as the average of the second set of performance data.
[0177] Based on the Gaussian model corresponding to the disk performance metrics, determine the performance threshold range corresponding to the disk performance metrics.
[0178] If the mean of the second set of performance data is within the performance threshold range, then the initial detection value corresponding to the disk performance index is determined to be the set first detection value.
[0179] If the mean of the second set of performance data is outside the performance threshold range, then the initial detection value corresponding to the disk performance index is determined to be the set second detection value.
[0180] Based on the same inventive concept as the above-described method embodiments, this application also provides an electronic device. This electronic device can be used to perform slow disk detection on a server's disk. In this embodiment, the structure of the electronic device can be as follows: Figure 7 As shown, it includes a memory 701 and one or more processors 702.
[0181] The memory 701 is used to store computer programs executed by the processor 702. The memory 701 may mainly include a program storage area and a data storage area. The program storage area may store the operating system and programs required to run instant messaging functions, etc.; the data storage area may store various instant messaging information and operation instruction sets, etc.
[0182] Memory 701 may be volatile memory, such as random-access memory (RAM); memory 701 may also be non-volatile memory, such as read-only memory, flash memory, hard disk drive (HDD), or solid-state drive (SSD); or memory 701 may be any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but is not limited thereto. Memory 701 may be a combination of the above-described memories.
[0183] The processor 702 may include one or more central processing units (CPUs) or digital processing units, etc. The processor 702 is used to implement the aforementioned slow disk detection method when calling the computer program stored in the memory 701.
[0184] This application does not limit the specific connection medium between the memory 701 and the processor 702 described above in the embodiments. This disclosure embodiment... Figure 7 The memory 701 and the processor 702 are connected via a bus 703, and the bus 703 is in Figure 7 The connections between other components are shown in bold lines only and are not intended to be limiting. The 703 bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, Figure 7 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.
[0185] According to one aspect of this application, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the slow disk detection method described in the above embodiments.
[0186] The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of readable storage media include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0187] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application.
Claims
1. A method for detecting slow disks, characterized in that, include: Obtain performance data for each disk in at least one of the disks included in the server, and determine the disk type of each disk based on the performance data of each disk; For any type of multiple disks in the server, obtain at least one set of performance data for each disk; Each set of performance data corresponds to a set disk performance metric; Based on the performance data of each disk, invalid disks that are faulty or unused are identified, and the invalid disks are removed from the plurality of disks, with the remaining disks being used as target disks; For each target disk, the performance data of the target disk is weighted and summed according to the weight of the first indicator corresponding to each disk performance indicator to obtain the disk performance value of the target disk. According to the disk performance value of each target disk in descending order, a set number of target disks are selected from the target disks, and the set number of target disks are used as the disks to be tested; All or some of the target disks, excluding the disk to be tested, are used as sample disks; wherein, the disk performance value of the disk to be tested is higher than that of the sample disks; Based on the performance data of the sample disks, Gaussian models are established for each disk performance metric. For each of the disks to be tested, perform the following operations: Each set of performance data of the disk to be tested is input into the Gaussian model corresponding to each disk performance index for testing, and the initial test value corresponding to each disk performance index is obtained. Based on the weight of the second indicator corresponding to each disk performance indicator, the initial detection value corresponding to each disk performance indicator is weighted and summed to obtain the target detection value corresponding to the disk to be detected. If the target detection value is greater than the set threshold, the disk to be detected is determined to be a slow disk.
2. The method as described in claim 1, characterized in that, The step of establishing a Gaussian model for each disk performance metric based on the performance data of the sample disks includes: For each disk performance metric, perform the following operations: From the performance data of the sample disks, select the first set of performance data corresponding to the disk performance index, and determine the mean and standard deviation of the multiple performance data contained in the first set of performance data respectively. Based on the mean and standard deviation of multiple performance data included in the first set of performance data, a Gaussian model corresponding to the disk performance index is established.
3. The method as described in claim 1, characterized in that, The step of inputting each set of performance data of the disk to be tested into the Gaussian model corresponding to each disk performance index for testing, and obtaining the initial test value corresponding to each disk performance index, includes: For each disk performance metric, perform the following operations: From the performance data of the disk to be tested, select the second set of performance data corresponding to the disk performance index, and take the average of the multiple performance data contained in the second set of performance data as the average of the second set of performance data. Based on the Gaussian model corresponding to the disk performance metric, determine the performance threshold range corresponding to the disk performance metric; If the mean of the second set of performance data is within the performance threshold range, then the initial detection value corresponding to the disk performance index is determined to be the set first detection value. If the mean of the second set of performance data is outside the performance threshold range, then the initial detection value corresponding to the disk performance indicator is determined to be the set second detection value.
4. A slow disk detection device, characterized in that, include: The data acquisition unit is used to acquire the performance data of each disk in at least one disk included in the server, and determine the disk type of each disk based on the performance data of each disk; for multiple disks of any type in the server, it acquires at least one set of performance data for each disk; each set of performance data corresponds to a set disk performance index; The data processing unit is used to determine invalid disks that are faulty or unused based on the performance data of each disk, remove the invalid disks from the plurality of disks, and use the remaining disks as target disks; For each target disk, the performance data of the target disk is weighted and summed according to the weight of the first indicator corresponding to each disk performance indicator to obtain the disk performance value of the target disk. According to the disk performance value of each target disk in descending order, a set number of target disks are selected from the target disks, and the set number of target disks are used as the disks to be tested; All or some of the target disks, excluding the disk to be tested, are used as sample disks; wherein, the disk performance value of the disk to be tested is higher than that of the sample disks; The model building unit is used to build a Gaussian model for each disk performance index based on the performance data of the sample disks. The slow disk detection unit performs the following operations for each disk to be tested: inputting each set of performance data of the disk to be tested into the Gaussian model corresponding to each disk performance index for testing, and obtaining the initial detection value corresponding to each disk performance index; according to the second index weight corresponding to each disk performance index, performing a weighted summation on the initial detection values corresponding to each disk performance index to obtain the target detection value corresponding to the disk to be tested; if the target detection value is greater than a set threshold, then the disk to be tested is determined to be a slow disk.
5. The apparatus as described in claim 4, characterized in that, The model building unit is specifically used for: For each disk performance metric, perform the following operations: From the performance data of the sample disks, select the first set of performance data corresponding to the disk performance index, and determine the mean and standard deviation of the multiple performance data contained in the first set of performance data respectively. Based on the mean and standard deviation of multiple performance data included in the first set of performance data, a Gaussian model corresponding to the disk performance index is established.
6. An electronic device, characterized in that, It includes a processor and a memory, wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps of any of the methods described in claims 1 to 3.
7. A computer-readable storage medium, characterized in that, It includes program code that, when run on an electronic device, causes the electronic device to perform the steps of any of the methods described in claims 1 to 3.