A method and system for proactive failure detection for computer servers

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using dynamic time planning and fault risk analysis, the detection frequency is adaptively adjusted, solving the problem of fixed detection frequency for computer servers. This enables efficient and timely fault detection, improving the operational stability and reliability of the server.

CN122285367APending Publication Date: 2026-06-26CHENGDU POLYTECHNIC +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CHENGDU POLYTECHNIC
Filing Date: 2026-05-29
Publication Date: 2026-06-26

Application Information

Patent Timeline

29 May 2026

Application

26 Jun 2026

Publication

CN122285367A

IPC: G06F11/07; G06F11/30; G06F11/32; G06F9/48

AI Tagging

Technology Topics

Self adaptiveActive sensing

Technical Efficacy Phrases

Realize adaptive adjustment avoid it happening again

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Ternary precursor coprecipitation reaction apparatus
CN122076364Aavoid enteringAvoid dynamic seal defectsChemical/physical/physico-chemical stationary reactors Clutch Physics
An intelligent light-sensing system for high-speed train headlights that adapts to changing line of sight distance
CN121284801BFlexible adjustment of lighting rangeavoid security risksRailway lightingElectrical apparatusVisibilityFuzzy control system
Cable-stayed bridge cantilever hanging basket construction strain state recognition method based on deep learning
CN122262826ARealize adaptive adjustmentAlleviating the problem of overlapping decision boundariesBiological models Semantic alignment Algorithm
Oil separator with high separation efficiency
CN122328925AEnsure dynamic matchingEfficient strippingOil separation Inlet pressure
Shoulder strap type adaptive handbag
CN224386979UImprove comfort Realize adaptive adjustment Travelling sacks Handbags Torsion spring

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, the fault detection frequency of computer servers cannot be adaptively adjusted, resulting in redundant data burden when the detection frequency is too high, and failure to detect faults in a timely manner when the frequency is too low, thus affecting operational stability and reliability.

Method used

By extracting detection data from the previous detection time, dynamic time planning and detection control are performed, the detection frequency is adaptively adjusted, non-standard detection data is identified and fault risk values are calculated, and idle time warnings are provided.

Benefits of technology

This reduces redundant detection data, improves operational efficiency, ensures timely fault detection, enhances server stability and reliability, and minimizes the impact on normal business operations.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122285367A_ABST

Patent Text Reader

Abstract

This invention relates to the field of server fault detection technology, providing a method and system for proactive fault detection in computer servers. The invention updates the phased detection data of the server cluster; performs dynamic time planning and detection control for proactive fault detection based on the previous detection time and data; extracts non-standard detection data from the phased detection data; performs fault risk analysis on the non-standard detection data; and identifies risky servers and provides off-peak warnings when there are hidden fault risks. It can extract previous detection data from the phased detection data for dynamic time planning and detection control, achieving adaptive adjustment of the detection frequency. This avoids generating a large amount of redundant detection data, reduces computational burden, improves operating efficiency, and ensures timely detection of server faults, enhancing operational stability and reliability.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of server fault detection technology, and in particular relates to a method and system for active fault detection of computer servers. Background Technology

[0002] Server fault detection is a technical process that collects, analyzes, and judges server operating status data to identify whether the server has experienced any abnormalities. This process aims to provide early warnings of server faults, ensure business continuity and data security, and is of great significance for improving server operational reliability and reducing maintenance costs.

[0003] In existing technologies, active fault detection of computer servers typically involves periodic data collection and analysis at a fixed frequency, without the ability to adaptively adjust the detection frequency. In practical applications, if the detection frequency is too high, a large amount of redundant detection data will be generated, increasing the computational burden of data processing and analysis and reducing operating efficiency; if the detection frequency is too low, it may lead to the inability to detect server faults in a timely manner, affecting the stability and reliability of operation. Summary of the Invention

[0004] The purpose of this invention is to provide a method and system for proactive fault detection in computer servers, aiming to solve the technical problems existing in the prior art mentioned in the background.

[0005] The embodiments of the present invention are implemented as follows: A method for proactive fault detection in computer servers, the method specifically includes the following steps: Update the phase detection data of the server cluster and extract the detection data from the previous detection time. Based on the previous detection time and the previous detection data, dynamic time planning and detection control are performed for proactive fault detection. The stage detection data is compared and analyzed according to the preset standard threshold data, and non-standard detection data is extracted from the stage detection data. A fault risk analysis is performed on the non-standard test data to calculate multiple fault risk values and determine whether there is a hidden fault risk. When there is a risk of hidden faults, identify the risky servers and issue early warnings during off-peak hours.

[0006] As a further limitation of the technical solution of this embodiment of the invention, the step of updating the phase detection data of the server cluster and extracting the previous detection data of the previous detection time specifically includes the following steps: Update the detection record data of the server cluster; According to the preset phase cycle, phase detection data is extracted from the detection record data; The detection time of the previous detection is identified by performing detection time identification on the stage detection data; Extract the previous detection data from the previous detection time from the stage detection data.

[0007] As a further limitation of the technical solution of this invention, the dynamic time planning and detection control for active fault detection based on the previous detection time and the previous detection data specifically includes the following steps: According to the preset representative monitoring type, extract representative detection data from the previous detection data; Based on the previous detection time and the representative detection data, a dynamic time planning process for proactive fault detection is performed to calculate the next detection time. According to the next detection time, perform active fault detection control and acquire active detection data; The phase detection data is updated based on the active detection data.

[0008] As a further limitation of the technical solution of this embodiment of the invention, the formula for calculating the next detection time is: ; in, The next testing time; The preset standard testing cycle; This refers to the time of the previous test; The preset overall representative threshold; For the first in the server cluster A computer server in The representative detection value at that time, there are a total of in the server cluster. One computer server; For the first in the server cluster The representative standard value of a computer server.

[0009] As a further limitation of the technical solution of this embodiment of the invention, the step of performing fault risk analysis on the non-standard test data, calculating multiple fault risk values, and determining whether there is a hidden fault risk specifically includes the following steps: Time identification is performed on the non-standard detection data to determine the non-standard start time of multiple computer servers; Based on the non-standard test data and multiple non-standard start times, calculate the failure risk value of multiple computer servers; The multiple fault risk values are compared with a preset hidden fault threshold; Determine whether there is a hidden fault risk.

[0010] As a further limitation of the technical solution of this embodiment of the invention, the calculation formula for the plurality of fault risk values is as follows: ; in, For the first in the server cluster Failure risk value of a computer server; This is a preset adjustment factor; For the first in the server cluster Non-standard start time for individual computer servers; The current time; For the first in the server cluster A computer server in The temperature at that time; For the first in the server cluster A computer server in Resource utilization rate at that time.

[0011] As a further limitation of the technical solution of this invention, the step of identifying the risk server and providing off-peak warning when there is a hidden fault risk specifically includes the following steps: When there is a hidden fault risk, a target risk value is determined from multiple fault risk values; Based on the target risk value, the corresponding risk server is determined from multiple computer servers in the server cluster; Obtain the operational status data of the risk server; Based on the aforementioned operational status data, determine the relative idle time; During the relatively idle time, the work tasks of the risk server are paused. Risk warnings are issued for the aforementioned risky servers.

[0012] A proactive fault detection system for computer servers, the system comprising a detection data update module, a dynamic time planning module, a detection data extraction module, a hidden risk assessment module, and a risk idle time early warning module, wherein: The detection data update module is used to update the phase detection data of the server cluster and extract the previous detection data at the previous detection time. The dynamic time planning module is used to perform dynamic time planning and detection control for proactive fault detection based on the previous detection time and the previous detection data. The detection data extraction module is used to compare and analyze the stage detection data according to preset standard threshold data, and extract non-standard detection data from the stage detection data. The hidden risk assessment module is used to perform fault risk analysis on the non-standard test data, calculate multiple fault risk values, and determine whether there is a hidden fault risk. The risk idle time early warning module is used to identify risky servers and issue idle time early warnings when there is a hidden risk of failure.

[0013] As a further limitation of the technical solution of this embodiment of the invention, the dynamic time planning module specifically includes: The representative data extraction unit is used to extract representative detection data from the previous detection data according to a preset representative monitoring type. The dynamic time planning unit is used to perform dynamic time planning for proactive fault detection based on the previous detection time and the representative detection data, and to calculate the next detection time. An active detection control unit is used to perform active fault detection control according to the next detection time and to acquire active detection data; The data update unit is used to update the stage detection data based on the active detection data.

[0014] As a further limitation of the technical solution of this embodiment of the invention, the risk idle time early warning module specifically includes: The target risk value determination unit is used to determine a target risk value from a plurality of said fault risk values when there is a hidden fault risk; The risk server determination unit is used to determine the corresponding risk server from multiple computer servers in the server cluster according to the target risk value. A status data acquisition unit is used to acquire the operating status data of the risk server; The idle time determination unit is used to determine the relative idle time based on the running status data. A task pause processing unit is used to pause the working tasks of the risk server during the relatively idle time. The risk warning unit is used to provide risk warnings for the risk server.

[0015] Compared with the prior art, the beneficial effects of the present invention are: (1) This invention extracts the previous detection data from the stage detection data and performs dynamic time planning and detection control for active fault detection, thereby achieving adaptive adjustment of the detection frequency. This not only avoids generating a large amount of redundant detection data, reduces the computational burden, and improves operating efficiency, but also ensures timely detection of server faults and improves the stability and reliability of operation. (2) This invention extracts non-standard test data from the stage test data, and then calculates the fault risk value of multiple computer servers according to the non-standard test data and multiple non-standard start times. Based on the multiple fault risk values, it determines whether there is a hidden fault risk. It can identify potential hidden fault risks in advance before an obvious actual fault occurs, which is convenient for early warning and intervention before the fault evolves into an actual anomaly, thereby reducing the impact on normal business operations and further improving the stability and reliability of server operation. Attached Figure Description

[0016] Figure 1 A flowchart of a proactive fault detection method for computer servers provided in an embodiment of the present invention is shown; Figure 2 The following is an application architecture diagram of the active fault detection system for computer servers provided in an embodiment of the present invention. Detailed Implementation

[0017] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0018] Understandably, current proactive fault detection for computer servers typically employs periodic data collection and analysis at a fixed frequency, rather than adaptively adjusting the detection frequency. In practical applications, if the detection frequency is too high, it will generate a large amount of redundant detection data, increasing the computational burden of data processing and analysis and reducing operational efficiency; if the detection frequency is too low, it may lead to the inability to detect server faults in a timely manner, affecting operational stability and reliability.

[0019] To address the aforementioned issues, this invention discloses a method and system for proactive fault detection in computer servers. This method involves updating the phased detection data of the server cluster and extracting data from previous detection times. Based on the previous detection time and data, dynamic time planning and detection control are performed for proactive fault detection. The phased detection data is compared and analyzed according to preset standard threshold data to extract non-standard detection data. Fault risk analysis is performed on the non-standard detection data, calculating multiple fault risk values and determining whether there is a hidden fault risk. If a hidden fault risk is identified, the risk server is identified, and an off-peak warning is issued. This method can extract data from previous detection times from the phased detection data for dynamic time planning and detection control, enabling adaptive adjustment of the detection frequency. This avoids generating a large amount of redundant detection data, reduces computational burden, improves operating efficiency, and ensures timely detection of server faults, thereby enhancing operational stability and reliability.

[0020] Specifically, Figure 1 A flowchart of a proactive fault detection method for computer servers provided by an embodiment of the present invention is shown.

[0021] In a preferred embodiment of the present invention, a method for proactive fault detection in a computer server specifically includes the following steps: Step S101: Update the phase detection data of the server cluster and extract the previous detection data at the previous detection time.

[0022] In this embodiment of the invention, the fault detection of the server cluster is updated and recorded, the detection record data is obtained, and the stage detection data is extracted from the detection record data according to the preset stage cycle. Then, the detection time of the stage detection data is identified to determine multiple stage detection times. From the multiple stage detection times, the previous detection time that is closest to the current time is determined, and the corresponding previous detection data is extracted from the stage detection data according to the previous detection time.

[0023] It is understood that, in the embodiments of the present invention, the stage period can be one working day.

[0024] It is understood that, in this embodiment of the invention, the fault detection of the server cluster is a proactive detection process carried out according to a planned schedule, which includes the detection of data types such as temperature, voltage, resource utilization (CPU utilization, memory utilization, etc.), and network status.

[0025] Specifically, in another preferred embodiment provided by the present invention, the step of updating the phase detection data of the server cluster and extracting the previous detection data at the previous detection time specifically includes the following steps: Update the detection record data of the server cluster; According to the preset phase cycle, phase detection data is extracted from the detection record data; The detection time of the previous detection is identified by performing detection time identification on the stage detection data; Extract the previous detection data from the previous detection time from the stage detection data.

[0026] Furthermore, the active fault detection method for computer servers also includes the following steps: Step S102: Based on the previous detection time and the previous detection data, perform dynamic time planning and detection control for active fault detection.

[0027] In this embodiment of the invention, representative detection data is extracted from the previous detection data according to a preset representative monitoring type. Then, based on the previous detection time and the representative detection data, dynamic time planning for proactive fault detection is performed to calculate the next detection time. Subsequently, a corresponding detection control signal is generated according to the next detection time. When the next detection time is reached, the detection control signal is responded to, proactive fault detection control is performed, proactive detection data is acquired, and the proactive detection data is integrated into the stage detection data to achieve automatic updating of the stage detection data. Specifically, the formula for calculating the next detection time is: ; in, The next testing time; The preset standard testing cycle; This refers to the time of the previous test; The preset overall representative threshold; For the first in the server cluster A computer server in The representative detection value at that time, there are a total of in the server cluster. One computer server; For the first in the server cluster The representative standard value of a computer server.

[0028] It is understood that, in the embodiments of the present invention, the monitoring type represents one of the data types such as temperature, voltage, resource utilization, and network status.

[0029] It is understood that, in another embodiment of the present invention, fluctuation analysis can be performed on data types such as temperature, voltage, resource utilization, and network status in the previous detection data, and the data type with the largest fluctuation can be selected as the representative monitoring type.

[0030] It is understood that, in this embodiment of the invention, integrating active detection data into stage detection data enables synchronous updates of stage detection data and detection record data.

[0031] Specifically, in another preferred embodiment provided by the present invention, the dynamic time planning and detection control for active fault detection based on the previous detection time and the previous detection data specifically includes the following steps: According to the preset representative monitoring type, extract representative detection data from the previous detection data; Based on the previous detection time and the representative detection data, a dynamic time planning process for proactive fault detection is performed to calculate the next detection time. According to the next detection time, perform active fault detection control and acquire active detection data; The phase detection data is updated based on the active detection data.

[0032] Furthermore, the active fault detection method for computer servers also includes the following steps: Step S103: Compare and analyze the stage detection data according to the preset standard threshold data, and extract non-standard detection data from the stage detection data.

[0033] In this embodiment of the invention, preset standard threshold data and fault threshold data are loaded. By comparing the stage detection data with the standard threshold data and fault threshold data, non-standard detection data that deviates from the standard threshold range corresponding to the standard threshold data but does not reach the fault threshold standard corresponding to the fault threshold data is extracted from the stage detection data.

[0034] Understandably, standard threshold data is a numerical range representing a healthy standard state, composed of multiple pre-set standard threshold ranges. If the detected data is within the standard threshold range, it indicates that the corresponding computer server is in a healthy standard state.

[0035] Understandably, standard threshold data has corresponding standard threshold ranges set according to different monitoring types. Non-standard detection data has values for one or more monitoring types that are outside the corresponding standard threshold range. In this case, the corresponding computer server is not in a healthy standard state.

[0036] It is understandable that the fault threshold data is a fault threshold standard set according to different monitoring types. If the detected data exceeds the fault threshold standard, it indicates that the corresponding computer server is in a state of explicit actual failure.

[0037] It is understood that, in the embodiments of the present invention, non-standard detection data has one or more monitoring types of values that are outside the corresponding standard threshold range. However, the values of all monitoring types do not exceed the corresponding fault threshold standard, indicating that the corresponding computer server is not in a healthy standard state, but has not reached a manifest actual fault state. Non-standard detection data is detection data corresponding to the computer server not being in a healthy standard state, but not reaching a manifest actual fault state.

[0038] Step S104: Perform fault risk analysis on the non-standard test data, calculate multiple fault risk values, and determine whether there is a hidden fault risk.

[0039] In this embodiment of the invention, by performing time identification on non-standard detection data, multiple computer servers are identified as not being in a standard health state but not yet reaching a manifest actual fault state, based on non-standard start times. Then, according to the non-standard detection data and multiple non-standard start times, fault risk values for multiple computer servers are calculated. These fault risk values are compared with a preset hidden fault threshold. If all fault risk values are not greater than the hidden fault threshold, it is determined that there is no hidden fault risk; if one or more fault risk values are greater than the hidden fault threshold, it is determined that there is a hidden fault risk. Specifically, the calculation formula for multiple fault risk values is as follows: ; in, For the first in the server cluster Failure risk value of a computer server; This is a preset adjustment factor; For the first in the server cluster Non-standard start time for individual computer servers; The current time; For the first in the server cluster A computer server in The temperature at that time; For the first in the server cluster A computer server in Resource utilization rate at that time.

[0040] Specifically, in another preferred embodiment provided by the present invention, the step of performing fault risk analysis on the non-standard test data, calculating multiple fault risk values, and determining whether there is a hidden fault risk specifically includes the following steps: Time identification is performed on the non-standard detection data to determine the non-standard start time of multiple computer servers; Based on the non-standard test data and multiple non-standard start times, calculate the failure risk value of multiple computer servers; The multiple fault risk values are compared with a preset hidden fault threshold; Determine whether there is a hidden fault risk.

[0041] Furthermore, the active fault detection method for computer servers also includes the following steps: Step S105: When there is a risk of hidden faults, identify the risk server and issue an early warning during off-peak hours.

[0042] In this embodiment of the invention, when a hidden fault risk is determined, a target risk value greater than the hidden fault threshold is determined from multiple fault risk values. Then, according to the target risk value, the corresponding risk server is determined from multiple computer servers in the server cluster. At the same time, the operating status data of the risk server is obtained. By analyzing the operating status data, the relative idle time of the risk server is determined. Then, during the relative idle time, the working tasks of the risk server are suspended, and a risk warning is issued to the risk server to remind staff to intervene and handle it in a timely manner.

[0043] It is understood that, in this embodiment of the invention, the relatively idle time can be the time when the resource utilization of the risk server, such as CPU utilization and memory utilization, is less than 50%.

[0044] Specifically, in another preferred embodiment provided by the present invention, the step of identifying the risk server and providing off-peak warning when there is a hidden fault risk specifically includes the following steps: When there is a hidden fault risk, a target risk value is determined from multiple fault risk values; Based on the target risk value, the corresponding risk server is determined from multiple computer servers in the server cluster; Obtain the operational status data of the risk server; Based on the aforementioned operational status data, determine the relative idle time; During the relatively idle time, the work tasks of the risk server are paused. Risk warnings are issued for the aforementioned risky servers.

[0045] Furthermore, Figure 2 The following is an application architecture diagram of the active fault detection system for computer servers provided in an embodiment of the present invention.

[0046] Specifically, in another preferred embodiment provided by the present invention, a proactive fault detection system for a computer server includes: The detection data update module 101 is used to update the phase detection data of the server cluster and extract the previous detection data from the previous detection time.

[0047] In this embodiment of the invention, the detection data update module 101 updates and records the fault detection data of the server cluster, obtains the detection record data, and extracts the stage detection data from the detection record data according to the preset stage cycle. Then, it identifies the detection time of the stage detection data, determines multiple stage detection times, and determines the previous detection time closest to the current time from the multiple stage detection times. According to the previous detection time, it extracts the corresponding previous detection data from the stage detection data.

[0048] The dynamic time planning module 102 is used for dynamic time planning and detection control for active fault detection based on the previous detection time and the previous detection data.

[0049] In this embodiment of the invention, the dynamic time planning module 102 extracts corresponding representative detection data from the previous detection data according to a preset representative monitoring type. Then, based on the previous detection time and the representative detection data, it performs dynamic time planning for proactive fault detection, calculates the next detection time, and generates a corresponding detection control signal according to the next detection time. When the next detection time is reached, it responds to the detection control signal to perform proactive fault detection control, acquires proactive detection data, and integrates the proactive detection data into the stage detection data to achieve automatic updating of the stage detection data. Specifically, the formula for calculating the next detection time is: ; in, The next testing time; The preset standard testing cycle; This refers to the time of the previous test; The preset overall representative threshold; For the first in the server cluster A computer server in The representative detection value at that time, there are a total of in the server cluster. One computer server; For the first in the server cluster The representative standard value of a computer server.

[0050] Specifically, in another preferred embodiment provided by the present invention, the dynamic time planning module 102 specifically includes: The representative data extraction unit is used to extract representative detection data from the previous detection data according to a preset representative monitoring type. The dynamic time planning unit is used to perform dynamic time planning for proactive fault detection based on the previous detection time and the representative detection data, and to calculate the next detection time. An active detection control unit is used to perform active fault detection control according to the next detection time and to acquire active detection data; The data update unit is used to update the stage detection data based on the active detection data.

[0051] Furthermore, the active fault detection system for computer servers also includes: The detection data extraction module 103 is used to compare and analyze the stage detection data according to the preset standard threshold data, and extract non-standard detection data from the stage detection data.

[0052] In this embodiment of the invention, the detection data extraction module 103 loads preset standard threshold data and fault threshold data. By comparing the stage detection data with the standard threshold data and fault threshold data, non-standard detection data that deviates from the standard threshold range corresponding to the standard threshold data but does not reach the fault threshold standard corresponding to the fault threshold data is extracted from the stage detection data.

[0053] The hidden risk assessment module 104 is used to perform fault risk analysis on the non-standard test data, calculate multiple fault risk values, and determine whether there is a hidden fault risk.

[0054] In this embodiment of the invention, the hidden risk judgment module 104 determines, by performing time identification on the non-standard detection data, that multiple computer servers are not in a healthy standard state but have not reached a manifest actual fault state, based on non-standard start times. Then, according to the non-standard detection data and multiple non-standard start times, it calculates the fault risk values of the multiple computer servers. By comparing these multiple fault risk values with a preset hidden fault threshold, if all fault risk values are not greater than the hidden fault threshold, it is determined that there is no hidden fault risk; if one or more fault risk values are greater than the hidden fault threshold, it is determined that there is a hidden fault risk. Specifically, the calculation formula for the multiple fault risk values is as follows: ; in, For the first in the server cluster Failure risk value of a computer server; This is a preset adjustment factor; For the first in the server cluster Non-standard start time for individual computer servers; The current time; For the first in the server cluster A computer server in The temperature at that time; For the first in the server cluster A computer server in Resource utilization rate at that time.

[0055] The risk idle time early warning module 105 is used to identify risky servers and issue idle time early warnings when there is a hidden fault risk.

[0056] In this embodiment of the invention, when a hidden fault risk is determined, the risk idle time warning module 105 determines a target risk value greater than the hidden fault threshold from multiple fault risk values, and then determines the corresponding risk server from multiple computer servers in the server cluster according to the target risk value. At the same time, it obtains the operating status data of the risk server, analyzes the operating status data to determine the relative idle time of the risk server, and then suspends the working tasks of the risk server during the relative idle time, and issues a risk warning to the risk server to remind staff to intervene in a timely manner.

[0057] Specifically, in another preferred embodiment provided by the present invention, the risk idle time early warning module 105 specifically includes: The target risk value determination unit is used to determine a target risk value from a plurality of said fault risk values when there is a hidden fault risk; The risk server determination unit is used to determine the corresponding risk server from multiple computer servers in the server cluster according to the target risk value. A status data acquisition unit is used to acquire the operating status data of the risk server; The idle time determination unit is used to determine the relative idle time based on the running status data. A task pause processing unit is used to pause the working tasks of the risk server during the relatively idle time. The risk warning unit is used to provide risk warnings for the risk server.

[0058] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of this patent should be determined by the appended claims.

Claims

1. A method for proactive fault detection in computer servers, characterized in that, The method specifically includes the following steps: Update the phase detection data of the server cluster and extract the detection data from the previous detection time. Based on the previous detection time and the previous detection data, dynamic time planning and detection control are performed for proactive fault detection. The stage detection data is compared and analyzed according to the preset standard threshold data, and non-standard detection data is extracted from the stage detection data. A fault risk analysis is performed on the non-standard test data to calculate multiple fault risk values and determine whether there is a hidden fault risk. When there is a risk of hidden faults, identify the risky servers and issue early warnings during off-peak hours.

2. The active fault detection method for computer servers according to claim 1, characterized in that, The process of updating the phase detection data of the server cluster and extracting the previous detection data from the previous detection time specifically includes the following steps: Update the detection record data of the server cluster; According to the preset phase cycle, phase detection data is extracted from the detection record data; The detection time of the previous detection is identified by performing detection time identification on the stage detection data; Extract the previous detection data from the previous detection time from the stage detection data.

3. The active fault detection method for computer servers according to claim 1, characterized in that, The dynamic time planning and detection control for proactive fault detection based on the previous detection time and the previous detection data specifically includes the following steps: According to the preset representative monitoring type, extract representative detection data from the previous detection data; Based on the previous detection time and the representative detection data, a dynamic time planning process for proactive fault detection is performed to calculate the next detection time. According to the next detection time, perform active fault detection control and acquire active detection data; The phase detection data is updated based on the active detection data.

4. The active fault detection method for computer servers according to claim 3, characterized in that, The formula for calculating the next detection time is: ； in, The next testing time; The preset standard testing cycle; This refers to the time of the previous test; The preset overall representative threshold; For the first in the server cluster A computer server in The representative detection value at that time, there are a total of in the server cluster. One computer server; For the first in the server cluster The representative standard value of a computer server.

5. The active fault detection method for computer servers according to claim 1, characterized in that, The process of performing fault risk analysis on the non-standard test data, calculating multiple fault risk values, and determining whether there is a hidden fault risk specifically includes the following steps: Time identification is performed on the non-standard detection data to determine the non-standard start time of multiple computer servers; Based on the non-standard test data and multiple non-standard start times, calculate the failure risk value of multiple computer servers; The multiple fault risk values are compared with a preset hidden fault threshold; Determine whether there is a hidden fault risk.

6. The active fault detection method for computer servers according to claim 5, characterized in that, The formulas for calculating the multiple fault risk values are as follows: ； in, For the first in the server cluster Failure risk value of a computer server; This is a preset adjustment factor; For the first in the server cluster Non-standard start time for individual computer servers; The current time; For the first in the server cluster A computer server in The temperature at that time; For the first in the server cluster A computer server in Resource utilization rate at that time.

7. The active fault detection method for computer servers according to claim 1, characterized in that, The process of identifying risky servers and issuing off-peak warnings when there is a risk of hidden faults specifically includes the following steps: When there is a hidden fault risk, a target risk value is determined from multiple fault risk values; Based on the target risk value, the corresponding risk server is determined from multiple computer servers in the server cluster; Obtain the operational status data of the risk server; Based on the aforementioned operational status data, determine the relative idle time; During the relatively idle time, the work tasks of the risk server are paused. Risk warnings are issued for the aforementioned risky servers.

8. A proactive fault detection system for computer servers, characterized in that, The system includes a detection data update module, a dynamic time planning module, a detection data extraction module, a hidden risk assessment module, and a risk idle time early warning module, wherein: The detection data update module is used to update the phase detection data of the server cluster and extract the previous detection data at the previous detection time. The dynamic time planning module is used to perform dynamic time planning and detection control for proactive fault detection based on the previous detection time and the previous detection data. The detection data extraction module is used to compare and analyze the stage detection data according to preset standard threshold data, and extract non-standard detection data from the stage detection data. The hidden risk assessment module is used to perform fault risk analysis on the non-standard test data, calculate multiple fault risk values, and determine whether there is a hidden fault risk. The risk idle time early warning module is used to identify risky servers and issue idle time early warnings when there is a hidden risk of failure.

9. The active fault detection system for computer servers according to claim 8, characterized in that, The dynamic time planning module specifically includes: The representative data extraction unit is used to extract representative detection data from the previous detection data according to a preset representative monitoring type. The dynamic time planning unit is used to perform dynamic time planning for proactive fault detection based on the previous detection time and the representative detection data, and to calculate the next detection time. An active detection control unit is used to perform active fault detection control according to the next detection time and to acquire active detection data; The data update unit is used to update the stage detection data based on the active detection data.

10. The active fault detection system for computer servers according to claim 8, characterized in that, The risk idle time early warning module specifically includes: The target risk value determination unit is used to determine a target risk value from a plurality of said fault risk values when there is a hidden fault risk; The risk server determination unit is used to determine the corresponding risk server from multiple computer servers in the server cluster according to the target risk value. A status data acquisition unit is used to acquire the operating status data of the risk server; The idle time determination unit is used to determine the relative idle time based on the running status data. A task pause processing unit is used to pause the working tasks of the risk server during the relatively idle time. The risk warning unit is used to provide risk warnings for the risk server.