A server mainboard fault intelligent diagnosis method and system and a storage medium

By constructing anomaly map mapping of motherboard voltage, temperature, and power consumption and building a timing correlation network, the problem of difficulty in capturing the timing transmission and causal propagation of abnormal symptoms in multi-source monitoring data of server motherboards is solved. This enables accurate location and intelligent diagnosis of fault roots, improving the accuracy of fault location and system stability.

CN122240381APending Publication Date: 2026-06-19GUIZHOU UNIVERSITY OF FINANCE AND ECONOMICS +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUIZHOU UNIVERSITY OF FINANCE AND ECONOMICS
Filing Date
2026-05-14
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, multi-source monitoring data from server motherboards is difficult to capture the temporal transmission and causal propagation of abnormal symptoms, making it difficult for maintenance personnel to distinguish between primary faults and secondary anomalies when faced with multiple alarms, thus affecting the accuracy and timeliness of fault location.

Method used

By acquiring data stream segments of motherboard voltage, temperature, and power consumption monitoring, voltage anomaly symptom mapping maps, temperature anomaly symptom mapping maps, and power consumption anomaly symptom mapping maps are constructed. Based on these maps, a multi-source anomaly symptom time-series correlation network is constructed to perform fault root cause localization analysis, generate fault diagnosis and recovery command streams, and realize intelligent fault diagnosis.

Benefits of technology

It significantly improves the accuracy and reliability of fault source location in complex fault scenarios, avoids misjudgment or omission, and enhances the operational stability and maintainability of the server system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240381A_ABST
    Figure CN122240381A_ABST
Patent Text Reader

Abstract

This invention provides a method, system, and storage medium for intelligent fault diagnosis of server motherboards. It acquires timing-marked data stream segments of motherboard voltage, temperature, and power consumption monitoring generated during server motherboard operation and performs fault symptom mapping processing to generate corresponding voltage anomaly symptom mapping maps, temperature anomaly symptom mapping maps, and power consumption anomaly symptom mapping maps. Based on these three types of symptom mapping maps, a multi-source anomaly symptom timing correlation network is constructed. Fault root cause localization analysis is performed on this network to obtain a localization result containing the identifier of the monitoring data stream segment where the fault root cause is located, the anomaly symptom category identifier, and the propagation impact range characterization information. Based on the localization result, a fault diagnosis recovery command stream containing voltage adjustment, temperature control, and power consumption limitation recovery instructions is generated. This invention improves the accuracy of fault diagnosis and the targeting of recovery operations.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent diagnostic technology, specifically to an intelligent diagnostic method, system, and storage medium for server motherboard faults. Background Technology

[0002] As server motherboard integration and workload continue to increase, the motherboard's operational status exhibits complex dynamic coupling characteristics across multiple dimensions, including voltage supply, temperature control, and power consumption management. Real-time analysis of data generated during motherboard operation can identify abnormal motherboard states and determine the cause of failures. Currently, separate alarm thresholds are typically set for different collected data. When any monitored data exceeds the corresponding preset threshold range, an alarm is triggered, and maintenance personnel determine the possible fault location based on the alarm type and their experience. However, this independent threshold alarm mechanism struggles to capture the temporal transmission and causal propagation relationships of abnormal symptoms between different monitored data points. This makes it difficult for maintenance personnel to distinguish between primary faults and secondary abnormalities when faced with multiple alarms, thus affecting the accuracy of fault location and the timeliness of fault recovery. Summary of the Invention

[0003] In view of this, the present invention provides a method, system and storage medium for intelligent diagnosis of server motherboard faults.

[0004] On one hand, embodiments of the present invention provide a method for intelligent diagnosis of server motherboard faults, including: Acquire the raw operating status monitoring data stream generated by the target server motherboard in its running state. The raw operating status monitoring data stream includes continuously collected motherboard voltage monitoring data stream segments, motherboard temperature monitoring data stream segments, and motherboard power consumption monitoring data stream segments with time-series markers. The motherboard voltage monitoring data stream segment, motherboard temperature monitoring data stream segment, and motherboard power consumption monitoring data stream segment in the original operating status monitoring data stream are processed to perform operating status fault symptom mapping, and obtain the voltage abnormality symptom mapping map, temperature abnormality symptom mapping map, and power consumption abnormality symptom mapping map corresponding to the target server motherboard. A multi-source anomaly time-series correlation network for the target server motherboard is constructed based on voltage anomaly symptom mapping maps, temperature anomaly symptom mapping maps, and power consumption anomaly symptom mapping maps. The fault root cause localization analysis is performed on the multi-source abnormal symptom time series correlation network to obtain the fault root cause localization analysis results of the target server motherboard. The fault root cause localization analysis results include the identifier of the monitoring data stream segment where the fault root cause is located, the identifier of the abnormal symptom category corresponding to the fault root cause, and the propagation influence range of the fault root cause in the multi-source abnormal symptom time series correlation network. Based on the fault root cause location analysis results, a fault diagnosis and recovery command stream is generated for the target server motherboard. The fault diagnosis and recovery command stream includes voltage adjustment recovery commands, temperature control recovery commands, and power consumption limit recovery commands corresponding to the fault root cause.

[0005] On the other hand, embodiments of the present invention provide a computer system including a memory and a processor. The memory stores a computer program that can run on the processor, and the processor executes the program to implement the steps in the above method.

[0006] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the above method.

[0007] This invention maps operational status fault symptoms onto data stream segments monitoring motherboard voltage, temperature, and power consumption, resulting in corresponding abnormal symptom mapping maps. This transforms the raw monitoring data into a graph-based representation with fault symptom semantics, effectively characterizing the mapping relationship between monitoring trajectories and fault symptoms. Based on the multi-source abnormal symptom mapping maps, a temporal correlation network is constructed. The system characterizes the temporal dependencies and propagation path relationships between different abnormal symptoms, integrating isolated abnormal symptoms into a causal temporal correlation structure, providing a complete propagation context for fault root cause tracing. Furthermore, fault root cause localization analysis is performed on the temporal correlation network, obtaining localization results containing the identifier of the monitoring data stream segment where the fault root cause is located, the abnormal symptom category identifier, and the representation of the propagation impact range. This enables the diagnostic process to trace back along the propagation path to the true fault origin, significantly improving the accuracy and reliability of root cause localization in complex fault scenarios. Finally, based on the location results, a diagnostic recovery command stream is generated, which includes voltage adjustment, temperature control, and power consumption limitation recovery instructions. This enables intelligent fault diagnosis, from monitoring data collection, abnormal symptom correlation mining, accurate root cause location, to targeted recovery command generation. It effectively avoids the risk of secondary damage to the motherboard hardware caused by improper recovery operations due to misjudgment or omission of the root cause of the fault, and enhances the stability and maintainability of the server system. Attached Figure Description

[0008] Figure 1 This is a schematic diagram illustrating the implementation process of an intelligent diagnostic method for server motherboard faults provided in an embodiment of the present invention.

[0009] Figure 2 This is a schematic diagram of the composition structure of a fault diagnosis device provided in an embodiment of the present invention.

[0010] Figure 3 This is a schematic diagram of the hardware entity of a computer system provided in an embodiment of the present invention. Detailed Implementation

[0011] The intelligent diagnostic method for server motherboard faults provided in this invention can be executed by the processor of a computer system. The computer system can refer to a server of any form, such as a rack-mounted high-performance computing server, a high-density blade server, an artificial intelligence training and inference server, a distributed storage server, a virtualized converged architecture server, or an edge computing server, etc., without any specific limitation.

[0012] Figure 1 This is a schematic diagram illustrating the implementation process of an intelligent fault diagnosis method for server motherboards provided in an embodiment of the present invention, as shown below. Figure 1 As shown, the method includes: Step S100: Obtain the raw operating status monitoring data stream generated by the target server motherboard in the running state. The raw operating status monitoring data stream includes continuously collected motherboard voltage monitoring data stream segments, motherboard temperature monitoring data stream segments, and motherboard power consumption monitoring data stream segments with time-series markers.

[0013] In this embodiment, the target server motherboard refers to the printed circuit board assembly deployed on the data center computing node, whose surface is equipped with multiple sets of sensors to perform physical state monitoring. The motherboard voltage monitoring data stream segment originates from the analog-to-digital converter sample values ​​sampled by the onboard substrate management controller polling the voltage regulator output, processor core power supply phase, and memory slot power supply pins via its internal integrated circuit bus. Each sample value is encapsulated as a data tuple containing a voltage amplitude field and a Coordinated Universal Time Stamp (UTC) field, forming a discrete time sequence with time stamps. The motherboard temperature monitoring data stream segment is generated by digital temperature sensing nodes mounted on the processor substrate, the gaps between memory modules, and the chipset heat sink. The sensing nodes return temperature measurements and their corresponding counter timestamps according to a fixed sampling period. The motherboard power consumption monitoring data stream segment obtains the instantaneous current values ​​of each power rail through a current-sensing amplifier, multiplies them with the corresponding voltage rail values, and then appends a sampling timestamp.

[0014] Step S200: Perform operational status fault symptom mapping processing on the motherboard voltage monitoring data stream segment, motherboard temperature monitoring data stream segment, and motherboard power consumption monitoring data stream segment in the original operational status monitoring data stream to obtain the voltage anomaly symptom mapping map, temperature anomaly symptom mapping map, and power consumption anomaly symptom mapping map corresponding to the target server motherboard. The voltage anomaly symptom mapping map represents the mapping relationship between voltage fluctuation trajectory and fault symptom category, the temperature anomaly symptom mapping map represents the mapping relationship between temperature change trend and fault symptom category, and the power consumption anomaly symptom mapping map represents the mapping relationship between power consumption transient jump and fault symptom category.

[0015] In an optional embodiment, step S200 specifically includes steps S210 to S260: Step S210: Extract the voltage amplitude change trajectory corresponding to adjacent timing marks in the motherboard voltage monitoring data stream segment, input the voltage amplitude change trajectory into the pre-constructed voltage anomaly symptom mapping model, and obtain voltage fluctuation trajectory characterization information. The voltage fluctuation trajectory characterization information includes voltage drop trajectory segment, voltage rise trajectory segment and voltage ripple trajectory segment.

[0016] For example, a motherboard voltage monitoring data stream segment might be a one-dimensional floating-point sequence indexed by sampling timestamps. The process of extracting the voltage amplitude change trajectory corresponding to adjacent time markers involves performing a first-order difference operation on the original sampling sequence to generate a voltage increment sequence between adjacent sampling points. Continuous sampling intervals in the voltage increment sequence with negative amplitudes and absolute values ​​exceeding a preset dead-zone threshold are initially marked as candidate voltage decrease intervals, while continuous sampling intervals with positive amplitudes and exceeding the preset dead-zone threshold are marked as candidate voltage increase intervals.

[0017] The pre-built voltage anomaly symptom mapping model can employ a one-dimensional temporal convolutional neural network architecture, which consists of stacked temporal convolutional modules, residual connection branches, and pointwise feedforward layers. The temporal convolutional modules are composed of alternating stacks of causal convolutional layers, dilated convolutional layers, and gated linear units. The causal convolutional layers ensure that only input information from the current and historical time points is used during feature extraction. The dilated convolutional layers increase the receptive field by an exponentially growing dilation factor to capture voltage fluctuation patterns at different time scales. The gated linear units receive the output channel of the dilated convolutional layers, uniformly divide it into two parts along the channel dimension, apply an sigmoid growth activation function to one part, and then perform element-wise multiplication with the other part to control the retention and forgetting ratio of information during forward propagation. The residual connection branches perform element-wise addition between the input and output feature maps of the temporal convolutional modules to alleviate gradient decay during deep network training. The pointwise feedforward layer contains two linear transformation layers. The first linear transformation layer maps the input feature dimension to a higher-dimensional feature space and is followed by a modified linear unit activation. The second linear transformation layer projects the high-dimensional feature space back to the original dimension to maintain feature dimension consistency.

[0018] Voltage amplitude variation trajectories are input as a one-dimensional array to the voltage anomaly symptom mapping model, flowing sequentially through three stacked temporal convolutional modules. The output feature map of the first temporal convolutional module is passed to the second temporal convolutional module to further abstract the mid-frequency components of voltage fluctuations, and the output feature map of the second temporal convolutional module is passed to the third temporal convolutional module to capture the low-frequency trend components of voltage fluctuations. After concatenating the output feature maps of the three temporal convolutional modules along the channel dimension, a global average pooling layer is used to compress the temporal dimension information, generating a fixed-length global feature vector. This global feature vector is fed into three parallel classification head networks, corresponding to the voltage descent trajectory segment classification head, voltage surge trajectory segment classification head, and voltage ripple trajectory segment classification head, respectively. Each classification head network consists of two fully connected layers. The first fully connected layer is followed by a random deactivation regularization operation, and the second fully connected layer is followed by a normalized exponential function to output a confidence score for the corresponding trajectory segment category. Trajectory segments with confidence scores exceeding a preset threshold are selected as output components of the voltage fluctuation trajectory representation information.

[0019] The voltage descent trajectory segment describes the rapid drop in voltage amplitude from the nominal operating voltage to a range significantly below the nominal value within a short time window. Its waveform envelope exhibits a three-segment structure: a steep falling edge, a low-level plateau, and a recovering rising edge. The voltage rise trajectory segment describes the rapid rise in voltage amplitude from the nominal operating voltage to a range significantly above the nominal value within a short time window. Its waveform envelope also exhibits a three-segment structure: a steep rising edge, a high-level plateau, and a recovering falling edge. The voltage ripple trajectory segment describes the range where the voltage amplitude exhibits periodic or quasi-periodic small-amplitude oscillations near the nominal operating voltage. Its waveform characteristics are manifested as an AC component with a certain repetition frequency and limited amplitude superimposed on a DC component.

[0020] Step S220: Extract the temperature gradient change trend corresponding to adjacent time markers in the motherboard temperature monitoring data stream segment, input the temperature gradient change trend into the pre-built temperature anomaly sign mapping model, and obtain the temperature change trend characterization information. The temperature change trend characterization information includes temperature step increase trend segment, temperature oscillation fluctuation trend segment, and temperature continuous deviation trend segment.

[0021] For example, a motherboard temperature monitoring data stream segment is a one-dimensional floating-point sequence indexed by sampling timestamps. The process of extracting the temperature gradient change trend corresponding to adjacent time markers involves performing a first-order difference operation on the original temperature sequence to obtain a temperature change rate sequence, and further performing a second-order difference operation on the temperature change rate sequence to obtain a temperature change acceleration sequence. The temperature change rate sequence and the temperature change acceleration sequence together constitute a phase space trajectory describing the temperature dynamics behavior.

[0022] The pre-built temperature anomaly symptom mapping model employs a hybrid architecture combining a gated recurrent unit network (GRN) and a self-attention pooling layer. The GRN receives a temperature gradient trend sequence as input. Internally, it updates the gating vector to control the proportion of information transferred from historical hidden states to the current hidden state, and resets the gating vector to control the participation of historical hidden states in the computation of the current candidate hidden state. At each time step, the GRN outputs the hidden state vector for that time step, and the hidden state vectors from all time steps constitute the hidden state sequence. This hidden state sequence is then input to the self-attention pooling layer, which maps the sequence to a query matrix, a key matrix, and a value matrix using three learnable linear projection matrices. The transpose of the query matrix and the key matrix is ​​multiplied, and then normalized using an exponential function to obtain the attention weight matrix. This attention weight matrix is ​​multiplied with the value matrix to obtain a weighted context vector. The weighted context vector is then mapped to a three-dimensional output probability vector through a fully connected layer. These three dimensions correspond to the recognition confidence of temperature step-up trend segments, temperature oscillation trend segments, and temperature continuous deviation trend segments, respectively. Temperature trend segments with confidence levels exceeding a preset confidence threshold are taken as output components representing temperature change trend information.

[0023] The temperature step-up trend segment describes a temperature series where a significant, discontinuous jump occurs near a specific time point, followed by a sustained high temperature level. The temperature oscillation trend segment describes a temperature series that fluctuates around a central temperature value over a longer time window. The temperature continuous deviation trend segment describes a monotonous drift pattern where a temperature series gradually moves away from a preset temperature baseline without exhibiting a regression trend.

[0024] Step S230: Extract the power transient jump trajectory corresponding to adjacent timing markers in the motherboard power consumption monitoring data stream segment, input the power transient jump trajectory into the pre-built power anomaly symptom mapping model, and obtain power transient jump characterization information. The power transient jump characterization information includes power peak pulse segments, power drop trough segments, and power periodic fluctuation segments.

[0025] For example, the motherboard power consumption monitoring data stream segment is a one-dimensional floating-point number sequence indexed by the sampling timestamp. The process of extracting the power consumption transient jump trajectory corresponding to adjacent time markers involves applying a sliding window difference algorithm to the original power consumption sequence, calculating the peak-to-valley difference between the maximum and minimum power consumption values ​​within each sliding window, and recording the start and end positions of the time window when the peak-to-valley difference exceeds a preset jump threshold. At the same time, the short-time zero-crossing rate of the power consumption sequence within the sliding window is calculated to help distinguish between spike-type jumps and periodic jumps.

[0026] The pre-built power anomaly symptom mapping model can be a multi-scale one-dimensional convolutional neural network architecture containing three parallel feature extraction branches. Each branch has a different convolutional kernel size to capture power jump patterns across different time spans. The first branch uses a smaller one-dimensional convolutional kernel to capture power spikes, the second branch uses a medium-sized one-dimensional convolutional kernel to capture power dips, and the third branch uses a larger one-dimensional convolutional kernel to capture periodic power fluctuations. Each branch consists of a one-dimensional convolutional layer, a batch normalization layer, a modified linear unit activation layer, and a max-pooling layer stacked sequentially. The output feature maps from the three parallel feature extraction branches are concatenated along the channel dimension. The concatenated feature map is then compressed along the time dimension by a global max-pooling layer to generate a fixed-length feature vector. This feature vector is fed into a multi-label classifier consisting of two fully connected layers. The first fully connected layer includes batch normalization and modified linear unit activation, while the second fully connected layer is followed by an element-wise applied sigmoid growth curve activation function to simultaneously output the presence probabilities of three independent classes. Power transient jump types with a probability exceeding a preset discrimination threshold are selected as output components of power transient jump characterization information.

[0027] Power spike pulse segments describe positive pulse waveforms in a power consumption sequence that are extremely short in duration, have a steep rise edge, and a peak amplitude significantly higher than the average power consumption level. Power drop trough segments describe negative pulse waveforms in a power consumption sequence that are finite in duration, have a steep fall edge, and a trough amplitude significantly lower than the average power consumption level. Power periodic fluctuation segments describe fluctuation patterns in a power consumption sequence that exhibit a stable repetitive period within a specific frequency band.

[0028] Step S240: Based on the voltage drop trajectory segment, voltage rise trajectory segment, and voltage ripple trajectory segment in the voltage fluctuation trajectory characterization information, perform symptom category matching processing with the preset voltage fault symptom category library to obtain the power supply link abnormal symptom mapping association corresponding to voltage drop, the voltage regulation abnormal symptom mapping association corresponding to voltage rise, and the filter circuit abnormal symptom mapping association corresponding to voltage ripple. Then, perform graph-based organization processing on the power supply link abnormal symptom mapping association, the voltage regulation abnormal symptom mapping association, and the filter circuit abnormal symptom mapping association to generate a voltage abnormal symptom mapping graph.

[0029] The pre-defined voltage fault symptom category library stores three types of fault symptom template clusters in key-value pair format. The power supply link anomaly symptom template cluster contains multiple reference template vectors with different combinations of droop amplitude and droop duration characteristics. Each reference template vector is associated with a semantic tag for a power supply link anomaly subtype. The voltage regulation anomaly symptom template cluster contains multiple reference template vectors with different combinations of ripple amplitude and ripple recovery time characteristics. Each reference template vector is associated with a semantic tag for a voltage regulation anomaly subtype. The filter circuit anomaly symptom template cluster contains multiple reference template vectors with different combinations of ripple frequency and ripple amplitude characteristics. Each reference template vector is associated with a semantic tag for a filter circuit anomaly subtype.

[0030] Graph-based organization refers to organizing the discrete mapping association triples generated by symptom category matching into a directed graph data structure with nodes and edges. The power supply link anomaly symptom mapping association, the voltage regulation anomaly symptom mapping association, and the filter circuit anomaly symptom mapping association serve as three source nodes in the graph. Each source node is connected to its mapped fault symptom category target node via categorized edges. The weight attribute of the edge stores the matching similarity value, and the additional attribute of the edge stores a quantitative summary of the joint distribution representation of the association features. The voltage anomaly symptom mapping graph is the serialized representation of this directed graph data structure.

[0031] In an optional embodiment, step S240 specifically includes steps S241 to S246: Step S241: Extract the correlation features of voltage sag amplitude and sag duration from the voltage sag trajectory segment to obtain voltage sag correlation feature representation information, which includes the joint distribution representation of sag amplitude and sag duration.

[0032] After extracting the voltage sag trajectory segment, the voltage values ​​of all sampling points within the segment are scanned to locate the lowest voltage value. The difference between the nominal operating voltage and the lowest voltage value is taken as the sag amplitude, and the sag amplitude is normalized based on the nominal operating voltage to obtain the relative sag amplitude. The moment when the voltage amplitude drops from the nominal operating voltage and crosses the preset sag judgment threshold is recorded as the sag start time, and the moment when the voltage amplitude recovers from below the sag judgment threshold to cross the threshold is recorded as the sag end time. The time span between the sag end time and the sag start time is taken as the sag duration. The sag amplitude and sag duration constitute a two-dimensional feature vector. The two-dimensional feature vectors corresponding to multiple voltage sag trajectory segments are processed by a kernel density estimation algorithm to generate a two-dimensional probability density function, which constitutes the joint distribution representation of the sag amplitude and sag duration.

[0033] Step S242: Input the voltage drop correlation feature characterization information into the power supply link abnormality symptom matching branch in the voltage fault symptom category library, calculate the symptom feature similarity distribution between the voltage drop correlation feature characterization information and each candidate power supply link abnormality symptom template in the power supply link abnormality symptom matching branch, and determine the power supply link abnormality symptom mapping association corresponding to the voltage drop trajectory segment based on the similarity peak position in the symptom feature similarity distribution.

[0034] After receiving the voltage sag correlation feature representation information, the power supply link anomaly symptom matching branch uses its contained two-dimensional probability density function as the query distribution. For each candidate power supply link anomaly symptom template in the power supply link anomaly symptom matching branch, the template itself stores a reference two-dimensional probability density function representing a specific fault mode. The calculation process of the symptom feature similarity distribution uses Bach distance or bulldozer distance to quantify the difference between the query distribution and each reference distribution. The quantized difference value is mapped to a similarity score through a monotonically decreasing function. The similarity scores corresponding to all candidate templates constitute the symptom feature similarity distribution vector. Peak detection is performed on this vector, and the semantic label of the candidate power supply link anomaly symptom template corresponding to the similarity peak position is selected as the power supply link anomaly symptom mapping association bound to the voltage sag trajectory segment. This mapping association is specifically, for example, a data structure that contains the start and end timestamps of the sag trajectory segment, the fault category identifier of the selected template, and the matching similarity score.

[0035] Step S243: Extract the correlation features of voltage surge amplitude and voltage surge recovery time from the voltage surge trajectory segment to obtain voltage surge correlation feature representation information, which includes the joint distribution representation of voltage surge amplitude and voltage surge recovery time.

[0036] After extracting the voltage surge trajectory segment, the voltage values ​​of all sampling points within the segment are scanned to locate the highest voltage value. The difference between the highest voltage value and the nominal operating voltage is taken as the surge amplitude, and the surge amplitude is normalized based on the nominal operating voltage to obtain the relative surge amplitude. The moment when the voltage amplitude rises from the nominal operating voltage and crosses the preset surge judgment threshold is recorded as the surge start time, and the moment when the voltage amplitude recovers from above the surge judgment threshold to cross the threshold is recorded as the surge end time. The time span between the surge end time and the surge start time is taken as the surge recovery time. The surge amplitude and surge recovery time constitute a two-dimensional feature vector. The two-dimensional feature vectors corresponding to multiple voltage surge trajectory segments are processed by a kernel density estimation algorithm to generate a two-dimensional probability density function, which constitutes the joint distribution representation of the surge amplitude and surge recovery time.

[0037] Step S244: Input the voltage surge correlation feature characterization information into the voltage regulation abnormality symptom matching branch in the voltage fault symptom category library, calculate the symptom feature similarity distribution between the voltage surge correlation feature characterization information and each candidate voltage regulation abnormality symptom template in the voltage regulation abnormality symptom matching branch, and determine the voltage regulation abnormality symptom mapping association corresponding to the voltage surge trajectory segment based on the similarity peak position in the symptom feature similarity distribution.

[0038] After receiving the voltage surge correlation feature representation information, the voltage regulation anomaly matching branch uses its contained two-dimensional probability density function as the query distribution. For each candidate voltage regulation anomaly symptom template in the voltage regulation anomaly matching branch, the template itself stores a reference two-dimensional probability density function representing a specific fault mode. The calculation process of the symptom feature similarity distribution can be achieved by using the maximum mean difference metric to quantify the difference between the query distribution and each reference distribution. The quantized difference value is mapped to a similarity score through a radial basis kernel function. The similarity scores corresponding to all candidate templates constitute the symptom feature similarity distribution vector. Peak detection is performed on this vector, and the semantic label of the candidate voltage regulation anomaly symptom template corresponding to the similarity peak position is selected as the voltage regulation anomaly symptom mapping association bound to the voltage surge trajectory segment. This mapping association is specifically, for example, a data structure that contains the start and end timestamps of the surge trajectory segment, the fault category identifier of the selected template, and the matching similarity score.

[0039] Step S245: Extract the correlation features of ripple frequency and ripple amplitude from the voltage ripple trajectory segment to obtain voltage ripple correlation feature representation information, which includes the joint distribution representation of ripple frequency and ripple amplitude.

[0040] Specifically, after extracting voltage ripple trajectory segments, the voltage sequence within these segments undergoes DC component removal processing. The resulting AC component sequence, after DC component removal, is transformed to the frequency domain, for example, using a Fast Fourier Transform. In the frequency domain representation, the frequency corresponding to the dominant peak of the power spectral density is identified as the ripple fundamental frequency. The peak-to-peak amplitude of the AC component sequence is taken as the ripple amplitude, and the ripple amplitude is normalized using the nominal operating voltage as a reference to obtain the relative ripple amplitude. The ripple fundamental frequency and the relative ripple amplitude constitute a two-dimensional feature vector. The two-dimensional feature vectors corresponding to multiple voltage ripple trajectory segments are processed by a kernel density estimation algorithm to generate a two-dimensional probability density function. This two-dimensional probability density function constitutes the joint distribution representation of the ripple frequency and ripple amplitude.

[0041] Step S246: Input the voltage ripple correlation feature characterization information into the filter circuit abnormality symptom matching branch in the voltage fault symptom category library, calculate the symptom feature similarity distribution between the voltage ripple correlation feature characterization information and each candidate filter circuit abnormality symptom template in the filter circuit abnormality symptom matching branch, and determine the filter circuit abnormality symptom mapping association corresponding to the voltage ripple trajectory segment based on the similarity peak position in the symptom feature similarity distribution.

[0042] After receiving the voltage ripple correlation feature representation information, the filter circuit anomaly symptom matching branch uses its contained two-dimensional probability density function as the query distribution. For each candidate filter circuit anomaly symptom template in the filter circuit anomaly symptom matching branch, the template itself stores a reference two-dimensional probability density function representing a specific fault mode. The calculation process of the symptom feature similarity distribution uses a bulldozer distance metric to quantify the difference between the query distribution and each reference distribution. The quantized difference value is mapped to a similarity score through a monotonically decreasing function. The similarity scores corresponding to all candidate templates constitute the symptom feature similarity distribution vector. Peak detection is performed on this vector, and the semantic label of the candidate filter circuit anomaly symptom template corresponding to the similarity peak position is selected as the filter circuit anomaly symptom mapping association bound to the voltage ripple trajectory segment. The mapping association is specifically, for example, a data structure that contains the start and end timestamps of the ripple trajectory segment, the fault category identifier of the selected template, and the matching similarity score.

[0043] Step S250: Based on the temperature step increase trend segment, temperature oscillation trend segment, and temperature continuous deviation trend segment in the temperature change trend characterization information, perform symptom category matching processing with the preset temperature fault symptom category library to obtain the heat dissipation path blockage symptom mapping association corresponding to temperature step increase, the heat dissipation fan operation abnormality symptom mapping association corresponding to temperature oscillation, and the heat dissipation medium failure symptom mapping association corresponding to temperature continuous deviation. Then, perform atlas organization processing on the heat dissipation path blockage symptom mapping association, the heat dissipation fan operation abnormality symptom mapping association, and the heat dissipation medium failure symptom mapping association to generate a temperature anomaly symptom mapping atlas.

[0044] Similar to the previous example, the pre-defined temperature fault symptom category library can store three types of fault symptom template clusters in key-value pairs. The heat dissipation path blockage symptom template cluster contains multiple reference template vectors with different combinations of step start time and step slope characteristics. Each reference template vector is associated with a semantic label for a heat dissipation path blockage subtype. The cooling fan malfunction symptom template cluster contains multiple reference template vectors with different combinations of oscillation frequency and oscillation amplitude characteristics. Each reference template vector is associated with a semantic label for a cooling fan malfunction subtype. The heat dissipation medium failure symptom template cluster contains multiple reference template vectors with different combinations of deviation from the baseline and deviation duration characteristics. Each reference template vector is associated with a semantic label for a heat dissipation medium failure subtype.

[0045] In an optional embodiment, step S250 specifically includes steps S251 to S256: Step S251: Extract the correlation features of the step start time and step slope of the temperature step upward trend segment to obtain temperature step correlation feature representation information, which includes the joint distribution representation of the step start time and step slope.

[0046] Within each sliding window, a univariate linear regression is performed with the window start time as the independent variable and the temperature within the window as the dependent variable, and the slope of the regression line is recorded. When the slope of a sliding window first exceeds a preset step slope threshold and the slopes of subsequent sliding windows remain at a high level, the start time of that sliding window is marked as the step start time. The relative time offset corresponding to the step start time is expressed as a multiple of the sampling period. Within an observation interval after the step start time, the temperature sequence is linearly fitted, and the slope of the fitted line is taken as the step slope. The step start time and the step slope constitute a two-dimensional feature vector. The two-dimensional feature vectors corresponding to multiple temperature step upward trend segments are processed by a kernel density estimation algorithm to generate a two-dimensional probability density function, which constitutes the joint distribution representation of the step start time and the step slope.

[0047] Step S252: Input the temperature step correlation feature representation information into the heat dissipation path blockage symptom matching branch in the temperature fault symptom category library, calculate the symptom feature similarity distribution between the temperature step correlation feature representation information and each candidate heat dissipation path blockage symptom template in the heat dissipation path blockage symptom matching branch, and determine the heat dissipation path blockage symptom mapping association corresponding to the temperature step upward trend segment based on the similarity peak position in the symptom feature similarity distribution.

[0048] After receiving the temperature step correlation feature representation information, the heat dissipation path blockage symptom matching branch uses its contained two-dimensional probability density function as the query distribution. For each candidate heat dissipation path blockage symptom template in the heat dissipation path blockage symptom matching branch, the template itself stores a reference two-dimensional probability density function representing a specific fault mode. The calculation process of the symptom feature similarity distribution can use the Bhattacharyya distance metric to quantify the difference between the query distribution and each reference distribution. The quantified difference value is mapped to a similarity score through a monotonically decreasing function. The similarity scores corresponding to all candidate templates constitute the symptom feature similarity distribution vector. Peak detection is performed on this vector, and the semantic label of the candidate heat dissipation path blockage symptom template corresponding to the similarity peak position is selected as the heat dissipation path blockage symptom mapping association bound to the temperature step upward trend segment. This mapping association is specifically, for example, a data structure that contains the start and end timestamps of the step trend segment, the fault category identifier of the selected template, and the matching similarity score.

[0049] Step S253: Extract the correlation features of oscillation frequency and oscillation amplitude from the temperature oscillation trend segment to obtain temperature oscillation correlation feature characterization information, which includes the joint distribution characterization of oscillation frequency and oscillation amplitude.

[0050] After extracting temperature oscillation trend segments, the temperature sequence within these segments is detrended. A locally weighted regression scatter smoothing algorithm is used to fit the slow-changing trend component of the temperature sequence, and this trend component is subtracted from the original temperature sequence to obtain the detrended residual sequence. A Fast Fourier Transform (FFT) is performed on the detrended residual sequence to obtain its power spectral density estimate. In the power spectral density curve, the frequencies corresponding to spectral peaks exceeding a preset significance level are identified as the dominant oscillation frequencies. The peak-to-peak amplitude of the detrended residual sequence is taken as the oscillation amplitude. The dominant oscillation frequency and the oscillation amplitude constitute a two-dimensional feature vector. The two-dimensional feature vectors corresponding to multiple temperature oscillation trend segments are processed by a kernel density estimation algorithm to generate a two-dimensional probability density function. This two-dimensional probability density function constitutes the joint distribution representation of the oscillation frequency and oscillation amplitude.

[0051] Step S254: Input the temperature oscillation correlation feature characterization information into the cooling fan operation abnormality symptom matching branch in the temperature fault symptom category library, calculate the symptom feature similarity distribution between the temperature oscillation correlation feature characterization information and each candidate cooling fan operation abnormality symptom template in the cooling fan operation abnormality symptom matching branch, and determine the cooling fan operation abnormality symptom mapping association corresponding to the temperature oscillation fluctuation trend segment based on the similarity peak position in the symptom feature similarity distribution.

[0052] After receiving the temperature oscillation correlation feature representation information, the cooling fan malfunction symptom matching branch uses its contained two-dimensional probability density function as the query distribution. For each candidate cooling fan malfunction symptom template in the cooling fan malfunction symptom matching branch, the template itself stores a reference two-dimensional probability density function representing a specific fault mode. The calculation process of the symptom feature similarity distribution uses the maximum mean difference measure to quantify the difference between the query distribution and each reference distribution. The quantized difference value is mapped to a similarity score through a radial basis kernel function. The similarity scores corresponding to all candidate templates constitute the symptom feature similarity distribution vector. Peak detection is performed on this vector, and the semantic label of the candidate cooling fan malfunction symptom template corresponding to the similarity peak position is selected as the cooling fan malfunction symptom mapping association bound to the temperature oscillation fluctuation trend segment. This mapping association is specifically, for example, a data structure that contains the start and end timestamps of the oscillation trend segment, the fault category identifier of the selected template, and the matching similarity score.

[0053] Step S255: Extract the correlation features of deviation from baseline and duration of deviation for the temperature continuous deviation trend segment to obtain the temperature continuous deviation correlation feature representation information, which includes the joint distribution representation of deviation from baseline and duration of deviation.

[0054] After extracting the temperature deviation trend segment, a historical temperature sequence prior to the start time of this segment is obtained, and the mean of this historical temperature sequence is calculated as the deviation baseline. For each sampling point within the temperature deviation trend segment, the difference between its temperature value and the deviation baseline is calculated, and the average of the differences for all sampling points is taken as the deviation magnitude. The time span of the temperature deviation trend segment is taken as the deviation duration. The deviation magnitude and the deviation duration constitute a two-dimensional feature vector. The two-dimensional feature vectors corresponding to multiple temperature deviation trend segments are processed by a kernel density estimation algorithm to generate a two-dimensional probability density function. This two-dimensional probability density function constitutes the joint distribution representation of the deviation baseline and the deviation duration.

[0055] Step S256: Input the temperature continuous deviation associated feature characterization information into the heat dissipation medium failure symptom matching branch in the temperature fault symptom category library, calculate the symptom feature similarity distribution between the temperature continuous deviation associated feature characterization information and each candidate heat dissipation medium failure symptom template in the heat dissipation medium failure symptom matching branch, and determine the heat dissipation medium failure symptom mapping association corresponding to the temperature continuous deviation trend segment based on the similarity peak position in the symptom feature similarity distribution.

[0056] After receiving the information on the associated features of continuous temperature deviation, the heat dissipation medium failure symptom matching branch uses its contained two-dimensional probability density function as the query distribution. For each candidate heat dissipation medium failure symptom template in the heat dissipation medium failure symptom matching branch, the template itself stores a reference two-dimensional probability density function representing a specific failure mode. The calculation process of the symptom feature similarity distribution uses a bulldozer distance metric to quantify the difference between the query distribution and each reference distribution. The quantized difference value is mapped to a similarity score through a monotonically decreasing function. The similarity scores corresponding to all candidate templates constitute the symptom feature similarity distribution vector. Peak detection is performed on this vector, and the semantic label of the candidate heat dissipation medium failure symptom template corresponding to the similarity peak position is selected as the heat dissipation medium failure symptom mapping association bound to the continuous temperature deviation trend segment. This mapping association is specifically, for example, a data structure that contains the start and end timestamps of the continuous deviation trend segment, the fault category identifier of the selected template, and the matching similarity score.

[0057] Step S260: Based on the power consumption peak pulse segment, power consumption trough segment, and power consumption periodic fluctuation segment in the power consumption transient jump characterization information, perform symptom category matching processing with the preset power consumption fault symptom category library to obtain the power supply link short circuit symptom mapping association corresponding to the power consumption peak pulse, the power supply link open circuit symptom mapping association corresponding to the power consumption trough, and the load link abnormal symptom mapping association corresponding to the power consumption periodic fluctuation. Then, perform graph-based organization processing on the power supply link short circuit symptom mapping association, the power supply link open circuit symptom mapping association, and the load link abnormal symptom mapping association to generate a power consumption abnormal symptom mapping graph.

[0058] Similarly, the pre-defined power consumption fault symptom category library stores three types of fault symptom template clusters in key-value pair format. The power supply link short circuit symptom template cluster contains multiple reference template vectors with different combinations of pulse amplitude and pulse width characteristics, each reference template vector associated with a semantic label for a power supply link short circuit subtype. The power supply link open circuit symptom template cluster contains multiple reference template vectors with different combinations of valley depth and valley width characteristics, each reference template vector associated with a semantic label for a power supply link open circuit subtype. The load link anomaly symptom template cluster contains multiple reference template vectors with different combinations of fluctuation period and fluctuation amplitude characteristics, each reference template vector associated with a semantic label for a load link anomaly subtype.

[0059] The symptom category matching process employs a nearest neighbor classification strategy based on template matching. For power spike pulse segments, the ratio of peak power consumption to average power consumption is extracted as the pulse amplitude, and the duration of the pulse at half the peak amplitude is extracted as the pulse width. The two-dimensional feature vector formed by the pulse amplitude and pulse width is used to calculate the Euclidean distance between each candidate template in the power supply link short circuit symptom template cluster and its reference feature vector. The candidate template with the smallest distance is selected as the matching result. For power drop valley segments, the ratio of average power consumption to valley power consumption is extracted as the valley depth, and the duration of the valley at half the depth is extracted as the valley width. The two-dimensional feature vector formed by the valley depth and valley width is used to calculate the Euclidean distance between each candidate template in the power supply link open circuit symptom template cluster and its reference feature vector. The candidate template with the smallest distance is selected as the matching result. For power consumption periodic fluctuation segments, the dominant fluctuation period is calculated using the autocorrelation function, and the root mean square amplitude of the fluctuation sequence is extracted as the fluctuation amplitude. The Euclidean distance between the two-dimensional feature vector formed by the dominant fluctuation period and the fluctuation amplitude and the reference feature vector of each candidate template in the load link anomaly symptom template cluster is calculated one by one, and the candidate template with the smallest distance is selected as the matching result.

[0060] The graph-based organization process organizes the discrete mapping association triples generated from the symptom category matching process into a directed graph data structure with nodes and edges. The power supply link short-circuit symptom mapping association, the power supply link open-circuit symptom mapping association, and the load link anomaly symptom mapping association serve as three source nodes in the graph. Each source node is connected to its mapped fault symptom category target node via categorized edges. The weight attribute of the edge stores the matching similarity value, and the additional attribute of the edge stores a quantified summary of the association features. The power consumption anomaly symptom mapping graph is the serialized representation of this directed graph data structure.

[0061] Step S300: Construct a multi-source anomaly time-series correlation network for the target server motherboard based on the voltage anomaly symptom mapping map, temperature anomaly symptom mapping map, and power consumption anomaly symptom mapping map. The multi-source anomaly time-series correlation network is used to characterize the temporal dependency correlation of anomalies between different monitoring data stream segments and the correlation of the propagation path of anomalies.

[0062] The multi-source anomaly symptom timing correlation network can adopt a heterogeneous directed graph data structure. Its node set is composed of nodes associated with power supply link anomalies, voltage regulation anomalies, and filter circuit anomalies from the voltage anomaly symptom mapping graph; nodes associated with heat dissipation path blockage, cooling fan operation anomalies, and heat dissipation medium failure from the temperature anomaly symptom mapping graph; and nodes associated with power supply link short circuits, power supply link open circuits, and load link anomalies from the power consumption anomaly symptom mapping graph. Network edges represent the temporal order and causal relationship of the anomaly events corresponding to two nodes. The edge direction points from the earlier-occurring source anomaly node to the later-occurring target anomaly node. The edge weight stores the statistical distribution characteristics of the temporal offset between the source and target anomaly nodes.

[0063] In an optional embodiment, step S300 specifically includes steps S310 to S360: Step S310: Extract the time markers of voltage anomaly occurrence corresponding to each voltage anomaly occurrence mapping association in the voltage anomaly occurrence mapping map, and pair the voltage anomaly occurrence time markers with the voltage anomaly occurrence mapping associations to obtain a voltage anomaly occurrence time sequence with a time index.

[0064] The occurrence time of a voltage anomaly is marked as the start timestamp of the voltage sag trajectory segment, voltage swell trajectory segment, or voltage ripple trajectory segment corresponding to the voltage anomaly symptom mapping association. For power supply link anomaly symptom mapping association nodes, the sag start time of the bound voltage sag trajectory segment is extracted as the voltage anomaly symptom occurrence time marker. For voltage regulation anomaly symptom mapping association nodes, the swell start time of the bound voltage swell trajectory segment is extracted as the voltage anomaly symptom occurrence time marker. For filter circuit anomaly symptom mapping association nodes, the start time of the bound voltage ripple trajectory segment is extracted as the voltage anomaly symptom occurrence time marker. The node identifier, fault category identifier, and voltage anomaly symptom occurrence time marker of each voltage anomaly symptom mapping association node are encapsulated into a timing event tuple. All timing event tuples are sorted in ascending order according to the chronological order of the voltage anomaly symptom occurrence time markers, forming a voltage anomaly symptom timing sequence with a timing index. The timing index is an integer number starting from 1 and incrementing, corresponding one-to-one with the sorted event tuples.

[0065] Step S320: Extract the time markers of the occurrence of temperature anomalies corresponding to each temperature anomaly sign mapping association in the temperature anomaly sign mapping map, and pair the temperature anomaly sign occurrence time markers with the temperature anomaly sign mapping associations to obtain a time series sequence of temperature anomalies with a time index.

[0066] The time stamp for the occurrence of an abnormal temperature symptom is defined as the start timestamp of the temperature step increase trend segment, temperature oscillation trend segment, or temperature continuous deviation trend segment corresponding to the temperature anomaly symptom mapping association. For the heat dissipation path blockage symptom mapping association node, the start time of the step increase trend segment bound to it is extracted as the time stamp for the occurrence of the abnormal temperature symptom. For the cooling fan malfunction symptom mapping association node, the start time of the temperature oscillation trend segment bound to it is extracted as the time stamp for the occurrence of the abnormal temperature symptom. For the heat dissipation medium failure symptom mapping association node, the start time of the temperature continuous deviation trend segment bound to it is extracted as the time stamp for the occurrence of the abnormal temperature symptom. The node identifier, fault category identifier, and temperature anomaly symptom occurrence time stamp of each temperature anomaly symptom mapping association node are encapsulated into a time-series event tuple. All time-series event tuples are arranged in ascending order according to the chronological order of the temperature anomaly symptom occurrence time stamps to obtain a time-series sequence of temperature anomalies with a time-series index.

[0067] Step S330: Extract the time markers of power anomaly occurrences corresponding to each power anomaly map in the power anomaly map, and pair the time markers of power anomaly occurrences with the power anomaly map associations to obtain a power anomaly time sequence with a time index.

[0068] The time stamp for the occurrence of power consumption anomalies is defined as the start timestamp of the power consumption spike pulse segment, power consumption trough segment, or power consumption periodic fluctuation segment corresponding to the power consumption anomaly symptom mapping association. For power supply link short circuit symptom mapping association nodes, the time when the rising edge of its bound power consumption spike pulse segment crosses the half-peak amplitude point is extracted as the power consumption anomaly symptom occurrence time stamp. For power supply link open circuit symptom mapping association nodes, the time when the falling edge of its bound power consumption trough segment crosses the half-depth point is extracted as the power consumption anomaly symptom occurrence time stamp. For load link anomaly symptom mapping association nodes, the start time of its bound power consumption periodic fluctuation segment is extracted as the power consumption anomaly symptom occurrence time stamp. The node identifier, fault category identifier, and power consumption anomaly symptom occurrence time stamp of each power consumption anomaly symptom mapping association node are encapsulated into a time-series event tuple. All time-series event tuples are sorted in ascending order according to the order of the power consumption anomaly symptom occurrence time stamps to obtain a power consumption anomaly symptom time sequence with a time-series index.

[0069] Step S340: Input the time series sequences of voltage anomaly symptoms, temperature anomaly symptoms, and power consumption anomaly symptoms into the pre-constructed time series correlation analysis model. Perform time series alignment processing on the time markers of voltage anomaly symptoms, temperature anomaly symptoms, and power consumption anomaly symptoms to obtain the temporal dependency correlation characterization information between different anomalies. The temporal dependency correlation characterization information is used to characterize the temporal interval distribution of voltage anomaly symptoms occurring before temperature anomalies and the temporal interval distribution of power consumption anomalies occurring before voltage anomalies.

[0070] In an optional embodiment, step S340 specifically includes steps S341 to S346: Step S341: Input the voltage anomaly sign time series and the temperature anomaly sign time series into the first time series comparison branch of the time series correlation analysis model, perform cross-time series matching processing on the voltage anomaly sign occurrence time marker and the temperature anomaly sign occurrence time marker, calculate the first time offset distribution between the voltage anomaly sign occurrence time marker and the temperature anomaly sign occurrence time marker, and determine the first time series interval distribution in which the voltage anomaly sign occurs before the temperature anomaly sign based on the peak interval of the first time offset distribution.

[0071] After receiving the voltage anomaly sign time series and the temperature anomaly sign time series, the first time series comparison branch constructs a two-dimensional time offset analysis matrix. The row indices of the matrix correspond to each voltage anomaly sign event in the voltage anomaly sign time series, and the column indices correspond to each temperature anomaly sign event in the temperature anomaly sign time series. The matrix element values ​​are the signed time offsets obtained by subtracting the occurrence timestamps of the row index events from the occurrence timestamps of the column index events; positive values ​​indicate that the temperature anomaly sign occurs after the voltage anomaly sign, and negative values ​​indicate that the temperature anomaly sign occurs before the voltage anomaly. All element values ​​in the matrix are aggregated, and an adaptive bandwidth kernel density estimation method is used to generate a probability density curve for the time offsets; this curve is the first time offset distribution. In the first time offset distribution, continuous positive offset intervals with probability densities exceeding a preset density threshold are identified as peak intervals. Extract the positive time offset range corresponding to the peak interval and normalize it into the probability density function of the time interval distribution. This probability density function is the first time interval distribution, which represents the statistical regularity that, given the occurrence of a voltage anomaly event, a temperature anomaly event will occur with a specific probability density within a specific time window thereafter.

[0072] Step S342: Input the time series sequence of abnormal temperature symptoms and the time series sequence of abnormal power consumption symptoms into the second time series comparison branch of the time series correlation analysis model, perform cross-time series matching processing on the time markers of the occurrence of abnormal temperature symptoms and the time markers of the occurrence of abnormal power consumption symptoms, calculate the second time offset distribution between the time markers of the occurrence of abnormal temperature symptoms and the time markers of the occurrence of abnormal power consumption symptoms, and determine the second time series interval distribution in which the abnormal temperature symptoms occur before the abnormal power consumption symptoms based on the peak interval of the second time offset distribution.

[0073] After receiving the time-series sequences of temperature anomaly symptoms and power consumption anomaly symptoms, the second time-series alignment branch constructs a two-dimensional time offset analysis matrix. The row indices of the matrix correspond to each temperature anomaly event in the temperature anomaly symptoms time-series sequence, and the column indices correspond to each power consumption anomaly event in the power consumption anomaly symptoms time-series sequence. The matrix element values ​​are the signed time offsets obtained by subtracting the occurrence timestamps of the row index events from the occurrence timestamps of the column index events; positive values ​​indicate that the power consumption anomaly occurred after the temperature anomaly, and negative values ​​indicate that the power consumption anomaly occurred before the temperature anomaly. All element values ​​in the matrix are aggregated, and an adaptive bandwidth kernel density estimation method is used to generate a probability density curve for the time offsets; this curve is the second time offset distribution. In the second time offset distribution, continuous positive offset intervals with probability densities exceeding a preset density threshold are identified as peak intervals. The positive time offset range corresponding to these peak intervals is extracted and normalized to a probability density function of the time-series interval distribution; this probability density function is the second time-series interval distribution.

[0074] Step S343: Input the power consumption anomaly symptom time series and the voltage anomaly symptom time series into the third time series comparison branch of the time series correlation analysis model, perform cross-time series matching processing on the power consumption anomaly symptom occurrence time marker and the voltage anomaly symptom occurrence time marker, calculate the third time offset distribution between the power consumption anomaly symptom occurrence time marker and the voltage anomaly symptom occurrence time marker, and determine the third time series interval distribution in which the power consumption anomaly symptom occurs before the voltage anomaly symptom based on the peak interval of the third time offset distribution.

[0075] After receiving the power consumption anomaly event time series and the voltage anomaly event time series, the third timing comparison branch constructs a two-dimensional time offset analysis matrix. The row indices of the matrix correspond to each power consumption anomaly event in the power consumption anomaly event time series, and the column indices correspond to each voltage anomaly event in the voltage anomaly event time series. The matrix element values ​​are the signed time offsets obtained by subtracting the occurrence timestamps of the row index events from the occurrence timestamps of the column index events; positive values ​​indicate that the voltage anomaly event occurred after the power consumption anomaly event, and negative values ​​indicate that the voltage anomaly event occurred before the power consumption anomaly event. All element values ​​in the matrix are aggregated, and an adaptive bandwidth kernel density estimation method is used to generate a probability density curve for the time offsets; this curve is the third time offset distribution. In the third time offset distribution, continuous positive offset intervals with probability densities exceeding a preset density threshold are identified as peak intervals. The positive time offset range corresponding to these peak intervals is extracted and normalized to a probability density function of the timing interval distribution; this probability density function is the third timing interval distribution.

[0076] Step S344: Perform time-dependent consistency verification on the first time-series interval distribution, the second time-series interval distribution, and the third time-series interval distribution, detect time-transfer consistency conflicts among the first time-series interval distribution, the second time-series interval distribution, and the third time-series interval distribution with time-transfer consistency conflicts, and perform time constraint adjustment on the time-series interval distribution with time-transfer consistency conflicts to obtain the adjusted first time-series interval distribution, the adjusted second time-series interval distribution, and the adjusted third time-series interval distribution.

[0077] The timing-dependent consistency check is based on the mathematical constraint of timing transitivity. In an ideal causal chain, if the time interval between voltage anomalies and temperature anomalies follows a first timing interval distribution, and the time interval between temperature anomalies and power consumption anomalies follows a second timing interval distribution, then the indirect time interval distribution derived from the convolution of these two distributions—where voltage anomalies precede power consumption anomalies—should be statistically consistent with the third timing interval distribution. In actual calculations, continuous convolution is performed on the probability density functions of the first and second timing interval distributions to obtain the indirect timing interval distribution. The difference between the indirect and third timing interval distributions is calculated using a bulldozer distance metric. When the difference exceeds a preset consistency tolerance threshold, a timing transitivity consistency conflict is identified.

[0078] When a temporal consistency conflict is detected, the temporal constraint adjustment process employs the alternating direction multiplier method to jointly optimize and adjust the first, second, and third temporal interval distributions. The optimization objective function consists of three terms: the first term is the information divergence penalty term between the adjusted distribution and the original distribution; the second term is the consistency constraint term between the convolution result of the adjusted first and second distributions and the adjusted third distribution; and the third term is the regularization and smoothing term. This optimization problem is solved iteratively until convergence or the preset maximum number of iterations is reached, outputting the adjusted first, second, and third temporal interval distributions.

[0079] Step S345: Based on the adjusted first time interval distribution, the adjusted second time interval distribution, and the adjusted third time interval distribution, construct a time dependency array between the voltage anomaly symptom time sequence, the temperature anomaly symptom time sequence, and the power consumption anomaly symptom time sequence. The row vectors of the time dependency array represent the source anomaly symptom categories, the column vectors of the time dependency array represent the target anomaly symptom categories, and the element values ​​of the time dependency array represent the time interval distribution between the source anomaly symptom categories and the target anomaly symptom categories.

[0080] The temporal dependency array is a three-dimensional data structure. Its first dimension is the source anomaly category index, the second dimension is the target anomaly category index, and the third dimension is a sampling representation of the probability density function of the temporal interval distribution. The source anomaly categories include three subcategories for voltage anomalies, three subcategories for temperature anomalies, and three subcategories for power consumption anomalies, totaling nine source categories. The target anomaly categories are also divided into the same nine categories. For any combination of source and target anomaly categories, if the combination has been explicitly modeled in the temporal alignment across the three modes, the array element values ​​are taken from the corresponding adjusted temporal interval distribution. If the combination has not been explicitly modeled, the array element values ​​are indirectly obtained by convolving the temporal interval distributions on the modeled paths, or assigned a zero probability distribution to indicate that there is no direct or indirect temporal dependency between the source and target categories.

[0081] Step S346: Convert the temporal dependency array into temporal sequential dependency association representation information with a directed acyclic graph structure. The direction of the directed edges in the temporal sequential dependency association representation information represents the order in which the abnormal symptoms appear, and the weight of the directed edges represents the central tendency measure of the temporal interval distribution.

[0082] The process of converting the temporal dependency array into a directed acyclic graph (DAG) structure employs a combination of threshold filtering and topological sorting. It iterates through all elements in the temporal dependency array. For source-target category combinations with non-zero element values, the expected value of the temporal interval distribution represented by that element value is calculated as the average time delay between the source-target category pair. When the average time delay is positive and the expected value exceeds a preset minimum delay threshold, a candidate directed edge is created from the source anomaly category node to the target anomaly category node. A depth-first search is performed on the initial directed graph formed by the candidate directed edges to detect the existence of directed loops. If a directed loop is detected, the directed edge with the lowest confidence in the temporal interval distribution within the loop is pruned until no directed loops exist in the graph. In the final DAG, the weight of each directed edge is assigned the expected value of the corresponding temporal interval distribution. This DAG and its edge weights constitute the temporal dependency association information.

[0083] Step S350: Based on the mapping associations of voltage anomaly symptoms, temperature anomaly symptoms, and power consumption anomaly symptoms, and combined with the temporal dependency association representation information, construct a multi-source anomaly symptom temporal association network for the target server motherboard. The multi-source anomaly symptom temporal association network uses the mapping associations of voltage anomaly symptoms, temperature anomaly symptoms, and power consumption anomaly symptoms as network nodes, and the temporal dependency association representation information as network edges. The network edges are used to represent the propagation path and propagation time delay of anomalies between network nodes.

[0084] The construction process of the multi-source anomaly symptom temporal correlation network instantiates the category-level directed acyclic graph in the temporal dependency association representation information into an event-level network topology. For each category-level directed edge in the temporal dependency association representation information, candidate edges are established between all specific anomaly symptom event nodes corresponding to the source anomaly symptom category and all specific anomaly symptom event nodes corresponding to the target anomaly symptom category. For each pair of candidate edges, the actual time offset between the occurrence time markers of the source event node and the occurrence time markers of the target event node is calculated, and this actual time offset is compared with the category-level temporal interval distribution. If the actual time offset is within the high probability density interval of the category-level temporal interval distribution, the candidate edge is retained as a formal network edge, and the propagation temporal delay attribute of the edge is assigned to the actual time offset. If the actual time offset deviates significantly from the category-level temporal interval distribution, the candidate edge is discarded. Finally, all the retained network nodes and all formal network edges together constitute the multi-source anomaly symptom temporal correlation network.

[0085] Step S360: Perform network topology analysis on the multi-source anomaly time series correlation network, extract key convergence nodes and key divergence nodes in the multi-source anomaly time series correlation network, and generate anomaly propagation path map of the multi-source anomaly time series correlation network based on the key convergence nodes and key divergence nodes. The anomaly propagation path map is used to characterize the path hierarchy and path branch distribution of anomalies propagating from key divergence nodes to key convergence nodes.

[0086] In an optional embodiment, step S360 specifically includes steps S361 to S366: Step S361: Calculate the number of in-degree edges and the number of out-degree edges for each network node in the multi-source anomaly symptom time-series correlation network, and construct the degree distribution representation information of the network node based on the number of in-degree edges and the number of out-degree edges.

[0087] For each node in the multi-source anomaly symptom time-series correlation network, all its incoming edges are traversed and counted cumulatively to obtain the number of in-degree edges. All its outgoing edges are traversed and counted cumulatively to obtain the number of out-degree edges. The in-degree edges of all nodes in the network are aggregated into an in-degree sequence, and the mean and standard deviation of the out-degree sequence are calculated. The out-degree edges of all nodes in the network are aggregated into an out-degree sequence, and the mean and standard deviation of the out-degree sequence are calculated. The degree distribution representation information includes the number of in-degree edges, the number of out-degree edges, the empirical cumulative distribution function of the in-degree sequence, and the empirical cumulative distribution function of the out-degree sequence for each node.

[0088] Step S362: Perform node importance sorting on the degree distribution representation information, mark network nodes with more in-degree edges than the in-degree threshold as candidate convergence node set, and mark network nodes with more out-degree edges than the out-degree threshold as candidate divergence node set.

[0089] The node importance ranking process is based on the concept of degree centrality. The in-degree threshold is set to the mean of the in-degree sequence plus a dynamically adjustable multiple of the standard deviation of the in-degree sequence, which is dynamically determined according to the network size. For each network node, if the number of its in-degree edges exceeds this in-degree threshold, the node is added to the candidate set of convergent nodes. The out-degree threshold is set to the mean of the out-degree sequence plus a dynamically adjustable multiple of the standard deviation of the out-degree sequence. For each network node, if the number of its out-degree edges exceeds this out-degree threshold, the node is added to the candidate set of divergent nodes.

[0090] Step S363: Perform convergence path backtracking analysis on each candidate convergence node in the candidate convergence node set, extract all predecessor network nodes connected to the candidate convergence node and the edge weight distribution between the predecessor network nodes and the candidate convergence node, and calculate the convergence strength metric of the candidate convergence node based on the edge weight distribution.

[0091] For each candidate convergence node in the candidate convergence node set, a reverse breadth-first traversal is performed along the incoming edge direction in the multi-source anomaly symptom temporal correlation network to collect all predecessor network nodes reachable by directed paths from that candidate convergence node. For each predecessor network node, all directed paths between it and the candidate convergence node are determined, and the sum of the edge weights on each path is calculated as the cumulative propagation delay of that path. The minimum cumulative propagation delay among all directed paths is taken as the effective time distance between the predecessor network node and the candidate convergence node. The convergence strength metric is defined as a composite index that is proportional to the total number of predecessor network nodes and inversely proportional to the harmonic mean of the effective time distances of predecessor network nodes. Specifically, the convergence strength metric is the product of the total number of predecessor network nodes and the reciprocal of the harmonic mean of the effective time distances of predecessor network nodes.

[0092] Step S364: Perform forward tracing of divergent paths for each candidate divergent node in the candidate divergent node set, extract all successor network nodes connected to the candidate divergent node and the edge weight distribution between the candidate divergent node and its successor network nodes, and calculate the divergence strength metric of the candidate divergent node based on the edge weight distribution.

[0093] For each candidate diverging node in the candidate diverging node set, a forward breadth-first traversal is performed along the outgoing edges in the multi-source anomaly symptom temporal correlation network to collect all successor network nodes reachable by directed paths from the candidate diverging node. For each successor network node, all directed paths between the candidate diverging node and the candidate diverging node are determined. The sum of the edge weights on each path is calculated as the cumulative propagation delay of that path, and the minimum cumulative propagation delay among all directed paths is taken as the effective time distance between the candidate diverging node and the successor network node. The divergence intensity metric is defined as a composite index that is directly proportional to the total number of successor network nodes and inversely proportional to the harmonic mean of the effective time distances of successor network nodes. The divergence intensity metric is the product of the total number of successor network nodes and the reciprocal of the harmonic mean of the effective time distances of successor network nodes.

[0094] Step S365: The candidate aggregation node set is screened according to the aggregation intensity metric. Candidate aggregation nodes whose aggregation intensity metric exceeds the aggregation intensity threshold are identified as key aggregation nodes in the multi-source anomaly symptom time-series correlation network. Key aggregation nodes represent the concentrated area of ​​fault impact in the propagation path of anomalies.

[0095] The convergence intensity threshold is adaptively determined based on the distribution characteristics of the convergence intensity metrics of all candidate convergence nodes. The median and median absolute deviation of the convergence intensity metrics for all nodes in the candidate convergence node set are calculated, and the convergence intensity threshold is the median plus an adjustable multiple of the median absolute deviation. For each candidate convergence node, if its convergence intensity metric exceeds the convergence intensity threshold, it is marked as a critical convergence node. Critical convergence nodes indicate the location in the multi-source anomaly temporal correlation network where the impacts of faults from multiple different physical subsystems and multiple different anomaly categories ultimately converge and manifest in both time and space.

[0096] Step S366: The candidate divergent node set is screened according to the divergence intensity metric. Candidate divergent nodes whose divergence intensity metric exceeds the divergence intensity threshold are identified as key divergent nodes in the multi-source anomaly symptom time-series correlation network. Key divergent nodes represent the source of fault impact diffusion in the anomaly symptom propagation path.

[0097] The divergence intensity threshold is adaptively determined based on the distribution characteristics of the divergence intensity metrics of all candidate divergence nodes. The median and median absolute deviation of the divergence intensity metrics for all nodes in the candidate divergence node set are calculated, and the divergence intensity threshold is taken as the median plus an adjustable multiple of the median absolute deviation. For each candidate divergence node, if its divergence intensity metric exceeds the divergence intensity threshold, it is marked as a critical divergence node. A critical divergence node indicates, in a multi-source anomaly temporal correlation network, the initial location where an early anomaly of a single physical subsystem acts as a fault source, radiating the fault's impact to multiple different physical subsystems through physical coupling, thermal coupling, or load coupling.

[0098] Step S400: Perform fault root cause localization analysis on the multi-source abnormal symptom time-series correlation network to obtain the fault root cause localization analysis results of the target server motherboard. The fault root cause localization analysis results include the identifier of the monitoring data stream segment where the fault root cause is located, the identifier of the abnormal symptom category corresponding to the fault root cause, and the propagation influence range representation information of the fault root cause in the multi-source abnormal symptom time-series correlation network.

[0099] The goal of fault root cause analysis is to identify the node with the highest causal priority as the fault root cause node in the multi-source anomaly symptom temporal correlation network. This root cause node should meet two core characteristics: first, it has no incoming edges in the network topology, or although it has incoming edges, the predecessor nodes associated with these edges are temporal rather than causal; second, it has significant fault impact propagation strength, meaning its outgoing edges can reach a considerable number of successor nodes in the network. The monitoring data stream segment identifier indicates which segment the root cause node was initially detected from: the motherboard voltage monitoring data stream segment, the motherboard temperature monitoring data stream segment, or the motherboard power consumption monitoring data stream segment. The anomaly symptom category identifier corresponding to the root cause node indicates the specific fault semantic label of the voltage anomaly symptom mapping association, temperature anomaly symptom mapping association, or power consumption anomaly symptom mapping association corresponding to the root cause node. The propagation impact range characterization information describes the scale and structural hierarchy of the network subgraph that can be covered from the fault root cause node.

[0100] In an optional embodiment, step S400 includes steps S410 to S460: Step S410: Extract the key divergent nodes and the edge weight distribution of the subsequent network nodes connected to the key divergent nodes in the multi-source anomaly symptom temporal correlation network, and determine the propagation intensity of the fault impact of the key divergent nodes on the subsequent network nodes based on the edge weight distribution.

[0101] For each critical divergent node, all outgoing edges are retrieved in the multi-source anomaly symptom temporal correlation network. The successor network nodes pointed to by each outgoing edge are collected, and the weight value of each outgoing edge is extracted. The edge weight distribution is represented by a statistical histogram of these outgoing edge weight values. The fault impact propagation strength is defined as the weighted sum of the influence of the critical divergent node on all its direct successor network nodes. The influence contribution value of each direct successor network node is calculated using a decay function with the edge weight as the independent variable. The function value monotonically decreases as the edge weight increases, indicating that the greater the propagation delay, the weaker the instantaneous impact. The fault impact propagation strength is the sum of the influence contribution values ​​of all direct successor network nodes.

[0102] Step S420: Extract the edge weight distribution of key aggregation nodes and predecessor network nodes connected to key aggregation nodes in the multi-source anomaly symptom time-series correlation network, and determine the aggregation intensity of the fault impact of predecessor network nodes on key aggregation nodes based on the edge weight distribution.

[0103] For each critical convergence node, all incoming edges are retrieved in the multi-source anomaly symptom temporal correlation network. The starting predecessor network node of each incoming edge is collected, and the weight value of each incoming edge is extracted. The edge weight distribution is represented by a statistical histogram of these incoming edge weight values. The convergence intensity of the fault impact is defined as the weighted sum of the influence received by the critical convergence node from all its direct predecessor network nodes. The influence contribution value of each direct predecessor network node is calculated using a decay function with the edge weight as the independent variable. The function value decreases monotonically as the edge weight value increases. The convergence intensity of the fault impact is the sum of the influence contribution values ​​of all direct predecessor network nodes.

[0104] Step S430: Input the key diverging node, key converging node, fault impact propagation intensity, and fault impact convergence intensity into the pre-constructed fault root cause localization analysis model, perform reverse tracing of the fault propagation path in the multi-source anomaly symptom time-series correlation network, and locate the fault root cause node of the target server motherboard. The fault root cause node is a network node in the multi-source anomaly symptom time-series correlation network that has fault impact propagation intensity but does not have the fault impact convergence intensity of the predecessor network node.

[0105] The pre-built fault root cause localization analysis model maintains a set of candidate fault root cause nodes. The size of the candidate set is gradually reduced by iteratively applying topological constraints, strength constraints, and time-series constraints until it converges to a unique fault root cause node or a minimal set of root cause nodes. The fault propagation path reverse tracing process starts from the identified key convergence node and backtracks step-by-step along the incoming edges. At each backtracking step, it evaluates whether the current node possesses sufficient conditions to be a fault root cause.

[0106] In an optional embodiment, step S430 specifically includes steps S431 to S436: Step S431: Initialize all network nodes in the multi-source anomaly symptom time-series correlation network as a set of candidate fault root source nodes, which includes critical diverging nodes and non-critical diverging nodes.

[0107] During the initialization phase, the candidate fault root cause node set is assigned the complete set of nodes in the multi-source anomaly symptom time-series correlation network. This set includes network nodes marked as critical divergent nodes in step S360, non-critical divergent nodes that do not meet the critical divergent node criteria, and network nodes marked as critical convergence nodes.

[0108] Step S432: Perform a predecessor node existence detection process on each candidate fault root source node in the candidate fault root source node set. If the candidate fault root source node has a predecessor network node in the multi-source abnormal symptom time series association network, then remove the candidate fault root source node from the candidate fault root source node set to obtain the candidate fault root source node set after the initial screening.

[0109] The predecessor node existence detection process traverses each candidate node in the candidate fault root source node set, checking whether the candidate node has any incoming edges in the multi-source anomaly symptom temporal correlation network. If the number of incoming edges is greater than zero, it indicates that other anomaly symptom events have occurred in the network before the anomaly symptom event represented by the candidate node occurred, and the candidate node does not meet the conditions to be the starting root source in time sequence. The candidate node is removed from the candidate fault root source node set. After traversing and removing all candidate nodes, the remaining candidate nodes in the set are all nodes with an in-degree of 0 in the network, forming the initial set of candidate fault root source nodes.

[0110] Step S433: Perform fault impact propagation strength verification processing on each candidate fault root source node in the initial screening candidate fault root source node set. If the fault impact propagation strength of the candidate fault root source node is lower than the propagation strength threshold, the candidate fault root source node is removed from the initial screening candidate fault root source node set to obtain the second screening candidate fault root source node set.

[0111] For each candidate node in the initial set of candidate root cause nodes, the fault impact propagation strength value calculated in step S410 is queried. If the candidate node is a non-critical diverging node and its propagation strength has not been explicitly calculated, a restricted forward traversal is performed starting from the candidate node, and the number of successor nodes within a finite number of steps is calculated as the approximate propagation strength. The propagation strength threshold is set as a multiple of the average out-degree of nodes in the network. If the fault impact propagation strength of the candidate root cause node is lower than the propagation strength threshold, it indicates that although the candidate node is earlier in time, its influence is extremely limited and insufficient to constitute a system-level fault root cause requiring special handling. The candidate node is removed from the set, and the remaining nodes constitute the second set of candidate root cause nodes.

[0112] Step S434: For each candidate fault root source node in the candidate fault root source node set after secondary screening, the fault impact propagation path length evaluation process is performed. The propagation path level depth from the candidate fault root source node to the farthest successor network node in the multi-source abnormal symptom time-series correlation network is calculated. If the propagation path level depth is lower than the path depth threshold, the candidate fault root source node is removed from the candidate fault root source node set after secondary screening, and the candidate fault root source node set after tertiary screening is obtained.

[0113] For each candidate node in the set of candidate root cause nodes after secondary screening, a depth-first search is performed in the multi-source anomaly symptom temporal correlation network, starting from that candidate node, to calculate the maximum number of edges in all directed paths reachable from that starting node. This maximum number of edges is the propagation path depth. The path depth threshold is set according to the network size. If the propagation path depth of a candidate root cause node is lower than this path depth threshold, it indicates that the fault impact of that candidate node naturally decays and disappears after propagation for several steps, failing to form an impact chain with significant depth, and its importance as a root cause is insufficient. This candidate node is removed from the set, and the remaining nodes constitute the set of candidate root cause nodes after the third screening.

[0114] Step S435: If the set of candidate fault root cause nodes after three screenings contains only one candidate fault root cause node, then the candidate fault root cause node is determined as the fault root cause node of the target server motherboard.

[0115] After completing the above three stages of screening, the cardinality of the candidate fault root cause node set after the three screenings is checked. If only one candidate node remains in the set, the fault root cause localization analysis model outputs this candidate node as the uniquely determined fault root cause node.

[0116] Step S436: If the set of candidate root cause nodes after three screenings contains multiple candidate root cause nodes, extract the time mark of the occurrence of abnormal symptoms for each candidate root cause node, and determine the candidate root cause node with the earliest time mark of the occurrence of abnormal symptoms as the root cause node of the target server motherboard.

[0117] When the set of candidate root cause nodes after three rounds of screening contains more than one candidate node, it indicates the existence of multiple disconnected or weakly connected anomalous symptom propagation components in the network, each with its own potential root cause node. In this case, the fault root cause localization analysis model further invokes the time priority decision rule. For each candidate root cause node in the set, the time marker of the occurrence of the anomalous symptom is extracted from its corresponding anomalous symptom mapping association data structure. The time markers of the occurrence of the anomalous symptom of all candidate nodes are compared, and the candidate node with the earliest time marker is selected as the root cause node of the target server motherboard. If the time markers of the occurrence of the anomalous symptom of multiple candidate nodes are the same or within a very small time tolerance, the propagation intensity of the fault impact of these candidate nodes is further compared, and the candidate node with the strongest propagation intensity is selected as the root cause node.

[0118] Step S440: Extract the node identifier of the fault root cause node in the multi-source abnormal symptom timing association network, and determine the monitoring data stream segment identifier where the fault root cause is located based on the node identifier. The monitoring data stream segment identifier is used to indicate the target monitoring data stream segment in the motherboard voltage monitoring data stream segment, motherboard temperature monitoring data stream segment, or motherboard power consumption monitoring data stream segment corresponding to the fault root cause node.

[0119] Each fault root cause node has a unique node identifier in the multi-source anomaly symptom timing correlation network. The encoding rule of this node identifier embeds the monitoring data stream type field from which the node originates. The fault root cause localization analysis model parses the type field of this node identifier. If the type field value is 1, the monitoring data stream segment identifier points to the motherboard voltage monitoring data stream segment. If the type field value is 2, the monitoring data stream segment identifier points to the motherboard temperature monitoring data stream segment. If the type field value is 3, the monitoring data stream segment identifier points to the motherboard power consumption monitoring data stream segment.

[0120] Step S450: Extract the abnormal symptom category identifier corresponding to the fault root cause node. The abnormal symptom category identifier is used to indicate the target abnormal symptom mapping association in the voltage abnormal symptom mapping association, temperature abnormal symptom mapping association, or power consumption abnormal symptom mapping association corresponding to the fault root cause node.

[0121] In addition to the monitoring data stream type field, the node identifier of the fault root cause node also contains an embedded abnormal symptom category subtype field. The fault root cause localization analysis model parses the category subtype field of this node identifier and uniquely determines the target abnormal symptom mapping association based on the combination of the type field and the category subtype field. For voltage abnormal symptoms, the abnormal symptom category identifier points to a power supply link abnormal symptom mapping association, a voltage regulation abnormal symptom mapping association, or a filter circuit abnormal symptom mapping association. For temperature abnormal symptoms, the abnormal symptom category identifier points to a heat dissipation path blockage symptom mapping association, a cooling fan operation abnormal symptom mapping association, or a heat dissipation medium failure symptom mapping association. For power consumption abnormal symptoms, the abnormal symptom category identifier points to a power supply link short circuit symptom mapping association, a power supply link open circuit symptom mapping association, or a load link abnormal symptom mapping association.

[0122] Step S460: Based on the root cause node of the fault and all subsequent network nodes connected to the root cause node of the fault, extract the fault propagation subtree starting from the root cause node in the multi-source anomaly symptom time-series association network, and generate the propagation influence range representation information of the root cause of the fault in the multi-source anomaly symptom time-series association network according to the number of nodes and the edge level depth of the fault propagation subtree.

[0123] Using the root node of the fault as the root node, a breadth-first traversal is performed in the multi-source anomaly symptom temporal correlation network to collect all successor network nodes reachable from the root node via directed paths. The root node, all collected successor network nodes, and the edges connecting them together form a directed subgraph originating from the root node; this directed subgraph is the fault propagation subtree. The total number of network nodes in the fault propagation subtree is counted as the node count. The number of edges in the longest directed path from the root node to each leaf node in the fault propagation subtree is calculated as the edge level depth. The propagation impact range is represented by a data structure containing the node count attribute, edge level depth attribute, and adjacency list representation of the fault propagation subtree.

[0124] Step S500: Generate a fault diagnosis and recovery command stream for the target server motherboard based on the fault root cause location analysis results. The fault diagnosis and recovery command stream includes voltage regulation recovery command, temperature control recovery command, and power consumption limit recovery command corresponding to the fault root cause. The fault diagnosis and recovery command stream is used to instruct the target server motherboard to perform recovery operations that match the fault root cause.

[0125] In an optional embodiment, step S500 specifically includes steps S510 to S570: Step S510: Parse the monitoring data stream segment identifier in the fault root cause location analysis result, and determine the monitoring data stream type corresponding to the fault root cause. The monitoring data stream type is one of the following: motherboard voltage monitoring data stream type, motherboard temperature monitoring data stream type, or motherboard power consumption monitoring data stream type.

[0126] The fault diagnosis and recovery instruction generation module reads the monitoring data stream segment identifier field from the fault root cause localization analysis result data structure. Based on the enumerated values ​​of this field, the monitoring data stream type is parsed into a three-valued logical variable. The value of this variable represents the motherboard voltage monitoring data stream type, the motherboard temperature monitoring data stream type, and the motherboard power consumption monitoring data stream type, respectively. This monitoring data stream type determines the retrieval scope of the subsequent recovery strategy template and the dominant direction of instruction generation.

[0127] Step S520: Analyze the abnormal symptom category identifier in the fault root cause location analysis results, determine the specific category of abnormal symptom corresponding to the fault root cause, the specific category of abnormal symptom is the power supply link abnormal symptom category, voltage regulation abnormal symptom category or filter circuit abnormal symptom category defined in the voltage abnormal symptom mapping map, or the heat dissipation path blockage symptom category, cooling fan operation abnormal symptom category or heat dissipation medium failure symptom category defined in the temperature abnormal symptom mapping map, or the power consumption abnormal symptom mapping map is the power supply link short circuit symptom category, power supply link open circuit symptom category or load link abnormal symptom category.

[0128] The fault diagnosis and recovery instruction generation module reads the abnormal symptom category identifier field from the fault root cause localization analysis result data structure and performs joint decoding in conjunction with the monitoring data stream type field. For the motherboard voltage monitoring data stream type, the abnormal symptom category identifier is specified from one of three candidate subcategories; for the motherboard temperature monitoring data stream type, it is specified from one of three candidate subcategories; for the motherboard power consumption monitoring data stream type, it is specified from one of three candidate subcategories. This specific abnormal symptom category provides semantic information at the level of the fault's physical mechanism, which is used to retrieve the most suitable refined recovery strategy in subsequent steps.

[0129] Step S530: Based on the monitoring data stream type and the specific category of abnormal symptoms, retrieve the fault recovery strategy template that matches the monitoring data stream type and the specific category of abnormal symptoms from the preset fault recovery strategy mapping library. The fault recovery strategy template includes voltage regulation recovery strategy template, temperature control recovery strategy template and power consumption limitation recovery strategy template.

[0130] The pre-defined fault recovery strategy mapping library is a key-value pair database. The key is a tuple of the monitoring data stream type and the specific category of the abnormal symptom, and the value is the storage path or serialized data block of the corresponding fault recovery strategy template. A fault recovery strategy template is a parameterized instruction generation blueprint, containing fields for target parameter range, adjustment step rate, and execution timing constraints. Depending on the monitoring data stream type, the retrieved fault recovery strategy template is categorized as one of three: voltage regulation recovery strategy template, temperature control recovery strategy template, or power consumption limitation recovery strategy template.

[0131] Step S540: If the fault recovery strategy template is a voltage regulation recovery strategy template, extract the voltage regulation target range representation information and voltage regulation step rate representation information from the voltage regulation recovery strategy template, and generate a voltage regulation recovery instruction based on the voltage regulation target range representation information and voltage regulation step rate representation information. The voltage regulation recovery instruction is used to instruct the voltage regulation link of the target server motherboard to perform a progressive voltage regulation operation.

[0132] In an optional embodiment, step S540 specifically includes steps S541 to S545: Step S541: Perform template parameter parsing processing on the voltage regulation recovery strategy template, extract the preset voltage regulation upper limit characterization information and voltage regulation lower limit characterization information in the voltage regulation recovery strategy template, and construct voltage regulation target range characterization information based on the voltage regulation upper limit characterization information and voltage regulation lower limit characterization information.

[0133] The template parameter parsing process reads the parameter dictionary from the voltage regulation recovery strategy template, locates the entry with the upper voltage regulation limit as the key, and assigns its value to the upper voltage regulation limit representation information. The entry with the lower voltage regulation limit as the key is assigned its value to the lower voltage regulation limit representation information. The voltage regulation target range representation information is a closed-interval data structure composed of the lower voltage regulation limit representation information and the upper voltage regulation limit representation information.

[0134] Step S542: Perform adjustment rate parsing processing on the voltage regulation recovery strategy template, extract the preset voltage regulation single step amplitude representation information and the time interval representation information of adjacent step operations in the voltage regulation recovery strategy template, and construct voltage regulation step rate representation information based on the voltage regulation single step amplitude representation information and the time interval representation information of adjacent step operations.

[0135] The adjustment rate parsing process reads the parameter dictionary from the voltage regulation recovery strategy template. For entries where the key is the single step size, its value is assigned to the voltage regulation single step size representation information. For entries where the key is the step time interval, its value is assigned to the time interval representation information of adjacent step operations. The voltage regulation step rate representation information is a structure containing the above two fields.

[0136] Step S543: Obtain the current real-time voltage monitoring value of the target server motherboard, calculate the voltage deviation between the real-time voltage monitoring value and the voltage adjustment upper limit representation information in the voltage adjustment target range representation information, and determine the number of step operations required for voltage adjustment based on the voltage deviation and the voltage adjustment single step amplitude representation information.

[0137] The fault diagnosis and recovery command generation module obtains the real-time voltage monitoring value of the target server motherboard at the current sampling moment through the sensor data reading interface of the baseboard management controller. It compares the real-time voltage monitoring value with the voltage regulation upper limit representation information. If the real-time voltage monitoring value is higher than the voltage regulation upper limit representation information, the voltage deviation is the difference between the real-time voltage monitoring value and the voltage regulation upper limit representation information. If the real-time voltage monitoring value is lower than the voltage regulation lower limit representation information, the voltage deviation is the difference between the voltage regulation lower limit representation information and the real-time voltage monitoring value. If the real-time voltage monitoring value is within the voltage regulation target range representation information, the voltage deviation is zero. The number of step operations is calculated by dividing the absolute value of the voltage deviation by the voltage regulation single step amplitude representation information and rounding up.

[0138] Step S544: Based on the step operation count and the time interval between adjacent step operations, generate a voltage regulation step operation instruction sequence with a timing mark sequence. Each step operation instruction in the voltage regulation step operation instruction sequence includes the target voltage value of the step operation and the execution time mark of the step operation.

[0139] Using the current system time as the base time, for each integer sequence number from 1 to the number of step operations, calculate the target voltage value corresponding to that sequence number. If the current voltage is higher than the voltage regulation upper limit representation information, the target voltage value is equal to the current voltage minus the product of the sequence number and the voltage regulation single step amplitude representation information, until the target voltage value drops to the voltage regulation upper limit representation information. If the current voltage is lower than the voltage regulation lower limit representation information, the target voltage value is equal to the current voltage plus the product of the sequence number and the voltage regulation single step amplitude representation information, until the target voltage value rises to the voltage regulation lower limit representation information. The execution time marker for this sequence number of step operations is equal to the base time plus the sequence number minus 1 multiplied by the time interval representation information of the adjacent step operations. Each step operation instruction includes a target voltage value field and an execution time marker field. All step operation instructions are arranged in ascending order of sequence number to form a voltage regulation step operation instruction sequence.

[0140] Step S545: Encapsulate the voltage regulation step operation instruction sequence and the voltage regulation target range characterization information into an instruction to generate a voltage regulation recovery instruction. The voltage regulation recovery instruction is used to instruct the voltage regulation link of the target server motherboard to perform a gradual voltage regulation operation from the real-time voltage monitoring value to the voltage regulation target range characterization information according to the voltage regulation step operation instruction sequence.

[0141] The instruction encapsulation process takes the voltage regulation step operation instruction sequence as the instruction payload, uses the voltage regulation target range characterization information as a verification reference field, and adds an instruction type identifier and instruction validity period timestamp, encapsulating it into a voltage regulation recovery instruction conforming to the baseboard management controller command protocol format. This instruction is sent to the voltage regulator controller on the target server motherboard via the internal integrated circuit bus or platform environment control interface. The voltage regulator controller parses the instruction payload and executes the step adjustment of the output voltage item by item according to the predetermined timing mark sequence.

[0142] Step S550: If the fault recovery strategy template is a temperature control recovery strategy template, extract the temperature control target range representation information and the temperature control fan speed adjustment curve representation information from the temperature control recovery strategy template, and generate a temperature control recovery instruction based on the temperature control target range representation information and the temperature control fan speed adjustment curve representation information. The temperature control recovery instruction is used to instruct the heat dissipation control link of the target server motherboard to perform a gradual temperature control operation.

[0143] In an optional embodiment, step S550 specifically includes steps S551 to S556: Step S551: Perform template parameter parsing processing on the temperature control recovery strategy template, extract the preset upper limit and lower limit characterization information of temperature control in the temperature control recovery strategy template, and construct the target range characterization information of temperature control based on the upper limit and lower limit characterization information of temperature control.

[0144] The template parameter parsing process reads the parameter dictionary from the temperature control recovery strategy template. For entries where the location key is the upper limit of temperature control, its value is assigned to the upper limit representation information. For entries where the location key is the lower limit of temperature control, its value is assigned to the lower limit representation information. The temperature control target range representation information is a closed interval data structure composed of the lower limit representation information and the upper limit representation information.

[0145] Step S552: Perform fan speed curve analysis on the temperature control recovery strategy template, and extract the preset temperature and fan speed mapping relationship characterization information in the temperature control recovery strategy template. The temperature and fan speed mapping relationship characterization information is used to characterize the target speed characterization information of the cooling fan corresponding to different temperature ranges.

[0146] The fan speed curve parsing and processing reads the segmented mapping table data structure from the temperature control recovery strategy template. This mapping table consists of several entries, each containing a lower bound of a temperature range, an upper bound of a temperature range, and the target speed representation information of the cooling fan corresponding to that temperature range. The temperature ranges do not overlap and continuously cover the entire range from the lowest operating temperature to the highest operating temperature.

[0147] Step S553: ​​Based on the mapping relationship between temperature and fan speed, construct a temperature-controlled fan speed adjustment curve representation information with temperature as the independent variable and the target fan speed representation information as the dependent variable. The temperature-controlled fan speed adjustment curve representation information includes a fan speed increasing curve segment during the temperature rising phase and a fan speed decreasing curve segment during the temperature falling phase.

[0148] The temperature-controlled fan speed adjustment curve is constructed by introducing hysteresis control logic based on the temperature-fan speed mapping relationship. For the temperature rise process, a monotonically non-decreasing stepped curve is generated according to the temperature ranges in the mapping table and the corresponding target fan speed; this stepped curve represents the fan speed increasing segment. For the temperature fall process, the lower boundary of each temperature range is shifted downwards by a preset hysteresis band width. A monotonically non-decreasing stepped curve is then generated according to the shifted temperature range and the corresponding target fan speed; this stepped curve represents the fan speed decreasing segment. The introduction of the hysteresis band prevents frequent oscillations and switching of the fan speed near the temperature threshold.

[0149] Step S554: Obtain the current real-time temperature monitoring value of the target server motherboard, input the real-time temperature monitoring value into the temperature control fan speed adjustment curve characterization information, determine the real-time cooling fan target speed characterization information corresponding to the real-time temperature monitoring value, and generate a fan speed adjustment step operation command based on the speed deviation between the real-time cooling fan target speed characterization information and the current speed characterization information of the cooling fan.

[0150] The fault diagnosis and recovery command generation module obtains the real-time temperature monitoring value of the target server motherboard at the current sampling moment through the sensor data reading interface of the baseboard management controller. Based on the direction of temperature change, it selects either a fan speed increasing curve segment or a fan speed decreasing curve segment as the currently valid curve segment. The real-time temperature monitoring value is substituted into the currently valid curve segment for table lookup or linear interpolation to obtain the real-time target speed characterization information of the cooling fan. The module reads the current speed characterization information of the cooling fan and calculates the absolute value of the difference between the real-time target speed characterization information and the current speed characterization information as the speed deviation. If the speed deviation exceeds a preset speed adjustment dead zone threshold, a fan speed adjustment step operation command is generated to adjust the fan speed to the real-time target speed characterization information. If the speed deviation does not exceed the dead zone threshold, no adjustment command is generated to avoid control jitter caused by fine-tuning.

[0151] Step S555: Based on the fan speed adjustment step operation command and the temperature control target range characterization information, generate a temperature control step operation command sequence with a time sequence mark. Each step operation command in the temperature control step operation command sequence includes the target temperature value of the step operation, the target speed characterization information of the cooling fan corresponding to the step operation, and the execution time mark of the step operation.

[0152] The generation process of the temperature control step operation command sequence is an iterative prediction process. It starts with the current real-time temperature monitoring value as the initial state, uses the midpoint of the temperature control target range representation information as the target state, and uses a preset temperature sampling period as the time step. At each time step, the temperature value for the next time step is predicted based on a simplified thermodynamic model, and the target fan speed representation information corresponding to that predicted temperature is determined using the temperature control fan speed adjustment curve representation information. The operation command for that time step includes the predicted temperature value as the target temperature value, the target fan speed representation information, and the current time plus the time step as the execution time marker. The iterative prediction process continues until the predicted temperature value enters the temperature control target range representation information and remains stable. All iteratively generated operation commands are arranged in chronological order to form the temperature control step operation command sequence.

[0153] Step S556: The temperature control step operation instruction sequence and the temperature control target range characterization information are encapsulated and processed to generate a temperature control recovery instruction. The temperature control recovery instruction is used to instruct the heat dissipation control link of the target server motherboard to perform a gradual temperature control operation from the real-time temperature monitoring value to the temperature control target range characterization information according to the temperature control step operation instruction sequence.

[0154] The instruction encapsulation process takes the temperature control step operation instruction sequence as the instruction payload, uses the temperature control target range representation information as the convergence judgment reference field, and adds an instruction type identifier and instruction validity period timestamp, encapsulating it into a temperature control recovery instruction conforming to the baseboard management controller command protocol format. This instruction is sent to the fan controller on the target server motherboard through the baseboard management controller's pulse width modulation fan control interface. The fan controller adjusts the fan drive duty cycle item by item according to the timing marks in the instruction sequence.

[0155] Step S560: If the fault recovery strategy template is a power limit recovery strategy template, extract the power limit threshold representation information and power limit duration representation information from the power limit recovery strategy template, and generate a power limit recovery instruction based on the power limit threshold representation information and power limit duration representation information. The power limit recovery instruction is used to instruct the power management link of the target server motherboard to perform a temporary power limit operation.

[0156] The power limit recovery strategy template is applicable when the root cause of the fault is determined to be one of the following categories: power supply link short circuit symptom, power supply link open circuit symptom, or load link abnormality symptom. This template presets a threshold for imposing a temporary upper limit on the total power consumption of the motherboard or the power consumption of a specific power rail, as well as the duration for which this limitation needs to be maintained. The template parameter parsing process reads the parameter dictionary from the power limit recovery strategy template, locates entries with the power limit threshold as the location key, and assigns their values ​​to the power limit threshold characterization information. It also locates entries with the limitation duration as the location key and assigns their values ​​to the power limit duration characterization information. The process of generating the power limit recovery instruction involves writing the power limit threshold characterization information into the processor's average operating power limit register and writing the power limit duration characterization information into the limitation release timer. This instruction is sent to the power management link of the target server motherboard via the power management bus interface of the baseboard management controller.

[0157] Step S570: Combine and arrange the voltage regulation recovery command, temperature control recovery command, and power consumption limitation recovery command according to the propagation influence range characterization information in the fault root cause location analysis results to generate a fault diagnosis recovery command stream with command execution priority order.

[0158] Optionally, step S570 includes steps S571 to S576: Step S571: Analyze the number of nodes and the depth of the connection layer of the fault propagation subtree in the propagation impact range representation information. Determine the propagation impact range level representation information of the fault root cause based on the number of nodes and the depth of the connection layer. The propagation impact range level representation information includes local impact level representation information, regional impact level representation information and global impact level representation information.

[0159] The fault diagnosis and recovery instruction generation module reads the node quantity attribute value and the connection layer depth attribute value from the propagation impact range characterization information. It compares the node quantity with preset first and second threshold values, and simultaneously compares the connection layer depth with preset first and second depth threshold values. If the node quantity is less than the first threshold value and the connection layer depth is less than the first depth threshold value, the propagation impact range level characterization information is assigned the value of local impact level characterization information. If the node quantity is between the first and second threshold values, or the connection layer depth is between the first and second depth threshold values, the propagation impact range level characterization information is assigned the value of regional impact level characterization information. If the node quantity is greater than the second threshold value and the connection layer depth is greater than the second depth threshold value, the propagation impact range level characterization information is assigned the value of global impact level characterization information.

[0160] Step S572: If the propagation impact range level characterization information is local impact level characterization information, then set the instruction execution priority order of voltage regulation recovery instruction, temperature control recovery instruction and power consumption limit recovery instruction to single instruction sequential execution mode. The single instruction sequential execution mode is used to indicate that only a single recovery instruction that matches the monitoring data stream type corresponding to the root cause of the fault is executed.

[0161] Under the local impact level characterization information, the impact of the fault is confined to a single physical subsystem, with no significant cross-subsystem propagation. In this case, the instruction orchestration process selects only the recovery instruction that matches the monitoring data stream type corresponding to the root cause of the fault; the other two recovery instructions are suppressed and not executed. The instruction execution priority order only includes this single recovery instruction, which has the highest and unique priority.

[0162] Step S573: If the propagation impact range level characterization information is the regional impact level characterization information, then set the instruction execution priority order of the voltage regulation recovery instruction, temperature control recovery instruction and power consumption limit recovery instruction to the multi-instruction serial execution mode. The multi-instruction serial execution mode is used to indicate that multiple recovery instructions matching all monitoring data stream types involved in the fault propagation subtree are executed sequentially according to the preset instruction execution order.

[0163] Under the regional impact level characterization information, the fault impact has propagated to several related subsystems, but the overall system remains under control. The instruction combination orchestration process identifies all monitoring data stream types involved in the fault propagation subtree and extracts all recovery instructions corresponding to these data stream types. The extracted recovery instructions are sorted in order of priority: first, recovery instructions from the root cause, then those along the propagation path. The sorted sequence of recovery instructions is assigned decreasing priority levels. The multi-instruction serial execution mode requires the baseboard management controller to execute one recovery instruction and wait for it to stabilize before initiating the execution of the next recovery instruction.

[0164] Step S574: If the propagation impact range level characterization information is global impact level characterization information, then set the instruction execution priority order of voltage regulation recovery instruction, temperature control recovery instruction and power consumption limit recovery instruction to multi-instruction parallel execution mode. The multi-instruction parallel execution mode is used to indicate the simultaneous execution of multiple recovery instructions that match all monitoring data stream types involved in the fault propagation subtree.

[0165] Under the global impact level characterization information, the fault has spread over a wide area and may cause multiple subsystems to simultaneously deviate from their normal operating points. The instruction orchestration process identifies all monitoring data stream types involved in the fault propagation subtree and extracts all recovery instructions corresponding to these monitoring data stream types. All extracted recovery instructions are assigned the same highest priority level and marked as executable in parallel. The multi-instruction parallel execution mode requires the baseboard management controller to simultaneously issue corresponding recovery instructions to multiple control links to suppress further expansion of the fault's impact as quickly as possible.

[0166] Step S575: Assign the voltage regulation recovery instruction, temperature control recovery instruction, and power consumption limit recovery instruction the corresponding instruction execution priority flag and instruction execution timing offset characterization information, respectively, and sort the voltage regulation recovery instruction, temperature control recovery instruction, and power consumption limit recovery instruction according to the instruction execution priority flag and instruction execution timing offset characterization information to generate a recovery instruction sequence with instruction execution order.

[0167] For each recovery instruction included in the execution plan, the instruction orchestration process assigns it an instruction execution priority flag, which is an integer, with smaller values ​​indicating higher priority. In multi-instruction serial execution mode, recovery instructions with adjacent priorities are also assigned instruction execution timing offset information, representing the time interval required between the completion of the previous instruction and the start of the next. All recovery instructions are first arranged in ascending order according to their instruction execution priority flags, and instructions with the same priority flag are then arranged in ascending order according to their instruction execution timing offset information, forming a recovery instruction sequence.

[0168] Step S576: Combine and encapsulate the recovery instruction sequence with the instruction execution priority order to generate a fault diagnosis and recovery instruction stream. The fault diagnosis and recovery instruction stream is used to instruct the target server motherboard to perform recovery operations that match the root cause of the fault according to the instruction execution priority order and the recovery instruction sequence.

[0169] The combined encapsulation process uses the recovery instruction sequence as the instruction list payload, the instruction execution priority order as the execution mode description field, and adds a global identifier for the instruction stream, a generation timestamp, and a digital signature verification field, encapsulating it into a fault diagnosis and recovery instruction stream conforming to the input format of the board management controller's task scheduler. After receiving this instruction stream, the board management controller's task scheduler parses the execution mode description field and, according to the instruction execution priority order and the timing constraints in the recovery instruction sequence, distributes the recovery instructions one by one or in parallel to the voltage regulation link driver, thermal control link driver, or power management link driver, completing the fault diagnosis and recovery operation for the target server motherboard.

[0170] Based on the foregoing embodiments, this invention provides a fault diagnosis device. The units and modules included in the device can be implemented by a processor in a computer device; of course, they can also be implemented by specific logic circuits. In the implementation process, the processor can be a central processing unit (CPU), a microprocessor unit (MPU), a digital signal processor (DSP), or a field programmable gate array (FPGA), etc.

[0171] Figure 2 This is a schematic diagram of the composition structure of a fault diagnosis device provided in an embodiment of the present invention, as shown below. Figure 2 As shown, the fault diagnosis device 200 includes: The data acquisition module 210 is used to acquire the raw operating status monitoring data stream generated by the target server motherboard in the running state. The raw operating status monitoring data stream includes continuously collected motherboard voltage monitoring data stream segments, motherboard temperature monitoring data stream segments, and motherboard power consumption monitoring data stream segments with time sequence marks. Data mapping module 220 is used to perform operation status fault symptom mapping processing on the motherboard voltage monitoring data stream segment, motherboard temperature monitoring data stream segment and motherboard power consumption monitoring data stream segment in the original operation status monitoring data stream, to obtain the voltage abnormality symptom mapping map, temperature abnormality symptom mapping map and power consumption abnormality symptom mapping map corresponding to the target server motherboard. Network construction module 230 is used to construct a multi-source anomaly timing correlation network of the target server motherboard based on voltage anomaly symptom mapping map, temperature anomaly symptom mapping map and power consumption anomaly symptom mapping map; The fault location module 240 is used to perform fault root cause location analysis on the multi-source abnormal symptom time-series correlation network to obtain the fault root cause location analysis results of the target server motherboard. The fault root cause location analysis results include the identifier of the monitoring data stream segment where the fault root cause is located, the identifier of the abnormal symptom category corresponding to the fault root cause, and the propagation influence range characterization information of the fault root cause in the multi-source abnormal symptom time-series correlation network. The instruction generation module 250 is used to generate a fault diagnosis and recovery instruction stream for the target server motherboard based on the fault root cause location analysis results. The fault diagnosis and recovery instruction stream includes voltage adjustment recovery instructions, temperature control recovery instructions, and power consumption limitation recovery instructions corresponding to the fault root cause.

[0172] The description of the above device embodiments is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments. Please refer to the description of the method embodiments of the present invention for understanding.

[0173] Figure 3 A hardware entity diagram of a computer system provided as an embodiment of the present invention, such as... Figure 3 As shown, the hardware entity of the computer system 1000 includes a processor 1001 and a memory 1002, wherein the memory 1002 stores a computer program that can run on the processor 1001, and the processor 1001 executes the program to implement the steps in the method of any of the above embodiments.

[0174] The memory 1002 stores computer programs that can run on the processor. The memory 1002 is configured to store instructions and applications that can be executed by the processor 1001. It can also cache data to be processed or already processed (e.g., image data, audio data, voice communication data, and video communication data) of the processor 1001 and various modules in the computer system 1000. It can be implemented by flash memory or random access memory (RAM).

[0175] The processor 1001 executes the program to implement the steps of the intelligent diagnostic method for server motherboard faults described above. The processor 1001 typically controls the overall operation of the computer system 1000.

[0176] This invention provides a computer storage medium storing one or more programs that can be executed by one or more processors to implement the steps of the server motherboard fault intelligent diagnosis method as described in any of the above embodiments.

Claims

1. A method for intelligent diagnosis of server motherboard faults, characterized in that, The method includes: Acquire the raw operating status monitoring data stream generated by the target server motherboard in the running state. The raw operating status monitoring data stream includes continuously collected motherboard voltage monitoring data stream segments, motherboard temperature monitoring data stream segments, and motherboard power consumption monitoring data stream segments with time sequence marks. The motherboard voltage monitoring data stream segment, the motherboard temperature monitoring data stream segment, and the motherboard power consumption monitoring data stream segment in the original operating status monitoring data stream are subjected to operating status fault symptom mapping processing to obtain the voltage abnormality symptom mapping map, temperature abnormality symptom mapping map, and power consumption abnormality symptom mapping map corresponding to the target server motherboard. Based on the voltage anomaly symptom mapping map, the temperature anomaly symptom mapping map, and the power consumption anomaly symptom mapping map, a multi-source anomaly symptom timing correlation network of the target server motherboard is constructed. The multi-source abnormal symptom time-series correlation network is subjected to fault root cause localization analysis to obtain the fault root cause localization analysis result of the target server motherboard. The fault root cause localization analysis result includes the monitoring data stream segment identifier where the fault root cause is located, the abnormal symptom category identifier corresponding to the fault root cause, and the propagation influence range characterization information of the fault root cause in the multi-source abnormal symptom time-series correlation network. Based on the fault root cause location analysis results, a fault diagnosis and recovery command stream is generated for the target server motherboard. The fault diagnosis and recovery command stream includes voltage adjustment recovery commands, temperature control recovery commands, and power consumption limit recovery commands corresponding to the fault root cause.

2. The intelligent diagnostic method for server motherboard faults according to claim 1, characterized in that, The process of mapping operational status fault symptoms onto the motherboard voltage monitoring data stream segment, motherboard temperature monitoring data stream segment, and motherboard power consumption monitoring data stream segment in the original operational status monitoring data stream yields a voltage anomaly symptom mapping map, a temperature anomaly symptom mapping map, and a power consumption anomaly symptom mapping map corresponding to the target server motherboard, including: Extract the voltage amplitude change trajectory corresponding to adjacent time markers in the motherboard voltage monitoring data stream segment, and input the voltage amplitude change trajectory into the pre-constructed voltage anomaly symptom mapping model to obtain voltage fluctuation trajectory characterization information; Extract the temperature gradient change trend corresponding to adjacent time markers in the motherboard temperature monitoring data stream segment, and input the temperature gradient change trend into the pre-constructed temperature anomaly sign mapping model to obtain temperature change trend characterization information. Extract the power transient jump trajectory corresponding to adjacent timing markers in the motherboard power consumption monitoring data stream segment, and input the power transient jump trajectory into the pre-constructed power anomaly symptom mapping model to obtain power transient jump characterization information; Based on the voltage drop trajectory segment, voltage rise trajectory segment, and voltage ripple trajectory segment in the voltage fluctuation trajectory characterization information, symptom category matching processing is performed with a preset voltage fault symptom category library to obtain the power supply link abnormal symptom mapping association corresponding to voltage drop, the voltage regulation abnormal symptom mapping association corresponding to voltage rise, and the filter circuit abnormal symptom mapping association corresponding to voltage ripple. The power supply link abnormal symptom mapping association, the voltage regulation abnormal symptom mapping association, and the filter circuit abnormal symptom mapping association are then processed into a graph-based organization to generate the voltage abnormal symptom mapping graph. Based on the temperature step increase trend segment, temperature oscillation trend segment, and temperature continuous deviation trend segment in the temperature change trend characterization information, symptom category matching processing is performed with a preset temperature fault symptom category library to obtain the heat dissipation path blockage symptom mapping association corresponding to temperature step increase, the heat dissipation fan operation abnormality symptom mapping association corresponding to temperature oscillation, and the heat dissipation medium failure symptom mapping association corresponding to temperature continuous deviation. The heat dissipation path blockage symptom mapping association, the heat dissipation fan operation abnormality symptom mapping association, and the heat dissipation medium failure symptom mapping association are then processed into a graph-based organization to generate the temperature anomaly symptom mapping graph. Based on the power consumption peak pulse segment, the power consumption trough segment, and the power consumption periodic fluctuation segment in the power consumption transient jump characterization information, symptom category matching processing is performed with a preset power consumption fault symptom category library to obtain the power supply link short circuit symptom mapping association corresponding to the power consumption peak pulse, the power supply link open circuit symptom mapping association corresponding to the power consumption trough, and the load link abnormal symptom mapping association corresponding to the power consumption periodic fluctuation. The power supply link short circuit symptom mapping association, the power supply link open circuit symptom mapping association, and the load link abnormal symptom mapping association are then processed into a graph-based organization to generate the power consumption abnormal symptom mapping graph.

3. The intelligent diagnostic method for server motherboard faults according to claim 2, characterized in that, The step of matching the voltage drop trajectory segment, voltage rise trajectory segment, and voltage ripple trajectory segment in the voltage fluctuation trajectory characterization information with a preset voltage fault symptom category library to obtain the power supply link abnormal symptom mapping association corresponding to voltage drop, the voltage regulation abnormal symptom mapping association corresponding to voltage rise, and the filter circuit abnormal symptom mapping association corresponding to voltage ripple, including: The voltage drop trajectory segment is subjected to correlation feature extraction processing of the drop amplitude and drop duration to obtain voltage drop correlation feature representation information; The voltage drop correlation feature characterization information is input into the power supply link abnormality symptom matching branch in the voltage fault symptom category library. The symptom feature similarity distribution between the voltage drop correlation feature characterization information and each candidate power supply link abnormality symptom template in the power supply link abnormality symptom matching branch is calculated. The power supply link abnormality symptom mapping association corresponding to the voltage drop trajectory segment is determined according to the similarity peak position in the symptom feature similarity distribution. The voltage surge trajectory segment is subjected to correlation feature extraction processing of surge amplitude and surge recovery time to obtain voltage surge correlation feature representation information; The voltage surge correlation feature characterization information is input into the voltage regulation abnormality symptom matching branch in the voltage fault symptom category library. The symptom feature similarity distribution between the voltage surge correlation feature characterization information and each candidate voltage regulation abnormality symptom template in the voltage regulation abnormality symptom matching branch is calculated. The voltage regulation abnormality symptom mapping association corresponding to the voltage surge trajectory segment is determined according to the similarity peak position in the symptom feature similarity distribution. The voltage ripple trajectory segment is subjected to correlation feature extraction processing of ripple frequency and ripple amplitude to obtain voltage ripple correlation feature characterization information; The voltage ripple correlation feature characterization information is input into the filter circuit abnormality symptom matching branch in the voltage fault symptom category library. The symptom feature similarity distribution between the voltage ripple correlation feature characterization information and each candidate filter circuit abnormality symptom template in the filter circuit abnormality symptom matching branch is calculated. The filter circuit abnormality symptom mapping association corresponding to the voltage ripple trajectory segment is determined according to the similarity peak position in the symptom feature similarity distribution.

4. The intelligent diagnostic method for server motherboard faults according to claim 2, characterized in that, The step-up temperature trend segment, the temperature oscillation trend segment, and the temperature continuous deviation trend segment in the temperature change trend characterization information are respectively matched with a preset temperature fault symptom category library to obtain the symptom mapping associations for the heat dissipation path blockage corresponding to the temperature step-up, the symptom mapping associations for the abnormal operation of the cooling fan corresponding to the temperature oscillation, and the symptom mapping associations for the heat dissipation medium failure corresponding to the temperature continuous deviation, including: The correlation feature extraction process between the step start time and the step slope is performed on the temperature step upward trend segment to obtain temperature step correlation feature characterization information. The temperature step correlation feature characterization information is input into the heat dissipation path blockage symptom matching branch in the temperature fault symptom category library. The symptom feature similarity distribution between the temperature step correlation feature characterization information and each candidate heat dissipation path blockage symptom template in the heat dissipation path blockage symptom matching branch is calculated. The heat dissipation path blockage symptom mapping association corresponding to the temperature step upward trend segment is determined according to the similarity peak position in the symptom feature similarity distribution. The correlation features of oscillation frequency and oscillation amplitude are extracted from the temperature oscillation trend segment to obtain temperature oscillation correlation feature characterization information; The temperature oscillation correlation feature characterization information is input into the cooling fan operation abnormality symptom matching branch in the temperature fault symptom category library. The symptom feature similarity distribution between the temperature oscillation correlation feature characterization information and each candidate cooling fan operation abnormality symptom template in the cooling fan operation abnormality symptom matching branch is calculated. The cooling fan operation abnormality symptom mapping association corresponding to the temperature oscillation fluctuation trend segment is determined according to the similarity peak position in the symptom feature similarity distribution. The correlation feature extraction process of deviation from the baseline and the duration of deviation is performed on the temperature deviation trend segment to obtain the temperature deviation correlation feature representation information. The temperature deviation association feature characterization information is input into the heat dissipation medium failure symptom matching branch in the temperature fault symptom category library. The symptom feature similarity distribution between the temperature deviation association feature characterization information and each candidate heat dissipation medium failure symptom template in the heat dissipation medium failure symptom matching branch is calculated. The heat dissipation medium failure symptom mapping association corresponding to the temperature deviation trend segment is determined according to the similarity peak position in the symptom feature similarity distribution.

5. The intelligent diagnostic method for server motherboard faults according to claim 1, characterized in that, The construction of the multi-source anomaly timing correlation network for the target server motherboard based on the voltage anomaly symptom mapping map, the temperature anomaly symptom mapping map, and the power consumption anomaly symptom mapping map includes: Extract the voltage anomaly occurrence time markers corresponding to each voltage anomaly occurrence mapping association in the voltage anomaly occurrence mapping map, and pair the voltage anomaly occurrence time markers with the voltage anomaly occurrence mapping associations to obtain a voltage anomaly occurrence time sequence with a time index. Extract the time markers of the occurrence of each temperature anomaly sign from the temperature anomaly sign mapping map, and pair the time markers of the occurrence of the temperature anomaly sign with the temperature anomaly sign mapping to obtain a time sequence of temperature anomalies with a time index. Extract the time markers of power anomaly occurrence corresponding to each power anomaly occurrence mapping association in the power anomaly occurrence mapping map, and pair the time markers of power anomaly occurrence with the power anomaly occurrence mapping association to obtain a power anomaly occurrence time sequence with time index; The voltage anomaly symptom time series, the temperature anomaly symptom time series, and the power consumption anomaly symptom time series are input into a pre-constructed time series correlation analysis model. The time markers of the occurrence of the voltage anomaly symptom, the temperature anomaly symptom, and the power consumption anomaly symptom are time-aligned to obtain the temporal dependency correlation characterization information between different anomalies. Based on the voltage anomaly symptom mapping association, the temperature anomaly symptom mapping association, and the power consumption anomaly symptom mapping association, and combined with the temporal sequence dependency association characterization information, a multi-source anomaly symptom temporal association network for the target server motherboard is constructed. The multi-source anomaly symptom temporal association network uses the voltage anomaly symptom mapping association, the temperature anomaly symptom mapping association, and the power consumption anomaly symptom mapping association as network nodes, and the temporal sequence dependency association characterization information as network edges. The network topology structure of the multi-source abnormal symptom time-series correlation network is analyzed and processed to extract the key convergence nodes and key divergence nodes in the multi-source abnormal symptom time-series correlation network, and an abnormal symptom propagation path map of the multi-source abnormal symptom time-series correlation network is generated based on the key convergence nodes and the key divergence nodes.

6. The intelligent diagnostic method for server motherboard faults according to claim 5, characterized in that, The step involves inputting the time-series sequences of voltage anomalies, temperature anomalies, and power consumption anomalies into a pre-constructed time-series correlation analysis model. Time-series alignment processing is then performed on the time markers of the voltage anomaly occurrence, temperature anomaly occurrence, and power consumption anomaly occurrence to obtain temporal dependency correlation characterization information between different anomalies, including: The voltage anomaly symptom time series and the temperature anomaly symptom time series are input into the first time series comparison branch of the time series correlation analysis model. Cross-time series matching processing is performed on the voltage anomaly symptom occurrence time marker and the temperature anomaly symptom occurrence time marker. The first time offset distribution between the voltage anomaly symptom occurrence time marker and the temperature anomaly symptom occurrence time marker is calculated. The first time series interval distribution in which the voltage anomaly symptom occurs before the temperature anomaly symptom is determined based on the peak interval of the first time offset distribution. The time series sequence of abnormal temperature symptoms and the time series sequence of abnormal power consumption symptoms are input into the second time series comparison branch of the time series correlation analysis model. Cross-time series matching processing is performed on the time markers of the occurrence of abnormal temperature symptoms and the time markers of the occurrence of abnormal power consumption symptoms. The second time offset distribution between the time markers of the occurrence of abnormal temperature symptoms and the time markers of the occurrence of abnormal power consumption symptoms is calculated. The second time series interval distribution in which the abnormal temperature symptoms occur before the abnormal power consumption symptoms is determined based on the peak interval of the second time offset distribution. The power consumption anomaly symptom time series and the voltage anomaly symptom time series are input into the third time series comparison branch of the time series correlation analysis model. Cross-time series matching processing is performed on the power consumption anomaly symptom occurrence time marker and the voltage anomaly symptom occurrence time marker. The third time offset distribution between the power consumption anomaly symptom occurrence time marker and the voltage anomaly symptom occurrence time marker is calculated. The third time series interval distribution in which the power consumption anomaly symptom occurs before the voltage anomaly symptom is determined based on the peak interval of the third time offset distribution. The first time-series interval distribution, the second time-series interval distribution, and the third time-series interval distribution are subjected to time-series dependency consistency verification processing to detect time-series transitive consistency conflicts among the first time-series interval distribution, the second time-series interval distribution, and the third time-series interval distribution. The time-series interval distributions with time-series transitive consistency conflicts are subjected to time-series constraint adjustment processing to obtain the adjusted first time-series interval distribution, the adjusted second time-series interval distribution, and the adjusted third time-series interval distribution. Based on the adjusted first time interval distribution, the adjusted second time interval distribution, and the adjusted third time interval distribution, a time dependency array is constructed between the voltage anomaly symptom time sequence, the temperature anomaly symptom time sequence, and the power consumption anomaly symptom time sequence. The temporal dependency array is converted into temporal dependency association representation information with a directed acyclic graph structure.

7. The intelligent diagnostic method for server motherboard faults according to claim 5, characterized in that, The process of performing network topology analysis on the multi-source anomaly symptom time-series correlation network to extract key convergence nodes and key divergence nodes in the multi-source anomaly symptom time-series correlation network includes: Calculate the number of in-degree edges and the number of out-degree edges for each network node in the multi-source anomaly symptom time-series correlation network, and construct the degree distribution representation information of the network node based on the number of in-degree edges and the number of out-degree edges. The degree distribution representation information is sorted by node importance. Network nodes with more in-degree edges than the in-degree threshold are marked as candidate convergence node sets, and network nodes with more out-degree edges than the out-degree threshold are marked as candidate divergence node sets. For each candidate convergence node in the candidate convergence node set, perform convergence path backtracking analysis to extract all predecessor network nodes connected to the candidate convergence node and the edge weight distribution between the predecessor network nodes and the candidate convergence node, and calculate the convergence strength metric of the candidate convergence node based on the edge weight distribution. For each candidate diverging node in the candidate diverging node set, a diverging path forward tracing process is performed to extract all successor network nodes connected to the candidate diverging node and the edge weight distribution between the candidate diverging node and the successor network nodes, and the divergence intensity metric of the candidate diverging node is calculated based on the edge weight distribution. The candidate aggregation node set is filtered according to the aggregation intensity metric, and the candidate aggregation nodes whose aggregation intensity metric exceeds the aggregation intensity threshold are determined as key aggregation nodes in the multi-source anomaly symptom time series correlation network. The candidate divergent node set is filtered based on the divergence intensity metric, and the candidate divergent nodes whose divergence intensity metric exceeds the divergence intensity threshold are identified as key divergent nodes in the multi-source anomaly symptom time-series correlation network.

8. The intelligent diagnostic method for server motherboard faults according to claim 1, characterized in that, The step of performing fault root cause localization analysis on the multi-source abnormal symptom time-series correlation network to obtain the fault root cause localization analysis results of the target server motherboard includes: Extract the edge weight distribution of the key divergent node and the subsequent network node connected to the key divergent node in the multi-source abnormal symptom time-series correlation network, and determine the fault impact propagation intensity of the key divergent node on the subsequent network node based on the edge weight distribution. Extract the edge weight distribution of the key aggregation node and the predecessor network node connected to the key aggregation node in the multi-source abnormal symptom time series correlation network, and determine the aggregation intensity of the fault impact of the predecessor network node on the key aggregation node based on the edge weight distribution. The key diverging node, the key converging node, the fault impact propagation intensity, and the fault impact convergence intensity are input into the pre-constructed fault root cause localization analysis model. The fault propagation path of the multi-source abnormal symptom time series correlation network is reversed and the fault root cause node of the target server motherboard is located. Extract the node identifier of the fault root cause node in the multi-source abnormal symptom time-series correlation network, and determine the monitoring data stream segment identifier where the fault root cause is located based on the node identifier; Extract the abnormal symptom category identifier corresponding to the fault root cause node. Based on the fault root cause node and all subsequent network nodes connected to the fault root cause node, extract the fault propagation subtree starting from the fault root cause node in the multi-source abnormal symptom time-series association network. Generate the propagation influence range representation information of the fault root cause in the multi-source abnormal symptom time-series association network based on the number of nodes and the edge level depth of the fault propagation subtree. The monitoring data stream segment identifier, the abnormal symptom category identifier, and the propagation impact range characterization information are encapsulated and processed to generate the root cause location analysis results of the target server motherboard.

9. The intelligent diagnostic method for server motherboard faults according to claim 8, characterized in that, The process of inputting the key diverging node, the key converging node, the fault impact propagation intensity, and the fault impact convergence intensity into a pre-constructed fault root cause localization analysis model, and performing reverse tracing of the fault propagation path on the multi-source abnormal symptom time-series correlation network to locate the fault root cause node of the target server motherboard includes: Initialize all network nodes in the multi-source anomaly symptom time-series correlation network as a set of candidate fault root source nodes; For each candidate fault root source node in the candidate fault root source node set, a predecessor node existence detection process is performed. If the candidate fault root source node has a predecessor network node in the multi-source abnormal symptom time series association network, the candidate fault root source node is removed from the candidate fault root source node set to obtain the candidate fault root source node set after initial screening. For each candidate fault root cause node in the initial screening set, the fault impact propagation intensity is verified. If the fault impact propagation intensity of the candidate fault root cause node is lower than the propagation intensity threshold, the candidate fault root cause node is removed from the initial screening set to obtain the second screening set of candidate fault root cause nodes. For each candidate fault root source node in the set of candidate fault root source nodes after the second screening, the fault impact propagation path length evaluation process is performed. The propagation path level depth from the candidate fault root source node to the farthest successor network node in the multi-source abnormal symptom time-series correlation network is calculated. If the propagation path level depth is lower than the path depth threshold, the candidate fault root source node is removed from the set of candidate fault root source nodes after the second screening, and the set of candidate fault root source nodes after the third screening is obtained. If the set of candidate fault root cause nodes after the three screenings contains only one candidate fault root cause node, then the candidate fault root cause node is determined as the fault root cause node of the target server motherboard. If the set of candidate root cause nodes after the three screenings contains multiple candidate root cause nodes, then the time marker of the occurrence of abnormal symptoms of each candidate root cause node is extracted, and the candidate root cause node with the time marker of the occurrence of abnormal symptoms is determined as the root cause node of the target server motherboard.

10. A computer system comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method according to any one of claims 1 to 9.

11. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the steps of the method according to any one of claims 1 to 9.