Anomaly detection method for supercomputing cluster

By combining a graph structure generation network with topological constraints and a time series prediction network in a supercomputing cluster, along with an adaptive data augmentation strategy, the problems of model number expansion and inaccurate graph structure learning in supercomputing cluster anomaly detection are solved, achieving more efficient and accurate anomaly detection.

CN122220184APending Publication Date: 2026-06-16国家超级计算天津中心 +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
国家超级计算天津中心
Filing Date
2026-05-18
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing methods for detecting anomalies in supercomputing clusters are ineffective at detecting cross-node link anomalies. The number of models increases, parameter maintenance becomes complex, graph structure learning is inaccurate, data augmentation strategies disrupt logical dependencies, and multi-task optimization causes models to get stuck in local optima, affecting the stability and generalization ability of online detection.

Method used

By combining graph structure generator networks with supercomputing cluster topology constraints, dynamic dependencies are determined from spatial and temporal causal perspectives. An adaptive data augmentation strategy is employed to jointly train graph structure and temporal prediction networks, enabling comprehensive and accurate anomaly detection of supercomputing clusters.

🎯Benefits of technology

It improves the comprehensiveness and accuracy of anomaly detection, enhances the robustness of the model in noisy environments, solves the problem of false negatives in homogeneous nodes, and achieves more efficient, easier-to-deploy and maintain cluster-level anomaly detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122220184A_ABST
    Figure CN122220184A_ABST
Patent Text Reader

Abstract

The application provides an anomaly detection method for a supercomputer cluster, comprising: performing feature extraction and fusion processing on index data and log data of the same node to obtain a feature tensor at the current time; inputting the feature tensor into a graph structure generation network to obtain a graph structure adjacency matrix through a learnable structure inference calculation; inputting the graph structure adjacency matrix and the feature tensor in a set historical time window into a trained graph encoding network to obtain a spatiotemporal graph embedding sequence; inputting the spatiotemporal graph embedding sequence into a trained time series prediction network to obtain feature prediction data of each node at the next time; and determining whether the supercomputer cluster has an abnormal condition according to the error between the feature prediction data and the feature observation data of each node at the next time. Through the technical scheme of the application, the cluster-level anomaly detection purpose of being more comprehensive, more accurate, more efficient, and easier to deploy and maintain can be achieved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of anomaly detection technology, specifically to an anomaly detection method for supercomputing clusters. Background Technology

[0002] With the rapid development of High Performance Computing (HPC) technology, supercomputer clusters are becoming increasingly large, typically containing tens of thousands of computing nodes, storage nodes, and complex interconnection network devices. To ensure the efficient and stable operation of supercomputing clusters, real-time monitoring and anomaly detection of key performance indicators (such as CPU utilization, memory bandwidth, I / O throughput, and network latency) of each node are necessary. These indicators constitute typical massive, high-dimensional, multivariate time series data.

[0003] In a supercomputing cluster environment, nodes are not isolated but tightly connected through a high-speed interconnect network, running complex distributed parallel jobs. This means that performance fluctuations in one node often propagate to other nodes through network topology or task scheduling logic, forming complex spatial topology dependencies. Furthermore, this propagation typically has a time lag (for example, I / O congestion on a storage node may only cause CPU waiting on a compute node several seconds later).

[0004] In ensuring high reliability of clusters, existing anomaly detection methods face severe challenges, mainly in the following four aspects:

[0005] Modeling efficiency and generalization challenges in large-scale clusters. Traditional methods often employ a "one node, one model" or "one type of node, one model" strategy, ignoring dependencies and propagation effects between nodes, making it difficult to detect anomalies across node links. In large-scale clusters, this approach leads to problems such as model proliferation, complex parameter maintenance, and high overhead from repetitive training at the engineering deployment level, making it difficult to meet the requirements of high efficiency, scalability, and sustainable operation and maintenance in production environments. Therefore, an anomaly detection method is needed that can uniformly represent the global state through a single model and provide unified modeling.

[0006] Existing graph methods generally rely on static graphs or single-perspective graph construction, making it difficult to obtain dynamic graph structures consistent with actual operational patterns. Although supercomputing systems have relatively fixed logical interconnections and access paths, dependencies caused by factors such as job scheduling, resource contention, and I / O congestion exhibit significant time-varying characteristics. Using only static topology or empirical edge connections can easily lead to inconsistencies between the learned graph structure and actual operational dependencies, thereby weakening the graph model's ability to represent anomaly propagation chains. On the other hand, many graph construction methods infer edges from only a single perspective, easily overlooking causal transmission relationships with time lag effects, resulting in insufficient identification of chain-like anomalies that occur first and then arrive. The core contradiction arises from this: strictly adhering to hierarchical logical topological constraints while simultaneously constructing a dynamic dependency graph consistent with operational patterns under data-driven conditions places higher demands on the accuracy and stability of graph structure learning.

[0007] The indiscriminate nature of data augmentation leads to semantic corruption: Existing methods widely employ common random edge deletion, random edge addition, and random feature masking to generate augmented views. In supercomputing architectures with strict hierarchies, such random perturbations easily disrupt backbone edges, causing the augmented view to deviate from the true logical dependencies. Furthermore, indiscriminate random feature discarding can easily mask highly discriminative anomalous signals. Therefore, an adaptive data augmentation mechanism is needed, which performs importance sampling based on confidence and centrality on the structure side and adaptive masking based on statistical variance on the feature side, thereby improving the quality of representation learning while maintaining the consistency of logical topological semantics.

[0008] Independent, step-by-step optimization of multiple tasks leads to models getting trapped in local optima, making it difficult to simultaneously guarantee the rationality of graph structure, robustness of representation, and consistency of prediction, thus affecting the stability and generalization ability of online detection. Existing methods typically decompose graph construction, node representation learning, and temporal prediction into independent sub-tasks for phased training. This step-by-step optimization results in a lack of coordination between the training objectives of each stage, easily leading to local optima. On the one hand, the lack of feedback from the prediction task means that the learned graph structure may only cater to the statistical characteristics of the data, deviating from the logical topological constraints and statistical priors of the supercomputing cluster. On the other hand, the lack of structure-aware constraints makes node representations overly sensitive to data noise and augmentation perturbations, leading to decreased robustness. In addition, without the support of high-quality graph structures and robust features, the prediction error of the temporal prediction module is highly susceptible to drift with cluster load fluctuations. This inconsistency of multiple objectives ultimately leads to oscillations in the residual score during online detection, easily causing false positives and false negatives, failing to meet the stability requirements of long-term online detection in supercomputing scenarios.

[0009] In view of the above, this application is hereby submitted. Summary of the Invention

[0010] To address at least one of the aforementioned problems, this application provides an anomaly detection method for supercomputing clusters, which can achieve more comprehensive, accurate, efficient, and easier-to-deploy and maintain cluster-level anomaly detection.

[0011] Firstly, embodiments of this application provide an anomaly detection method for supercomputing clusters, including:

[0012] Obtain the indicator data and log data of each node in the supercomputing cluster, and perform feature extraction and fusion processing on the indicator data and log data of the same node to obtain the feature tensor at the current moment;

[0013] The feature tensor is input into the trained graph structure generation network. The graph structure generation network determines the dynamic dependencies between nodes from both spatial and temporal causal perspectives, and obtains the graph structure adjacency matrix representing the dynamic dependencies. During the training phase, the graph structure generation network incorporates the inherent node topology constraints of the supercomputing cluster.

[0014] The graph structure adjacency matrix and the feature tensors within the set historical time window are input into the trained graph coding network to obtain the spatiotemporal graph embedding sequence;

[0015] The spatiotemporal graph embedding sequence is input into a trained temporal prediction network to obtain feature prediction data for each node at the next time step.

[0016] The presence of anomalies in the supercomputing cluster is determined by the error between the feature prediction data and the feature observation data of each node at the next moment.

[0017] Secondly, embodiments of this application also provide an electronic device, the electronic device comprising:

[0018] Processor and memory;

[0019] The processor executes the steps of the anomaly detection method for supercomputing clusters, as described in any embodiment, by calling programs or instructions stored in memory.

[0020] Thirdly, embodiments of this application also provide a computer-readable storage medium storing a program or instructions that cause a computer to perform the steps of the anomaly detection method for supercomputing clusters as described in any embodiment.

[0021] In summary, this application proposes an anomaly detection method for supercomputing clusters. It acquires indicator data and log data from each node in the supercomputing cluster, performs feature extraction and fusion processing on the indicator data and log data of the same node, and obtains a feature tensor for the current time step. This feature tensor is then input into a trained graph structure generation network. The graph structure generation network determines the dynamic dependencies between nodes from both spatial and temporal causal perspectives, obtaining a graph structure adjacency matrix representing these dynamic dependencies. During the training phase, the graph structure generation network incorporates the inherent node topological constraints of the supercomputing cluster, proposing a dynamic graph learning method that integrates statistical structure priors under logical-level hard constraints. This fundamentally solves the problems of information loss caused by relying solely on static topology or erroneous associations introduced by purely data-driven graph construction in traditional methods. Through the fusion of spatial co-occurrence and temporal causality perspectives, the model can adaptively capture the logical dependencies generated by job scheduling in the supercomputing cluster, achieving accurate characterization of complex time-varying relationships, thereby improving the comprehensiveness and accuracy of anomaly detection. The adaptive enhancement strategy combining topology and features significantly improves the model's robustness in noisy environments. The structure-aware contrastive learning mechanism effectively solves the problem of false negatives in isomorphic nodes. Attached Figure Description

[0022] Figure 1 This is a flowchart of an anomaly detection method for supercomputing clusters provided in an embodiment of this application;

[0023] Figure 2 This is a schematic diagram illustrating the process of feature extraction and fusion processing of indicator data and log data provided in an embodiment of this application;

[0024] Figure 3 This is a schematic diagram of a graph structure learning process provided in an embodiment of this application;

[0025] Figure 4 This is a schematic diagram illustrating a data augmentation and contrastive learning process provided in an embodiment of this application;

[0026] Figure 5 This is a schematic diagram of a joint training and online anomaly detection provided in an embodiment of this application;

[0027] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0028] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.

[0029] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0030] Example 1

[0031] Figure 1 This is a flowchart illustrating an anomaly detection method for supercomputing clusters provided in an embodiment of this application. See also... Figure 1 The anomaly detection method for supercomputing clusters specifically includes the following steps:

[0032] S110. Obtain the indicator data and log data of each node in the supercomputing cluster, perform feature extraction and fusion processing on the indicator data and log data of the same node, and obtain the feature tensor at the current moment.

[0033] Metric data includes, but is not limited to, characteristic channels such as CPU utilization, memory bandwidth, I / O throughput, and network latency.

[0034] The purpose of feature extraction and fusion processing for indicator data and log data of the same node is to solve the problem of unified representation of two types of heterogeneous data, indicator data and log data.

[0035] S120. Input the feature tensor into the trained graph structure generation network. The graph structure generation network determines the dynamic dependencies between nodes from both spatial and temporal causal perspectives, and obtains the graph structure adjacency matrix representing the dynamic dependencies.

[0036] In the training phase, the graph structure generation network incorporates the inherent node topology constraints of the supercomputing cluster. The goal of the graph structure generation network is to learn the "true adjacency relationships" between nodes from both spatial (instantaneous) and temporal (lagging) perspectives, considering both feature similarity and the effects of time lag and topological constraints. The training process of the graph structure generation network will be explained in detail later.

[0037] S130. Input the graph structure adjacency matrix and the feature tensors within the set historical time window into the trained graph coding network to obtain the spatiotemporal graph embedding sequence.

[0038] S140. Input the spatiotemporal graph embedding sequence into the trained temporal prediction network to obtain the feature prediction data of each node at the next time step.

[0039] S150. Determine whether there is an abnormal situation in the supercomputing cluster based on the error between the feature prediction data and the feature observation data of each node at the next moment.

[0040] Specifically, calculate the error between the predicted feature data and the observed feature data at the node dimension:

[0041] in, This represents the error between the predicted feature data and the observed feature data at node i at time t. This represents the feature prediction data of node i at time t. This represents the characteristic observation data of node i at time t.

[0042] The errors are standardized and aggregated into a system-level anomaly score. If the anomaly score exceeds a judgment threshold, an anomaly is determined to exist in the supercomputing cluster; the judgment threshold is the experimental value that maximizes the F1 score.

[0043] To eliminate the differences in the magnitude and volatility of basic errors across different nodes under varying load modes, the errors are standardized and aggregated into a system-level anomaly score. Specifically, online statistical methods are used to update the historical mean of the error for each node in real time. and standard deviation Based on these dynamic statistics, a standardized anomaly score is calculated for each node, and the highest score among all nodes is selected as the current timestamp. System anomaly score The formula is defined as follows:

[0044]

[0045] in, and They represent the first The average and standard deviation of the historical errors of each node are used. This maximization strategy ensures that the model can sensitively capture even drastic anomalies occurring on only a single node in the cluster. Finally, a grid search is used to find the optimal anomaly detection threshold. On the calibration data, the search interval for the threshold is defined as the anomaly score. Between the minimum and maximum values, all possible thresholds are iterated with a fixed step size, and the value that maximizes the F1 score is selected as the final decision threshold. If the system score is abnormal during online operation... If the threshold is exceeded, the system is deemed to have malfunctioned and an alarm is triggered.

[0046] The F1 score is a comprehensive metric for measuring model performance, particularly suitable for imbalanced classification scenarios (such as anomaly detection where normal samples far outnumber anomaly samples). Its calculation combines precision and recall, and the formula is: F1 = 2 × (Precision × Recall) / (Precision + Recall). Precision = number of samples predicted as anomaly and which are indeed anomalies / total number of samples predicted as anomalies, measuring the model's accuracy in predicting anomalies. Recall = number of samples predicted as anomalies and which are indeed anomalies / total number of actual anomalies, measuring the model's ability to identify all true anomalies. The F1 score ranges from 0 to 1; a higher F1 score indicates a better balance between precision and recall. In anomaly detection, simply pursuing high precision may lead to many missed anomalies (low recall); conversely, simply pursuing high recall may result in many false positives (low precision). The F1 score, considering both factors, allows for better selection of thresholds that yield better results in practical applications.

[0047] The anomaly detection method for supercomputing clusters provided in this application acquires indicator data and log data of each node in the supercomputing cluster. Feature extraction and fusion processing are performed on the indicator data and log data of the same node to obtain a feature tensor at the current time. This feature tensor is input into a trained graph structure generation network. The graph structure generation network determines the dynamic dependencies between nodes from both spatial and temporal causal perspectives, obtaining a graph structure adjacency matrix representing these dynamic dependencies. During the training phase, the graph structure generation network incorporates the inherent node topological constraints of the supercomputing cluster, proposing a dynamic graph learning method that integrates statistical structure priors under logical-level hard constraints. This fundamentally solves the problems of information loss caused by relying solely on static topology or erroneous associations introduced by purely data-driven graph construction in traditional methods. Through the fusion of spatial co-occurrence and temporal causality perspectives, the model can adaptively capture the logical dependencies generated by job scheduling in the supercomputing cluster, achieving accurate characterization of complex time-varying relationships, thereby improving the comprehensiveness and accuracy of anomaly detection. The adaptive enhancement strategy combining topology and features significantly improves the model's robustness in noisy environments. The structure-aware contrastive learning mechanism effectively solves the problem of false negatives in homogeneous nodes. The graph structure adjacency matrix and the feature tensors within a set historical time window are input into a trained graph coding network to obtain a spatiotemporal graph embedding sequence. This sequence is then input into a trained temporal prediction network to obtain feature prediction data for each node at the next time step. Based on the error between the feature prediction data and the feature observation data of each node at the next time step, the system determines whether the supercomputing cluster exhibits any anomalies. This enables more comprehensive, accurate, efficient, and easily deployable and maintainable cluster-level anomaly detection.

[0048] Example 2

[0049] Based on the above embodiments, this embodiment provides optional specific implementation methods for step S110.

[0050] For example, feature extraction and fusion processing are performed on indicator data and log data of the same node to obtain the feature tensor at the current moment, including:

[0051] In the indicator channel, a Temporal Convolutional Network (TCN) is used to encode the indicator data to obtain feature codes. The feature codes are then input into a linear layer, which maps the feature codes to the target space to obtain the indicator codes. The Temporal Convolutional Network includes causal convolutional structures and dilated convolutional structures.

[0052] In the log channel, a text encoding network maps individual log events in the log data into log semantic vectors. Using the current time as a reference, a first time window and a second time window preceding the current time are constructed, with the second time window encompassing the first time window. A weighted average is applied to the log semantic vectors falling within the first time window to obtain a first intermediate result, where log semantic vectors closer to the current time have a larger weight. Similarly, a weighted average is applied to the log semantic vectors falling within the second time window to obtain a second intermediate result, where log semantic vectors farther from the current time have a smaller weight. The first and second intermediate results are then weighted and summed to obtain a third intermediate result. This third intermediate result is input into a linear layer, which maps it to the target space to obtain the log encoding. The indicator channel is considered the main channel, and the log channel is considered a sparse semantic modulation channel. A gating-based feature fusion mechanism is used to fuse the indicator encoding and log encoding to obtain a feature tensor.

[0053] In this specific implementation, the monitoring metrics (i.e., metric data) and logs of the supercomputing system are time-rasterized at fixed 15-second time intervals. For each node, the metric has a value at each timestamp, while the logs only generate text events around certain timestamps. For ease of explanation, the following description uses a single node as an example; the remaining nodes follow the same process and are stacked along the node dimension.

[0054] On the indicator channel, for the first Each node, the raw data of the indicator can be represented as a shape of... matrix ,in For time steps, This represents the number of metrics collected for this node. In the time dimension, a Temporal Convolutional Network (TCN) is used to encode the metric sequence for each node. The convolutional layers employ a superposition structure of causal convolution and dilated convolution to cover longer historical dependencies without revealing future information. For nodes... any time step Its indicator time series After multiple layers of causal convolution and dilated convolution, it is mapped to a length of Each time step has the following dimensions: The hidden feature sequence, denoted as Then, a linear layer is used to uniformly map the indicator features to a space with the same dimension as the log. Thus, the first The time step, the first The encoding tensor of the index side of each node is denoted as... .

[0055] On the log channel, the raw logs of each node are first time-aligned and semantically encoded. For the first... Each node's original log consists of several timestamped text events. Composition, in which For the first The timestamp of the log entry For log text, each log message is mapped to a semantic vector using a pre-trained text encoder BERT. Considering the high sparsity of log events at 15-second intervals, this embodiment does not force the construction of a single independent log vector for each timestamp, but instead uses each timestamp... For reference, time windows covering both short-term and long-term historical neighborhoods are constructed. Log events falling within these windows are then weighted and aggregated to obtain two types of log semantic features reflecting "proximate causes" and "remote causes." The short window is typically set to... (0) (2 minutes), the logs within the window are weighted and averaged according to the principle of "the closer to the current time, the greater the weight", thus emphasizing sudden and recent anomalies; long windows are usually set to (Approximately 12 minutes) Decay weights are applied to the logs within the window to weaken long-term noise, ensuring that important events with longer intervals are still detected. The aggregation results of the two windows are linearly combined according to learnable weights to obtain the final result. Log semantic vector of time node i Stacked along the time axis and node dimensions, forming a shape as follows: Log semantic tensor, The log time step dimension is then uniformly projected through a linear layer to the same dimension φ as the metric channel, resulting in the log-side coding tensor. , of which The timestamp, the first The log encoding vector of each node is .

[0056] In the cross-modal fusion stage, this embodiment employs a gating-based feature fusion mechanism, treating the indicator channel as the main channel and the log channel as a sparse semantic modulation channel. For each timestamp *i* and each node *i*, the indicator code and log code are first concatenated along the feature dimension and input into the gating network to obtain a channel-level gating coefficient vector:

[0057]

[0058] in, and For learnable parameters, For element-wise Sigmoid function, This represents vector concatenation. The final fused feature vector is combined using a channel-wise weighted approach as follows:

[0059]

[0060] in This represents element-wise multiplication. When a node has no log events within a certain time period, after window aggregation and linear mapping... Degenerates into a zero vector; the gating network automatically learns to do so during training. Adjusting the value to be close to 1 degenerates into a pure indicator representation that relies solely on indicator coding; when critical logs appear within a short or long window, By adjusting the gating coefficient, the index encoding under this timestamp is enhanced or suppressed, and sparse log semantics are explicitly injected into the fused representation of the nodes. The fused features of all timestamps and all nodes are stacked to obtain the final node fused feature tensor. During the training phase, the output of this module serves as input for subsequent graph structure learning, adaptive data augmentation, graph coding, and temporal prediction. During the online anomaly detection phase, the output of this module serves as input for the trained graph structure generation network.

[0061] In summary, the process of feature extraction and fusion of the aforementioned indicator data and log data is carried out through... Figure 2 The representation is as follows: input is multi-source data from the supercomputing cluster. Indicator data is preprocessed, processed by a temporal convolutional network, and linearly mapped before being fused with log data through a gated fusion network. Log data is preprocessed, BERT semantically encoded, multi-scale windowed, linearly mapped, and then fused with indicator data through a gated fusion network. The final output is a node fusion feature tensor X, which is then fed to downstream modules.

[0062] Example 3

[0063] Based on the above embodiments, this embodiment provides optional specific implementation methods for the training process of graph structure generation networks, graph coding networks, and temporal prediction networks.

[0064] Among them, the training-related processing operations of the graph structure generation network are integrated into the graph structure learning module. This module uses historical features within the sliding window to infer the dynamic dependencies between nodes under the inherent topological constraints of the supercomputing and generate an adjacency matrix.

[0065] Set timestamp The length of the historical window is The corresponding fused feature fragment sequence (i.e., the feature tensor after fusing indicator data and log data) is denoted as ,in This is a matrix of node feature tensors for the current timestamp. For the number of nodes, As a feature dimension. Given the hierarchical interconnect architecture of supercomputing clusters where computation and storage are separated (e.g., compute nodes must access MDS and OSS via I / O nodes), this embodiment pre-constructs a binary topology constraint matrix. To exclude invalid connections, this matrix is ​​set to 1 only for node pairs that satisfy the hierarchical interconnection rules, serving as a hard constraint throughout the process. That is, in the binary topology constraint matrix, the element corresponding to two nodes that are allowed to interact with data is 1, while the element corresponding to two nodes that are not allowed to interact with data is 0.

[0066] Under this constraint, a priori map is first constructed as a pseudo-label signal to guide learning. Specifically, based on the features at the current time step... Calculate the cosine similarity between nodes and retain the similarity of each row. Nearest neighbors are used to construct a basic binary K-nearest neighbor (KNN) graph. To further suppress noise using second-order neighborhood information, this embodiment calculates the Jaccard similarity between nodes based on the KNN graph, using the intersection-union ratio of the neighbor sets of two nodes as a measure of structural overlap. Subsequently, each row retains... The entries with larger Jaccard values ​​are set to zero in the remaining positions, and a binary topological constraint matrix is ​​superimposed. This ultimately generates a sparse and robust statistical prior graph. .

[0067] In general, based on the feature tensor at the reference time, the cosine similarity between nodes is calculated. For each current node, only the top k1 neighboring nodes with the highest similarity are retained, and the element representing the relationship between the current node and its neighboring nodes is set to 1, thus obtaining the representation matrix of the binarized K-nearest neighbor graph. Based on the representation matrix of the binarized K-nearest neighbor graph, the Jaccard similarity C between two nodes is determined.

[0068] C = (A ∩ B) / (A ∪ B)

[0069] Where A represents the set of adjacent nodes of one of the two nodes, and B represents the set of adjacent nodes of the other of the two nodes.

[0070] For each current node, only the top k2 nodes with the highest Jaccard similarity are retained. The element values ​​corresponding to these retained nodes are set to 1, and the element values ​​corresponding to the remaining nodes are set to 0 to obtain the intermediate representation matrix. The intermediate representation matrix is ​​then multiplied element-wise with the binary topological constraint matrix to obtain the representation matrix of the prior graph.

[0071] Building upon the prior graph, this module learns data-driven adjacency relationships from two complementary perspectives. The first perspective is spatial self-attention, utilizing only features at the current time step. The immediate impact between inference nodes. The query matrix and key matrix are obtained through linear transformation: , .in , These are learnable weights. A spatial attention map is defined as:

[0072]

[0073] And the diagonal lines are forced to zero to avoid self-loops. This perspective gives the interaction strength between nodes within the allowed topology at the current time. D represents the number of feature dimensions of the feature tensor. The characteristic tensor representing the reference time t, Let T denote the binary topological constraint matrix, and let T denote the sign of the matrix transpose. The operator representing element-wise multiplication. This represents the Sigmoid activation function.

[0074] The second perspective is a temporal causal perspective, designed to capture the temporal lag in the propagation of anomalies. This perspective calculates historical lag frames. With the current frame Cross-attention between different lag steps is used, and a learnable temporal bias term is introduced to correct the underlying correlation of different lag steps; this is achieved by traversing a pre-defined set of lag steps. Then the time-series cause-effect graph The calculation formula is:

[0075]

[0076] in, This represents the time-series cause-effect graph at reference time t. Indicates the time lag step. Represents the set of time lag steps. Let (tl) be the characteristic tensor. and The weight matrix to be learned. This represents the timing bias term to be learned, and max() represents the function that takes the maximum value.

[0077] Since the two perspectives focus on different types of relationships, this module uses a priori graph to adaptively weight and fuse them. Let the set of allowed edge locations be denoted as . In the set Computational Spatial Attention Map With prior diagram Time-series cause-effect graph With prior diagram The cosine similarity within the mask region yields a scalar. and Stack the two together and arrange them according to their temperature coefficients. Perform Softmax normalization to obtain the fusion weights:

[0078]

[0079] The final learned adjacency matrix is ​​defined as:

[0080]

[0081] in, This represents the adjacency matrix at reference time t. Represents the normalization function. Greater than zero, Indicates the temperature coefficient. The parameters to be learned Represents the spatial attention map With the prior graph In the set Cosine similarity on Represents the time-series cause-effect graph With the prior graph In the set Cosine similarity on the set Ω is to satisfy The set of allowed edges.

[0082] Adjacency matrix The element in the timestamp represents the time stamp. upper node For nodes The predicted edge strength. In an optional implementation, also for each row... Perform Top-K truncation for the sparse adjacency structure used in subsequent GraphSAGE message passing.

[0083] To ensure that the learned graph structure conforms to the logical topology of the supercomputing system and statistically approximates the prior graphs constructed by KNN and Jaccard, this module sets... Define the structural prior loss. The position is considered a positive edge, and we get... Binary labels. The structural prior loss is expressed as a binary cross-entropy form:

[0084]

[0085] Indicates structural prior loss. Represents a set The total number of elements in the middle, for The element in the timestamp represents the time stamp. upper node For nodes Predicted edge strength , representing a binary label, if the prior image If the value of the element representing the relationship between node i and node j is 1, then... If the prior graph If the value of the element representing the relationship between node i and node j is 0, then... .

[0086] By jointly minimizing this loss along with the contrastive loss of the adaptive enhancement module and the predictive loss of the temporal prediction module during training, this embodiment adaptively learns a dynamic graph structure consistent with the statistical laws of supercomputing operation while strictly adhering to the inherent relationships, providing a reliable graph prior for subsequent graph coding and anomaly detection.

[0087] The above graph structure learning process can be achieved through Figure 3 The input includes: Figure 2 The output of Module 1 (node ​​fusion feature tensor X) and the historical window features shown are processed through perspective 1 (spatial self-attention) and perspective 2 (temporal causal attention), respectively. Then, they are fused with the structural prior graph (which is obtained through logical topological constraints and statistical structural soft priors) through prior consistency. The output is: dynamic adjacency matrix A, and the structural prior loss is calculated (for joint training).

[0088] During the training of the graph coding network, adaptive augmentation is added. Specifically, data augmentation is performed based on the graph structure adjacency matrix and feature tensor to generate augmented views of the graph structure and features. The goal is to adaptively perturb the structure between nodes (adding or deleting edges) and node features (channel masking + noise) to generate two "slightly different but semantically consistent" augmented views. Then, through subsequent contrastive learning, the graph coding network learns robust node embedding representations (even if the model is not sensitive to perturbations of structure and features, it can still distinguish different nodes), laying the foundation for subsequent anomaly detection.

[0089] Specifically, data augmentation includes structure-side adaptive augmentation (which involves intelligently adding or deleting edges in the graph, adding homogeneous edges, simulating "structural perturbations" in the graph, and enhancing robustness) and feature-side adaptive augmentation (which guides the graph encoding network to focus on "high-discrimination" feature channels, thereby enhancing the robustness of features).

[0090] In the process of adaptive enhancement on the structural side, the graph structure adjacency matrix is ​​first utilized. Calculate the confidence and centrality of each candidate edge to adjust the strength of the edge deletion perturbation.

[0091] In general, it includes the following sub-steps:

[0092] 131. Determine confidence and centrality based on the adjacency matrix of a graph structure.

[0093] For any candidate edge confidence level Defined as:

[0094]

[0095] in For the Sigmoid function, Scaling factor This represents the graph structure adjacency matrix output by the graph structure learning module at time t. The elements in the expression are used to characterize the strength of the dependency from node i to node j. Typically... Take continuous values; the larger the value, the stronger the dependency. The closer it is to 1, the higher the model's confidence in the existence of that edge.

[0096] The centrality of an edge is approximated by the weighted degree of the nodes, and denoted as... For nodes The degree (i.e., the number of edges connected to node i). To maximize the degree of the entire graph, then the edge centrality for:

[0097]

[0098] This indicator characterizes the edge Relative importance within the overall graph structure, and the approximate centrality of edges located in the core region. Larger, weakly connected edge approximate centrality Smaller.

[0099] 132. Determine the probability of deleting and adding edges based on confidence and centrality.

[0100] An adaptive edge deletion probability matrix is ​​constructed based on two indicators: confidence and centrality. Its elements (The meaning is the probability of deleting an edge between node i and node j) is:

[0101]

[0102] This represents the base edge deletion probability, which is a set value used to control the overall edge deletion intensity, for example, 0.1; A binary topological constraint matrix In this mechanism, only edges allowed by the topological constraints are either deleted or retained. Edges with low confidence or low centrality are more likely to be deleted. This ensures that weakly connected edges with low confidence and located at the edge (low centrality) are preferentially removed, while strong backbone edges are retained.

[0103] Regarding the edge-adding strategy, sampling is only performed on node regions that are allowed by topological constraints and are currently not connected. Edge-adding probability. Proportional to the cosine similarity of the node's neighborhood Furthermore, to prevent excessive perturbation from damaging graph semantics, a maximum perturbation rate threshold is introduced. (e.g., 5%); when the edge change rate caused by sampling exceeds this threshold, the edge addition probability is automatically scaled and truncated. Finally, after adjusting the graph structure based on the edge deletion and addition probabilities, two binary adjacency matrices are generated. and Binary adjacency matrix and Multiplying each view by the adjacency matrix of the graph structure yields the structural inputs of the two views (i.e., the first enhanced view of the graph structure and the second enhanced view of the graph structure).

[0104] 133. Based on the edge deletion probability, perform edge deletion on the already connected edges in the graph structure to obtain the first binary matrix. Add edges to unconnected positions in the graph structure based on the edge addition probability to obtain the second binary matrix. ; Perform element-wise multiplication between the first binary matrix and the adjacency matrix of the graph structure to obtain the first enhanced view of the graph structure. The second binary matrix is ​​multiplied element-wise with the adjacency matrix of the graph structure to obtain the second enhanced view of the graph structure. .

[0105] The edge deletion process is as follows: A pseudo-random number generator is used to generate a random number for the edge between node i and node j (assuming an existing edge exists between them). If the random number is less than the deletion probability of the edge between node i and node j, the edge is deleted, and the element representing the edge strength between node i and node j in the corresponding matrix is ​​set to 0. If the random number is greater than the deletion probability, the edge between node i and node j is retained, and the element representing the edge strength between node i and node j in the corresponding matrix is ​​set to 1. This yields a binary adjacency matrix. .

[0106] The edge-adding process is as follows: A pseudo-random number generator is used to generate a random number for the edge between node i and node j (assuming there is no original edge between node i and node j). If the random number is less than the probability of adding an edge between node i and node j, then an edge is added between node i and node j, and the element representing the edge strength between node i and node j in the corresponding matrix is ​​set to 1. If the random number is greater than the probability of adding an edge between node i and node j, then no edge is added between node i and node j, and the element representing the edge strength between node i and node j in the corresponding matrix is ​​set to 0. This yields a binary adjacency matrix. .

[0107] 134. Calculate the standard deviation vector of the feature tensor in the node dimension, use the normalized result of the standard deviation vector as the importance score of the feature channel, determine the masking probability based on the importance score, generate a mask vector based on the masking probability, apply it to the feature tensor, and superimpose independent Gaussian noise to obtain the first enhanced view and the second enhanced view of the feature.

[0108] In feature-side adaptive enhancement, to force the model to focus on feature channels with high discriminative power, an adaptive masking strategy based on channel variance is proposed. The feature tensor is then computed. The standard deviation vector at the node dimension, after normalization, serves as the importance score for the feature channel. Define each channel. The dropout probability is That is, common channels with smaller variance and less information content are more likely to be masked by l. This represents the set base probability value, for example, 0.5. A mask vector is generated based on the masking probability and applied to... A small amount of independent Gaussian noise is then added to obtain two enhanced feature views. and (i.e., the first enhanced view of the feature and the second enhanced view of the feature).

[0109] Specifically, in order to generate feature views The program performs a Bernoulli sampling on each feature channel of the feature tensor based on the masking probability. Specifically, for each feature channel of the feature tensor, the program calls a random number generator to independently generate a random number uniformly distributed in the interval 0-1. If this random number is less than the masking probability, the sampling result is "masked". If the sampling result is "masked", the corresponding position of that channel in the mask vector is 0; otherwise, it is 1. Finally, the mask vector is compared with the feature tensor. Element-wise multiplication is performed to obtain intermediate results, which can then be further superimposed with random Gaussian noise to obtain a feature view. To generate a feature view The above operation is then performed independently once more. Since these are the results of two independent random samplings, their masked channel sets will partially overlap, but will not be completely identical. This ensures the accuracy of the feature view. and They are two enhanced views that are of the same origin and similar, but have random differences.

[0110] Finally, the two enhanced views generated and Inputting into the GraphSAGE network, we obtain node embedding representations. , To maximize the consistency of the representation of the same node across different views while maintaining the distinguishability of different node representations, this module constructs a node-level InfoNCE contrastive loss function. To eliminate the "false negative" problem introduced by the graph structure itself (i.e., directly adjacent nodes on the graph should be semantically similar and should not be forced apart), this embodiment is based on... Design a structure-aware negative sampling mechanism. For each node... The negative sample set is defined as:

[0111]

[0112] This means explicitly excluding one-hop neighbors and only treating non-neighbor nodes as negative samples. Representation of the adjacency matrix of the graph structure The element in the timestamp represents the time stamp. upper node For nodes The predicted edge strength.

[0113] For any node Its one-way contrast loss is defined as:

[0114]

[0115] in, Represents a node The one-way contrast loss is given by sim(), where sim() represents the similarity calculation function, exp() represents the natural exponential function, and τ represents the temperature coefficient. The embedding representation of the first enhanced view of the graph structure and the first enhanced view of the features for node i. The embedding representation of the second enhanced view of the graph structure and the second enhanced view of the features for node i. Let represent the embedding representations of the second augmented view of the graph structure and the second augmented view of the features for node k. The contrastive loss for the entire graph is the mean of the losses for all nodes. By minimizing this loss, the model can learn robust node representations that are invariant to both structural perturbations and feature-side noise under unsupervised conditions, providing a more stable and generalizable input representation for subsequent graph encoding and online anomaly detection.

[0116] In summary, the above-mentioned data augmentation and contrastive learning processes can be achieved through... Figure 4 The input includes: Figure 2 The node fusion feature tensor X and the output of module one Figure 3 The adjacency matrix A output by module two is used for adaptive enhancement based on this input. This includes structural enhancement: calculating edge confidence and centrality, preserving backbones, removing weak edges, and limiting edge supplementation to obtain an enhanced structural view. Feature enhancement includes calculating channel variance, low-variance masking + Gaussian noise to obtain an enhanced feature view, which is then encoded by an encoder, with added structure-aware negative sampling, and the contrastive loss is calculated. The contrastive loss is then input into the joint training and online anomaly detection module.

[0117] The adaptive enhancement strategy combining topology and features significantly improves the model's robustness in noisy environments. Addressing the high-noise characteristics of supercomputing data, this implementation abandons the traditional random perturbation model and proposes a structure sampling approach that preserves backbone connections while eliminating weakly correlated edges, along with a variability-based feature masking mechanism. This method adaptively adjusts the enhancement intensity using edge confidence and channel variance, strictly maintaining the topological semantic consistency of the enhanced view while introducing contrastive perturbations, effectively preventing critical failure modes from being masked by random noise.

[0118] Furthermore, this embodiment also includes a joint training and anomaly detection module. This module aims to achieve end-to-end training of the model through multi-task joint optimization and utilize the trained dynamic graph structure. Deep spatiotemporal modeling of historical feature sequences is performed to predict the node state at the next moment, thereby enabling online anomaly detection.

[0119] In order to capture the evolution of the system state over time, this module first sets the length to... Historical Integration Features Window It is viewed as a sequence of feature snapshots arranged chronologically. In the spatial encoding stage, time-step encoding is employed. For each time step within the historical window... Extract the node feature matrix at the current time step. This data is then input into the GraphSAGE encoder, which shares parameters with the adaptive data augmentation module. The learned graph at the current time step is then used... Defined topology, GraphSAGE Each node in the algorithm performs weighted neighbor aggregation and outputs the node embedding containing spatial topology information at that time step. By repeating this operation for all time steps within the window, the original feature sequence is transformed into a set of graph embedding vectors, which are then stacked in chronological order to form a spatiotemporal graph embedding sequence. .

[0120] In the temporal evolution prediction stage, this module introduces a gated recurrent unit (GRU) as a temporal predictor to process the generated graph embedding sequence. The GRU network reads each embedding vector from the sequence sequentially, step by step. It effectively remembers long and short-term historical dependencies by utilizing its internal reset and update gate mechanisms. Extracting GRU after processing the last time step... Hidden state after This is then mapped back to the original feature space through a fully connected regression layer, thus obtaining the predicted node feature values ​​for the next time step. .

[0121] To drive the model to learn normal spatiotemporal evolution patterns, a time series prediction loss is defined. The loss is used to calculate the predicted feature matrix. With the true observation matrix The mean absolute error (MAE) between them is defined by the formula:

[0122]

[0123] in, This represents the time series prediction loss. The total number of nodes. The number of feature dimensions. and They represent the first The node at the th The predicted and true values ​​are displayed across each feature dimension. This loss function enables the model to learn the spatiotemporal dependencies between nodes within the system and their evolutionary patterns.

[0124] In some implementations, a multi-task joint loss function is constructed to balance the logical interpretability of the graph structure, the robustness of the feature representation, and the accuracy of temporal prediction. During offline training, the graph generation network, graph encoding network, and temporal prediction network are collaboratively optimized through joint backpropagation, resulting in a total loss function. Defined as:

[0125]

[0126] in , , To balance the hyperparameters of the task weights, a joint optimization strategy is employed. The graph structure learning module learns dynamic adjacency relationships that satisfy the supercomputing logic topology constraints and are consistent with operational statistics. The contrastive learning module obtains node embedding representations that are robust to structural perturbations and feature noise. The time series prediction module further captures long-term evolution patterns based on this, thereby providing high-quality prior prediction information for the online anomaly detection module.

[0127] Using a trained model, unsupervised anomaly detection is performed on real-time incoming monitoring data. The core objective of anomaly detection is to identify abnormal situations that deviate from normal behavioral patterns based on the deviation between actual observations and model predictions.

[0128] like Figure 5 The diagram illustrates a joint training and online anomaly detection approach, where the joint optimization objective is based on... Figure 3 The structural prior loss calculated in Module 2, Figure 4 Module 3 calculates the contrastive loss and predictive loss, and backpropagates to update the parameters. During online anomaly detection, Module 1 fuses indicator data and log data to obtain a feature tensor, which is then input into the trained graph structure generation network (i.e., Figure 3 The spatial self-attention and temporal causal attention shown are used to obtain the graph structure adjacency matrix. The graph structure adjacency matrix and the feature tensors within the set historical time window are then input into the trained graph encoding network (i.e., Figure 4 The encoder shown obtains the spatiotemporal graph embedding sequence; the spatiotemporal graph embedding sequence is then input into a trained temporal prediction network (i.e., Figure 5 The GRU network shown obtains the feature prediction data of each node at the next time step; based on the error between the feature prediction data and the feature observation data of each node at the next time step, it determines whether there is an abnormal situation in the supercomputing cluster, including: calculating the prediction residual, standardizing the data, obtaining the anomaly score, and determining whether it is abnormal through a threshold.

[0129] This embodiment proposes a priori-guided dynamic graph construction mechanism, effectively overcoming the limitations of single-view representation. It proposes a dynamic graph learning method that integrates statistical structural priors under hard constraints at the logical level, fundamentally solving the information loss problems caused by relying solely on static topology or the erroneous associations introduced by purely data-driven graph construction in traditional methods. Through the fusion of spatial co-occurrence and temporal causality perspectives, the model can adaptively capture the logical dependencies generated by job scheduling in supercomputing clusters, achieving accurate characterization of complex time-varying relationships.

[0130] An adaptive enhancement strategy combining topology and features is proposed, significantly improving the model's robustness in noisy environments. Addressing the high-noise characteristics of supercomputing data, this embodiment abandons the traditional random perturbation mode and proposes a structure sampling method that preserves backbone connections while eliminating weakly correlated edges, along with a variability-based feature masking mechanism. This method adaptively adjusts the enhancement intensity using edge confidence and channel variance, strictly maintaining the topological semantic consistency of the enhanced view while introducing contrastive perturbations, effectively preventing critical failure modes from being masked by random noise.

[0131] A structure-aware contrastive learning mechanism is proposed to effectively address the problem of false negatives caused by homogeneous nodes. A structure-aware negative sampling strategy is constructed to explicitly remove potential first-order neighbors and homogeneous nodes from the negative sample set, avoiding intra-class conflicts caused by false negatives in traditional contrastive learning. This mechanism ensures that the model can learn more compact and discriminative normal pattern representations, significantly reducing the false alarm rate caused by false negatives of homogeneous nodes.

[0132] An end-to-end framework for multi-task collaborative optimization is proposed, achieving mutual enhancement between graph generation and anomaly detection. A unified computational framework integrating graph structure learning, representation alignment, and temporal prediction is established. By jointly optimizing the structural prior loss, contrastive loss, and predictive regression loss, gradient collaboration among multiple tasks is achieved. This framework not only ensures the interpretability of the learned graph but also, through end-to-end closed-loop feedback, simultaneously improves the quality of graph structure generation and the accuracy of downstream anomaly detection.

[0133] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. For example... Figure 6 As shown, the electronic device 500 includes one or more processors 501 and memory 502.

[0134] The processor 501 may be a central processing unit (CPU) or other form of processing unit with data processing and / or instruction execution capabilities, and may control other components in the electronic device 500 to perform desired functions.

[0135] The memory 502 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 501 may execute the program instructions to implement the anomaly detection method for supercomputing clusters and / or other desired functions described above in any embodiment of this application. Various contents such as initial extrinsic parameters and thresholds may also be stored in the computer-readable storage medium.

[0136] In one example, the electronic device 500 may further include an input device 503 and an output device 504, these components being interconnected via a bus system and / or other forms of connection mechanisms (not shown). The input device 503 may include, for example, a keyboard, a mouse, etc. The output device 504 may output various information to the outside, including warning messages, braking force, etc. The output device 504 may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output devices, etc.

[0137] Of course, for the sake of simplicity, Figure 6 Only some of the components of the electronic device 500 relevant to this application are shown in this illustration; components such as buses, input / output interfaces, etc., are omitted. In addition, the electronic device 500 may include any other suitable components depending on the specific application.

[0138] In addition to the methods and devices described above, embodiments of this application may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the steps of the anomaly detection method for supercomputing clusters provided in any embodiment of this application.

[0139] This document uses specific examples to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. For those skilled in the art, several improvements, modifications, or variations can be made without departing from the principles of this invention, and the above technical features can also be combined in an appropriate manner; these improvements, modifications, variations, or combinations, or the direct application of the inventive concept and technical solution to other situations without modification, should all be considered within the scope of protection of this application.

Claims

1. An anomaly detection method for supercomputing clusters, characterized in that, include: Obtain the indicator data and log data of each node in the supercomputing cluster, and perform feature extraction and fusion processing on the indicator data and log data of the same node to obtain the feature tensor at the current moment; The feature tensor is input into a trained graph structure generation network. The graph structure generation network determines the dynamic dependencies between nodes from both spatial and temporal causal perspectives, thereby obtaining a graph structure adjacency matrix representing the dynamic dependencies. During the training phase, the graph structure generation network incorporates the inherent node topology constraints of the supercomputing cluster. The graph structure adjacency matrix and the feature tensors within the set historical time window are input into the trained graph coding network to obtain the spatiotemporal graph embedding sequence; The spatiotemporal graph embedding sequence is input into a trained temporal prediction network to obtain feature prediction data for each node at the next time step. The presence of any abnormalities in the supercomputing cluster is determined based on the error between the predicted feature data and the feature observation data of each node at the next moment.

2. The anomaly detection method for supercomputing clusters according to claim 1, characterized in that, The feature extraction and fusion processing of indicator data and log data for the same node to obtain the feature tensor at the current moment includes: In the indicator channel, a temporal convolutional network is used to encode the indicator data to obtain feature codes. The feature codes are then input into a linear layer, which maps the feature codes to the target space to obtain the indicator codes. The temporal convolutional network includes causal convolutional structures and dilated convolutional structures. In the log channel, a text encoding network maps individual log events in the log data into log semantic vectors. Using the current time as a reference, a first time window and a second time window preceding the current time are constructed, with the second time window encompassing the first time window. A weighted average is applied to the log semantic vectors falling within the first time window to obtain a first intermediate result, where log semantic vectors closer to the current time have a larger weight. A weighted average is then applied to the log semantic vectors falling within the second time window to obtain a second intermediate result, where log semantic vectors farther from the current time have a smaller weight. The first and second intermediate results are then weighted and summed to obtain a third intermediate result. This third intermediate result is input into a linear layer, which maps it to the target space to obtain the log encoding. The indicator channel is regarded as the main channel, and the log channel is regarded as a sparse semantic modulation channel. A gating-based feature fusion mechanism is used to fuse the indicator code and the log code to obtain the feature tensor.

3. The anomaly detection method for supercomputing clusters according to claim 1, characterized in that, The graph structure generation network is trained in the following manner: A binary topology constraint matrix is ​​generated based on the inherent node topology constraints of the supercomputing cluster. In the binary topology constraint matrix, the element corresponding to two nodes that are allowed to interact with data has a value of 1, and the element corresponding to two nodes that are not allowed to interact with data has a value of 0. A priori graph is constructed based on the binary topological constraint matrix and the feature tensor of the reference time. Based on the prior graph and the feature tensor of the reference time, the spatial attention graph is obtained by reasoning the influence features between nodes from a spatial perspective through a spatial self-attention mechanism. Based on the prior graph and the feature tensor of the reference time, the time lag impact features of abnormal propagation between nodes are inferred from the perspective of temporal causality through the cross attention mechanism to obtain the temporal causal graph; The prior graph is used to adaptively weight and fuse the spatial attention graph and the temporal causal graph to obtain an adjacency matrix representing dynamic dependencies. Based on the adjacency matrix and the prior graph, a structural prior loss is constructed. During training, the learning parameters are adjusted with the goal of minimizing the structural prior loss.

4. The anomaly detection method for supercomputing clusters according to claim 3, characterized in that, The construction of the prior graph based on the binary topological constraint matrix and the feature tensor at the reference time includes: Based on the feature tensor at the reference time, the cosine similarity between nodes is calculated. For each current node, only the top k1 neighboring nodes with the highest similarity are retained, and the element value representing the relationship between the current node and its neighboring nodes is set to 1 to obtain the representation matrix of the binarized K-nearest neighbor graph. Based on the representation matrix of the binarized K-nearest neighbor graph, the Jaccard similarity C between two nodes is determined: C = (A ∩ B) / (A ∪ B) Where A represents the set of adjacent nodes of one of the two nodes, and B represents the set of adjacent nodes of the other of the two nodes. For each current node, only the top k2 nodes with the highest Jaccard similarity are retained. The element values ​​corresponding to these retained nodes are set to 1, and the element values ​​corresponding to the remaining nodes are set to 0, thus obtaining the intermediate representation matrix. The intermediate representation matrix and the binary topological constraint matrix are multiplied element-wise to obtain the representation matrix of the prior graph.

5. The anomaly detection method for supercomputing clusters according to claim 3, characterized in that, The step of obtaining a spatial attention map by inferring the influence features between nodes from a spatial perspective through a spatial self-attention mechanism based on the prior map and the feature tensor includes: The spatial attention map is determined by the following formula: in, , ; and The weight matrix to be learned; This represents the spatial attention map at reference time t, where D represents the number of feature dimensions of the feature tensor. The characteristic tensor representing the reference time t, Let T denote the binary topological constraint matrix, and let T denote the sign of the matrix transpose. The operator representing element-wise multiplication. This represents the Sigmoid activation function; The process of obtaining a temporal causal graph based on the prior graph and the feature tensor, through a cross-attention mechanism, infers the time lag impact features of abnormal propagation between nodes from a temporal causal perspective, including: The time-series cause-effect graph is determined by the following formula: in, The time-series cause-effect graph representing the reference time t, Indicates the time lag step. Represents the set of time steps with lag. express( The feature tensor at time ) and The weight matrix to be learned. This represents the timing bias term to be learned, and max() represents the function that takes the maximum value. The step of adaptively weighting and fusing the spatial attention graph and the temporal causal graph using the prior graph to obtain an adjacency matrix representing dynamic dependencies includes: The adjacency matrix is ​​determined by the following formula: in, This represents the adjacency matrix at reference time t. Represents the normalization function. Greater than zero, Indicates the temperature coefficient. The parameters to be learned Represents the spatial attention map With the prior graph In the set Cosine similarity on Represents the time-series cause-effect graph With the prior graph In the set Cosine similarity on the set Ω is to satisfy The set of allowed edges; The prior loss of the structure is: This represents the prior loss of the structure. Represents a set The total number of elements in the middle, for The element in the timestamp represents the time stamp. upper node For nodes Predicted edge strength , representing a binary label, if the prior graph If the value of the element representing the relationship between node i and node j is 1, then... If the prior graph If the value of the element representing the relationship between node i and node j is 0, then... .

6. The anomaly detection method for supercomputing clusters according to claim 1, characterized in that, The graph coding network is trained in the following way: The following node-level contrastive loss function is constructed, and the parameters of the graph coding network are adjusted with the goal of minimizing the contrastive loss function. For any node Its one-way contrastive loss function is defined as: in, Represents a node The one-way contrast loss is given by sim(), where sim() represents the similarity calculation function, exp() represents the natural exponential function, and τ represents the temperature coefficient. The embedding representation of the first enhanced view of the graph structure and the first enhanced view of the features for node i. The embedding representation of the second enhanced view of the graph structure and the second enhanced view of the features for node i. The embedding representation of the second enhanced view of the graph structure and the second enhanced view of the features for node k. Represents a node The negative sample set, , Representation of the adjacency matrix of the graph structure The element in the timestamp represents the time stamp. upper node For nodes The predicted edge strength.

7. The anomaly detection method for supercomputing clusters according to claim 6, characterized in that, The embedded representations of the first enhanced view of the graph structure and the first enhanced view of the feature, as well as the embedded representations of the second enhanced view of the graph structure and the second enhanced view of the feature, are determined in the following manner: Data augmentation is performed based on the graph structure adjacency matrix and feature tensor to generate a first augmented view and a second augmented view of the graph structure, as well as a first augmented view and a second augmented view of the features. The first augmented view of the graph structure and the first augmented view of the features are input into the graph coding network to be trained to obtain the embedded representations of the first augmented view of the graph structure and the first augmented view of the features. The second augmented view of the graph structure and the second augmented view of the features are input into the graph encoding network to be trained to obtain the embedded representations of the second augmented view of the graph structure and the second augmented view of the features.

8. The anomaly detection method for supercomputing clusters according to claim 7, characterized in that, Data augmentation is performed based on the graph structure adjacency matrix and feature tensor to generate a first augmented view and a second augmented view of the graph structure, as well as a first augmented view and a second augmented view of the features, including: The confidence and centrality of candidate edges are determined based on the adjacency matrix of the graph structure. Based on the confidence level and the centrality, the edge deletion probability and edge addition probability are determined; according to the edge deletion probability, edge deletion is performed on the already connected edges in the graph structure to obtain the first binary matrix. Add edges to unconnected positions in the graph structure based on the edge addition probability to obtain the second binary matrix. The first binary matrix is ​​multiplied element-wise with the adjacency matrix of the graph structure to obtain the first enhanced view of the graph structure. The second binary matrix is ​​multiplied element-wise with the adjacency matrix of the graph structure to obtain the second enhanced view of the graph structure. ; The standard deviation vector of the feature tensor in the node dimension is calculated, and the result after normalization of the standard deviation vector is used as the importance score of the feature channel. The masking probability is determined according to the importance score. A mask vector is generated according to the masking probability and applied to the feature tensor. Independent Gaussian noise is also superimposed to obtain the first enhanced view and the second enhanced view of the feature.

9. The anomaly detection method for supercomputing clusters according to claim 1, characterized in that, The time-series prediction network is trained in the following manner: Construct the following temporal prediction loss function, and adjust the parameters of the temporal prediction network with the goal of minimizing the temporal prediction loss function; in, This represents the time series prediction loss. The total number of nodes. The number of feature dimensions. and They represent the first The node at the th Predicted and actual values ​​for each feature dimension.

10. The anomaly detection method for supercomputing clusters according to claim 1, characterized in that, Determining whether the supercomputing cluster has any abnormal conditions based on the error between the predicted feature data and the feature observation data of each node at the next time step includes: Calculate the error between the predicted feature data and the observed feature data at the node dimension; The errors are standardized and aggregated into system-level anomaly scores. If the abnormal score is greater than the judgment threshold, it is determined that there is an abnormal condition in the supercomputing cluster; The decision threshold is the experimental value that maximizes the F1 score.