Internet of things linkage-based supercomputing center emergency coordination control method and system
By combining an IoT monitoring node array with a supercomputing center resource scheduling platform, a spatial correlation network of fluctuation source nodes is constructed, enabling accurate identification, adaptive matching, and efficient execution of emergency coordination and control in the supercomputing center. This addresses the shortcomings of existing emergency coordination and control technologies and improves the timeliness of emergency response and resource utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- POWERCHINA RAILWAY CONSTR
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-19
AI Technical Summary
The existing emergency coordination and control mechanisms of supercomputing centers cannot fully correlate the spatial distribution characteristics of abnormal events, the matching accuracy between emergency tasks and computing resources is insufficient, the collaborative efficiency of multi-node parallel response is low, and it is difficult to meet the emergency response requirements for high-reliability operation.
By collecting environmental status perception data through an array of IoT monitoring nodes, a spatial correlation network of fluctuation source nodes is constructed. Combined with the supercomputing center resource scheduling platform, emergency response task request instructions are generated to achieve precise triggering of emergency tasks and adaptive matching of resources. Based on the node resource occupancy status and task execution progress, emergency coordination and control instructions are generated to achieve nearby scheduling and reasonable division of labor.
It improved the timeliness and accuracy of emergency response, reduced cross-node data transmission overhead and coordination complexity, and ensured the operational stability and security of the supercomputing center.
Smart Images

Figure CN122248032A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer control technology, specifically to an emergency coordination and control method and system for supercomputing centers based on Internet of Things (IoT) linkage. Background Technology
[0002] As a critical infrastructure supporting large-scale scientific computing, engineering simulation, and big data processing, the operational stability of supercomputing centers directly affects the delivery efficiency and business continuity of various computing tasks. With the continuous expansion of supercomputing centers, the number of computing nodes, power supply equipment, cooling equipment, and network equipment deployed within the data center is growing exponentially. The complexity of dynamic changes in equipment operating status and data center environmental parameters continues to increase. If any environmental anomaly or equipment failure in any area is not addressed promptly, it can trigger widespread task interruptions or even hardware damage, causing incalculable losses.
[0003] Currently, supercomputing centers generally deploy IoT monitoring systems. These systems collect environmental data such as temperature, humidity, power supply voltage, and equipment vibration through monitoring nodes distributed throughout the data center. Combined with preset threshold alarm rules, these systems identify abnormal events and trigger corresponding emergency response procedures. At the resource scheduling level, supercomputing centers typically use a centralized resource management platform to allocate computing resources according to task requirements, achieving load balancing and efficient utilization of the computing node cluster. For emergency scenarios, existing scheduling mechanisms often employ a reserved fixed emergency resource pool. Upon receiving an anomaly alarm, nodes are allocated from this pool to execute emergency response tasks.
[0004] With the development trend of multi-regional collaboration and cross-center scheduling in supercomputing centers, higher demands are placed on the global perception capability of abnormal events, the dynamic adaptation capability of resource scheduling, and the execution efficiency of multi-node collaboration during emergency response. Existing emergency coordination and control mechanisms cannot fully correlate the spatial distribution characteristics of abnormal events, the matching accuracy between emergency tasks and computing resources is insufficient, and the collaborative efficiency of multi-node parallel response is low, making it difficult to meet the emergency response needs of high-reliability operation of supercomputing centers. Summary of the Invention
[0005] This application provides an emergency coordination and control method and system for supercomputing centers based on Internet of Things (IoT) linkage.
[0006] This application provides an embodiment of an emergency coordination and control method for supercomputing centers based on Internet of Things (IoT) linkage, including: The raw environmental state perception data stream transmitted back by the IoT monitoring node array is collected, and the raw environmental state perception data stream is processed to detect and process data stream state change points to obtain a set of abnormal fluctuation feature descriptors corresponding to the environmental state perception data unit. Based on the fluctuation source node identifier contained in each data fluctuation anomaly feature descriptor in the set of abnormal fluctuation feature descriptors, a spatial association network of fluctuation source nodes is constructed. The spatial association network of fluctuation source nodes includes multiple fluctuation source nodes and spatial association edges connecting fluctuation source nodes that have spatial adjacency relationships. The spatial association network of the fluctuation source node is input into the supercomputing center resource scheduling platform for emergency task trigger condition parsing and processing, and an emergency response task request instruction matching the network topology characteristics of the spatial association network of the fluctuation source node is generated. Based on the emergency response task request instruction, the real-time operation status database of the supercomputing center computing node cluster is queried to obtain the node resource occupancy status vector and node task execution progress vector of each supercomputing node in the supercomputing center computing node cluster. Based on the node resource occupancy status vector and node task execution progress vector, the available nodes of the supercomputing center computing node cluster are screened to obtain a set of candidate supercomputing nodes that are qualified to take over the emergency task. Each fluctuation source node in the fluctuation source node spatial association network is spatially adjacent to the candidate supercomputing nodes in the candidate supercomputing node group set. This generates an emergency coordination and control instruction set containing a node mapping table and a task assignment sequence. The emergency coordination and control instruction set is then sent to the candidate supercomputing node group set to activate the parallel emergency response task execution operation.
[0007] This application also provides an emergency coordination and control system for supercomputing centers, including: processor; Storage device, on which computer programs are stored, When the computer program is executed by the processor, the processor enables the processor to implement any of the aforementioned IoT-based emergency coordination and control methods for supercomputing centers.
[0008] This application embodiment also provides a readable storage medium storing a program or instructions, which, when executed by a processor, implements the steps of the supercomputing center emergency coordination and control method based on Internet of Things linkage.
[0009] This application's embodiments start from the full-process collaborative logic of emergency response in supercomputing centers. Through the omni-channel perception of IoT monitoring node arrays and the detection of data flow state mutation points, it achieves accurate identification and characteristic characterization of environmental anomalies. The spatial association network of fluctuation source nodes, constructed based on fluctuation source node identifiers, fully restores the spatial distribution characteristics and node relationships of abnormal events, providing a topology-level decision-making basis for the accurate triggering of emergency tasks. Through the supercomputing center resource scheduling platform's analysis of the topological characteristics of the fluctuation source node spatial association network, it generates emergency response task request instructions matching the degree of anomaly impact, achieving adaptive matching between emergency response levels and resource requirements, avoiding problems of over-scheduling or insufficient response. Combined with node resource occupancy status... Multi-dimensional filtering of vectors and node task execution progress vectors ensures the resource availability and task undertaking capacity of candidate supercomputing node groups, providing a resource foundation for the efficient execution of emergency tasks. Through spatial proximity mapping between fluctuation source nodes and candidate supercomputing nodes, the scheduling and reasonable division of labor for anomaly handling tasks are realized. The generated set of emergency coordination and control instructions can drive candidate supercomputing node groups to execute emergency response tasks in parallel, significantly shortening the response and handling cycle of abnormal events. At the same time, it reduces cross-node data transmission overhead and coordination complexity. From perception, decision-making, scheduling, and execution, it improves the timeliness, accuracy, and resource utilization of emergency coordination and control in the supercomputing center, effectively ensuring the stability and security of the supercomputing center's operation. Attached Figure Description
[0010] Figure 1 This is a flowchart illustrating an emergency coordination and control method for a supercomputing center based on Internet of Things (IoT) linkage, as provided in an embodiment of this application.
[0011] Figure 2 This is a schematic diagram of the basic structure of a supercomputing center emergency coordination and control system provided in an embodiment of this application.
[0012] Figure 3 This is a functional block diagram of an emergency coordination and control device for a supercomputing center, provided as an embodiment of this application. Detailed Implementation
[0013] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the embodiments of this application will be further described in detail below with reference to the accompanying drawings and specific implementation methods.
[0014] See Figure 1 As shown, this figure is a flowchart of an emergency coordination and control method for a supercomputing center based on Internet of Things linkage, provided in an embodiment of this application. This method is executed through the emergency coordination and control system of the supercomputing center. Figure 1 As shown, the method includes steps 110-150.
[0015] Step 110: Collect the original environmental state perception data stream transmitted back by the IoT monitoring node array, perform data stream state change point detection processing on the original environmental state perception data stream, and obtain the set of abnormal fluctuation feature descriptors corresponding to the environmental state perception data unit.
[0016] The IoT monitoring node array of the supercomputing center is deployed in various areas of the computer room, covering key locations such as power supply modules, cooling modules, computing node cabinets, and network switching equipment. Each monitoring node continuously collects environmental status parameters such as temperature, humidity, power supply voltage, equipment vibration, and smoke concentration in the corresponding area according to a preset sampling period. The raw environmental status perception data stream transmitted back consists of multiple environmental status perception data units carrying node identifiers and timestamps. Each data unit corresponds to all monitoring parameters of a monitoring node at a certain sampling time.
[0017] Step 111: Collect the raw environmental state perception data stream transmitted back by the monitoring node array deployed in the IoT perception layer during the continuous sampling period. The raw environmental state perception data stream contains environmental state perception data units generated by multiple monitoring nodes at different sampling times, carrying node identifiers and timestamps.
[0018] The monitoring node array of the IoT sensing layer is deployed in a grid pattern according to the physical layout of the data center. Each monitoring node is pre-assigned a unique node identifier, which includes information such as the node's region and the type of monitoring parameter, facilitating rapid location of anomalies. The sampling period of the monitoring nodes is set differently based on the dynamic characteristics of the monitoring parameters. Longer sampling periods are used for parameters with slower change rates, such as temperature and humidity, while shorter sampling periods are used for parameters with faster change rates, such as power supply voltage and equipment vibration. The clocks of all monitoring nodes are kept consistent through the global clock synchronization service of the supercomputing center, ensuring that the timestamp accuracy of the transmitted data meets the requirements of time series analysis.
[0019] The supercomputing center's emergency coordination and control system receives data from all monitoring nodes through a dedicated IoT data access gateway. The access gateway performs preliminary legality verification on the received data, discarding invalid data with invalid node identifiers or timestamps outside the reasonable range. The verified environmental status perception data units are stored in a temporary data stream buffer in the order of receipt, forming a complete original environmental status perception data stream. This data stream will be directly input into the sorting and processing stage.
[0020] Step 112: Sort the environmental state perception data units corresponding to each monitoring node in the original environmental state perception data stream according to the sampling time order to generate the node perception data time sequence corresponding to each monitoring node.
[0021] The supercomputing center's emergency coordination and control system first groups the raw environmental state sensing data stream based on the node identifiers carried by the environmental state sensing data units. Data units with the same node identifier are grouped into the same group, and each group corresponds to all the collected data from one monitoring node. For the environmental state sensing data units within each group, the system sorts them in ascending order according to the timestamps carried by the data units. If two data units have the same timestamp, they are sorted according to the order in which they were received to avoid timing disorder.
[0022] After sorting, the system verifies the continuity of timestamps for each group, identifies adjacent data units with time intervals exceeding a preset threshold, and marks them as potential sampling interruption points. Data segments before and after the sampling interruption point are considered as independent time-series subsequences. The final generated time-series sequence of sensing data for each node is a set of continuous sampling data for the corresponding monitoring node. Each element in the sequence is an environmental state sensing data unit, and elements maintain a strict temporal order.
[0023] Step 113: Perform sliding window segmentation on the time series of the sensing data of each node, dividing the time series of the sensing data of each node into multiple continuous data segmentation units with fixed time window lengths.
[0024] The fixed time window length of the sliding window is preset based on the dynamic characteristics of the monitoring parameters and the sensitivity requirements for anomaly detection. Different time window lengths are configured for different monitoring parameter types, ensuring that each window contains sufficient sampling points to support distribution feature calculations while avoiding excessively long windows that could decrease the accuracy of abrupt change point location. The system uses a non-overlapping sliding window approach to segment the time series of node-sensing data. The end time of the current window becomes the start time of the next window, and the window sliding step is equal to the time window length. For remaining data at the end of the time series that is less than a complete window length, if the number of sampling points reaches a preset minimum sampling point threshold, it is treated as a separate data segment unit; otherwise, it is merged into the previous data segment unit. Each generated data segment unit contains four attributes: a start timestamp, an end timestamp, the node identifier, and all environmental state sensing data units within that window. All data segment units maintain the same chronological order as the original time series.
[0025] Step 114: Extract data distribution features for each data segment unit, and calculate the mean and standard deviation parameters of all environmental state perception data units within each data segment unit.
[0026] For each data segment unit, the system first extracts the monitoring parameter values from all environmental state sensing data units within that unit. For each type of monitoring parameter, the distribution characteristics are calculated. The data mean parameter is the arithmetic mean of all values of the same type of monitoring parameter within the data segment unit, reflecting the overall level of the monitoring parameters within that window. The data standard deviation parameter is a statistical measure of the dispersion of all values of the same type of monitoring parameter within the data segment unit, reflecting the degree of fluctuation of the monitoring parameters within that window. During the calculation process, the system first standardizes all monitoring parameter values to eliminate dimensional differences between different monitoring parameters and ensure the comparability of the distribution characteristics of different parameters. The standardization method is to subtract the historical mean of the parameter and then divide by the historical standard deviation. The historical mean and historical standard deviation are obtained from the statistical analysis of the normal operation data of the monitoring node over a preset period of time. Each data segment unit corresponds to a set of data mean parameters and data standard deviation parameters, with the number of parameters matching the number of monitoring parameter types for that monitoring node.
[0027] Step 115: Perform a difference calculation between the mean parameter of each data segment unit and the mean parameter of the adjacent preceding data segment unit to obtain the gradient parameter of the mean change between adjacent data segment units.
[0028] The system calculates the mean change gradient for each data segment unit (excluding the first one) according to their chronological order. For each type of monitoring parameter, the mean parameter of the current data segment unit is subtracted from the mean parameter of the preceding adjacent data segment unit to obtain the mean change gradient parameter for that type of monitoring parameter. The sign of the parameter reflects the direction of change, and the absolute value reflects the magnitude of change. For the first data segment unit, since there is no preceding adjacent data segment unit, its mean change gradient parameter is set to 0 by default and is not included in the abrupt change point detection. The mean change gradient parameters of all monitoring parameters constitute the set of mean change gradient parameters for the current data segment unit. This set, together with the obtained data dispersion change ratio parameter, serves as the basis for determining suspected abrupt change points.
[0029] Step 116: Calculate the ratio of the standard deviation parameter of each data segment unit to the standard deviation parameter of the adjacent preceding data segment unit to obtain the data dispersion change ratio parameter between adjacent data segment units.
[0030] The system calculates the data dispersion change ratio for each data segment unit except the first one, according to the time sequence of the data segment units. For each type of monitoring parameter, the data standard deviation parameter corresponding to the current data segment unit is divided by the data standard deviation parameter corresponding to the adjacent previous data segment unit to obtain the data dispersion change ratio parameter of that type of monitoring parameter. A parameter value greater than 1 indicates that the dispersion of the monitoring parameter increases, a parameter value less than 1 indicates that the dispersion of the monitoring parameter decreases, and a parameter value close to 1 indicates that there is no significant change in the dispersion.
[0031] To avoid division by zero errors caused by the standard deviation parameter of the previous data segment being 0, the system pre-sets a minimum standard deviation threshold. If the standard deviation parameter of the adjacent previous data segment is less than this minimum standard deviation threshold, the minimum standard deviation threshold is used as the denominator in the calculation. The data dispersion change ratio parameters corresponding to all monitored parameters constitute the data dispersion change ratio parameter set of the current data segment unit. This set, together with the mean change gradient parameter set, is input into the suspected abrupt change point location stage.
[0032] Step 117: Perform suspected mutation point localization processing on the time series of node-sensing data based on the mean change gradient parameter and the data dispersion change ratio parameter.
[0033] The system comprehensively judges the set of mean change gradient parameters and the set of data dispersion change ratio parameters for each data segment unit. For each type of monitoring parameter, it judges whether the absolute value of its mean change gradient parameter exceeds the preset gradient threshold and whether the data dispersion change ratio parameter exceeds the preset ratio threshold. If any type of monitoring parameter meets both of the above conditions at the same time, it is determined that there is abnormal fluctuation in the corresponding time period of the data segment unit and it is marked as a suspected mutation data segment unit.
[0034] To reduce the probability of false positives, the system also performs joint verification on multiple consecutive adjacent data segment units. If multiple consecutive data segment units are marked as suspected mutation data segment units, the time period is confirmed as a genuine period of abnormal fluctuation. If only a single isolated suspected mutation data segment unit exists, it is determined to be noise interference, and its suspected mutation label is removed. All the suspected mutation data segment units located are input into the abnormal fluctuation feature descriptor generation stage, ultimately forming a complete set of abnormal fluctuation feature descriptors.
[0035] Step 1171: Perform mutation point localization processing on the node-aware data time series according to the mean change gradient parameter and the data dispersion change ratio parameter, and identify data segmentation units in the node-aware data time series that simultaneously satisfy the mean change gradient parameter exceeding a preset gradient threshold and the data dispersion change ratio parameter exceeding a preset ratio threshold as suspected mutation data segmentation units.
[0036] The preset gradient threshold and preset ratio threshold are pre-set based on the type of monitoring parameter and the statistical characteristics of historical normal operation data. Different types of monitoring parameters correspond to different thresholds, which are set as the upper limit of the historical normal fluctuation range of the parameter, ensuring that both real anomalies can be effectively identified and normal random fluctuations can be filtered out. The system judges all monitoring parameters in each data segment unit one by one. If the absolute value of the mean change gradient parameter of at least one monitoring parameter exceeds the corresponding preset gradient threshold, and the data dispersion change ratio parameter of the monitoring parameter exceeds the corresponding preset ratio threshold, the data segment unit is marked as a suspected anomalous data segment unit. If multiple monitoring parameters in a data segment unit simultaneously meet the above conditions, all monitoring parameter types that meet the conditions are recorded as a reference for assessing the scope of the anomaly's impact.
[0037] Step 1172: Obtain the start sampling time point and end sampling time point corresponding to the suspected mutation data segmentation unit, and take the time period between the start sampling time point and the end sampling time point as the period when the data fluctuation anomaly occurs.
[0038] The starting sampling time point of a suspected mutation data segment unit is the timestamp of the first environmental state sensing data unit within that unit, and the ending sampling time point is the timestamp of the last environmental state sensing data unit within that unit. If there are multiple consecutive adjacent suspected mutation data segment units, the starting sampling time point of the first suspected mutation data segment unit is taken as the starting time point of the overall anomaly occurrence period, and the ending sampling time point of the last suspected mutation data segment unit is taken as the ending time point of the overall anomaly occurrence period, merging them into a continuous data fluctuation anomaly occurrence period to avoid splitting continuous anomalies into multiple independent periods. For non-contiguous suspected mutation data segment units, each corresponds to an independent data fluctuation anomaly occurrence period. Each data fluctuation anomaly occurrence period corresponds to a unique monitoring node identifier, clearly defining the location and time range of the anomaly. This period will serve as the basis for extracting the time range of fluctuation amplitude parameters.
[0039] Step 1173: Extract the maximum and minimum values of all environmental state sensing data units during the period when the data fluctuation anomaly occurred, and calculate the absolute value of the difference between the maximum and minimum values as the data fluctuation amplitude parameter.
[0040] For each period of data fluctuation anomaly, the system first extracts the values of the monitoring parameters that triggered the anomaly from all environmental state perception data units within that period. It then filters out the maximum and minimum values of these parameters within that period, calculates the absolute value of the difference between them to obtain the data fluctuation amplitude parameter. This parameter reflects the severity of the anomaly; a larger parameter value indicates a higher intensity of the anomaly. If multiple monitoring parameters trigger the anomaly within that period, the fluctuation amplitude parameter for each parameter is calculated separately, and the maximum value is selected as the final data fluctuation amplitude parameter for that period, ensuring that the most severe anomaly is reflected.
[0041] Step 1174: Extract the node identifier of the monitoring node corresponding to the period when the data fluctuation anomaly occurred as the fluctuation source node identifier, and perform association and combination processing on the data fluctuation amplitude parameter and the fluctuation source node identifier to generate an initial data fluctuation anomaly feature descriptor.
[0042] The fluctuation source node identifier directly adopts the unique node identifier of the monitoring node corresponding to the period when the data fluctuation anomaly occurred. This identifier contains information such as the node's region code and equipment type code, and can be directly used to query attributes such as the node's deployment location and monitoring parameter type. The system binds the fluctuation source node identifier with the corresponding data fluctuation amplitude parameter to generate an initial data fluctuation anomaly feature descriptor. This descriptor only contains two core attributes: the location of the anomaly and the fluctuation intensity. To ensure the uniqueness of the descriptor, the system assigns a unique feature descriptor identifier to each initial data fluctuation anomaly feature descriptor for easy indexing and management.
[0043] Step 1175: Add the starting sampling time point of the data fluctuation anomaly occurrence period as the fluctuation start time identifier to the initial data fluctuation anomaly feature descriptor, and add the ending sampling time point of the data fluctuation anomaly occurrence period as the fluctuation end time identifier to the initial data fluctuation anomaly feature descriptor, generating a data fluctuation anomaly feature descriptor containing a fluctuation source node identifier, a fluctuation start time identifier, a fluctuation end time identifier, and a data fluctuation amplitude parameter, and the set of abnormal fluctuation feature descriptors is composed of all data fluctuation anomaly feature descriptors.
[0044] The start and end timestamps of fluctuations use a globally unified timestamp format to ensure the comparability of abnormal timestamps from different nodes. The system adds these two timestamps to the initial data fluctuation anomaly feature descriptor to form a complete data fluctuation anomaly feature descriptor. Each descriptor fully represents the location, time, and intensity information of an abnormal fluctuation occurring at a monitoring node during a specific time period. All complete data fluctuation anomaly feature descriptors corresponding to all abnormal fluctuations are aggregated to form an anomaly fluctuation feature descriptor set. Each anomaly fluctuation feature descriptor in the set carries a unique feature descriptor identifier, which does not appear repeatedly in the entire set.
[0045] Step 120: Construct a spatial association network of fluctuation source nodes based on the fluctuation source node identifier contained in each data fluctuation anomaly feature descriptor in the set of abnormal fluctuation feature descriptors. The spatial association network of fluctuation source nodes contains multiple fluctuation source nodes and spatial association edges connecting fluctuation source nodes that have spatial adjacency relationships.
[0046] The supercomputing center's emergency coordination and control system first parses all descriptors in the abnormal fluctuation feature descriptor set, extracts the fluctuation source node identifier from each descriptor, queries the pre-stored monitoring node deployment location database, and obtains the three-dimensional spatial coordinate parameters of each fluctuation source node. The reference coordinate system for the coordinate parameters is the global unified coordinate system of the supercomputing center's computer room. All nodes' coordinates are in the same coordinate system, ensuring the accuracy of spatial distance calculation.
[0047] The system traverses all fluctuation source nodes in pairs, calculates the spatial distance between each pair of nodes, and determines whether there is a spatial adjacency relationship between the nodes. If there is, it generates the corresponding spatial association edge. Finally, it constructs a spatial association network of fluctuation source nodes by taking all fluctuation source nodes as network nodes and all spatial association edges as network edges. This network can intuitively reflect the spatial distribution characteristics of abnormal fluctuations and the association relationship between nodes.
[0048] Step 121: Parse the fluctuation source node identifier contained in each data fluctuation anomaly feature descriptor in the abnormal fluctuation feature descriptor set, and query the node deployment location database of the monitoring node array deployed in the IoT sensing layer according to the fluctuation source node identifier to obtain the node spatial coordinate parameters corresponding to each fluctuation source node.
[0049] The node deployment location database pre-stores static attribute information for all IoT monitoring nodes, including node identifiers, node spatial coordinate parameters, region, monitoring parameter type, and deployment time. The node spatial coordinate parameters are three-dimensional coordinates, including x, y, and height coordinates, accurate to the centimeter level, accurately reflecting the actual location of the node within the data center. The system parses each descriptor in the abnormal fluctuation feature descriptor set, extracts the fluctuation source node identifier, and uses this identifier as the query key to access the node deployment location database to retrieve the corresponding node spatial coordinate parameters. If an identifier cannot be matched, the fluctuation source node is marked as an unknown location node and will not participate in the spatial association network construction.
[0050] Step 122: Randomly select two different data fluctuation anomaly feature descriptors from the set of abnormal fluctuation feature descriptors as the first candidate feature descriptor and the second candidate feature descriptor.
[0051] The system employs a combinatorial traversal approach to select feature descriptor pairs. First, all descriptors in the abnormal fluctuation feature descriptor set are sorted lexicographically by their identifiers. Then, following the sorted order, the first descriptor is selected as the first candidate feature descriptor, and each subsequent descriptor is selected as the second candidate feature descriptor. This ensures that every pair of distinct feature descriptor combinations is selected at least once, avoiding duplicate calculations and omissions. Each selected pair of candidate feature descriptors corresponds to two different fluctuation source nodes, ensuring that the calculation represents the spatial distance between two different nodes. The selected first and second candidate feature descriptors are then input into the node identifier extraction stage to obtain the corresponding node coordinate parameters.
[0052] Step 123: Obtain the fluctuation source node identifier corresponding to the first candidate feature descriptor as the first candidate node identifier, and obtain the fluctuation source node identifier corresponding to the second candidate feature descriptor as the second candidate node identifier.
[0053] The system parses the structures of the first and second candidate feature descriptors, extracting the fluctuation source node identifier field, which serves as the first and second candidate node identifiers, respectively. These two identifiers correspond to two different fluctuation source nodes. The system performs a validity check on the extracted identifiers to confirm that a corresponding record exists in the node deployment location database. If either identifier does not have a corresponding record, the system skips that pair of candidate feature descriptors and continues to select the next pair. The valid first and second candidate node identifiers will be used as the query keys for querying the node's spatial coordinate parameters.
[0054] Step 124: Query the node deployment location database based on the first candidate node identifier and the second candidate node identifier to obtain the spatial coordinate parameters of the first candidate node and the spatial coordinate parameters of the second candidate node.
[0055] The system uses the identifiers of the first and second candidate nodes as query keys to access the node deployment location database. It retrieves the 3D spatial coordinates of the first candidate node and the second candidate node, respectively. Both coordinate parameters belong to the same global coordinate system and can be directly used for distance calculation without coordinate transformation. The system verifies the completeness of the two coordinate parameters, confirming that they contain valid values for the x-coordinate, y-coordinate, and height coordinates. If any coordinate parameters are missing, the system skips that pair of candidate nodes and proceeds to the next pair of candidate feature descriptors. The two verified node spatial coordinates are then input into the spatial distance calculation process.
[0056] Step 125: Calculate the straight-line distance between the spatial coordinate parameters of the first candidate node and the spatial coordinate parameters of the second candidate node as the spatial distance parameter between nodes.
[0057] The system calculates the straight-line distance between two nodes based on their 3D spatial coordinate parameters. During the calculation, the differences in the x-coordinate, y-coordinate, and height of the two nodes are squared. The sum of these three squares is then taken as the square root, and the result is the spatial distance parameter between the nodes, reflecting the actual distance between the two wave source nodes in physical space. All coordinate parameters are calculated in meters, ensuring consistency in the unit of measurement. To avoid distance calculation errors caused by coordinate inconsistencies, the system pre-sets a minimum distance threshold. If the calculated spatial distance parameter is less than this minimum threshold, the two nodes are considered to be deployed in overlapping locations, and the minimum distance threshold is directly used for comparison.
[0058] Step 126: Based on the comparison result between the spatial distance parameter between the nodes and the preset threshold, determine the fluctuation source node pairs with spatial adjacency and generate the corresponding spatial association edges.
[0059] The preset spatial correlation distance threshold is pre-set based on the physical layout of the supercomputing center's computer room, the coverage of monitoring nodes, and the spatial propagation characteristics of abnormal events. The threshold value corresponds to the maximum spatial distance at which an abnormal event may have a correlation effect. If the distance between two nodes is less than this threshold, it is considered that the abnormal fluctuations of the two nodes may have a spatial correlation. The system compares the calculated spatial distance parameters between nodes with the preset spatial correlation distance threshold. Based on the comparison result, it determines whether two nodes have a spatial adjacency relationship. If so, it generates a corresponding spatial correlation edge. After all node pairs are processed, all the fluctuation source nodes and spatial correlation edges obtained will jointly constitute the fluctuation source node spatial correlation network. This network will serve as the input for parsing the emergency task triggering conditions.
[0060] Step 1261: Determine whether the spatial distance parameter between the nodes is less than the preset spatial association distance threshold. If the determination result is yes, then determine the fluctuation source node corresponding to the first candidate node identifier and the fluctuation source node corresponding to the second candidate node identifier as a fluctuation source node pair with spatial adjacency.
[0061] The preset spatial association distance threshold is set differently based on the equipment density of different areas. For areas with dense computing node racks, the threshold is set smaller, while for areas with large equipment such as power supply and cooling systems, the threshold is set larger, ensuring that the determination of spatial adjacency matches the actual situation of different areas. The system compares the spatial distance parameters between nodes in each candidate node pair. If the parameter value is less than the spatial association distance threshold for the corresponding area, the two nodes are determined to be spatially adjacent and marked as a fluctuation source node pair. If the parameter value is greater than or equal to the threshold, the two nodes are determined not to be spatially adjacent, and no association edge is generated. For fluctuation source node pairs determined to have a spatial adjacency, the system records the identifiers of the two nodes as the basis for generating spatial association edges.
[0062] Step 1262: Generate a spatial association edge connecting the wave source node corresponding to the first candidate node identifier and the wave source node corresponding to the second candidate node identifier for the wave source node pair with spatial adjacency.
[0063] The generated spatial association edges are undirected, meaning the association between two nodes has no directional difference. Each spatial association edge contains three core attributes: edge identifier, source node identifier, and target node identifier. The source node identifier and target node identifier are the identifiers of the two nodes in the fluctuating source node pair, respectively. The edge identifier is a unique identifier assigned by the system to distinguish different spatial association edges. To facilitate network topology feature analysis, the system also adds a weight attribute to each spatial association edge. The weight value is inversely proportional to the spatial distance parameter between nodes; that is, the smaller the distance between two nodes, the greater the weight of the association edge, reflecting a higher degree of spatial association between the two nodes. The generated spatial association edges will be temporarily stored in an association edge set and used to construct a complete association network after all node pairs have been processed.
[0064] Step 1263: Continue to select two other different data fluctuation anomaly feature descriptors from the set of abnormal fluctuation feature descriptors as new first candidate feature descriptors and second candidate feature descriptors, until all different feature descriptor combinations in the set of abnormal fluctuation feature descriptors have been processed.
[0065] The system processes all pairwise combinations sequentially according to the pre-sorted feature descriptors. After processing each candidate feature descriptor pair, it records the processing status to avoid duplicate processing. If the set of abnormal fluctuation feature descriptors contains N data fluctuation abnormal feature descriptors, a total of N×(N-1) / 2 sets of feature descriptor combinations need to be processed (i.e., the number of combinations of arbitrarily selecting 2 elements from N elements), ensuring that all possible node pairs are covered. If duplicate node pair combinations occur during processing, the system will automatically skip them to avoid generating duplicate spatial association edges. After all combinations are processed, the system stops selecting candidate feature descriptor pairs and enters the association network construction stage.
[0066] Step 1264: Combine all the fluctuation source nodes and all the generated spatial association edges to construct a fluctuation source node spatial association network with fluctuation source nodes as network nodes and spatial association edges as network connections.
[0067] The system treats all fluctuation source nodes as a set of network nodes. Each node retains attributes such as its fluctuation source node identifier, node spatial coordinate parameters, data fluctuation amplitude parameters, fluctuation start time identifier, and fluctuation end time identifier. All generated spatially related edges are treated as a set of network edges, with each edge retaining its edge identifier, source node identifier, target node identifier, and weight attribute. The constructed spatially related network of fluctuation source nodes is stored using an adjacency list. Each node corresponds to a list of adjacent nodes, storing the identifiers of all adjacent nodes with which it is related, along with the corresponding edge weights. This facilitates quick lookup of a node's neighbors and calculation of network topology features. This spatially related network of fluctuation source nodes is used for parsing and processing emergency task triggering conditions.
[0068] Step 130: Input the spatial association network of the fluctuation source node into the supercomputing center resource scheduling platform for emergency task trigger condition parsing processing, and generate an emergency response task request instruction that matches the network topology characteristics of the spatial association network of the fluctuation source node.
[0069] The supercomputing center resource scheduling platform is the core platform used by the supercomputing center for unified management of computing resources and scheduling of task execution. It has functions such as task priority management, dynamic resource allocation, and multi-task collaborative scheduling. The supercomputing center emergency coordination and control system sends the completed spatial correlation network of fluctuation source nodes to the supercomputing center resource scheduling platform. The platform first extracts and analyzes the node attributes and topology characteristics of the network, and, in conjunction with preset emergency response level rules, assesses the comprehensive impact of the abnormal event, determines the corresponding emergency response task level and task requirements, and finally generates an emergency response task request instruction containing emergency task requirements and correlation network data. This instruction will trigger an available computing node screening process to allocate appropriate computing resources for the emergency response task.
[0070] Step 131: Parse all the wave source node identifiers contained in the wave source node spatial association network, query the preset node importance classification database according to the wave source node identifiers, and obtain the node importance level parameter corresponding to each wave source node.
[0071] The node importance grading database pre-stores the importance level information of all IoT monitoring nodes. Node importance levels are categorized based on the importance of the objects monitored. Nodes monitoring core power supply equipment, core cooling equipment, and core computing clusters have higher importance levels, while nodes monitoring ordinary auxiliary areas and non-critical equipment have lower importance levels. The level parameter is a continuous numerical value, with higher values indicating higher node importance. The system accesses the node importance grading database using the fluctuation source node identifier as the query key to retrieve the node importance level parameter corresponding to each fluctuation source node. If an identifier cannot be matched, the node's importance level is defaulted to the lowest level.
[0072] Step 132: Perform network topology feature extraction processing on the spatial association network of the fluctuation source nodes, and calculate the network diameter parameter and network connectivity parameter of the spatial association network of the fluctuation source nodes.
[0073] The network diameter parameter is the maximum value of the shortest path between any two nodes in the spatial association network of the fluctuation source nodes, reflecting the spatial range affected by the abnormal event. A larger network diameter indicates a wider impact range of the abnormal event. The network connectivity parameter is the number of connected components in the network, reflecting the degree of concentration of the abnormal event's distribution. A smaller connectivity indicates a more concentrated distribution of abnormal nodes and a higher degree of correlation. The system uses a breadth-first search algorithm to calculate the shortest path between all node pairs and selects the maximum value as the network diameter parameter. A disjoint-set data structure algorithm is used to partition the nodes in the network into connected components, and the number of connected components is counted as the network connectivity parameter.
[0074] Step 133: Count the number of all fluctuation source nodes in the spatial association network of the fluctuation source nodes as the parameter of the total number of nodes affected by the abnormal event.
[0075] The total number of nodes affected by the abnormal event is a parameter that directly counts the number of elements in the node set of the spatial correlation network of the fluctuation source node. This parameter reflects the total number of monitoring nodes affected by the abnormal event. The larger the number, the greater the scope of the abnormal event's impact.
[0076] During the statistical process, nodes previously marked as having unknown locations were excluded, and only fluctuation source nodes with valid spatial coordinate parameters were counted. If the spatial association network of fluctuation source nodes contains multiple connected components, the sum of the number of nodes in all connected components is used as the parameter for the total number of nodes affected by the abnormal event. This parameter will serve as one of the basic parameters for calculating the comprehensive impact coefficient of the abnormal event.
[0077] Step 134: Perform a weighted summation operation on the total number of nodes affected by the abnormal event based on the node importance level parameter corresponding to each fluctuation source node to generate the comprehensive impact coefficient of the abnormal event, and count the number of all spatial association edges in the spatial association network of the fluctuation source node as the total number of association relationships between nodes.
[0078] The comprehensive impact coefficient of an abnormal event is calculated by summing the node importance level parameters of each fluctuation source node. Nodes with higher importance levels contribute more to the comprehensive impact. This coefficient comprehensively reflects the severity of the abnormal event's impact; a higher coefficient value indicates greater harm. The total number of inter-node relationships directly counts the number of elements in the edge set of the spatial association network of the fluctuation source nodes. This parameter reflects the tightness of the association between abnormal nodes; a larger number indicates stronger spatial association between abnormal nodes and a higher probability of the abnormal event spreading.
[0079] Step 135: Determine the corresponding emergency response task level and template based on the comprehensive impact coefficient of the abnormal event, the total number of inter-node relationships, and the network diameter parameter, and generate an emergency response task request instruction containing network coding data.
[0080] The system combines the combination of three core parameters, queries the preset emergency response level mapping rules, and determines the corresponding emergency response task level. Different levels correspond to different task priorities, resource requirements, and response processes. The higher the level, the more computing resources need to be mobilized and the higher the task execution priority.
[0081] After determining the emergency response task level, the system matches the corresponding emergency task template. The template predefines the types and quantities of computing resources required for the emergency task at that level, the task execution process, data processing requirements, etc. The spatial association network of the fluctuation source node is serialized and encoded and added to the additional data field of the template. Finally, a complete emergency response task request instruction is generated, which will serve as a trigger signal for querying the running status of the supercomputing node.
[0082] Step 1351: Based on the comprehensive impact coefficient of the abnormal event, the total number of inter-node relationships, and the network diameter parameter, query the preset emergency response level mapping table to obtain the emergency response task level parameter that matches the combination of the comprehensive impact coefficient of the abnormal event, the total number of inter-node relationships, and the network diameter parameter.
[0083] The emergency response level mapping table is pre-defined based on the analysis of historical anomalies and expert assessments. The table stores emergency response task level parameters corresponding to different parameter combinations. These parameters are divided into multiple levels, each corresponding to a specific parameter range. The system matches the calculated comprehensive impact coefficient of the anomaly, the total number of inter-node relationships, and the network diameter parameter against the parameter ranges in the mapping table. It finds the emergency response task level parameters corresponding to a perfectly matching parameter combination. If no perfectly matching combination exists, the higher-level parameter with the closest parameter range is selected as the matching result, ensuring that the emergency response level covers the actual impact of the anomaly.
[0084] Step 1352: Query the preset emergency task template database according to the emergency response task level parameter, and obtain the emergency task template corresponding to the emergency response task level parameter. The emergency task template includes a required computing resource type field, a required computing resource quantity field, and a task execution priority field.
[0085] The emergency task template database pre-stores standardized task templates corresponding to different emergency response levels. Each template contains fixed fields such as the required computing resource type, the required quantity of computing resources, the task execution priority, data processing logic, and output requirements. The required computing resource type specifies the configuration requirements for resources such as CPU, memory, storage, and network; the required quantity of computing resources specifies the number of supercomputing nodes needed; and the task execution priority specifies the scheduling priority of this emergency task relative to other regular tasks, with higher-priority tasks receiving priority in resource allocation. The system accesses the emergency task template database using the emergency response task level parameter as the query key to retrieve the corresponding emergency task template. All fields in the template serve as the basis for generating emergency response task request instructions.
[0086] Step 1353: After serializing and encoding the spatial association network of the fluctuation source node, add it to the additional data field of the emergency task template to generate an emergency response task request instruction containing the encoded data of the spatial association network of the fluctuation source node and the content of all fields in the emergency task template.
[0087] The system uses a common structured data encoding format to serialize and encode the spatial association network of the fluctuation source nodes. The encoding process preserves all node attributes, edge attributes and topological information of the network. The encoded data can be completely restored to the original association network structure through decoding.
[0088] After encoding is completed, the system writes the encoded data into the additional data field of the emergency task template, and at the same time fills in all the fixed field contents of the template to generate a complete emergency response task request instruction. The instruction contains all necessary information such as the task unique identifier, task type, emergency response level, resource requirements, priority, and associated network encoded data.
[0089] Step 140: Based on the emergency response task request instruction, query the real-time operating status database of the supercomputing center computing node cluster, obtain the node resource occupancy status vector and node task execution progress vector of each supercomputing node in the supercomputing center computing node cluster, and perform available node screening processing on the supercomputing center computing node cluster according to the node resource occupancy status vector and node task execution progress vector to obtain a set of candidate supercomputing nodes that are qualified to take over the emergency task.
[0090] The supercomputing center's computing node cluster consists of a large number of heterogeneous computing nodes. Each node has different hardware configurations and real-time operating status. The real-time operating status database continuously collects real-time operating status data from all computing nodes, including resource usage, task execution status, and health status. The data update frequency meets the real-time scheduling requirements.
[0091] Based on the resource requirements in the emergency response task request instruction, the supercomputing center emergency coordination and control system first queries the real-time operation status database of all target supercomputing centers to obtain the node resource occupancy status vector and the node task execution progress vector for each supercomputing node. The node resource occupancy status vector reflects the current hardware resource idleness of the node, and the node task execution progress vector reflects the current task load of the node. Combining the parameters of the two vectors, the system filters out supercomputing nodes that meet the requirements for emergency task operation through multi-level screening conditions, forming a candidate supercomputing node group set. The nodes in this set have sufficient idle resources and task undertaking capacity to undertake the execution of emergency response tasks.
[0092] Step 141: Parse the emergency response task level parameter in the emergency response task request instruction, determine the list of target supercomputing center identifiers that need to participate in the emergency response based on the emergency response task level parameter, and access the real-time operation status database interface of the corresponding supercomputing center resource scheduling platform for each target supercomputing center identifier in the target supercomputing center identifier list.
[0093] Different emergency response task levels correspond to different resource scheduling ranges. Lower-level emergency tasks only need to mobilize resources from the local supercomputing center, while higher-level emergency tasks need to mobilize resources from multiple associated supercomputing centers. The system first parses the emergency response task level parameter in the emergency response task request instruction, determines the supercomputing centers that need to participate in the emergency response according to the preset scheduling range rules, and generates a list of target supercomputing center identifiers. Each identifier in the list corresponds to an independent supercomputing center resource scheduling platform.
[0094] Based on the identifier of each target supercomputing center, the system establishes a secure connection with the real-time operational status database interface of the corresponding supercomputing center's resource scheduling platform through a pre-configured interface access address and authentication information. The connection process employs two-way authentication to ensure data access security. After the connection is established, the system will query the operational status data of the computing nodes of the corresponding supercomputing center through this interface.
[0095] Step 142: Send a node status query request instruction to the real-time running status database interface corresponding to each target supercomputing center identifier. The node status query request instruction includes the node status data type identifier to be obtained.
[0096] The node status query request instruction includes fields such as a unique request identifier, a request timestamp, a target supercomputing center identifier, and a node status data type identifier to be obtained. The node status data type identifier specifies the data type of the node status to be queried, including CPU utilization, memory utilization, storage input / output status, network bandwidth utilization, current task list, and remaining task execution time, which correspond one-to-one with the data types required to generate the node resource utilization status vector and the node task execution progress vector.
[0097] The system independently generates node status query request commands for the real-time operational status database interface of each target supercomputing center. A unique identifier within the command distinguishes different query requests, facilitating response data matching. After sending the request command, the system enters a waiting-for-response state, awaiting the corresponding interface to return node status data.
[0098] Step 143: Receive the node status response data stream returned by the real-time running status database interface corresponding to each target supercomputing center identifier. The node status response data stream contains the node status data packet of each supercomputing node in the supercomputing center computing node cluster corresponding to the target supercomputing center identifier.
[0099] After receiving a node status query request, the real-time runtime status database interface verifies the validity of the request. If the verification is successful, it extracts the latest status data of all computing nodes from the real-time runtime status database and packages the status data of each computing node into an independent node status data packet. Each data packet contains a node identifier, a data collection timestamp, and all requested status data fields. All node status data packets are aggregated to form a node status response data stream and returned to the requester.
[0100] After receiving the node status response data stream, the system discards damaged or invalid data packets during transmission and stores the remaining data packets according to the node identifier, ensuring that the node status data of each supercomputing center is associated with the corresponding supercomputing center identifier.
[0101] Step 144: Parse the node status data packet of each supercomputing node, extract the instantaneous values of CPU core utilization, memory space utilization, storage device input / output queue depth, and network interface bandwidth utilization from the node status data packet of each supercomputing node, and arrange and combine the instantaneous values of CPU core utilization, memory space utilization, storage device input / output queue depth, and network interface bandwidth utilization according to the preset feature vector dimension order to generate the node resource utilization status vector of each supercomputing node.
[0102] The preset feature vector dimensions are fixed in the following order: instantaneous CPU core utilization, instantaneous memory space utilization, instantaneous storage device I / O queue depth, and instantaneous network interface bandwidth utilization. The parameter values for all four dimensions are standardized values between 0 and 1, reflecting the node's computing, memory, storage, and network resource usage, respectively. Higher values indicate higher resource utilization and less idle resources. The system parses the node status data packet of each supercomputing node, extracts the parameter values of the four dimensions sequentially, and arranges them in a fixed order to generate a four-dimensional node resource utilization status vector. Each vector corresponds to a unique supercomputing node identifier.
[0103] Step 145: Parse the node status data packet of each supercomputing node, extract the list of currently executing tasks from the node status data packet of each supercomputing node, the task list contains the task identifier and the estimated remaining execution time of each executing task, sum the estimated remaining execution time of all tasks in the task list of each supercomputing node, and obtain the total remaining execution time parameter of the node tasks of each supercomputing node.
[0104] The list of currently executing tasks in the node status data packet records information on all incomplete regular tasks of the node. Each task entry includes fields such as task identifier, task priority, and estimated remaining execution time. The estimated remaining execution time is estimated by the supercomputing center's resource scheduling platform based on historical execution data and current execution progress, with the unit being seconds. The system sums the estimated remaining execution times of all tasks in the task list of each supercomputing node. The resulting total remaining execution time parameter reflects the total time required for the node to complete all current regular tasks. A larger parameter value indicates a heavier current task load and a weaker ability to handle new tasks. This parameter, together with the node task concurrency parameter, constitutes the node task execution progress vector.
[0105] Step 146: Count the number of tasks in the task list of each supercomputing node as the node task concurrency parameter of each supercomputing node. Arrange and combine the total remaining execution time parameter of the node tasks and the node task concurrency parameter according to the preset progress vector dimension order to generate the node task execution progress vector of each supercomputing node.
[0106] The preset progress vector dimensions are fixed as follows: the total remaining execution time of node tasks and the number of concurrent node tasks. These two parameters reflect the total task load and the number of concurrent tasks on the node, respectively. The number of concurrent node tasks represents the total number of tasks currently being executed on the node; a larger value indicates higher task scheduling overhead and a weaker ability to handle new parallel tasks. The system generates a two-dimensional node task execution progress vector by arranging these two parameters in a fixed order, with each vector corresponding to a unique supercomputing node identifier.
[0107] Step 147: Analyze the node resource occupancy state vector of each supercomputing node and extract the instantaneous value of the CPU core occupancy rate of each supercomputing node as the first screening index parameter.
[0108] The first screening metric parameter is the first dimension parameter of the node resource occupancy state vector, reflecting the current occupancy of the node's CPU resources. It is the core indicator for determining whether a node has sufficient computing resources to handle emergency tasks. Emergency tasks typically require a large amount of computing resources; if the CPU core occupancy rate is too high, the execution performance of the emergency task cannot be guaranteed. The system analyzes the node resource occupancy state vectors of all supercomputing nodes and extracts the instantaneous value of the CPU core occupancy rate of each node as the first screening metric parameter. This parameter will be used for comparison and judgment under the first-level screening conditions.
[0109] Step 148: Compare the first screening index parameter with the preset CPU idle threshold, and select supercomputing nodes whose first screening index parameter is lower than the CPU idle threshold as a set of supercomputing nodes that pass the first screening condition. Analyze the node resource occupancy state vector of each supercomputing node that passes the first screening condition, and extract the instantaneous value of the memory space occupancy rate of each supercomputing node as the second screening index parameter.
[0110] A preset CPU idle threshold is set based on the computing resource requirements of emergency tasks. The threshold value corresponds to the minimum CPU idle ratio required for the emergency task. Only nodes with CPU core utilization rates below this threshold have sufficient computing resources to run the emergency task. The system compares the first screening metric parameter of each supercomputing node with this threshold, retaining nodes with parameter values below the threshold to form a set of supercomputing nodes that pass the first screening condition, filtering out nodes with insufficient CPU resources. Subsequently, the system analyzes the node resource utilization status vector of the nodes that pass the first screening condition, extracting the instantaneous value of memory space utilization as the second screening metric parameter. This parameter reflects the node's memory resource utilization and is used to determine the second-level screening condition.
[0111] Step 149: Compare the second screening index parameter with the preset memory space free threshold, and select the supercomputing nodes whose second screening index parameter is lower than the memory space free threshold from the set of supercomputing nodes that pass the first screening condition as the set of supercomputing nodes that pass the second screening condition. Analyze the node task execution progress vector of each supercomputing node that passes the second screening condition, and extract the node task concurrency parameter of each supercomputing node as the third screening index parameter.
[0112] A preset memory space free threshold is set based on the memory resource requirements of emergency tasks. The threshold value corresponds to the minimum memory free ratio required by the emergency task. Only nodes with a memory space occupancy rate below this threshold have sufficient memory resources to run the emergency task. The system compares the second screening index parameter of each node that passes the first screening condition with this threshold, retaining nodes with parameter values below the threshold to form a set of supercomputing nodes that pass the second screening condition, filtering out nodes with insufficient memory resources. Subsequently, the system analyzes the node task execution progress vector of the nodes that pass the second screening condition, extracting the node task concurrency number parameter of the second dimension as the third screening index parameter. This parameter reflects the current concurrent task load of the node and is used to determine the third level of screening conditions.
[0113] Step 1491: Compare the third screening index parameter with the preset upper limit threshold for the number of concurrent tasks. Select supercomputing nodes whose third screening index parameter is lower than the upper limit threshold for the number of concurrent tasks from the set of supercomputing nodes that have passed the second screening condition as the set of supercomputing nodes that have passed the third screening condition. Analyze the node task execution progress vector of each supercomputing node that has passed the third screening condition and extract the total remaining execution time parameter of the node task of each supercomputing node as the fourth screening index parameter.
[0114] The preset upper limit for the number of concurrent tasks is pre-set based on the hardware configuration of the supercomputing nodes and the performance requirements of emergency tasks. The threshold value corresponds to the maximum number of tasks a node can handle simultaneously. Only nodes with a number of concurrent tasks below this threshold can guarantee the scheduling efficiency and execution performance of emergency tasks. The system compares the third screening metric parameter of each node that passes the second screening condition with this threshold, retaining nodes with parameter values below the threshold to form a set of supercomputing nodes that pass the third screening condition, filtering out nodes with excessive concurrent task load. Subsequently, the system parses the node task execution progress vector of the nodes that pass the third screening condition, extracting the first dimension of the node task's total remaining execution time parameter as the fourth screening metric parameter. This parameter reflects the time required for the node to complete all current regular tasks and is used for the judgment of the fourth screening condition.
[0115] Step 1492: Compare the fourth screening index parameter with the preset task remaining time tolerance threshold, select the supercomputing nodes whose fourth screening index parameter is lower than the task remaining time tolerance threshold from the set of supercomputing nodes that have passed the third screening condition as the set of supercomputing nodes that have passed the fourth screening condition, mark all supercomputing nodes in the set of supercomputing nodes that have passed the fourth screening condition as candidate supercomputing nodes that are qualified to take over emergency tasks, and form the candidate supercomputing node group set by all candidate supercomputing nodes.
[0116] The preset remaining task duration tolerance threshold is pre-set based on the response time requirements of the emergency task. The threshold value corresponds to the maximum tolerable waiting time of the emergency task. Only nodes with a total remaining execution time of their tasks below this threshold can begin executing tasks within the time required by the emergency task, ensuring the timeliness of the emergency response. The system compares the fourth screening index parameter of each node that passes the third screening condition with this threshold, retaining nodes with parameter values below the threshold to form a set of supercomputing nodes that pass the fourth screening condition. Nodes in this set all meet the resource requirements, load requirements, and response time requirements of the emergency task and are marked as candidate supercomputing nodes qualified to take over the emergency task. All candidate supercomputing nodes are aggregated to form a candidate supercomputing node group set, which is used for the mapping and allocation of fluctuation source nodes and supercomputing nodes.
[0117] Step 150: Perform spatial proximity mapping and allocation processing on each fluctuation source node in the fluctuation source node spatial association network and the candidate supercomputing nodes in the candidate supercomputing node group set to generate an emergency coordination and control instruction set containing a node mapping relationship table and a task division sequence, and send the emergency coordination and control instruction set to the candidate supercomputing node group set to activate the parallel emergency response task execution operation.
[0118] The principle of spatial proximity mapping allocation is to assign each fluctuation source node to the candidate supercomputing node closest to its physical location. This reduces data transmission latency, improves the processing efficiency of emergency tasks, and ensures that adjacent fluctuation source nodes are assigned to the same supercomputing node as much as possible, reducing the collaboration overhead between nodes. The supercomputing center's emergency coordination and control system first obtains the spatial coordinates of each fluctuation source node and the rack position coordinates of each candidate supercomputing node, calculates the spatial distance between each pair, completes node mapping based on the principle of minimum distance, and generates a node mapping relationship table. Then, according to the topology of the spatial association network of fluctuation source nodes and the node mapping relationship, it divides the tasks that each supercomputing node needs to handle, generates a task assignment sequence, and combines the node mapping relationship table and the task assignment sequence to generate dedicated control instructions for each candidate supercomputing node. These are then compiled into an emergency coordination and control instruction set, and finally, the instruction set is sent to all candidate supercomputing nodes to activate the parallel emergency response task execution process of the nodes, realizing multi-node collaborative processing of abnormal events.
[0119] Step 151: Parse the spatial association network of the fluctuation source nodes, obtain the node spatial coordinate parameters corresponding to each fluctuation source node in the spatial association network of the fluctuation source nodes, parse the node identifier of each candidate supercomputing node in the candidate supercomputing node group set, query the supercomputing center network topology database according to the node identifier, and obtain the rack position coordinate parameters of each candidate supercomputing node.
[0120] The supercomputing center's network topology database pre-stores static attribute information for all supercomputing nodes, including node identifiers, racks, rack coordinates, network access locations, and hardware configurations. The rack coordinates are three-dimensional coordinates, using the same globally unified coordinate system as IoT monitoring nodes to ensure consistency in spatial distance calculations. The system first parses the node set of the spatial association network of the fluctuation source nodes, extracting the node spatial coordinate parameters for each fluctuation source node. Then, it parses the node identifier of each candidate supercomputing node in the candidate supercomputing node group set, using this identifier as the query key to access the supercomputing center's network topology database and retrieve the rack coordinate parameters corresponding to each candidate supercomputing node.
[0121] Step 152: Calculate the spatial distance metric between each fluctuation source node and each candidate supercomputing node based on the node spatial coordinate parameters of each fluctuation source node and the rack position coordinate parameters of each candidate supercomputing node.
[0122] The spatial distance metric is calculated in the same way as the spatial distance between the source nodes. It's based on three-dimensional coordinates, calculating the straight-line distance between them. The calculation involves squaring the differences in the horizontal, vertical, and height coordinates, summing the squares, and then taking the square root. The result is the spatial distance metric, reflecting the physical distance between the source node and the candidate supercomputing node. Closer distances indicate shorter data transmission paths and lower latency. All coordinate parameters are calculated in meters, ensuring dimensional consistency. For each source node, the system calculates its spatial distance metric with all candidate supercomputing nodes, forming a set of distance metrics for that node. This set is used for target supercomputing node selection.
[0123] Step 153: For each fluctuation source node, select the candidate supercomputing node with the smallest spatial distance metric value from the candidate supercomputing node group set as the mapping target supercomputing node of the fluctuation source node. Generate a node mapping relationship pair between each candidate supercomputing node and one or more fluctuation source nodes according to the mapping target supercomputing node corresponding to each fluctuation source node. The node mapping relationship table is composed of all node mapping relationship pairs.
[0124] The system sorts the distance metric set corresponding to each fluctuation source node in ascending order and selects the candidate supercomputing node that ranks first in the sorted list as the mapping target supercomputing node for that fluctuation source node. This ensures that each fluctuation source node is assigned to the nearest candidate supercomputing node. If multiple candidate supercomputing nodes have the same and minimum spatial distance metric value as the fluctuation source node, the candidate supercomputing node with the lowest node resource utilization rate is selected as the mapping target to achieve load balancing. Each fluctuation source node and its corresponding mapping target supercomputing node form a node mapping relationship pair, which contains two fields: a fluctuation source node identifier and a supercomputing node identifier. All node mapping relationship pairs are summarized to form a node mapping relationship table, which records all fluctuation source nodes that each candidate supercomputing node needs to process.
[0125] Step 154: parse each spatial association edge in the spatial association network of the fluctuation source node, obtain the first fluctuation source node identifier and the second fluctuation source node identifier connected by each spatial association edge, query the first mapping target supercomputing node identifier corresponding to the first fluctuation source node identifier according to the node mapping relationship table, and query the second mapping target supercomputing node identifier corresponding to the second fluctuation source node identifier.
[0126] The system traverses all spatially related edges in the spatial association network of the wave source nodes, sequentially parsing the source node identifier and target node identifier of each edge, which are respectively used as the first and second wave source node identifiers. Then, using these two identifiers as query keys, the system retrieves the node mapping table to obtain the corresponding mapping target supercomputing node identifiers for the two wave source nodes, namely the first and second mapping target supercomputing node identifiers. The system records the two supercomputing node identifiers obtained from the query to determine whether the processing task of the related edge requires cross-node collaboration.
[0127] Step 155: Based on the comparison result between the first mapping target supercomputing node identifier and the second mapping target supercomputing node identifier, generate an internal processing task indication or collaborative subtask and its data content format indication for the spatial association edge, so as to construct a complete task division sequence.
[0128] The system compares the identifiers of two mapped target supercomputing nodes. If the two identifiers are the same, it means that the two fluctuation source nodes connected by the spatial association edge are assigned to the same supercomputing node, and the corresponding association processing task can be completed within that node without cross-node collaboration. If the two identifiers are different, it means that the two fluctuation source nodes are assigned to different supercomputing nodes, and the corresponding association processing task requires collaboration between the two nodes. This task needs to be broken down into two sub-tasks and assigned to the two nodes respectively, specifying the data content and format to be exchanged between the two nodes. Based on the comparison results, the system generates corresponding task instructions. The task instructions for all spatial association edges are then summarized to form a complete task allocation sequence, which clarifies all the sub-tasks that each candidate supercomputing node needs to execute.
[0129] Step 1551: When the identifier of the first mapping target supercomputing node is the same as the identifier of the second mapping target supercomputing node, generate an internal processing task instruction to assign the association processing task corresponding to the spatial association edge to the supercomputing node corresponding to the same identifier.
[0130] The internal processing task indication includes fields such as task identifier, associated edge identifier, identifiers of the two fluctuation source nodes, and task processing requirements. It clearly states that the processing task for this association is completed independently by the corresponding supercomputing node, without requiring data interaction with other nodes. The system associates the generated internal processing task indication with the corresponding supercomputing node identifier, marking it as the task that node needs to execute. For multiple internal processing task indications corresponding to the same supercomputing node, the system sorts them according to the weight of the associated edge; tasks with higher weights are ranked higher, ensuring that important associations are processed first.
[0131] Step 1552: When the identifier of the first mapping target supercomputing node is different from the identifier of the second mapping target supercomputing node, generate a task to split the association relationship processing task corresponding to the spatial association edge into a first collaborative subtask executed by the supercomputing node corresponding to the identifier of the first mapping target supercomputing node and a second collaborative subtask executed by the supercomputing node corresponding to the identifier of the second mapping target supercomputing node, and generate a data content format indication that needs to be exchanged between the first collaborative subtask and the second collaborative subtask.
[0132] The first and second collaborative subtasks each contain fields such as task identifier, associated edge identifier, corresponding fluctuation source node identifier, collaborative node identifier, and task processing requirements, clearly defining the subtask content that each node needs to complete. The data content format indicator specifies the data type, data structure, encoding format, and transmission protocol that the two nodes need to exchange, ensuring that both nodes can correctly parse the collaborative data sent by the other and complete the collaborative processing of the associated relationships. The system associates each of the two collaborative subtasks with its corresponding supercomputing node identifier and simultaneously sends the data content format indicator to both nodes to guide the collaborative interaction between them.
[0133] Step 1553: Construct a complete task assignment sequence based on all internal processing task instructions and all collaborative subtasks and their data content format instructions.
[0134] The system groups all internal processing task instructions and collaborative subtasks according to the candidate supercomputing node identifiers. Each group corresponds to all tasks that a candidate supercomputing node needs to execute. Tasks within each group are sorted by priority, with higher-priority tasks listed first. Tasks with the same priority are sorted by the weight of their associated edges, ultimately forming a complete task allocation sequence. Each task entry in the sequence contains information such as the supercomputing node identifier, task type, task content, and processing requirements, which can be directly used to generate dedicated execution instructions for each supercomputing node.
[0135] Step 1554: After constructing a complete task division sequence based on all internal processing task instructions and all collaborative sub-tasks and their data content format instructions, an emergency coordination and control instruction set containing multiple node-specific instruction data blocks is generated. Each node-specific instruction data block corresponds to a candidate supercomputing node and contains a list of fluctuation source nodes that the candidate supercomputing node needs to process and a list of spatial correlation edge processing sub-tasks that need to be executed.
[0136] The system independently generates a corresponding node-specific instruction data block for each candidate supercomputing node. This block contains information about all fluctuation source nodes that the node needs to handle and information about all subtasks that need to be executed. The instruction data blocks for different nodes are independent of each other, containing only the necessary information for the node to execute its tasks, avoiding redundant data transmission. All node-specific instruction data blocks are aggregated to form an emergency coordination and control instruction set. Each block in the set corresponds to a unique candidate supercomputing node identifier, facilitating instruction distribution.
[0137] Step 15541: Obtain all fluctuation source node identifiers corresponding to the first candidate supercomputing node identifier in the node mapping relationship table, sort all fluctuation source node identifiers according to the lexicographical order of the fluctuation source node identifiers, and generate the first fluctuation source node processing list corresponding to the first candidate supercomputing node identifier.
[0138] The system first filters all node mapping pairs whose target is the first candidate supercomputing node from the node mapping table. It then extracts the identifiers of the fluctuation source nodes from these pairs and sorts them in ascending order according to lexicographical order. The sorted list of identifiers constitutes the first fluctuation source node processing list, clearly identifying all fluctuation source nodes that the supercomputing node needs to handle. The purpose of this sorting is to ensure a consistent order in which nodes process fluctuation source nodes, facilitating task scheduling and result verification.
[0139] Step 15542: Obtain all internal processing task indicators and all collaborative subtasks corresponding to the first candidate supercomputing node identifier in the task division sequence, sort all internal processing task indicators and all collaborative subtasks according to the order of appearance of the spatial association edges of the task in the spatial association network of the fluctuation source node, and generate a list of first spatial association edge processing subtasks corresponding to the first candidate supercomputing node identifier.
[0140] The system filters all internal processing task instructions and collaborative subtasks belonging to the first candidate supercomputing node from the task allocation sequence. These are then sorted in ascending order of the edge identifiers in the spatial association network of the fluctuation source node. The sorted task list is the first spatial association edge processing subtask list, which clearly specifies the order and content of all association edge processing tasks that the supercomputing node needs to execute. The purpose of sorting is to ensure that the order of association edge processing is consistent with the traversal order of the network topology, avoiding logical conflicts.
[0141] Step 15543: Create a first node-specific instruction data block for the first candidate supercomputing node identifier in the emergency coordination and control instruction set, and write the first candidate supercomputing node identifier field into the header of the first node-specific instruction data block.
[0142] The first-node dedicated instruction data block adopts a standardized structured data format, divided into a header and a body. The header stores the supercomputing node identifier corresponding to the block, used for target node matching during instruction distribution. The body stores the specific task content that the node needs to execute. The system first creates a blank first-node dedicated instruction data block, writes the first candidate supercomputing node identifier into the header, ensuring that the block will only be sent to the corresponding target supercomputing node, avoiding incorrect instruction delivery. After writing, the block header content is fixed, and the task content is written into the block body.
[0143] Step 15544: Write the first fluctuation source node processing list and the first spatial association edge processing subtask list sequentially into the main part of the first node dedicated instruction data block.
[0144] The system sequentially writes the generated list of first wave source nodes for processing and the list of first spatially related edge processing subtasks into the main body of the first node's dedicated instruction data block. Simultaneously, it supplements this with additional information such as emergency task identifiers, task priorities, task execution time requirements, and data return addresses to ensure that the supercomputing node can clearly understand all task requirements upon receiving the instruction. After writing is complete, the system performs an integrity check on the entire first node's dedicated instruction data block to ensure that all content is complete and correctly formatted. Once the check passes, the block is added to the emergency coordination and control instruction set.
[0145] Step 15545: Continue to obtain all fluctuation source node identifiers corresponding to the second candidate supercomputing node identifier in the node mapping relationship table, and generate the second fluctuation source node processing list corresponding to the second candidate supercomputing node identifier in the same processing method.
[0146] Following the exact same process as generating the first fluctuation source node processing list, the system selects the corresponding fluctuation source node identifiers from the node mapping table for the second candidate supercomputing nodes, sorts them, and generates the second fluctuation source node processing list. The generation method for the fluctuation source node processing lists for all candidate supercomputing nodes is completely consistent, ensuring the uniformity of the instruction format.
[0147] Step 15546: Continue to obtain all internal processing task indications and all collaborative subtasks corresponding to the second candidate supercomputing node identifier in the task division sequence, and generate a list of second spatial association edge processing subtasks corresponding to the second candidate supercomputing node identifier in the same processing method.
[0148] Following the exact same process as generating the first list of spatially related edge processing subtasks, the system selects corresponding task instructions and subtasks from the task allocation sequence for the second candidate supercomputing node, sorts them, and generates the second list of spatially related edge processing subtasks. The generation method for the spatially related edge processing subtask lists for all candidate supercomputing nodes is completely consistent, ensuring the uniformity of task scheduling logic.
[0149] Step 15547: Create a second node-specific instruction data block for the second candidate supercomputing node identifier in the emergency coordination and control instruction set. Write the second candidate supercomputing node identifier field in the header of the second node-specific instruction data block. Write the second fluctuation source node processing list and the second spatial association edge processing subtask list in the main body of the second node-specific instruction data block in sequence.
[0150] Following the exact same process as constructing the first node's dedicated instruction data block, the system creates the second node's dedicated instruction data block. It writes the second candidate supercomputing node identifier in the header, the second fluctuation source node processing list in the body, and the second spatial association edge processing subtask list. After supplementing additional information and completing verification, it adds the block to the emergency coordination and control instruction set. The construction process for the node-specific instruction data blocks corresponding to all candidate supercomputing nodes is completely consistent, ensuring standardized instruction formats and facilitating unified parsing by supercomputing nodes.
[0151] Step 15548: Repeat all the above steps until each candidate supercomputing node identifier in the node mapping table has been processed, generating an emergency coordination and control instruction set containing multiple node-specific instruction data blocks. Each node-specific instruction data block corresponds to a candidate supercomputing node and contains a list of fluctuation source nodes that the candidate supercomputing node needs to process and a list of spatial correlation edge processing subtasks that need to be executed.
[0152] The system processes all candidate supercomputing nodes in the candidate supercomputing node group set sequentially, generating a corresponding node-specific instruction data block for each node. After all blocks pass verification, they are added to the emergency coordination and control instruction set. Upon completion, the emergency coordination and control instruction set contains the same number of node-specific instruction data blocks as the number of candidate supercomputing nodes. Each block corresponds to a unique candidate supercomputing node and contains all the information required for that node to execute emergency tasks.
[0153] Step 156: Parse the candidate supercomputing node identifier contained in the header of the dedicated instruction data block of each node in the emergency coordination and control instruction set, and establish a control instruction transmission channel with each candidate supercomputing node based on the candidate supercomputing node identifier.
[0154] The system first traverses all node-specific instruction data blocks in the emergency coordination and control instruction set, extracting the candidate supercomputing node identifier from the header of each block. Based on the identifier, it queries the supercomputing center's node management database to obtain the control plane network address and access authentication information for each candidate supercomputing node. Based on the control plane network address, the system establishes an independent encrypted control instruction transmission channel with each candidate supercomputing node using a dedicated task scheduling protocol. Two-way authentication is performed during channel establishment to ensure the security of the transmission channel and prevent instruction tampering or forgery. Each candidate supercomputing node corresponds to an independent transmission channel, facilitating the independent sending of control instructions and status monitoring.
[0155] Step 157: For each node-specific instruction data block, the node-specific instruction data block is completely sent to the target candidate supercomputing node indicated by the candidate supercomputing node identifier through the corresponding control instruction transmission channel. Monitor the instruction reception confirmation message returned by each target candidate supercomputing node after receiving the corresponding node-specific instruction data block. The instruction reception confirmation message includes the target candidate supercomputing node identifier and the reception timestamp.
[0156] The system sends node-specific instruction data blocks sequentially through the corresponding control instruction transmission channels according to the priority order of the candidate supercomputing nodes. During transmission, a block-based transmission and verification retransmission mechanism is employed to ensure that the data blocks are transmitted completely and correctly to the target candidate supercomputing nodes. Upon receiving a complete and correctly verified node-specific instruction data block, each target candidate supercomputing node immediately returns an instruction reception confirmation message. This message includes its own node identifier and reception timestamp, notifying the system that the instruction has been successfully received. After sending each data block, the system starts a timer, waiting for the corresponding confirmation message to return.
[0157] Step 158: For target candidate supercomputing nodes that do not return an instruction reception confirmation message within the preset confirmation waiting time threshold, re-establish the control instruction transmission channel with the target candidate supercomputing node and resend the corresponding node-specific instruction data block.
[0158] The preset confirmation waiting time threshold is pre-set based on the network latency characteristics of the supercomputing center's control plane. If the system does not receive an instruction reception confirmation message from the corresponding target candidate supercomputing node after the timer exceeds this threshold, it is determined that the transmission has failed or the node is unresponsive. At this time, the system will retry establishing a control instruction transmission channel with the target candidate supercomputing node. If the channel is successfully established, the corresponding node-specific instruction data block will be resent. If the number of channel establishment failures exceeds the preset retry count threshold, the node is determined to be faulty, removed from the candidate supercomputing node group set, and the mapping and allocation of the fluctuation source node and instruction generation will be re-performed to ensure that all emergency tasks have corresponding nodes to undertake them.
[0159] Step 159: After all target candidate supercomputing nodes have successfully returned instruction reception confirmation messages, a global emergency task start instruction is generated, which includes a unified start time point parameter.
[0160] After confirming that all target candidate supercomputing nodes have successfully received the corresponding node-specific instruction data blocks, the system sets a unified start time parameter based on the current global system time and the emergency task response time requirements. This start time is later than the current system time, allowing sufficient time for all nodes to complete pre-task preparations, including resource pre-allocation, environment initialization, and establishment of collaboration channels. The global emergency task start instruction includes fields such as task identifier, start time parameter, and task validity period. All nodes will synchronously start task execution according to the unified time in this instruction, ensuring the coordination of the entire emergency response process.
[0161] Step 1591: The global emergency task start command is simultaneously broadcast to all target candidate supercomputing nodes that have successfully returned command confirmation messages. After receiving the global emergency task start command, each target candidate supercomputing node parses the start time point parameter in the global emergency task start command, and when the system time reaches the time indicated by the start time point parameter, it begins to execute in parallel the various tasks in the fluctuation source node processing list and the spatial correlation edge processing subtask list contained in its respective received node-specific instruction data block.
[0162] The system simultaneously sends the global emergency task start command to all target candidate supercomputing nodes via the supercomputing center's control plane broadcast channel, ensuring that all nodes receive the start command almost simultaneously and avoiding start time deviations caused by command transmission delays. Upon receiving the start command, each target candidate supercomputing node parses the start time point parameters and completes all preparatory work for task execution while waiting for the system time to reach that point. This includes reserving necessary computing, memory, storage, and network bandwidth, loading the emergency task processing program, and establishing data transmission channels with collaborating nodes. When the system time reaches the start time point, all nodes simultaneously begin executing their respective tasks, completing all processing work in the order of the fluctuation source node processing list and the spatially related edge processing sub-task list. This achieves multi-node parallel collaborative emergency response, efficiently analyzing and handling abnormal events.
[0163] This application's embodiments start from the full-process collaborative logic of emergency response in supercomputing centers. Through the omni-channel perception of IoT monitoring node arrays and the detection of data flow state mutation points, it achieves accurate identification and characteristic characterization of environmental anomalies. The spatial association network of fluctuation source nodes, constructed based on fluctuation source node identifiers, fully restores the spatial distribution characteristics and node relationships of abnormal events, providing a topology-level decision-making basis for the accurate triggering of emergency tasks. Through the supercomputing center resource scheduling platform's analysis of the topological characteristics of the fluctuation source node spatial association network, it generates emergency response task request instructions matching the degree of anomaly impact, achieving adaptive matching between emergency response levels and resource requirements, avoiding problems of over-scheduling or insufficient response. Combined with node resource occupancy status... Multi-dimensional filtering of vectors and node task execution progress vectors ensures the resource availability and task undertaking capacity of candidate supercomputing node groups, providing a resource foundation for the efficient execution of emergency tasks. Through spatial proximity mapping between fluctuation source nodes and candidate supercomputing nodes, the scheduling and reasonable division of labor for anomaly handling tasks are realized. The generated set of emergency coordination and control instructions can drive candidate supercomputing node groups to execute emergency response tasks in parallel, significantly shortening the response and handling cycle of abnormal events. At the same time, it reduces cross-node data transmission overhead and coordination complexity. From perception, decision-making, scheduling, and execution, it improves the timeliness, accuracy, and resource utilization of emergency coordination and control in the supercomputing center, effectively ensuring the stability and security of the supercomputing center's operation.
[0164] This application embodiment involves real-time processing of multi-source heterogeneous data streams transmitted back by IoT monitoring nodes, and identification of abnormal fluctuations through mutation point detection. The core of this process lies in sliding window segmentation, distribution feature extraction, and joint threshold judgment of gradient and ratio. The above operations have mature applications in the fields of statistical process control and time series anomaly detection. Real-time calculation of mean and standard deviation, and difference and ratio calculation of adjacent windows are all standard data processing procedures. The threshold parameter can be calibrated by the distribution characteristics of historical normal data.
[0165] Constructing a spatially interconnected network based on fluctuation source node identifiers requires calculating the spatial distances between nodes and determining their adjacency. This relies on a pre-established unified coordinate system and a database of node deployment locations. Spatial distance calculation uses the three-dimensional Euclidean distance formula, which is a basic geometric operation. The spatial adjacency threshold can be set based on the layout of equipment in the data center and experience with anomaly propagation. Network topology features such as network diameter and connectivity can be calculated using classic graph algorithms such as breadth-first search and disjoint-set data structures.
[0166] The emergency response level is determined based on a combination of the comprehensive impact coefficient, the number of associated edges, and the network diameter, through a pre-defined mapping table. The mapping relationship can be established based on historical emergency cases and expert experience. The available node selection process involves parsing multi-dimensional resource state vectors and comparing tiered thresholds. Indicators such as CPU utilization and memory utilization are standard operating system performance parameters, and thresholds are set according to the resource requirements of the emergency task, which is a standard practice in resource scheduling. The spatial proximity mapping between the fluctuation source node and the supercomputing node adopts the minimum distance principle, a typical nearest neighbor allocation algorithm.
[0167] The generation of the task division sequence is based on the mapping results and the affiliation relationship between the nodes at both ends of the associated edge. The splitting of collaborative subtasks and the data format indication are clear, conforming to the general paradigm of distributed task decomposition. Therefore, although the embodiments of this application do not exhaustively list the mathematical derivations of all algorithms or the specific values of thresholds, relying on mature existing technical systems and engineering practice standards, those skilled in the art are fully capable of transforming these concepts into implementable technical solutions to achieve accurate identification of abnormal events, reasonable classification of emergency tasks, and full-process collaborative control of multi-node parallel response.
[0168] The original sensing data involved in the embodiments of this application includes various physical quantities such as temperature, humidity, voltage, and vibration, as well as heterogeneous parameters derived therefrom such as spatial coordinates, timestamps, and resource occupancy rates. The units and orders of magnitude of different physical quantities are significantly different.
[0169] In practical engineering implementation, all input features must be normalized or standardized before entering the processing flow; this is a standard step in data preprocessing. The text explicitly states that the monitoring parameter values are standardized when calculating data distribution characteristics by subtracting the historical mean and dividing by the historical standard deviation. This effectively eliminates the dimensional differences between different parameters, making dimensionless indicators such as the gradient of mean change and the ratio of dispersion change comparable. Node spatial coordinate parameters uniformly adopt a global coordinate system, with all coordinate units in meters. Spatial distance calculations are directly based on the Euclidean distance formula, ensuring consistency in distance dimensions. Parameters such as CPU utilization and memory utilization in the node resource occupancy status vector are themselves dimensionless percentage values between 0 and 1, requiring no additional conversion for direct threshold comparison. The unit for the estimated remaining execution time parameter is uniformly set to seconds, ensuring the accuracy of remaining time summation and comparison. The spatial distance metric between the fluctuation source node and the supercomputing node is also calculated based on a unified coordinate system with consistent distance units, ensuring the effectiveness of the minimum distance mapping principle. The comprehensive impact coefficient in the emergency response level mapping table is obtained by summing the importance level parameters. Since the importance level parameters themselves are dimensionless values, the summation result is also a dimensionless coefficient. Through standardized processing, a unified coordinate system, and dimensionless transformation throughout the entire technical solution, those skilled in the art can effectively eliminate dimensional differences caused by different physical quantities, ensuring the rigor of mathematical calculations and the effectiveness of decision-making logic in each stage, thereby fully achieving the invention's objective and improving the timeliness and accuracy of emergency coordination and control in supercomputing centers.
[0170] See Figure 2 As shown in the figure, this is a schematic diagram of the basic structure of a supercomputing center emergency coordination and control system 20 provided in an embodiment of this application. The supercomputing center emergency coordination and control system 20 includes: Processor 201; Storage device 202, on which computer program 2020 is stored; When the computer program 2020 is executed by the processor 201, the processor 201 implements any of the aforementioned IoT-based emergency coordination and control methods for supercomputing centers.
[0171] Based on the above, a readable storage medium is provided, on which a program or instructions are stored, and when the program or instructions are executed by a processor, the steps of the above method are implemented.
[0172] See Figure 3 As shown in the figure, this is a functional block diagram of a supercomputing center emergency coordination and control device provided in an embodiment of this application. The supercomputing center emergency coordination and control device includes: The state change detection module is used to collect the original environmental state perception data stream transmitted back by the IoT monitoring node array, perform data stream state change point detection processing on the original environmental state perception data stream, and obtain the set of abnormal fluctuation feature descriptors corresponding to the environmental state perception data unit. The association network construction module is used to construct a spatial association network of fluctuation source nodes based on the fluctuation source node identifier contained in each data fluctuation anomaly feature descriptor in the abnormal fluctuation feature descriptor set. The spatial association network of fluctuation source nodes includes multiple fluctuation source nodes and spatial association edges connecting fluctuation source nodes that have spatial adjacency relationships. The trigger condition parsing module is used to input the spatial association network of the fluctuation source node into the supercomputing center resource scheduling platform for emergency task trigger condition parsing processing, and generate an emergency response task request instruction that matches the network topology characteristics of the spatial association network of the fluctuation source node. The available node filtering module is used to query the real-time operating status database of the supercomputing center computing node cluster based on the emergency response task request instruction, obtain the node resource occupancy status vector and node task execution progress vector of each supercomputing node in the supercomputing center computing node cluster, and perform available node filtering processing on the supercomputing center computing node cluster according to the node resource occupancy status vector and node task execution progress vector to obtain a set of candidate supercomputing nodes that are qualified to take over the emergency task. The emergency command generation module is used to perform spatial proximity mapping and allocation processing on each fluctuation source node in the fluctuation source node spatial association network and the candidate supercomputing nodes in the candidate supercomputing node group set, generate an emergency coordination and control command set containing a node mapping relationship table and a task division sequence, and send the emergency coordination and control command set to the candidate supercomputing node group set to activate the parallel emergency response task execution operation.
[0173] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, or indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.
[0174] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0175] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0176] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory, random access memory, magnetic disks, or optical disks.
[0177] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A method for emergency coordination and control of a supercomputing center based on Internet of Things (IoT) linkage, characterized in that, The method includes: The raw environmental state perception data stream transmitted back by the IoT monitoring node array is collected, and the raw environmental state perception data stream is processed to detect and process data stream state change points to obtain a set of abnormal fluctuation feature descriptors corresponding to the environmental state perception data unit. Based on the fluctuation source node identifier contained in each data fluctuation anomaly feature descriptor in the set of abnormal fluctuation feature descriptors, a spatial association network of fluctuation source nodes is constructed. The spatial association network of fluctuation source nodes includes multiple fluctuation source nodes and spatial association edges connecting fluctuation source nodes that have spatial adjacency relationships. The spatial association network of the fluctuation source node is input into the supercomputing center resource scheduling platform for emergency task trigger condition parsing and processing, and an emergency response task request instruction matching the network topology characteristics of the spatial association network of the fluctuation source node is generated. Based on the emergency response task request instruction, the real-time operation status database of the supercomputing center computing node cluster is queried to obtain the node resource occupancy status vector and node task execution progress vector of each supercomputing node in the supercomputing center computing node cluster. Based on the node resource occupancy status vector and node task execution progress vector, the available nodes of the supercomputing center computing node cluster are screened to obtain a set of candidate supercomputing nodes that are qualified to take over the emergency task. Each fluctuation source node in the fluctuation source node spatial association network is spatially adjacent to the candidate supercomputing nodes in the candidate supercomputing node group set. This generates an emergency coordination and control instruction set containing a node mapping table and a task assignment sequence. The emergency coordination and control instruction set is then sent to the candidate supercomputing node group set to activate the parallel emergency response task execution operation.
2. The method according to claim 1, characterized in that, The raw environmental state perception data stream transmitted back by the IoT monitoring node array is collected. The raw environmental state perception data stream undergoes data stream state change point detection processing to obtain a set of abnormal fluctuation feature descriptors corresponding to the environmental state perception data units, including: The raw environmental state perception data stream transmitted back by the monitoring node array deployed in the IoT perception layer during a continuous sampling period is collected. The raw environmental state perception data stream contains environmental state perception data units generated by multiple monitoring nodes at different sampling times, carrying node identifiers and timestamps. The environmental state perception data units corresponding to each monitoring node in the original environmental state perception data stream are sorted according to the sampling time sequence to generate a node perception data time sequence corresponding to each monitoring node. The time series of the sensing data of each node is processed by sliding window segmentation, dividing the time series of the sensing data of each node into multiple continuous data segmentation units with fixed time window lengths; For each data segment unit, data distribution features are extracted, and the mean and standard deviation parameters of all environmental state perception data units within each data segment unit are calculated. The difference between the mean parameter of each data segment unit and the mean parameter of the adjacent preceding data segment unit is calculated to obtain the gradient parameter of the mean change between adjacent data segment units. The ratio of the standard deviation parameter of each data segment unit to the standard deviation parameter of the adjacent preceding data segment unit is calculated to obtain the data dispersion change ratio parameter between adjacent data segment units. The suspected mutation point location process is performed on the time series of node-sensing data based on the mean change gradient parameter and the data dispersion change ratio parameter.
3. The method according to claim 1, characterized in that, The step of constructing a spatial association network of fluctuation source nodes based on the fluctuation source node identifier contained in each data fluctuation anomaly feature descriptor in the set of abnormal fluctuation feature descriptors includes: The fluctuation source node identifier contained in each data fluctuation anomaly feature descriptor in the abnormal fluctuation feature descriptor set is parsed, and the node deployment location database of the monitoring node array deployed in the IoT sensing layer is queried according to the fluctuation source node identifier to obtain the node spatial coordinate parameters corresponding to each fluctuation source node. Two different data fluctuation anomaly feature descriptors are randomly selected from the set of abnormal fluctuation feature descriptors as the first candidate feature descriptor and the second candidate feature descriptor; Obtain the fluctuation source node identifier corresponding to the first candidate feature descriptor as the first candidate node identifier, and obtain the fluctuation source node identifier corresponding to the second candidate feature descriptor as the second candidate node identifier. Query the node deployment location database based on the first candidate node identifier and the second candidate node identifier to obtain the spatial coordinate parameters of the first candidate node and the spatial coordinate parameters of the second candidate node. The spatial straight-line distance between the spatial coordinate parameters of the first candidate node and the spatial coordinate parameters of the second candidate node is calculated as the spatial distance parameter between nodes; Based on the comparison results between the spatial distance parameters between the nodes and the preset threshold, fluctuation source node pairs with spatial adjacency are determined and corresponding spatial association edges are generated.
4. The method according to claim 1, characterized in that, The step of inputting the spatial association network of the fluctuation source nodes into the supercomputing center resource scheduling platform for emergency task trigger condition parsing processing, and generating an emergency response task request instruction that matches the network topology characteristics of the spatial association network of the fluctuation source nodes, includes: The identifiers of all fluctuation source nodes contained in the fluctuation source node spatial association network are parsed, and the node importance level parameter corresponding to each fluctuation source node is obtained by querying the preset node importance classification database based on the fluctuation source node identifiers. The network topology features of the spatial association network of the wave source nodes are extracted, and the network diameter parameter and network connectivity parameter of the spatial association network of the wave source nodes are calculated. The total number of all fluctuation source nodes in the spatial association network of the fluctuation source nodes is used as the parameter of the total number of nodes affected by the abnormal event. Based on the node importance level parameter corresponding to each fluctuation source node, the total number of nodes affected by the abnormal event is weighted and summed to generate the comprehensive impact coefficient of the abnormal event. The number of all spatial association edges in the spatial association network of the fluctuation source node is counted as the total number of association relationships between nodes. Based on the comprehensive impact coefficient of the abnormal event, the total number of relationships between nodes, and the network diameter parameter, the corresponding emergency response task level and template are determined, and an emergency response task request instruction containing network coding data is generated.
5. The method according to claim 1, characterized in that, The process of querying the real-time operational status database of the supercomputing center's computing node cluster based on the emergency response task request command, and obtaining the node resource occupancy status vector and node task execution progress vector for each supercomputing node in the supercomputing center's computing node cluster, includes: The emergency response task level parameter in the emergency response task request instruction is parsed, and a list of target supercomputing center identifiers that need to participate in the emergency response is determined based on the emergency response task level parameter. For each target supercomputing center identifier in the list of target supercomputing center identifiers, the real-time operation status database interface of the corresponding supercomputing center resource scheduling platform is accessed. Send a node status query request instruction to the real-time running status database interface corresponding to each target supercomputing center identifier. The node status query request instruction includes the node status data type identifier to be obtained. Receive the node status response data stream returned by the real-time running status database interface corresponding to each target supercomputing center identifier. The node status response data stream contains the node status data packet of each supercomputing node in the supercomputing center computing node cluster corresponding to the target supercomputing center identifier. The node status data packet of each supercomputing node is parsed, and the instantaneous values of CPU core utilization, memory space utilization, storage device input / output queue depth, and network interface bandwidth utilization are extracted from the node status data packet of each supercomputing node. The instantaneous values of CPU core utilization, memory space utilization, storage device input / output queue depth, and network interface bandwidth utilization are arranged and combined according to the preset feature vector dimension order to generate the node resource utilization status vector of each supercomputing node. Parse the node status data packet of each supercomputing node, extract the list of currently executing tasks from the node status data packet of each supercomputing node, the task list contains the task identifier and the estimated remaining execution time of each executing task, sum the estimated remaining execution time of all tasks in the task list of each supercomputing node, and obtain the total remaining execution time parameter of the node tasks of each supercomputing node. The number of tasks in the task list of each supercomputing node is counted as the node task concurrency parameter of each supercomputing node. The total remaining execution time parameter of the node tasks and the node task concurrency parameter are arranged and combined according to the preset progress vector dimension order to generate the node task execution progress vector of each supercomputing node.
6. The method according to claim 1, characterized in that, The process of filtering available nodes in the supercomputing center's computing node cluster based on the node resource occupancy status vector and the node task execution progress vector yields a set of candidate supercomputing nodes qualified for emergency task takeover, including: The node resource occupancy state vector of each supercomputing node is parsed and processed, and the instantaneous value of the CPU core occupancy rate of each supercomputing node is extracted as the first screening index parameter. The first screening index parameter is compared with a preset CPU idle threshold. Supercomputing nodes whose first screening index parameter is lower than the CPU idle threshold are selected as a set of supercomputing nodes that pass the first screening condition. The node resource occupancy status vector of each supercomputing node that passes the first screening condition is parsed and processed to extract the instantaneous value of the memory space occupancy rate of each supercomputing node as the second screening index parameter. The second screening index parameter is compared with a preset memory space free threshold. From the set of supercomputing nodes that pass the first screening condition, the supercomputing nodes whose second screening index parameter is lower than the memory space free threshold are selected as the set of supercomputing nodes that pass the second screening condition. The node task execution progress vector of each supercomputing node that passes the second screening condition is parsed and processed to extract the node task concurrency parameter of each supercomputing node as the third screening index parameter. The third screening index parameter is compared with the preset upper limit threshold for the number of concurrent tasks. From the set of supercomputing nodes that have passed the second screening condition, the supercomputing nodes whose third screening index parameter is lower than the upper limit threshold for the number of concurrent tasks are selected as the set of supercomputing nodes that have passed the third screening condition. The node task execution progress vector of each supercomputing node that has passed the third screening condition is parsed and processed, and the total remaining execution time parameter of the node task of each supercomputing node is extracted as the fourth screening index parameter. The fourth screening index parameter is compared with a preset task remaining time tolerance threshold. From the set of supercomputing nodes that have passed the third screening condition, supercomputing nodes whose fourth screening index parameter is lower than the task remaining time tolerance threshold are selected as the set of supercomputing nodes that have passed the fourth screening condition. All supercomputing nodes in the set of supercomputing nodes that have passed the fourth screening condition are marked as candidate supercomputing nodes that are qualified to take over emergency tasks. All candidate supercomputing nodes constitute the candidate supercomputing node group set.
7. The method according to any one of claims 1-6, characterized in that, The step of performing spatial proximity mapping and allocation processing on each fluctuation source node in the spatial association network of the fluctuation source nodes and the candidate supercomputing nodes in the candidate supercomputing node grouping set to generate an emergency coordination and control instruction set containing a node mapping relationship table and a task assignment sequence includes: The spatial association network of the fluctuation source nodes is analyzed to obtain the node spatial coordinate parameters corresponding to each fluctuation source node in the spatial association network of the fluctuation source nodes. The node identifier of each candidate supercomputing node in the candidate supercomputing node group set is analyzed. The supercomputing center network topology database is queried according to the node identifier to obtain the rack position coordinate parameters of each candidate supercomputing node. The spatial distance metric between each fluctuation source node and each candidate supercomputing node is calculated based on the node spatial coordinate parameters of each fluctuation source node and the rack position coordinate parameters of each candidate supercomputing node. For each fluctuation source node, the candidate supercomputing node with the smallest spatial distance metric value with the fluctuation source node is selected from the candidate supercomputing node group set as the mapping target supercomputing node of the fluctuation source node. Based on the mapping target supercomputing node corresponding to each fluctuation source node, a node mapping relationship pair between each candidate supercomputing node and one or more fluctuation source nodes is generated. All node mapping relationship pairs constitute a node mapping relationship table. Each spatial association edge in the spatial association network of the fluctuation source node is parsed to obtain the first fluctuation source node identifier and the second fluctuation source node identifier connected by each spatial association edge. The first mapping target supercomputing node identifier corresponding to the first fluctuation source node identifier is queried according to the node mapping relationship table, and the second mapping target supercomputing node identifier corresponding to the second fluctuation source node identifier is queried. Based on the comparison result between the first mapping target supercomputing node identifier and the second mapping target supercomputing node identifier, an internal processing task indication or collaborative subtask and its data content format indication for the spatially associated edge are generated to construct a complete task division sequence.
8. The method according to claim 7, characterized in that, The step of generating internal processing task indicators or collaborative subtasks and their data content format indicators for the spatially associated edges based on the comparison result between the first mapping target supercomputing node identifier and the second mapping target supercomputing node identifier, in order to construct a complete task division sequence, includes: When the identifier of the first mapping target supercomputing node is the same as the identifier of the second mapping target supercomputing node, an internal processing task instruction is generated to assign the association processing task corresponding to the spatial association edge to the supercomputing node corresponding to the same identifier. When the identifier of the first mapping target supercomputing node is different from the identifier of the second mapping target supercomputing node, the association relationship processing task corresponding to the spatial association edge is split into a first collaborative subtask executed by the supercomputing node corresponding to the identifier of the first mapping target supercomputing node and a second collaborative subtask executed by the supercomputing node corresponding to the identifier of the second mapping target supercomputing node, and a data content format indication that needs to be exchanged between the first collaborative subtask and the second collaborative subtask is generated. Construct a complete task division sequence based on all internal processing task instructions and all collaborative subtasks and their data content format instructions; After constructing a complete task division sequence based on all internal processing task instructions and all collaborative sub-tasks and their data content format instructions, an emergency coordination and control instruction set containing multiple node-specific instruction data blocks is generated. Each node-specific instruction data block corresponds to a candidate supercomputing node and contains a list of fluctuation source nodes that the candidate supercomputing node needs to process and a list of spatial correlation edge processing sub-tasks that need to be executed.
9. An emergency coordination and control system for a supercomputing center, characterized in that, include: processor; A storage device storing a computer program, which, when executed by the processor, causes the processor to implement the Internet of Things-based emergency coordination and control method for supercomputing centers as described in any one of claims 1-8.
10. A readable storage medium, characterized in that, The readable storage medium stores a program or instruction, which, when executed by a processor, implements the Internet of Things-based emergency coordination and control method for supercomputing centers as described in any one of claims 1-8.