Dynamic load-based computing power resource elastic scheduling method
By labeling phases and pattern recognition, the resource scheduling method is optimized, which solves the problem of insufficient load prediction accuracy in existing technologies, realizes efficient utilization of computing resources and energy consumption management, and adapts to the load characteristics changes of different training tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHONGLIAN YUNGANG DATA TECH CO LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies cannot effectively sense the alternation rhythm of computing and communication, resulting in insufficient load prediction accuracy, misalignment of control actions with business rhythm, and risks of overheating and energy waste.
By collecting load data from the computing cluster, labeling the computing phase and communication phase, mining iterative synchronization patterns, generating an iterative pattern codebook, and adjusting based on the load prediction model, resource scheduling is optimized by combining a self-evolution mechanism.
It improves computing resource utilization by 25%-35%, reduces cooling system energy consumption by 15%-20%, lowers the temperature exceedance rate to below 0.5%, reduces performance throttling events by 90%, and decreases PUE value by 0.1-0.15, adapting to the load characteristics changes of different training tasks.
Smart Images

Figure CN122240314A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computing resource scheduling technology, and in particular to a method for elastic scheduling of computing resources based on dynamic load. Background Technology
[0002] Elastic scheduling of computing resources is a core supporting technology for the operation of computing power in data centers and intelligent computing clusters. It refers to relying on a pooled architecture of computing resources to dynamically allocate computing resources on demand, intelligently route tasks, balance node load, and elastically scale resources according to preset business rules and optimization goals, based on real-time perception of business computing power needs and cluster resource load status. Its core objective is to maximize the utilization efficiency of computing resources, adapt to dynamic fluctuations in business load, and reduce the overall cost of computing power operation while strictly adhering to service level agreements (SLAs).
[0003] The core of computing resource scheduling is based on real-time perception and accurate prediction of computing load, linking computing power allocation and supporting system control to achieve collaborative optimization. Existing technologies have achieved node-level and rack-level computing load data collection within the cluster. Based on time-series prediction models, it can complete the prediction of load trends for single nodes and single racks, optimize computing task allocation strategies by combining real-time load data, and coordinate with supporting infrastructure to adjust operating parameters. This achieves passive collaborative optimization without interfering with tenant business, and is suitable for multi-tenant data center hosting operation scenarios.
[0004] The aforementioned and existing related technologies often suffer from the following shortcomings: Existing technologies all employ independent load timing prediction and adaptation control logic for single racks and single nodes. Their underlying technical approach treats the load as a statistically significant numerical sequence for fitting, failing to understand and model the business-driven logic behind the load waveform. For the cross-rack strong synchronous iterative mode exhibited by large-scale distributed training services, existing technologies cannot perceive the alternating rhythm of computation and communication phases, resulting in severely insufficient accuracy in cluster-level coordinated load prediction. Control actions are misaligned with the business rhythm, failing to effectively avoid the overheating risk caused by synchronous load peaks and also causing energy and computing resource waste during load off-peak periods. Summary of the Invention
[0005] The technical problem to be solved by this invention is that existing technologies have the disadvantage of independent prediction and fitting values, which cannot perceive the alternation rhythm of computation and communication, resulting in misalignment of prediction and control. To address this, we propose an elastic scheduling method for computing resources based on dynamic load.
[0006] To achieve the above objectives, this application adopts the following technical solution: a dynamic load-based elastic scheduling method for computing resources, comprising: collecting load data and collective communication synchronization events of nodes in a computing cluster; labeling the computing phase and communication phase of the load data based on the collective communication synchronization events; mining and encoding the iterative synchronization pattern of the cluster based on the historical load data after phase labeling, to generate an iterative pattern codebook; predicting future load patterns and load curves through a load prediction model based on the iterative pattern codebook and historical pattern sequences; performing coordinated regulation of computing power, cooling, and energy storage resources with business phases according to the predicted load patterns and load curves; and updating the iterative pattern codebook and the load prediction model according to the deviation between the prediction and the actual situation. The step of mining and encoding the iterative synchronization pattern of the cluster based on the historical load data after phase labeling includes: verifying the cross-node synchronization of the cluster; extracting load waveform feature vectors from multiple consecutive complete iteration cycles; measuring the morphological similarity of load waveforms in different cycles based on dynamic time warping distance; clustering multiple iteration cycles according to the morphological similarity, with each cluster center representing a typical iterative pattern and stored in the iterative pattern codebook.
[0007] Furthermore, the method of measuring morphological similarity based on dynamic time warping distance includes: calculating the dynamic time warping distance for load time series of two iteration cycles; the dynamic time warping distance is defined as the square root of the minimum sum of squares of the differences between corresponding data points of the two sequences under a warping path that satisfies monotonicity and continuity constraints.
[0008] Furthermore, the load waveform feature vector includes at least three of the following features: calculated phase duration, calculated phase peak power consumption, calculated phase average power consumption, communication phase duration, total network traffic of the communication phase, and idle time between iterations.
[0009] Furthermore, the load prediction model includes a pattern predictor and a waveform generator; the pattern predictor is used to predict the type, duration and peak power consumption of future iteration patterns based on historical iteration pattern sequences; the waveform generator is used to generate fine-grained future load prediction curves based on the predicted iteration pattern type and duration.
[0010] Furthermore, the sequence model used by the pattern predictor is a Transformer encoder or a state-space model; the generation model used by the waveform generator is a variational autoencoder or a conditional diffusion model.
[0011] Furthermore, the regulation of computing power, cooling, and energy storage resources in coordination with the business phase includes: performing memory consolidation on the computing node within the current communication phase window when it is predicted that the computing phase is about to begin; determining the pre-cooling start time of the cooling system and performing pre-cooling based on the predicted computing phase start time and pre-cooling advance; and planning the charging and discharging strategy of the energy storage system within the communication phase window based on the predicted computing phase load demand and the energy storage system status.
[0012] Furthermore, the regulation of the coordination between computing resources and service phases specifically involves: when the next stage is predicted to be a computing phase and its start time is within a first preset time range in the future, triggering the memory consolidation operation and ensuring that the operation is completed before the end of the current communication phase; the regulation of the coordination between cooling resources and service phases specifically involves: subtracting the pre-cooling advance from the predicted start time of the computing phase to obtain the pre-cooling start time.
[0013] Furthermore, the collected load data also includes network congestion indicators within the communication phase; the pattern features in the iterative pattern codebook include the network congestion indicators; and the load prediction model is configured to learn the correlation between the network congestion indicators and communication phase changes.
[0014] Furthermore, a dynamic load-based elastic scheduling system for computing resources is provided to implement a dynamic load-based elastic scheduling method for computing resources. The system comprises: a data acquisition and phase labeling module for acquiring cluster load data and collective communication synchronization events, and calculating phases and communication phases based on the event labels; a synchronization pattern mining module for mining iterative synchronization patterns based on historical data to generate an iterative pattern codebook; a load prediction module for predicting future load patterns and load curves based on the codebook and historical pattern sequences; a collaborative control module for issuing control commands coordinated with business phases to computing power, cooling, and energy storage resources according to the prediction results; and a self-evolution module for updating the codebook and the load prediction module based on the deviation between the prediction and the actual situation.
[0015] The technical effects and advantages of this invention are as follows: This invention uses phase labeling of load timing by collecting communication library synchronization events, explicitly encoding the computation-communication cycle of distributed training services into an iterative pattern codebook, and performing resource maintenance operations such as node regularization within the communication phase window based on the pattern prediction results. This method avoids resource contention during the computation phase, improving the cluster's computing power resource utilization rate by approximately 25%-35% compared to traditional responsive scheduling schemes, effectively solving the resource idleness problem caused by control lag.
[0016] In this invention, based on the precise start-up time and duration predicted by the pattern, countdown-style pre-cooling is performed before the calculated phase start-up, and the cooling power is dynamically reduced during the communication phase. This achieves precise alignment between the cooling system and the load waveform, avoiding over-cooling or under-cooling caused by control lag in traditional solutions. Simultaneously, the energy storage charging and discharging strategy is optimized based on the prediction results, improving the absorption capacity of photovoltaic green electricity. Actual measurements show that in typical intelligent computing business scenarios, this method can reduce cooling system energy consumption by 15%-20%, and the overall PUE value decreases by approximately 0.1-0.15.
[0017] In this invention, phase labeling and pattern recognition enable the system to understand the business logic behind the load waveform and predict the start time and peak power consumption of the next calculation phase. The cooling system pre-cools accordingly, effectively avoiding localized hotspots and equipment frequency throttling caused by delayed cooling. Statistics show that after adopting this method, the temperature exceedance rate of the cluster during peak synchronous load periods has been reduced to below 0.5%, and performance throttling events due to overheating have been reduced by more than 90%, effectively ensuring the SLA compliance capability of training tasks.
[0018] In this invention, a self-evolving closed-loop mechanism is used to re-cluster and iterate the pattern codebook daily based on newly added data, and incrementally train the prediction model, dynamically adjusting strategy parameters such as pre-cooling lead time. This mechanism enables the system to adapt to changes in load characteristics caused by different training tasks and different parallel strategies, maintaining high-precision prediction and optimal control effects over the long term, and avoiding performance degradation caused by model aging. Attached Figure Description
[0019] The disclosure of this invention is illustrated with reference to the accompanying drawings. It should be understood that the drawings are for illustrative purposes only and are not intended to limit the scope of protection of this invention. In the drawings, the same reference numerals are used to refer to the same parts:
[0020] Figure 1 This is a flowchart of the method of the present invention; Figure 2 This is the logic diagram for iterative synchronization pattern recognition in this invention; Figure 3 Construct a timing graph for the phase-sensing dataset of this invention; Figure 4 This is an overall structural diagram of the present invention. Detailed Implementation
[0021] It is readily understood that, based on the technical solution of this invention, those skilled in the art can propose various interchangeable structural methods and implementations without altering the essential spirit of the invention. Therefore, the following detailed embodiments and accompanying drawings are merely illustrative examples of the technical solution of this invention and should not be considered as the entirety of the invention or as limitations or restrictions on the technical solution of this invention.
[0022] Reference Figures 1-4As shown, this embodiment provides a method for elastic scheduling of computing resources based on dynamic load, applied to a cluster of 500 AI servers deployed in a smart computing center. This cluster is used to train a large language model with hundreds of billions of parameters. The training task adopts a hybrid parallel strategy of data parallelism and pipeline parallelism. The cluster load exhibits a typical strong synchronous iterative mode: in each training iteration, all nodes in the cluster first synchronously enter a high-load computing phase of about 180 seconds, with the power consumption of a single node jumping from 4kW to 11kW; then synchronously enter a communication phase of about 30 seconds for gradient synchronization, with the node power consumption dropping to about 5kW.
[0023] Example 1: This example provides a method for elastic scheduling of computing resources based on dynamic load, including the following steps:
[0024] Step 1: Construct a phase-aware global time-series dataset.
[0025] Deploy data acquisition agents on the computing nodes of the cluster, and read the node’s total power consumption, AI processor core temperature and utilization of each core in 1-second cycles through the intelligent platform management interface IPMI; obtain NPU utilization, memory bandwidth usage, core clock frequency and real-time temperature of each AI core through the processor’s dedicated driver interface.
[0026] The total power consumption and inlet / outlet air temperature of the rack-level system are collected at a 1-second cycle using the Modbus TCP protocol on the intelligent PDU in the rack. The compressor frequency, fan speed, and supply air temperature setpoint and actual values are collected at a 1-second cycle using the BACnet protocol on the in-row air conditioning controller. The state of charge (SOC), charging / discharging power, and real-time photovoltaic output are collected at a 1-second cycle using the Modbus TCP protocol on the energy storage converter and photovoltaic inverter.
[0027] Communication phase characteristic data is collected in the following ways: An eBPF program is deployed on each computing node to capture RoCEv2 network traffic from the network card driver layer, and the inbound and outbound traffic and retransmission rate per second are counted. Simultaneously, the number of retransmitted data packets and the number of explicit congestion notification (ECN) flagged messages are collected through the hardware counter of the RoCE network card. The Huawei Collective Communication Library (HCCL) is mounted via hook functions to intercept the start and completion events of synchronization primitives such as AllReduce in real time, recording the event timestamps and the list of nodes participating in the communication. This introduces the business layer information of communication library synchronization events, enabling the system to directly perceive the alternating computation and communication rhythm of distributed training. This assigns clear business semantics to the subsequent load waveform, fundamentally solving the limitation of existing technologies that only understand the load from a numerical perspective.
[0028] All collected data is bound to a Unix timestamp synchronized with NTP and aggregated into the central time-series database InfluxDB via a Kafka message queue.
[0029] Before data is stored in the database, the HCCL event log is parsed in real time: when a collective communication primitive start event is detected, all sampling points after that time point until the next collective communication primitive completion event are marked as communication phases; all sampling points after the collective communication primitive completion event until the next collective communication primitive start event are marked as calculation phases. Based on the communication library events, the load time series is phase-labeled, and the business semantics of the load waveform are explicitly encoded. This transforms the originally chaotic load numerical sequence into structured data with clear business phase divisions, providing a data foundation for identifying cluster iteration patterns. Ultimately, an enhanced time-series dataset containing original physical quantity data, phase labels, and timestamps is formed.
[0030] Step 2: Discover and encode cluster iterative synchronization patterns.
[0031] Read the most recent 72 hours of continuous time-series data from InfluxDB, group it by node, and then perform the following analysis workflow:
[0032] 2.1 Iterative Boundary Detection: First-order difference calculation is performed on the network interface card (NIC) outbound traffic sequence and NPU utilization sequence for each node to identify abrupt drops in traffic or utilization from their peak values. The collective communication start timestamp recorded in the HCCL event log is used as the hard label for the communication phase start point, and the collective communication completion timestamp is used as the hard label for the calculated phase start point. By comparing the abrupt drops with the hard labels, if more than 95% of the abrupt drops deviate from the hard labels within ±2 seconds, the data quality is considered acceptable; otherwise, the difference threshold is adjusted and the detection is repeated.
[0033] 2.2 Cross-rack synchronization verification: Based on the starting point of each phase calculation (i.e., the time when collective communication is completed), the relative time difference of all nodes in the cluster entering the phase calculation is calculated. If the average time difference is less than 10 milliseconds and the standard deviation is less than 5 milliseconds, the cluster is determined to be a strongly synchronous iterative cluster, and step 2.3 will only be performed on such clusters in the future.
[0034] 2.3 Constructing the Iterative Pattern Codebook: Feature extraction is performed on 1000 consecutive complete training iterations. Each complete iteration contains one computation phase and one communication phase. For the... Each iteration period defines the feature vector. as follows: ,in: For the first Calculate the phase duration (seconds) for each cycle; Calculate the peak power consumption (kW) of the phase within this cycle; To calculate the average power consumption (kW) of the phase; The duration of the communication phase (in seconds); Total network traffic (GB) during the communication phase; The idle time between iterations, i.e., the first iteration... The cycle ends (communication phase ends) to the [number]th [cycle / phase]. The interval (in seconds) between the start of each cycle (the start of the next calculated phase).
[0035] This feature vector quantifies the load characteristics of each training iteration cycle, transforming continuous load waveforms into discrete pattern identifiers, and realizing digital modeling of distributed training business beats; enabling the system to understand load change patterns as easily as understanding business syntax, laying the foundation for subsequent pattern-driven prediction.
[0036] To measure the morphological similarity of load waveforms across different periods, Dynamic Time Warping (DTW) distance is used as a similarity metric. This applies to two load power consumption time series with different periods. and Its DTW distance is defined as: ,in It is a regular path from (1,1) to (m,n) that satisfies the constraints of monotonicity and continuity (i.e., each step of the path can only move right, down, or down-right, and must cover both the starting and ending points). In actual calculations, dynamic programming is used to solve for this minimum value.
[0037] Based on the DTW distance matrix with 1000 periods, K-Means clustering algorithm is used for clustering. The number of clusters K is determined by the elbow rule: the silhouette coefficient under different K values is calculated, and the K value with the largest silhouette coefficient is selected. In this embodiment, K=5. Each cluster center is a 6-dimensional vector, representing a typical iterative pattern. These cluster center vectors and their corresponding labels are used to construct the iterative pattern codebook for the cluster, classifying iterative periods with similar shapes into the same pattern, forming a complete description of the cluster load behavior; enabling the system to identify the inherent regularity of the load waveform and providing structured input for the prediction model.
[0038] Step 3: Build a pattern-driven generative load forecasting model.
[0039] Deploy a TensorFlow-based prediction service on the cluster control node, which includes two sub-models: a pattern predictor and a waveform generator.
[0040] 3.1 Pattern Predictor: The input features are the pattern label sequence of the past 10 complete iteration cycles and the actual duration of each pattern. The iterative pattern sequence is used as the prediction input, rather than the original numerical sequence; this allows the model to learn the evolution patterns of business patterns, rather than simply fitting numerical fluctuations. The model employs a Transformer encoder, containing two encoding layers, each with four attention heads. The output layer is a fully connected layer followed by a Softmax activation function to obtain the probability distribution of the next pattern type; simultaneously, another fully connected layer outputs the predicted duration and predicted peak power consumption of the next pattern. The Transformer encoder can capture the long-range dependencies of pattern sequences and model the temporal correlation between different iterative patterns; achieving accurate prediction of future business patterns, rather than generalized load trend prediction. Training method: Using historical data, the pattern, duration, and peak power consumption of the 11th cycle are predicted from 10 consecutive cycles. The loss function is a weighted sum of cross-entropy loss and mean squared error loss. The Adam optimizer is used for training for 50 epochs.
[0041] 3.2 Waveform Generator: From historical data, all waveform segments belonging to the same mode category are extracted and normalized to 200 sampling points. A variational autoencoder (VAE) is trained for each mode category. The VAE encoder consists of a convolutional network and fully connected layers, outputting a mean vector and a log-variance vector. The decoder consists of fully connected layers and a deconvolutional network, recovering a 200-dimensional waveform. The loss function is the sum of the reconstruction error and the KL divergence. The VAE encodes the waveform morphology through the latent space, making it possible to generate new waveforms conforming to statistical characteristics from historical waveforms of the same mode category, decoupling the business mode from the specific physical waveform; providing refined input for regulation, enabling pre-cooling and charge / discharge strategies to be precisely aligned with the expected waveform. During prediction, the mode predictor outputs the next mode category, the expected duration, and the peak power consumption. The waveform generator randomly samples a latent code from the latent space of the corresponding category of the VAE, generates a normalized reference waveform through the decoder, then stretches the time axis according to the expected duration, and scales the amplitude according to the expected peak power consumption to obtain the final predicted power consumption curve. The final prediction results include: next stage type, expected start time, expected duration, expected peak power consumption, prediction confidence level, and fine-grained power consumption curves for the next 60 seconds.
[0042] Step 4: Coordinated regulation based on mode and phase.
[0043] The control node sends control commands to the computing power scheduler, cooling controller, and energy storage controller through the gRPC interface.
[0044] 4.1 Micro-schedule of computing resources: When the pattern predictor predicts with high confidence (95%) that the next stage is a computation phase and the start time is 30-60 seconds later, the computing resource scheduler performs node consolidation operations within the current communication phase window. Based on the predicted phase type and start time, resource maintenance is performed within the communication window. Node consolidation is completed during communication intervals to avoid resource contention during computation phases, achieving seamless scheduling synchronized with business cycles. Specifically, the scheduler queries the memory fragmentation rate of each node, initiates memory consolidation tasks for nodes with fragmentation rates exceeding 20%, and ensures that the consolidation tasks are completed before the end of the current communication phase to avoid affecting the next computation phase.
[0045] 4.2 Coordination of Electricity, Heat, and Storage Phases: Based on the prediction results, the cooling controller subtracts the pre-cooling advance time from the predicted start time of the calculated phase to obtain the pre-cooling start time, precisely binding the cooling control command to the predicted phase start time. This ensures that the pre-cooling action is time-aligned with peak load, fundamentally solving the lag problem of traditional response control. At this moment, the supply air temperature setpoint of the inter-row air conditioner is gradually lowered and the fan speed is increased via the BACnet protocol until the start of the calculated phase. After the calculated phase ends, the setpoint is gradually adjusted back based on the duration of the next communication phase. The energy storage controller calculates the required additional power based on the predicted peak power consumption and duration of the next calculated phase. Combining this with the current energy storage SOC and photovoltaic output, it decides whether to charge or discharge during the current communication phase. The specific charging and discharging power commands are sent to the energy storage converter via Modbus TCP.
[0046] Step 5: Self-evolutionary closed loop of the pattern codebook and prediction model. The system executes an offline update task every 24 hours. It retrieves new data from InfluxDB over the past 24 hours and performs the following updates.
[0047] 5.1 Deviation Calculation: For each prediction, record the prediction deviation (whether the predicted category is consistent with the actual category), phase timing deviation (start time error and duration error), waveform generation deviation (root mean square error between the actual curve and the predicted curve), and control effect deviation (the difference between the actual cabinet maximum temperature and the target temperature, and the difference between the actual change and the planned change in energy storage SOC).
[0048] 5.2 Model and Strategy Updates: Feature vectors from complete iteration cycles in the new data are extracted and re-clustered with the original codebook features. If the distance between the new cluster center and any existing center exceeds a threshold, a new pattern category is added, the codebook is updated, and changes in business patterns are dynamically identified. This ensures the pattern codebook always matches the features of the current training task, preventing model failure due to business changes. Incremental training of the pattern predictor is performed using prediction deviation data from the past 30 days, fine-tuning the Transformer model weights. The deviation in control effect is analyzed, and the temperature exceedance rate under different pre-cooling lead times is statistically analyzed. The pre-cooling lead time with the lowest temperature exceedance rate is selected as the new default value, and the strategy library is updated. Through daily closed-loop updates, the system can adapt to changes in business patterns, ensuring the pattern codebook and prediction model always match the features of the current training task, establishing a complete closed loop of perception-prediction-control-evolution; guaranteeing long-term prediction accuracy and control effect, and adapting to complex operational scenarios involving multiple tenants and multiple services.
[0049] Example 2: This example is basically the same as Example 1, except that a different model is used in step 3 to adapt to a larger cluster (2000 nodes) and a shorter iteration cycle (60 seconds for calculation phase and 10 seconds for communication phase).
[0050] In the pattern predictor in step 3.1, due to the shortened iteration cycle and the need for faster prediction speed, the Mamba state-space model is used instead of the Transformer. The Mamba model is a model based on structured state-space sequences, which has linear complexity when processing long sequences and can complete a forward inference in milliseconds, meeting real-time requirements.
[0051] In the waveform generator in step 3.2, this embodiment uses a conditional diffusion model instead of a VAE. The diffusion model gradually adds noise to the data through a forward process and learns to denoise through a reverse process, generating waveforms based on mode category and duration. The conditional diffusion model can generate more refined waveform details, especially performing better at calculating phase initiation spikes.
[0052] The remaining steps are the same as in Example 1.
[0053] Example 3: This example is basically the same as Example 1, except that finer-grained communication phase feature acquisition is introduced in step 1 to enhance the ability to predict phase drift caused by network congestion.
[0054] In addition to capturing HCCL synchronization events, this embodiment uses a hardware counter on each computing node's RoCE network card to collect the number of retransmitted data packets and the number of explicit congestion notification (ECN) tag messages within the communication phase in real time, with a sampling period of 1 second. These data are added as an additional dimension to the feature vector in step 2, expanding the feature vector into an eight-dimensional vector containing the retransmission rate and the number of ECN tags.
[0055] In the pattern predictor in step 3.1, the input features are increased by two dimensions, and the model can learn the correlation between retransmission rate and communication phase extension, introducing network congestion features into pattern prediction; predicting phase drift caused by network problems in advance, and avoiding interruption of training tasks.
[0056] When a high retransmission rate is predicted for a certain communication phase, the model will output a longer communication phase duration, thereby notifying the computing scheduler in advance to postpone the checkpoint saving operation and avoid training interruption due to phase drift.
[0057] The remaining steps are the same as in Example 1.
[0058] This invention also provides a dynamic load-based elastic scheduling system for computing resources. Deployed in an intelligent computing center, the system comprises the following modules: a data acquisition module deployed on computing nodes, racks, cooling and energy storage equipment, which collects fine-grained time-series data such as power consumption, temperature, network traffic, and synchronization events in real time via IPMI, ModbusTCP, BACnet, and eBPF / HCCL hooks, and adds phase tags; a synchronization pattern mining module iteratively detects boundaries and verifies cross-node synchronization of historical data, and generates an iterative pattern codebook based on DTW distance clustering, which is stored in Redis; a load prediction module includes a pattern predictor based on Transformer / Mamba and a waveform generator based on a VAE / diffusion model, outputting the next phase type, timing, power consumption curve, and confidence level; a collaborative control module links the computing scheduler, cooling controller, and energy storage controller via a gRPC interface to achieve collaborative optimization of computing, power, heat, and storage; and a self-evolving closed-loop module updates offline daily, continuously improving prediction accuracy and control effectiveness through deviation calculation and incremental training. The system forms a complete closed loop of perception-prediction-control-evolution.
[0059] The technical scope of this invention is not limited to the content described above. Those skilled in the art can make various modifications and variations to the above embodiments without departing from the technical concept of this invention, and all such modifications and variations should fall within the protection scope of this invention.
Claims
1. A method for elastic scheduling of computing resources based on dynamic load, characterized in that, include: Collect load data and collective communication synchronization events of nodes in the computing cluster; Based on the collective communication synchronization event, calculate the phase and communication phase for the load data annotation; Based on historical load data after phase labeling, the iterative synchronization pattern of the cluster is mined and encoded to generate an iterative pattern codebook; based on the iterative pattern codebook and historical pattern sequences, future load patterns and load curves are predicted through a load prediction model; according to the predicted load patterns and load curves, the computing power, cooling, and energy storage resources are regulated in coordination with the business phase; and, based on the deviation between the prediction and the actual load, the iterative pattern codebook and the load prediction model are updated.
2. The method for elastic scheduling of computing resources based on dynamic load according to claim 1, characterized in that, The step of mining and encoding the iterative synchronization pattern of the cluster based on the historical load data after phase annotation includes: verifying the cross-node synchronization of the cluster; extracting load waveform feature vectors from multiple consecutive complete iteration cycles; measuring the morphological similarity of load waveforms in different cycles based on dynamic time warping distance; and clustering multiple iteration cycles according to the morphological similarity, with each cluster center representing a typical iteration pattern and stored in the iteration pattern codebook.
3. The method for elastic scheduling of computing resources based on dynamic load according to claim 2, characterized in that, The method of measuring morphological similarity based on dynamic time warping distance includes: calculating the dynamic time warping distance for load time series of two iteration cycles; the dynamic time warping distance is defined as the square root of the minimum sum of squares of the differences between corresponding data points of the two sequences under a warping path that satisfies monotonicity and continuity constraints.
4. The method for elastic scheduling of computing resources based on dynamic load according to claim 2 or 3, characterized in that, The load waveform feature vector includes at least three of the following features: calculated phase duration, calculated phase peak power consumption, calculated phase average power consumption, communication phase duration, total network traffic of the communication phase, and idle time between iterations.
5. The method for elastic scheduling of computing resources based on dynamic load according to claim 4, characterized in that, The load prediction model includes a pattern predictor and a waveform generator; the pattern predictor is used to predict the type, duration and peak power consumption of future iteration patterns based on historical iteration pattern sequences; the waveform generator is used to generate fine-grained future load prediction curves based on the predicted iteration pattern type and duration.
6. The method for elastic scheduling of computing resources based on dynamic load according to claim 5, characterized in that, The pattern predictor uses a Transformer encoder or a state-space model as its sequence model; the waveform generator uses a variational autoencoder or a conditional diffusion model as its generation model.
7. The method for elastic scheduling of computing resources based on dynamic load according to claim 1, characterized in that, The aforementioned regulation of computing power, cooling, and energy storage resources in coordination with business phases includes: performing memory consolidation operations on computing nodes within the current communication phase window when it is predicted that a computing phase is about to begin; determining the pre-cooling start time of the cooling system and performing pre-cooling based on the predicted computing phase start time and pre-cooling advance; and planning the charging and discharging strategy of the energy storage system within the communication phase window based on the predicted computing phase load demand and the energy storage system status.
8. The method for elastic scheduling of computing resources based on dynamic load according to claim 7, characterized in that, The regulation of the coordination between computing resources and service phases is specifically as follows: when the next stage is predicted to be a computing phase and its start time is within a first preset time range in the future, the memory consolidation operation is triggered, and the operation is ensured to be completed before the end of the current communication phase; the regulation of the coordination between cooling resources and service phases is specifically as follows: the predicted start time of the computing phase is subtracted from the pre-cooling advance to obtain the pre-cooling start time.
9. The method for elastic scheduling of computing resources based on dynamic load according to claim 1, characterized in that, The collected load data also includes network congestion indicators within the communication phase; the pattern features in the iterative pattern codebook contain the network congestion indicators; and the load prediction model is configured to learn the correlation between the network congestion indicators and communication phase changes.
10. A dynamic load-based elastic scheduling system for computing resources, used to implement the dynamic load-based elastic scheduling method for computing resources according to claims 1-9, characterized in that, include: The data acquisition and phase labeling module is used to acquire cluster load data and collective communication synchronization events, and calculate the phase and communication phase based on the event labels; The synchronization pattern mining module is used to mine iterative synchronization patterns based on historical data and generate an iterative pattern codebook. The load prediction module is used to predict future load patterns and load curves based on the codebook and historical pattern sequences. The coordinated control module is used to issue control commands to computing power, cooling and energy storage resources in coordination with the business phase based on the prediction results; the self-evolution module is used to update the codebook and the load prediction module based on the deviation between the prediction and the actual situation.