Heterogeneous computing resource elastic scheduling method
By combining task awareness, resource status monitoring, and intelligent scheduling decisions, the problems of insufficient adaptability and delayed response in heterogeneous computing resource scheduling are solved, and efficient utilization of heterogeneous resources and dynamic assurance of service quality are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 创优数字科技(广东)有限公司
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-16
Smart Images

Figure CN122220060A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of resource allocation, and in particular to a method for elastic scheduling of heterogeneous computing resources. Background Technology
[0002] Generative AI services (such as large language model inference, image generation, and multimodal content creation) have been widely applied in fields such as office collaboration, digital content production, and intelligent interaction. These services are characterized by heterogeneous task types (text / image / video generation), large fluctuations in resource requirements (peak concurrency can be more than 10 times that of troughs), and strict network service quality constraints (latency requirements are usually less than 1 second, and throughput needs to meet high concurrency requirements).
[0003] Heterogeneous computing resources (including general-purpose CPUs, high-performance GPUs, dedicated FPGAs, AI accelerators, etc.) have become the core infrastructure supporting generative AI services. Different architectures exhibit significant differences in computing power, energy consumption, and adaptability. For example, GPUs excel at massively parallel floating-point operations and are suitable for large model inference; FPGAs possess low latency and high energy efficiency, making them suitable for lightweight generation tasks; while CPUs are suitable for auxiliary scenarios such as task scheduling and data preprocessing. How to achieve dynamic and elastic scheduling of heterogeneous resources to maximize resource utilization while meeting network service quality requirements has become a key technical challenge in deploying generative AI services.
[0004] In some feasible implementations, general container orchestration and scheduling schemes, static resource allocation schemes for heterogeneous computing resources, or resource utilization-based scheduling schemes can be used to solve the problem of dynamic elastic scheduling of heterogeneous resources. However, these implementations do not fully explore the characteristics and differences of heterogeneous resources, resulting in insufficient adaptability to heterogeneous resources and relatively slow response. They also suffer from a single scheduling decision dimension and a lack of dynamic feedback mechanism. Therefore, an elastic scheduling method for heterogeneous computing resources is needed to improve the decision-making efficiency and effectiveness of resource scheduling. Summary of the Invention
[0005] The purpose of this application is to at least address one of the aforementioned technical deficiencies, particularly the technical deficiencies of insufficient decision-making efficiency and scheduling effectiveness in the prior art for resource scheduling.
[0006] In a first aspect, this application provides a method for elastic scheduling of heterogeneous computing resources, the method comprising: The task-aware module obtains the task parameters of the generative artificial intelligence task and determines the computing resource requirements corresponding to the generative artificial intelligence task based on the task parameters. The heterogeneous resource status monitoring module acquires resource status data for each computing node in the heterogeneous computing cluster. The intelligent scheduling decision module generates a computing resource scheduling scheme based on the task parameters and resource status data, according to a multi-objective optimization model. The resource elastic adaptation module executes the computing resource scheduling scheme during the generative artificial intelligence task processing to perform resource scheduling and task migration for each computing node.
[0007] As an optional implementation, the task parameters include task type, input parameter characteristics, and network response constraints; the computational resource requirements include floating-point operations, memory requirements, and parallelism limits; and determining the computational resource requirements corresponding to the generative artificial intelligence task based on the task parameters includes: Based on the task parameters, the computational resource requirements are output using a pre-trained feature mapping model. The task types include text generation tasks, image generation tasks, video generation tasks, or multimodal generation tasks. The input parameter features corresponding to the text generation task include the number of input text units and the upper limit of the number of generated text units. The input parameter features corresponding to the image generation task include resolution, number of channels, and generation style complexity parameters. The features corresponding to the video generation task include frame rate, duration, and resolution. The multimodal generation task includes at least two of the text generation task, the image generation task, and the video generation task. The network response constraints include custom constraint parameters and / or preset constraint parameters.
[0008] As an optional implementation, the resource status data includes computing power data, resource reserve data, health status data, and energy consumption data. The step of obtaining the resource status data of each computing node in the heterogeneous computing cluster includes: According to a preset cycle, the resource status data is obtained in real time through the monitoring agent system of each computing node; In addition, according to the data acquisition instructions, the resource status data of the target computing node is acquired in real time through the monitoring agent system; The resource status data is processed for outliers and normalized, and then stored in a time-series database.
[0009] As an optional implementation, the multi-objective optimization model includes a deep deterministic policy gradient model. The step of generating a computational resource scheduling scheme based on the task parameters and the resource state data according to the multi-objective optimization model includes: Perform data preprocessing and vector concatenation on the task parameters and resource status data to generate a status input vector; The state input vector is processed by the generator network of the deep deterministic policy gradient model to generate an initial scheduling scheme, and the reward value of the initial scheduling scheme is determined by the evaluation network of the deep deterministic policy gradient model. If the reward value of the initial scheduling scheme exceeds a preset threshold, the initial scheduling scheme is determined as the target scheduling scheme. If the reward value of the initial scheduling scheme does not exceed the preset threshold, the initial scheduling scheme is cyclically adjusted until the reward value of the adjusted initial scheduling scheme exceeds the preset threshold, and the adjusted initial scheduling scheme at this time is taken as the target adjustment scheme. The target adjustment scheme is constrained and verified. After the constraint verification is passed, the target adjustment scheme is used as the computing resource scheduling scheme, and the target node, resource type, resource configuration method, concurrency allocation method and task start parameters are output. The constraint verification includes resource margin constraints, compatibility constraints, and network response feasibility constraints.
[0010] As an optional implementation, the reward value is calculated based on a weighted average of resource utilization improvement rate, network response achievement rate, and energy consumption parameters. The pre-training method of the deep deterministic policy gradient model includes: Acquire generative task training data within a preset time period and generate a dataset; The generative task training data includes task characteristics, resource status, scheduling results, and execution effects. With the goal of maximizing the reward value, the generator network and the evaluation network are trained iteratively until a preset convergence condition is met; The preset convergence conditions include the number of iterations reaching a preset threshold, or the resource utilization rate of the deep deterministic strategy gradient model on the test set corresponding to the dataset exceeding a first proportion threshold, and the network response compliance rate exceeding a second proportion threshold. Furthermore, based on the incremental data collected in the first cycle, the incremental data is added to the dataset and the current deep deterministic policy gradient model is iteratively trained in real time.
[0011] As an optional implementation method, the resource scheduling is executed in the following ways: When the total number of allocatable concurrent requests is lower than the number of concurrent task requests, the resource expansion process is triggered. The cloud platform application interface is called to create heterogeneous computing resource instances. The startup time priority is determined according to the resource type of each node to be expanded, and the resource expansion is performed on each node to be expanded in sequence according to the startup time priority. The expanded computing resources are then registered to the resource cluster list. When the resource utilization rate is continuously lower than the preset utilization rate threshold within a preset time period, the resource scaling down process is triggered. The task execution status on the node to be scaled down is checked. If there are running tasks, task migration is triggered. If the node is idle, computing resources are released and the resource cluster list is updated. Additionally, by using custom resources in container orchestration tools, task units can be bound to target nodes and resource types, and resource isolation parameters can be set. Furthermore, the task migration includes scaling down migration, disaster recovery migration, or load balancing migration. The priority of the task migration is sorted based on the network response constraints of each generative artificial intelligence task, and latency-sensitive tasks are migrated first, while throughput-sensitive tasks are migrated later. The task migration uses a state savepoint mechanism to save state savepoints to distributed storage at predetermined time intervals. During migration, after the model on the new node is loaded into the model state of the original node, the task data corresponding to the state savepoint is transmitted incrementally and synchronously. The state save points are used to indicate the intermediate state of model inference and the progress of input data sharding.
[0012] As an optional implementation, the method further includes: By executing the feedback module, the task execution parameters of the generative artificial intelligence task and the resource consumption parameters of each computing node during the processing of the generative artificial intelligence task are obtained. Based on the task execution parameters and the resource consumption parameters, determine the scheduling effect parameters of the computing resource scheduling scheme; Based on the scheduling effect parameters, the multi-objective optimization model is adjusted, and a preset alarm process is executed.
[0013] Secondly, this application provides a heterogeneous computing resource elastic scheduling device, the device comprising: The acquisition module is used to acquire the task parameters of the generative artificial intelligence task through the task awareness module, and determine the computing resource requirements corresponding to the generative artificial intelligence task based on the task parameters. The acquisition module is also used to acquire resource status data of each computing node in the heterogeneous computing cluster through the heterogeneous resource status monitoring module; The processing module is used to generate a computing resource scheduling scheme based on the task parameters and the resource status data, according to a multi-objective optimization model, through the intelligent scheduling decision module; The processing module is further configured to execute the computing resource scheduling scheme during the generative artificial intelligence task processing through the resource elastic adaptation module to perform resource scheduling and task migration for each computing node.
[0014] Thirdly, this application provides a computer device including one or more processors and a memory storing computer-readable instructions that, when executed by the one or more processors, perform the steps of the method described in the first aspect.
[0015] Fourthly, this application provides a storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method described in the first aspect.
[0016] As can be seen from the above technical solutions, the embodiments of this application have the following advantages: Based on any of the above embodiments, this application first analyzes the multi-dimensional features of generative AI tasks (such as task type, input parameter scale, and network response constraints) through a task awareness module, and outputs accurate computing resource requirement parameters using a pre-trained feature mapping model, solving the problem of insufficient task characteristic mining in traditional solutions. A heterogeneous resource status monitoring module collects real-time computing power, remaining capacity, health status, and energy consumption data of each computing node at millisecond intervals. After anomaly filtering and normalization, it provides highly reliable input for scheduling. The intelligent scheduling decision module adopts a deep deterministic policy gradient model, vectorizing and concatenating task requirements and resource status to generate a scheduling scheme. It integrates multi-objective optimization of resource utilization, QoS compliance rate, and energy consumption through a reward function, and performs constraint checks such as resource remaining capacity and compatibility, achieving dynamic trade-offs rather than single-dimensional decision-making. The resource elastic adaptation module triggers predictive scaling up and down based on startup time priority, binds resources through container orchestration, and achieves low-latency, seamless task migration based on a state savepoint mechanism, significantly shortening the latency of traditional threshold-triggered scaling up. The execution feedback module collects actual execution parameters and evaluates scheduling effectiveness. Through dynamic updates of model parameters and an alarm mechanism, it forms a closed-loop optimization, continuously adapting to task changes. The overall solution achieves breakthrough improvements in heterogeneous resource utilization, elastic response speed, and multi-objective scheduling efficiency. Attached Figure Description
[0017] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 A flowchart illustrating a method for elastic scheduling of heterogeneous computing resources provided in one embodiment of this application; Figure 2A schematic diagram of the overall process architecture of a heterogeneous computing resource elastic scheduling method provided in one embodiment of this application; Figure 3 This is an internal structural diagram of a computer device provided in an embodiment of this application. Detailed Implementation
[0019] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0020] Generative AI services (such as large language model inference, image generation, and multimodal content creation) have been widely applied in fields such as office collaboration, digital content production, and intelligent interaction. These services are characterized by heterogeneous task types (text / image / video generation), large fluctuations in resource requirements (peak concurrency can be more than 10 times that of troughs), and strict QoS constraints (latency requirements are usually less than 1 second, and throughput needs to meet high concurrency requirements).
[0021] Heterogeneous computing resources (including general-purpose CPUs, high-performance GPUs, dedicated FPGAs, AI accelerators, etc.) have become the core infrastructure supporting generative AI services. Different architectures exhibit significant differences in computing power, energy consumption, and adaptability: GPUs excel at massively parallel floating-point operations and are suitable for large model inference; FPGAs possess low latency and high energy efficiency, making them suitable for lightweight generation tasks; while CPUs are suitable for auxiliary scenarios such as task scheduling and data preprocessing. How to achieve dynamic and elastic scheduling of heterogeneous resources to maximize resource utilization while meeting service QoS requirements has become a key technical challenge in the deployment of generative AI services.
[0022] One feasible implementation is a general container orchestration and scheduling scheme (based on Kubernetes). Kubernetes (K8s) is the mainstream container orchestration platform, which uses a scheduler (kube-scheduler) to allocate resources. Its core logic is based on Pod resource requests (CPU / memory) and rules such as node affinity and taint tolerance to schedule tasks to nodes that meet the conditions. To adapt to AI scenarios, enhanced schedulers such as Volcano have been developed to support basic scheduling and queue management of GPU resources. The process is as follows: Tasks are submitted as Pods → the scheduler matches node resource availability → a target node is selected according to preset rules (such as the node with the lowest load) → resources are bound and the task is started.
[0023] Another feasible implementation is cloud vendor AI elastic scaling solutions. These solutions provide elastic scaling capabilities based on resource utilization for AI training / inference scenarios. By monitoring metrics such as GPU utilization and task queue length, scaling up is triggered when the metrics exceed preset thresholds (e.g., GPU utilization > 80%), and scaling down is triggered when they fall below the thresholds (e.g., GPU utilization < 30%). The core implementation is: preset resource thresholds → real-time metric monitoring → threshold-triggered scaling up / down → calling cloud resource APIs to create / release GPU instances → distributing tasks to the newly added resources.
[0024] Another feasible implementation is a dedicated heterogeneous computing resource scheduling tool (such as TensorFlowServing + NVIDIA TensorRT). These tools focus on heterogeneous acceleration of AI model inference, adapting to different GPU models through model optimization (such as TensorRT quantization), and supporting collaborative scheduling of CPU and GPU (CPU preprocessing + GPU inference). The scheduling logic is mainly based on the compatibility of model type and hardware, and adopts a static resource allocation strategy (such as allocating a fixed 1 GPU to a specific model).
[0025] These feasible heterogeneous resource scheduling schemes based on generative AI tasks have the following limitations: 1. Insufficient adaptability to heterogeneous resources: Existing solutions do not fully explore the differences in characteristics of heterogeneous resources, and only make simple matching based on "whether it is supported" (e.g., GPU for inference, CPU for assistance), without considering the optimal combination of task type and heterogeneous resources (e.g., lightweight text generation tasks can achieve low latency and low power consumption through FPGA+CPU combination).
[0026] 2. Lagging response of elastic scheduling: The scaling up and down mechanism based on fixed thresholds cannot predict sudden traffic surges in generative AI services (such as a surge in image generation requests caused by hot events). Scaling up usually takes 30 to 60 seconds, causing peak task latency to exceed the standard; untimely scaling down will result in waste of resources.
[0027] 3. Single-dimensional scheduling decision: Existing solutions only focus on resource utilization or task queue length, without integrating multi-dimensional objectives such as QoS requirements (latency / throughput) and energy consumption costs of generative AI tasks. This leads to the contradiction of "over-allocating resources to meet QoS" or "sacrificing service quality to save resources".
[0028] 4. Lack of dynamic feedback optimization: Once the scheduling strategy is determined, it is executed in a fixed manner and cannot be dynamically adjusted according to the performance of the task (such as the deviation between the actual latency and the expectation). It is difficult to adapt to the dynamic characteristics of generative AI tasks (such as the difference in resource requirements for different input sizes of the same model).
[0029] Therefore, the main purpose of this application is: 1. Achieve precise matching between generative AI tasks and heterogeneous computing resources, and fully leverage the characteristics and advantages of resources such as CPU / GPU / FPGA; 2. Improve the response speed of elastic scheduling, anticipate traffic fluctuations in advance, and achieve "predictive scaling up and down" to avoid a drop in QoS compliance rate during peak periods; 3. Construct a multi-objective optimization scheduling decision mechanism to balance resource utilization and energy consumption costs while meeting QoS constraints; 4. Establish a closed-loop feedback optimization system to dynamically iterate scheduling strategies based on task execution results and adapt to the dynamic changes of generative AI services.
[0030] In summary, the technical concept of this application lies in the following: First, the task-aware module analyzes the multi-dimensional characteristics of generative AI tasks (such as task type, input parameter scale, and network response constraints), and uses a pre-trained feature mapping model to output accurate computing resource requirement parameters, solving the problem of insufficient task characteristic mining in traditional solutions. The heterogeneous resource status monitoring module collects real-time computing power, remaining capacity, health status, and energy consumption data of each computing node at millisecond intervals. After anomaly filtering and normalization, it provides highly reliable input for scheduling. The intelligent scheduling decision module adopts a deep deterministic policy gradient model, vectorizing and concatenating task requirements and resource status to generate a scheduling scheme. It integrates multi-objective optimization of resource utilization, QoS compliance rate, and energy consumption through a reward function, and performs constraint checks such as resource remaining capacity and compatibility, achieving dynamic trade-offs rather than single-dimensional decision-making. The resource elastic adaptation module triggers predictive scaling up and down based on startup time priority, combines container orchestration to bind resources, and achieves low-latency, seamless task migration based on a state savepoint mechanism, significantly shortening the latency of traditional threshold-triggered scaling up. The execution feedback module collects actual execution parameters and evaluates scheduling effectiveness. Through dynamic updates of model parameters and an alarm mechanism, it forms a closed-loop optimization, continuously adapting to task changes. The overall solution achieves breakthrough improvements in heterogeneous resource utilization, elastic response speed, and multi-objective scheduling efficiency.
[0031] The methods provided in this application will be described in detail below based on the corresponding implementation methods in some practical application scenarios.
[0032] Figure 1 This is a flowchart illustrating a method for elastic scheduling of heterogeneous computing resources according to an embodiment of this application. Figure 2This is a schematic diagram of the overall process architecture of a heterogeneous computing resource elastic scheduling method provided in one embodiment of this application, as shown below. Figure 1 As shown, this application provides a method for elastic scheduling of heterogeneous computing resources, which is described below in conjunction with... Figure 2 The architecture shown provides a detailed explanation of the method provided in this application: S101. Obtain the task parameters of the generative artificial intelligence task through the task awareness module, and determine the computing resource requirements corresponding to the generative artificial intelligence task based on the task parameters. As an optional implementation, the task parameters include task type, input parameter characteristics, and network response constraints; the computational resource requirements include floating-point operations, memory requirements, and parallelism limits; and determining the computational resource requirements corresponding to the generative artificial intelligence task based on the task parameters includes: Based on the task parameters, the computational resource requirements are output using a pre-trained feature mapping model. The task types include text generation tasks, image generation tasks, video generation tasks, or multimodal generation tasks. The input parameter features corresponding to the text generation task include the number of input text units and the upper limit of the number of generated text units. The input parameter features corresponding to the image generation task include resolution, number of channels, and generation style complexity parameters. The features corresponding to the video generation task include frame rate, duration, and resolution. The multimodal generation task includes at least two of the text generation task, the image generation task, and the video generation task. The network response constraints include custom constraint parameters and / or preset constraint parameters.
[0033] This implementation combines specific input features (such as text length, resolution, frame rate, etc.) of text generation, image generation, video generation, and multimodal tasks with user-defined or system-preset QoS constraints (latency, throughput) through a pre-trained feature mapping model. The output parameters include core computing resource requirements such as floating-point operations, memory requirements, and parallelism limits. This design fully leverages the unique requirements of different generative tasks for heterogeneous resources, avoiding the shortcomings of traditional solutions that simply match hardware types. This achieves precise alignment between task requirements and resource computing power characteristics, providing highly accurate input for subsequent intelligent scheduling and improving the rationality of resource allocation.
[0034] For the task awareness module, the core function is to analyze the type and input characteristics of generative AI tasks, extract QoS constraints, and provide task-side data support for scheduling decisions.
[0035] Implementation details include: 1. Task type classification: Based on request protocol fields (such as "task-type" in HTTP header) or input data format, tasks are divided into four categories: text generation, image generation, video generation, and multimodal generation. Each task category has a predefined feature template.
[0036] 2. Input Feature Extraction: The preprocessing module parses the key parameters of the input data: for text tasks, extracts "text length (number of tokens) and maximum number of characters generated"; for image tasks, extracts "resolution, number of channels, and generation style complexity (predefined high / medium / low levels)"; for video tasks, extracts "frame rate, duration, and resolution".
[0037] 3. QoS Constraint Extraction: Supports two QoS configuration methods - user-defined (passing "latency-limit" and "throughput-target" through request parameters) or system default (based on task type presets, such as default latency ≤800ms for text generation and default throughput ≥5 concurrent requests / second for image generation).
[0038] 4. Computational Requirements Modeling: Based on task type + input features + QoS constraints, the core computational requirements of the task (floating-point operations (FLOPs), memory requirements, and parallelism limit) are output through a pre-trained feature mapping model.
[0039] The feature mapping model can employ a structurally appropriate statistical machine learning or neural network architecture, or it can be a predefined fixed-parameter calculation model.
[0040] For example, in a practical application scenario: The user submits an image generation request with the input data being "1024×1024 resolution, landscape style (medium complexity), generate 2 images". The user-defined QoS constraints are "latency ≤ 1500ms, throughput ≥ 3 concurrent requests / second".
[0041] Task awareness module processing flow: 1. The task type is identified as "image generation"; 2. Input feature extraction: resolution 1024×1024, number of channels 3, medium complexity, number of outputs 2; 3. Extract QoS constraints: latency ≤ 1500ms, throughput ≥ 3 concurrent connections / second; 4. Computational requirements modeling: Output FLOPs = 8.2 × 10¹², memory requirement = 12 GB, maximum parallelism = 8.
[0042] S102. Obtain resource status data of each computing node in the heterogeneous computing cluster through the heterogeneous resource status monitoring module; As an optional implementation, the resource status data includes computing power data, resource reserve data, health status data, and energy consumption data. The step of obtaining the resource status data of each computing node in the heterogeneous computing cluster includes: According to a preset cycle, the resource status data is obtained in real time through the monitoring agent system of each computing node; In addition, according to the data acquisition instructions, the resource status data of the target computing node is acquired in real time through the monitoring agent system; The resource status data is processed for outliers and normalized, and then stored in a time-series database.
[0043] This implementation uses lightweight monitoring agents deployed on each node to collect computing power, resource reserves, health status, and energy consumption data in real time at preset intervals (e.g., 100ms), and supports on-demand data supplementation for target nodes. Furthermore, it performs outlier filtering (e.g., replacing short-term mutation values) and normalization on the data, unifying the metrics of different resources before storing them in a time-series database. This mechanism ensures that resource status monitoring is both real-time and interference-resistant, providing a highly reliable data foundation for scheduling decisions and avoiding scheduling deviations caused by data lag or anomalies.
[0044] The core function of the heterogeneous resource status monitoring module is to collect the status data of heterogeneous computing resource clusters (CPU / GPU / FPGA / AI accelerator) in real time, provide resource-side data support for scheduling decisions, and ensure that scheduling is based on the latest resource status.
[0045] Implementation details include: 1. Monitoring Node Deployment: Deploy a lightweight monitoring agent on each computing node to support the adaptation of heterogeneous resources such as CPU (x86 / ARM architecture), GPU (NVIDIA A100 / A800 / 3090, AMD MI250), FPGA (Xilinx Alveo U280), and AI accelerator (Huawei Ascend 910B).
[0046] 2. Definition of monitoring indicators: Computing power related metrics: GPU SM utilization, FPGA computing throughput, CPU core utilization, and AI accelerator computing unit utilization. Resource reserves: Remaining video memory capacity, remaining system memory capacity, and allocable concurrent users; Health status: resource fault markers (such as GPU memory errors), network bandwidth utilization (during cross-node scheduling); Energy consumption data: GPU power consumption, total node energy consumption (for energy optimization goals).
[0047] 3. Data acquisition mechanism: The strategy of "real-time acquisition + on-demand acquisition" is adopted. Basic indicators (utilization rate, margin) are collected at 100ms intervals, and energy consumption data is collected at 1s intervals. When scheduling decisions require it (such as determining whether a node supports high-concurrency tasks), on-demand acquisition is triggered (such as collecting the network port queue length of the node).
[0048] 4. Data storage and preprocessing: Collected data is stored in a time-series database (InfluxDB). The preprocessing module performs outlier filtering (e.g., a sudden jump in GPU utilization to 100% for 10ms is considered an anomaly, and the value from the previous period is used instead) and normalization (mapping the utilization of different resources to the range of 0 to 1).
[0049] For example, in a real-world application scenario: The heterogeneous resource cluster consists of 3 nodes, and the real-time status collected by the monitoring module is as follows: Node 1 (GPU: NVIDIA A100): SM utilization = 45%, remaining video memory = 24GB, allocable concurrency = 6, power consumption = 320W, health status = normal; Node 2 (FPGA: Xilinx U280 + CPU: Intel Xeon 8375C): FPGA throughput = 7.5 × 10¹² FLOPs / s, CPU core utilization = 30%, remaining memory = 32GB, allocable concurrency = 4, power consumption = 180W, health status = normal; Node 3 (GPU: NVIDIA 3090): SM utilization = 85%, remaining video memory = 4GB, allocable concurrency = 1, power consumption = 280W, health status = normal.
[0050] S103. Through the intelligent scheduling decision module, a computing resource scheduling scheme is generated based on the task parameters and the resource status data according to the multi-objective optimization model. As an optional implementation, the multi-objective optimization model includes a deep deterministic policy gradient model. The step of generating a computational resource scheduling scheme based on the task parameters and the resource state data according to the multi-objective optimization model includes: Perform data preprocessing and vector concatenation on the task parameters and resource status data to generate a status input vector; The state input vector is processed by the generator network of the deep deterministic policy gradient model to generate an initial scheduling scheme, and the reward value of the initial scheduling scheme is determined by the evaluation network of the deep deterministic policy gradient model. If the reward value of the initial scheduling scheme exceeds a preset threshold, the initial scheduling scheme is determined as the target scheduling scheme. If the reward value of the initial scheduling scheme does not exceed the preset threshold, the initial scheduling scheme is cyclically adjusted until the reward value of the adjusted initial scheduling scheme exceeds the preset threshold, and the adjusted initial scheduling scheme at this time is taken as the target adjustment scheme. The target adjustment scheme is constrained and verified. After the constraint verification is passed, the target adjustment scheme is used as the computing resource scheduling scheme, and the target node, resource type, resource configuration method, concurrency allocation method and task start parameters are output. The constraint verification includes resource margin constraints, compatibility constraints, and network response feasibility constraints.
[0051] This implementation concatenates task parameters and resource status data into a state input vector. An initial scheduling scheme is then generated by a deep deterministic policy gradient model's generative network. An evaluation network calculates the reward value based on a weighted average of resource utilization improvement rate, QoS compliance rate, and energy consumption. The scheme is optimized through reward threshold judgment and iterative adjustments, and further validated by triple constraints of resource availability, compatibility, and network response feasibility. Finally, parameters such as target nodes, resource types, and concurrency allocation are output. This design integrates reinforcement learning to achieve multi-objective dynamic trade-offs, overcoming the limitations of traditional single-dimensional scheduling, improving resource utilization and reducing energy consumption while ensuring service quality.
[0052] As an optional implementation, the reward value is calculated based on a weighted average of resource utilization improvement rate, network response achievement rate, and energy consumption parameters. The pre-training method of the deep deterministic policy gradient model includes: Acquire generative task training data within a preset time period and generate a dataset; The generative task training data includes task characteristics, resource status, scheduling results, and execution effects. With the goal of maximizing the reward value, the generator network and the evaluation network are trained iteratively until a preset convergence condition is met; The preset convergence conditions include the number of iterations reaching a preset threshold, or the resource utilization rate of the deep deterministic strategy gradient model on the test set corresponding to the dataset exceeding a first proportion threshold, and the network response compliance rate exceeding a second proportion threshold. Furthermore, based on the incremental data collected in the first cycle, the incremental data is added to the dataset and the current deep deterministic policy gradient model is iteratively trained in real time.
[0053] This implementation constructs a training dataset using historical task characteristics, resource status, and execution results. It iteratively trains a deep deterministic policy gradient model by maximizing the reward value until the convergence conditions for resource utilization and QoS compliance are met. Furthermore, incremental data is collected periodically, and model parameters are updated iteratively in real time. This mechanism enables the scheduling model to continuously adapt to the dynamic changes in generative AI tasks (such as fluctuations in input size or model upgrades), avoiding the lag of static policies and improving the long-term applicability and robustness of the scheduling scheme.
[0054] The core function of the intelligent scheduling decision module is to generate the optimal scheduling scheme through a multi-objective optimization model based on the task feature data of the task perception module and the resource status data of the heterogeneous resource status monitoring module, so as to achieve accurate matching of "task-resource".
[0055] Implementation details include: 1. Decision Model Architecture: The model adopts a reinforcement learning (RL) model of "offline training + online inference", specifically the deep deterministic policy gradient (DDPG) model, which includes three core components: state space, action space, and reward function.
[0056] State space: Input features = task computation requirements (FLOPs / memory / parallelism) + QoS constraints (latency / throughput) + resource status (utilization of each node / margin / energy consumption), a 64-dimensional vector; Action space: The output is a resource allocation scheme, including "target node selection", "resource combination configuration" and "concurrency allocation", such as "select node 1 (A100 GPU) + allocate 12GB video memory + concurrency = 3"; One way to implement the reward function is: R = 0.4×U + 0.5×Q - 0.1×E (U is the resource utilization improvement rate, Q is the QoS compliance rate, and E is the energy consumption normalization value), ensuring a balance between multiple objectives.
[0057] 2. Offline training process: Dataset construction: Collect 1 million+ data entries of generative AI tasks from the past 6 months, including task characteristics, resource status, scheduling results, and execution effects (latency / throughput / energy consumption). Model training: With the goal of "maximizing cumulative reward R", iteratively train the Actor network (to generate scheduling schemes) and Critic network (to evaluate the merits of the schemes) of the DDPG model until the model achieves a QoS compliance rate of ≥98% and a resource utilization rate of ≥75% on the test set. Model updates: Incremental training is performed weekly based on new data to ensure that the model adapts to changes in task characteristics.
[0058] 3. Online decision-making process: Input data concatenation: The task feature vector output by the task awareness module is concatenated with the resource status vector output by the resource status monitoring module to form a 64-dimensional status input; Scheduling scheme generation: The Actor network outputs the initial scheduling scheme, and the Critic network evaluates the reward value of the scheme. If the reward value is lower than the threshold (e.g., R < 0.6), the scheme is fine-tuned (e.g., changing the target node or adjusting the concurrency). Constraint verification: The generated scheduling scheme is subjected to constraint verification, including resource reserve constraints (such as allocated video memory ≤ node remaining video memory), compatibility constraints (such as video generation tasks only support GPU / AI accelerators), and QoS feasibility constraints (based on historical data to predict whether the scheme can meet latency requirements). Scheme output: After verification, the final scheduling scheme is output, including the target node ID, resource type and allocation amount, and task startup parameters (such as GPU memory partition size).
[0059] Examples based on real-world application scenarios are provided below: Based on the image generation task (FLOPs=8.2×10¹², memory requirement=12GB, maximum parallelism=8; QoS: latency≤1500ms, throughput≥3 concurrent / second) and resource status data in the aforementioned application scenario, the processing procedure of the intelligent scheduling decision module may include: 1. Input state vector: Task features (8.2e12, 12GB, 8, 1500ms, 3) + Resource state (Node 1: 45%, 24GB, 6, 320W; Node 2: 7.5e12, 32GB, 4, 180W; Node 3: 85%, 4GB, 1, 280W); 2. Actor Network Initial Output: Select Node 1 (A100 GPU) + Allocate 12GB of video memory + Concurrency = 3; 3. Critic Network Assessment: Reward value R = 0.4 × 0.72 (resource utilization increased to 72%) + 0.5 × 0.99 (QoS compliance rate predicted to be 99%) - 0.1 × 0.8 (energy consumption normalized value 0.8) = 0.74; 4. Constraint Verification: Node 1 has 24GB of remaining video memory (≥12GB), the image generation task is compatible with A100 GPU, and the predicted latency based on historical data is 1200ms ≤ 1500ms. Verification passed. 5. Final output solution: Target node 1, resource type NVIDIA A100 GPU, video memory allocation 12GB, concurrency 3, GPU SM frequency set to 1410MHz.
[0060] S104. Through the resource elastic adaptation module, during the generative artificial intelligence task processing, the computing resource scheduling scheme is executed to perform resource scheduling and task migration for each computing node.
[0061] As an optional implementation method, the resource scheduling is executed in the following ways: When the total number of allocatable concurrent requests is lower than the number of concurrent task requests, the resource expansion process is triggered. The cloud platform application interface is called to create heterogeneous computing resource instances. The startup time priority is determined according to the resource type of each node to be expanded, and the resource expansion is performed on each node to be expanded in sequence according to the startup time priority. The expanded computing resources are then registered to the resource cluster list. When the resource utilization rate is continuously lower than the preset utilization rate threshold within a preset time period, the resource scaling down process is triggered. The task execution status on the node to be scaled down is checked. If there are running tasks, task migration is triggered. If the node is idle, computing resources are released and the resource cluster list is updated. Additionally, by using custom resources in container orchestration tools, task units can be bound to target nodes and resource types, and resource isolation parameters can be set. Furthermore, the task migration includes scaling down migration, disaster recovery migration, or load balancing migration. The priority of the task migration is sorted based on the network response constraints of each generative artificial intelligence task, and latency-sensitive tasks are migrated first, while throughput-sensitive tasks are migrated later. The task migration uses a state savepoint mechanism to save state savepoints to distributed storage at predetermined time intervals. During migration, after the model on the new node is loaded into the model state of the original node, the task data corresponding to the state savepoint is transmitted incrementally and synchronously. The state save points are used to indicate the intermediate state of model inference and the progress of input data sharding.
[0062] This implementation method, when expanding resources, calls the cloud platform interface to create resources according to the startup time priority of heterogeneous resource instances (e.g., FPGA is faster than GPU). When scaling down, idle nodes are released in conjunction with task migration. Container orchestration tools are used to isolate and bind tasks and resources. Furthermore, a state savepoint mechanism can be used to save the model inference state to distributed storage at preset intervals, with incremental data synchronization during migration. Combined with a migration strategy prioritized by QoS (latency-sensitive tasks are given priority), low-latency, seamless migration is achieved. This significantly shortens the scaling response time, ensures task continuity, and solves the resource waste and response latency problems of traditional threshold-triggered mechanisms.
[0063] The core function of the resource elastic adaptation module is to execute the scheduling scheme output by the intelligent scheduling decision module, realize the dynamic scaling up and down of heterogeneous resources, task binding and migration, and ensure the real-time performance and task continuity of elastic scheduling.
[0064] Implementation details include: 1. Resource Scheduling Executor: Resource Expansion: When the intelligent scheduling decision module determines that existing resources cannot meet task requirements (e.g., the sum of all available concurrency across all nodes < the number of concurrent task requests), expansion is triggered. Heterogeneous resource instances (e.g., adding an A100 GPU node) are created by calling cloud platform APIs (such as OpenStackNova API, Alibaba Cloud ECS API). Expansion priority is based on resource startup time (FPGA node startup time is 20s, GPU node startup time is 30s; resources with faster startup times are prioritized). After expansion, it is automatically registered to the resource cluster and synchronized to the status monitoring module.
[0065] Resource scaling down: When the execution feedback module detects that resource utilization is consistently below a threshold (e.g., GPU utilization <30% for 5 minutes), scaling down is triggered. Before scaling down, the task status on the node is checked. If there are running tasks, they are transferred to other nodes through the task migration mechanism; if there are no running tasks, the API is called to release resources and update the cluster resource list.
[0066] Resource binding: Based on the scheduling scheme, the task Pod is bound to the target node and resource type through the custom resource (CRD) of the container orchestration tool (K8s), and resource isolation parameters (such as GPU memory partition isolation, CPU core exclusivity) are set to avoid resource contention.
[0067] 2. Task Migration Manager: Migration trigger conditions: task migration during scaling down, disaster recovery migration during node failure, and resource load balancing migration (e.g., a node's GPU utilization is > 90% and there are idle nodes).
[0068] Migration Implementation Mechanism: The "Checkpoint + Incremental Synchronization" strategy is adopted. During task execution, a Checkpoint (containing the intermediate state of model inference and the progress of input data sharding) is generated every 200ms and stored in distributed storage (such as MinIO). During migration, the model of the target node is first loaded to the same state as the original node, and then the task data after the Checkpoint is transmitted through incremental synchronization. The migration time is controlled within 50ms to ensure that the task is unaware of the migration.
[0069] Migration priority: Based on task QoS constraints, latency-sensitive tasks (such as real-time dialogue generation) are migrated first, while throughput-intensive tasks (such as batch image generation) are migrated later.
[0070] Examples based on real-world application scenarios are provided below: Following the scheduling scheme generated by the aforementioned intelligent scheduling decision module, and adding 5 more identical image generation tasks (total concurrency requirement = 3 + 5 = 8), the resource elastic adaptation module handles the following: 1. Resource verification: The existing node 1 has an allocable concurrency of 6, node 2 has 4, node 3 has 1, and the total available concurrency is 11 ≥ 8, so no expansion is needed; 2. Resource Binding: Create 5 Pods using K8s CRD, binding them to Node 1 (2 Pods) and Node 2 (3 Pods). Each Pod on Node 1 is allocated 12GB of GPU memory, and each Pod on Node 2 is allocated 2.5×10¹² FLOPs / s of FPGA throughput + 8 CPU cores. 3. Task Startup: Synchronously load the image generation model to the GPU of node 1 and the FPGA of node 2, set the inference parameters (e.g., batch size = 3), and start the task execution; 4. Load balancing migration: After running for 1 minute, the GPU utilization of node 1 was monitored to rise to 88%, and the FPGA utilization of node 2 was 60%, triggering a migration—migrating one Pod from node 1 to node 2, synchronizing the intermediate state through checkpointing, the migration took 42ms, after which the utilization of node 1 was 72% and the utilization of node 2 was 75%, achieving load balancing.
[0071] As an optional implementation, the method further includes: By executing the feedback module, the task execution parameters of the generative artificial intelligence task and the resource consumption parameters of each computing node during the processing of the generative artificial intelligence task are obtained. Based on the task execution parameters and the resource consumption parameters, determine the scheduling effect parameters of the computing resource scheduling scheme; Based on the scheduling effect parameters, the multi-objective optimization model is adjusted, and a preset alarm process is executed.
[0072] This implementation method collects parameters such as actual task latency, throughput, resource utilization, and energy consumption through an execution feedback module to evaluate the QoS compliance rate, resource utilization, and energy efficiency of the scheduling scheme. Based on the evaluation results, it dynamically adjusts the model parameters (such as reward function weights) of the intelligent scheduling decision module and triggers alarms for abnormal indicators. This forms an "execution-feedback-optimization" closed loop, correcting scheduling deviations in real time and adapting to dynamic changes in tasks, thereby improving the long-term stability and efficiency of the system.
[0073] The core function of the execution feedback module is to collect key data during task execution, evaluate the effectiveness of the scheduling scheme, form a closed-loop feedback, and use it for model optimization of the intelligent scheduling decision module.
[0074] Implementation details include: 1. Feedback Data Collection: Task execution metrics: actual latency (time elapsed from task initiation to output), throughput (number of tasks completed per unit time), and task success rate (percentage of tasks completed without failure). Resource utilization metrics: actual resource utilization rate (e.g., average utilization rate of GPU SM), resource waste rate (the percentage difference between allocated resources and actual usage). Energy consumption metrics: total energy consumption of nodes during task execution, and energy consumption per unit of task (total energy consumption / number of tasks).
[0075] 2. Evaluation of scheduling effectiveness: Construct an evaluation index system: QoS compliance rate (the proportion of tasks with latency ≤ constraint value and throughput ≥ constraint value), resource utilization rate (actual resource usage / allocated resource usage), and energy efficiency (number of tasks / total energy consumption). Evaluation cycle: Each task is evaluated immediately upon completion, and a summary evaluation report is generated every hour.
[0076] 3. Closed-loop feedback optimization: Model parameter update: The evaluation results are used as feedback data and input into the Critic network of the intelligent scheduling decision module to update the model parameters once an hour (such as adjusting the weight of the reward function). Fine-tuning of scheduling strategy: If the QoS compliance rate of a certain type of task is consistently below 95% (such as video generation task), then increase the feature weight of that type of task in the model and optimize the scheduling scheme; Anomaly Alarm: When resource utilization is consistently below 50% or QoS compliance rate is below 90%, an alarm is triggered, prompting the administrator to check task configuration or resource cluster status.
[0077] Examples based on real-world application scenarios: After the aforementioned eight image generation tasks are completed, the feedback module will process the following steps: 1. Data Acquisition: Actual latency = 1150ms (all ≤1500ms), throughput = 3.2 concurrent requests / second (≥3), task success rate = 100%; Node 1 GPU average utilization = 72%, Node 2 FPGA average utilization = 75%, resource waste rate = 8%; total energy consumption = 12.6kWh, energy consumption per task = 1.575kWh / task; 2. Performance Evaluation: QoS Compliance Rate = 100%, Resource Utilization Rate = 73.5%, Energy Efficiency = 0.63 units / kWh; 3. Feedback Optimization: The evaluation data is input into the intelligent scheduling decision module to update the Critic network parameters and maintain the current reward function weights. Since all indicators meet the standards, no strategy fine-tuning is required and no alarms are triggered.
[0078] This application utilizes a task-aware module to accurately analyze the multidimensional characteristics (task type, input parameters, QoS constraints) of generative AI tasks and, combined with a pre-trained feature mapping model, outputs computational resource requirements to achieve intelligent prediction of task computing power needs. A heterogeneous resource status monitoring module collects real-time data on the computing power, remaining capacity, health status, and energy consumption of each computing node at millisecond intervals, and performs outlier filtering and normalization to ensure the real-time nature and reliability of resource status data. Based on the aforementioned task and resource data, an intelligent scheduling decision module uses a deep deterministic strategy gradient model to generate a multi-objective optimized scheduling scheme that balances resource utilization, QoS compliance rate, and energy consumption, and ensures the feasibility of the scheme through constraint verification. A resource elastic adaptation module binds resources through container orchestration and achieves seamless task migration based on a state savepoint mechanism, while triggering predictive scaling up and down according to QoS priority. This forms a closed loop of "precise perception - real-time monitoring - intelligent decision-making - elastic execution," significantly improving heterogeneous resource adaptability, scheduling response speed, and dynamic optimization capabilities.
[0079] The key points of this application are: a precise perception mechanism based on the multi-dimensional characteristics (type / input scale / QoS constraints) and heterogeneous resource characteristics of generative AI tasks, especially the task computational demand modeling method (outputting FLOPs, memory requirements, and parallelism through a pre-trained feature mapping model); a multi-objective intelligent scheduling decision model integrating reinforcement learning, including the definition of state space / action space, the design of a reward function that balances resource utilization, QoS compliance rate, and energy consumption, and a model operation mechanism of "offline training + online inference + incremental update"; and a heterogeneous resource elastic adaptation scheme that supports task-agnostic migration, including a migration mechanism based on checkpoint + incremental synchronization, a migration strategy based on task QoS priority, and a resource elastic scheduling logic of "predictive scaling" (based on task traffic prediction rather than fixed thresholds). Finally, a closed-loop feedback optimization system throughout the entire scheduling process, which evaluates the scheduling effect in real time through task execution data, dynamically updates scheduling model parameters and strategies, and achieves continuous iteration of scheduling performance.
[0080] Test results in real-world application scenarios demonstrate that the method provided in this application exhibits superior adaptability to heterogeneous resources. Compared to the existing "simple compatibility matching" approach, this application achieves optimal "task-resource" combinations (e.g., allocating FPGA+CPU for lightweight tasks and A100 GPU for heavy-load tasks) through task computation requirement modeling and resource characteristic awareness, resulting in a resource utilization improvement of over 30%. Elastic scheduling response is faster; while existing technologies' threshold-triggered expansion takes 30-60 seconds, this application, through intelligent prediction and rapid resource binding, reduces expansion response time to 5-10 seconds, increasing peak-period QoS compliance from 85% to over 98%. Scheduling decisions are more comprehensive; compared to the single-dimensional decision-making of existing technologies, this application's multi-objective optimization model balances QoS, resource utilization, and energy consumption. Under the premise of meeting service quality requirements, unit task energy consumption is reduced by over 25%, and resource waste is controlled within 10%. With enhanced dynamic adaptability, this application can automatically adapt to dynamic changes in generative AI tasks (such as model upgrades and input scale adjustments) through a closed-loop feedback optimization mechanism, without the need for manual modification of scheduling rules, thus improving adaptation efficiency by 80%.
[0081] This application also provides a heterogeneous computing resource elastic scheduling device, the device comprising: The acquisition module is used to acquire the task parameters of the generative artificial intelligence task through the task awareness module, and determine the computing resource requirements corresponding to the generative artificial intelligence task based on the task parameters. The acquisition module is also used to acquire resource status data of each computing node in the heterogeneous computing cluster through the heterogeneous resource status monitoring module; The processing module is used to generate a computing resource scheduling scheme based on the task parameters and the resource status data, according to a multi-objective optimization model, through the intelligent scheduling decision module; The processing module is further configured to execute the computing resource scheduling scheme during the generative artificial intelligence task processing through the resource elastic adaptation module to perform resource scheduling and task migration for each computing node.
[0082] This implementation uses a task-aware module to accurately analyze the multi-dimensional characteristics (task type, input parameters, QoS constraints) of generative AI tasks and, combined with a pre-trained feature mapping model, outputs computational resource requirements to achieve intelligent prediction of task computing power needs. A heterogeneous resource status monitoring module collects real-time data on the computing power, remaining capacity, health status, and energy consumption of each computing node at millisecond intervals, and performs outlier filtering and normalization to ensure the real-time nature and reliability of resource status data. Based on the above task and resource data, an intelligent scheduling decision module uses a deep deterministic strategy gradient model to generate a multi-objective optimized scheduling scheme that considers resource utilization, QoS compliance rate, and energy consumption, and ensures the feasibility of the scheme through constraint verification. A resource elastic adaptation module binds resources through container orchestration and achieves seamless task migration based on a state savepoint mechanism, while triggering predictive scaling up and down according to QoS priority. This forms a closed loop of "accurate perception - real-time monitoring - intelligent decision-making - elastic execution," significantly improving heterogeneous resource adaptability, scheduling response speed, and dynamic optimization capabilities.
[0083] As an optional implementation, the task parameters include task type, input parameter characteristics, and network response constraints; the computing resource requirements include floating-point operations, memory requirements, and parallelism limits; and the acquisition module determines the specific method for the computing resource requirements corresponding to the generative artificial intelligence task based on the task parameters, including: Based on the task parameters, the computational resource requirements are output using a pre-trained feature mapping model. The task types include text generation tasks, image generation tasks, video generation tasks, or multimodal generation tasks. The input parameter features corresponding to the text generation task include the number of input text units and the upper limit of the number of generated text units. The input parameter features corresponding to the image generation task include resolution, number of channels, and generation style complexity parameters. The features corresponding to the video generation task include frame rate, duration, and resolution. The multimodal generation task includes at least two of the text generation task, the image generation task, and the video generation task. The network response constraints include custom constraint parameters and / or preset constraint parameters.
[0084] This implementation combines specific input features (such as text length, resolution, frame rate, etc.) of text generation, image generation, video generation, and multimodal tasks with user-defined or system-preset QoS constraints (latency, throughput) through a pre-trained feature mapping model. The output parameters include core computing resource requirements such as floating-point operations, memory requirements, and parallelism limits. This design fully leverages the unique requirements of different generative tasks for heterogeneous resources, avoiding the shortcomings of traditional solutions that simply match hardware types. This achieves precise alignment between task requirements and resource computing power characteristics, providing highly accurate input for subsequent intelligent scheduling and improving the rationality of resource allocation.
[0085] As an optional implementation, the resource status data includes computing power data, resource reserve data, health status data, and energy consumption data. The specific method by which the acquisition module acquires the resource status data of each computing node in the heterogeneous computing cluster includes: According to a preset cycle, the resource status data is obtained in real time through the monitoring agent system of each computing node; In addition, according to the data acquisition instructions, the resource status data of the target computing node is acquired in real time through the monitoring agent system; The resource status data is processed for outliers and normalized, and then stored in a time-series database.
[0086] This implementation uses lightweight monitoring agents deployed on each node to collect computing power, resource reserves, health status, and energy consumption data in real time at preset intervals (e.g., 100ms), and supports on-demand data supplementation for target nodes. Furthermore, it performs outlier filtering (e.g., replacing short-term mutation values) and normalization on the data, unifying the metrics of different resources before storing them in a time-series database. This mechanism ensures that resource status monitoring is both real-time and interference-resistant, providing a highly reliable data foundation for scheduling decisions and avoiding scheduling deviations caused by data lag or anomalies.
[0087] As an optional implementation, the multi-objective optimization model includes a deep deterministic policy gradient model. The processing module generates a specific method for a computational resource scheduling scheme based on the task parameters and the resource state data, according to the multi-objective optimization model, including: Perform data preprocessing and vector concatenation on the task parameters and resource status data to generate a status input vector; The state input vector is processed by the generator network of the deep deterministic policy gradient model to generate an initial scheduling scheme, and the reward value of the initial scheduling scheme is determined by the evaluation network of the deep deterministic policy gradient model. If the reward value of the initial scheduling scheme exceeds a preset threshold, the initial scheduling scheme is determined as the target scheduling scheme. If the reward value of the initial scheduling scheme does not exceed the preset threshold, the initial scheduling scheme is cyclically adjusted until the reward value of the adjusted initial scheduling scheme exceeds the preset threshold, and the adjusted initial scheduling scheme at this time is taken as the target adjustment scheme. The target adjustment scheme is constrained and verified. After the constraint verification is passed, the target adjustment scheme is used as the computing resource scheduling scheme, and the target node, resource type, resource configuration method, concurrency allocation method and task start parameters are output. The constraint verification includes resource margin constraints, compatibility constraints, and network response feasibility constraints.
[0088] This implementation concatenates task parameters and resource status data into a state input vector. An initial scheduling scheme is then generated by a deep deterministic policy gradient model's generative network. An evaluation network calculates the reward value based on a weighted average of resource utilization improvement rate, QoS compliance rate, and energy consumption. The scheme is optimized through reward threshold judgment and iterative adjustments, and further validated by triple constraints of resource availability, compatibility, and network response feasibility. Finally, parameters such as target nodes, resource types, and concurrency allocation are output. This design integrates reinforcement learning to achieve multi-objective dynamic trade-offs, overcoming the limitations of traditional single-dimensional scheduling, improving resource utilization and reducing energy consumption while ensuring service quality.
[0089] As an optional implementation, the reward value is calculated by weighting the resource utilization improvement rate, network response achievement rate, and energy consumption parameters. The specific methods by which the processing module pre-trains the deep deterministic policy gradient model include: Acquire generative task training data within a preset time period and generate a dataset; The generative task training data includes task characteristics, resource status, scheduling results, and execution effects. With the goal of maximizing the reward value, the generator network and the evaluation network are trained iteratively until a preset convergence condition is met; The preset convergence conditions include the number of iterations reaching a preset threshold, or the resource utilization rate of the deep deterministic strategy gradient model on the test set corresponding to the dataset exceeding a first proportion threshold, and the network response compliance rate exceeding a second proportion threshold. Furthermore, based on the incremental data collected in the first cycle, the incremental data is added to the dataset and the current deep deterministic policy gradient model is iteratively trained in real time.
[0090] This implementation constructs a training dataset using historical task characteristics, resource status, and execution results. It iteratively trains a deep deterministic policy gradient model by maximizing the reward value until the convergence conditions for resource utilization and QoS compliance are met. Furthermore, incremental data is collected periodically, and model parameters are updated iteratively in real time. This mechanism enables the scheduling model to continuously adapt to the dynamic changes in generative AI tasks (such as fluctuations in input size or model upgrades), avoiding the lag of static policies and improving the long-term applicability and robustness of the scheduling scheme.
[0091] As an optional implementation, the specific method by which the processing module performs resource scheduling includes: When the total number of allocatable concurrent requests is lower than the number of concurrent task requests, the resource expansion process is triggered. The cloud platform application interface is called to create heterogeneous computing resource instances. The startup time priority is determined according to the resource type of each node to be expanded, and the resource expansion is performed on each node to be expanded in sequence according to the startup time priority. The expanded computing resources are then registered to the resource cluster list. When the resource utilization rate is continuously lower than the preset utilization rate threshold within a preset time period, the resource scaling down process is triggered. The task execution status on the node to be scaled down is checked. If there are running tasks, task migration is triggered. If the node is idle, computing resources are released and the resource cluster list is updated. Additionally, by using custom resources in container orchestration tools, task units can be bound to target nodes and resource types, and resource isolation parameters can be set. Furthermore, the task migration includes scaling down migration, disaster recovery migration, or load balancing migration. The priority of the task migration is sorted based on the network response constraints of each generative artificial intelligence task, and latency-sensitive tasks are migrated first, while throughput-sensitive tasks are migrated later. The task migration uses a state savepoint mechanism to save state savepoints to distributed storage at predetermined time intervals. During migration, after the model on the new node is loaded into the model state of the original node, the task data corresponding to the state savepoint is transmitted incrementally and synchronously. The state save points are used to indicate the intermediate state of model inference and the progress of input data sharding.
[0092] This implementation method, when expanding resources, calls the cloud platform interface to create resources according to the startup time priority of heterogeneous resource instances (e.g., FPGA is faster than GPU). When scaling down, idle nodes are released in conjunction with task migration. Container orchestration tools are used to isolate and bind tasks and resources. Furthermore, a state savepoint mechanism can be used to save the model inference state to distributed storage at preset intervals, with incremental data synchronization during migration. Combined with a migration strategy prioritized by QoS (latency-sensitive tasks are given priority), low-latency, seamless migration is achieved. This significantly shortens the scaling response time, ensures task continuity, and solves the resource waste and response latency problems of traditional threshold-triggered mechanisms.
[0093] As an optional implementation, the processing module is further configured to: By executing the feedback module, the task execution parameters of the generative artificial intelligence task and the resource consumption parameters of each computing node during the processing of the generative artificial intelligence task are obtained. Based on the task execution parameters and the resource consumption parameters, determine the scheduling effect parameters of the computing resource scheduling scheme; Based on the scheduling effect parameters, the multi-objective optimization model is adjusted, and a preset alarm process is executed.
[0094] This implementation method collects parameters such as actual task latency, throughput, resource utilization, and energy consumption through an execution feedback module to evaluate the QoS compliance rate, resource utilization, and energy efficiency of the scheduling scheme. Based on the evaluation results, it dynamically adjusts the model parameters (such as reward function weights) of the intelligent scheduling decision module and triggers alarms for abnormal indicators. This forms an "execution-feedback-optimization" closed loop, correcting scheduling deviations in real time and adapting to dynamic changes in tasks, thereby improving the long-term stability and efficiency of the system.
[0095] It should be noted that the division of the various modules in the above device is merely a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, these modules can be implemented entirely in software via processing element calls; they can be fully implemented in hardware; or some modules can be implemented by processing element calls to software, while others are implemented in hardware. For example, a processing module can be a separate processing element, or it can be integrated into a chip within the device. Alternatively, it can be stored as program code in the device's memory, and its functions can be called and executed by a processing element. The implementation of other modules is similar. Moreover, these modules can be fully or partially integrated together, or they can be implemented independently. The processing element here can be an integrated circuit with signal processing capabilities. During implementation, each step of the above method or each of the above modules can be completed through integrated logic circuits in the hardware of the processor element or through software instructions.
[0096] Indicatively, such as Figure 3 As shown, Figure 3 This is a schematic diagram of the internal structure of a computer device 300 provided in an embodiment of this application. The computer device 300 can be provided as a server. (Refer to...) Figure 3 The computer device 300 includes a processing component 302, which further includes one or more processors, and memory resources represented by memory 301 for storing instructions, such as application programs, that can be executed by the processing component 302. The application programs stored in memory 301 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 302 is configured to execute instructions to perform the methods of any of the embodiments described above.
[0097] The computer device 300 may also include a power supply component 303 configured to perform power management of the computer device 300, a wired or wireless network interface 304 configured to connect the computer device 300 to a network, and an input / output (I / O) interface 305. The computer device 300 may operate on an operating system stored in memory 301, such as Windows Server™, Mac OS X™, Unix™, Linux™, Free BSD™, or similar.
[0098] Those skilled in the art will understand that Figure 3 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0099] This application provides a storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the method provided in any embodiment.
[0100] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0101] The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can be referred to each other.
[0102] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A method for elastic scheduling of heterogeneous computing resources, characterized in that, include: The task-aware module obtains the task parameters of the generative artificial intelligence task and determines the computing resource requirements corresponding to the generative artificial intelligence task based on the task parameters. The heterogeneous resource status monitoring module acquires resource status data for each computing node in the heterogeneous computing cluster. The intelligent scheduling decision module generates a computing resource scheduling scheme based on the task parameters and resource status data, according to a multi-objective optimization model. The resource elastic adaptation module executes the computing resource scheduling scheme during the generative artificial intelligence task processing to perform resource scheduling and task migration for each computing node.
2. The method according to claim 1, characterized in that, The task parameters include task type, input parameter characteristics, and network response constraints. The computational resource requirements include floating-point operations, memory requirements, and parallelism limits. Determining the computational resource requirements corresponding to the generative artificial intelligence task based on the task parameters includes: Based on the task parameters, the computational resource requirements are output using a pre-trained feature mapping model. The task types include text generation tasks, image generation tasks, video generation tasks, or multimodal generation tasks. The input parameter features corresponding to the text generation task include the number of input text units and the upper limit of the number of generated text units. The input parameter features corresponding to the image generation task include resolution, number of channels, and generation style complexity parameters. The features corresponding to the video generation task include frame rate, duration, and resolution. The multimodal generation task includes at least two of the text generation task, the image generation task, and the video generation task. The network response constraints include custom constraint parameters and / or preset constraint parameters.
3. The method according to claim 1, characterized in that, The resource status data includes computing power data, resource reserve data, health status data, and energy consumption data. Obtaining the resource status data of each computing node in the heterogeneous computing cluster includes: According to a preset cycle, the resource status data is obtained in real time through the monitoring agent system of each computing node; In addition, according to the data acquisition instructions, the resource status data of the target computing node is acquired in real time through the monitoring agent system; The resource status data is processed for outliers and normalized, and then stored in a time-series database.
4. The method according to claim 1, characterized in that, The multi-objective optimization model includes a deep deterministic policy gradient model. The step of generating a computational resource scheduling scheme based on the task parameters and the resource state data according to the multi-objective optimization model includes: Perform data preprocessing and vector concatenation on the task parameters and resource status data to generate a status input vector; The state input vector is processed by the generative network of the deep deterministic policy gradient model to generate an initial scheduling scheme, and the reward value of the initial scheduling scheme is determined by the evaluation network of the deep deterministic policy gradient model. If the reward value of the initial scheduling scheme exceeds a preset threshold, the initial scheduling scheme is determined as the target scheduling scheme. If the reward value of the initial scheduling scheme does not exceed the preset threshold, the initial scheduling scheme is cyclically adjusted until the reward value of the adjusted initial scheduling scheme exceeds the preset threshold, and the adjusted initial scheduling scheme at this time is taken as the target adjustment scheme. The target adjustment scheme is constrained and verified. After the constraint verification is passed, the target adjustment scheme is used as the computing resource scheduling scheme, and the target node, resource type, resource configuration method, concurrency allocation method and task start parameters are output. The constraint verification includes resource margin constraints, compatibility constraints, and network response feasibility constraints.
5. The method according to claim 4, characterized in that, The reward value is calculated by weighting resource utilization improvement rate, network response compliance rate, and energy consumption parameters. The pre-training method of the deep deterministic policy gradient model includes: Acquire generative task training data within a preset time period and generate a dataset; The generative task training data includes task characteristics, resource status, scheduling results, and execution effects. With the goal of maximizing the reward value, the generator network and the evaluation network are trained iteratively until a preset convergence condition is met; The preset convergence conditions include the number of iterations reaching a preset threshold, or the resource utilization rate of the deep deterministic strategy gradient model on the test set corresponding to the dataset exceeding a first proportion threshold, and the network response compliance rate exceeding a second proportion threshold. Furthermore, based on the incremental data collected in the first cycle, the incremental data is added to the dataset and the current deep deterministic policy gradient model is iteratively trained in real time.
6. The method according to claim 1, characterized in that, The execution methods of resource scheduling include: When the total number of allocatable concurrent requests is lower than the number of concurrent task requests, the resource expansion process is triggered. The cloud platform application interface is called to create heterogeneous computing resource instances. The startup time priority is determined according to the resource type of each node to be expanded, and the resource expansion is performed on each node to be expanded in sequence according to the startup time priority. The expanded computing resources are then registered to the resource cluster list. When the resource utilization rate is continuously lower than the preset utilization rate threshold within a preset time period, the resource scaling down process is triggered. The task execution status on the node to be scaled down is checked. If there are running tasks, task migration is triggered. If the node is idle, computing resources are released and the resource cluster list is updated. Additionally, by using custom resources in container orchestration tools, task units can be bound to target nodes and resource types, and resource isolation parameters can be set. Furthermore, the task migration includes scaling down migration, disaster recovery migration, or load balancing migration. The priority of the task migration is sorted based on the network response constraints of each generative artificial intelligence task, and latency-sensitive tasks are migrated first, while throughput-sensitive tasks are migrated later. The task migration uses a state savepoint mechanism to save state savepoints to distributed storage at predetermined time intervals. During migration, after the model on the new node is loaded into the model state of the original node, the task data corresponding to the state savepoint is transmitted incrementally and synchronously. The state save points are used to indicate the intermediate state of model inference and the progress of input data sharding.
7. The method according to any one of claims 1-6, characterized in that, The method further includes: By executing the feedback module, the task execution parameters of the generative artificial intelligence task and the resource consumption parameters of each computing node during the processing of the generative artificial intelligence task are obtained. Based on the task execution parameters and the resource consumption parameters, determine the scheduling effect parameters of the computing resource scheduling scheme; Based on the scheduling effect parameters, the multi-objective optimization model is adjusted, and a preset alarm process is executed.
8. A heterogeneous computing resource elastic scheduling device, characterized in that, The device includes: The acquisition module is used to acquire the task parameters of the generative artificial intelligence task through the task awareness module, and determine the computing resource requirements corresponding to the generative artificial intelligence task based on the task parameters. The acquisition module is also used to acquire resource status data of each computing node in the heterogeneous computing cluster through the heterogeneous resource status monitoring module; The processing module is used to generate a computing resource scheduling scheme based on the task parameters and the resource status data, according to a multi-objective optimization model, through the intelligent scheduling decision module; The processing module is further configured to execute the computing resource scheduling scheme during the generative artificial intelligence task processing through the resource elastic adaptation module to perform resource scheduling and task migration for each computing node.
9. A computer device, characterized in that, The method includes one or more processors and a memory storing computer-readable instructions that, when executed by the one or more processors, perform the steps of the method as described in any one of claims 1-7.
10. A storage medium, characterized in that, The storage medium stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method as described in any one of claims 1-7.