Dynamic computing resource elastic scheduling method for heterogeneous server cluster

By using an improved deep reinforcement learning algorithm and a multi-objective constraint optimization model, the reward function is dynamically adjusted and task execution progress information is collected in real time. This solves the problem of low task-resource matching in heterogeneous server cluster resource scheduling, and achieves stable and dynamic load distribution.

CN122240273APending Publication Date: 2026-06-19GUOLIAN ZHONGYUAN (BEIJING) TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUOLIAN ZHONGYUAN (BEIJING) TECHNOLOGY CO LTD
Filing Date
2026-04-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing heterogeneous server cluster resource scheduling technologies, deep reinforcement learning algorithms cannot dynamically adapt to changes in task requirements and cluster resources, resulting in a disconnect between node pre-selection results and actual needs, and a reduced degree of task-resource matching.

Method used

An improved deep reinforcement learning algorithm is adopted to dynamically adjust the reward function based on the resource demand characteristics of the computing task and the resource status change trend of the heterogeneous server cluster. Combined with a multi-objective constraint optimization model, the algorithm parameters are dynamically adjusted by collecting task execution progress information in real time, thereby optimizing the load distribution of server nodes.

Benefits of technology

It improves the alignment between server node pre-selection results and computing task resource requirements, maintains the synchronous correlation between cluster resource scheduling and task execution, stabilizes the load distribution status, and enhances the dynamic adaptability of the scheduling process.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240273A_ABST
    Figure CN122240273A_ABST
Patent Text Reader

Abstract

This invention relates to the field of cluster resource scheduling technology, specifically a dynamic and elastic scheduling method for heterogeneous server clusters. The method includes: receiving a description of a computing task from an upper-layer application and parsing its resource requirement characteristics and dependency constraints; pre-selecting initial candidate server nodes in the cluster's global resource state space using an improved deep reinforcement learning algorithm with a dynamically adjustable reward function; constructing a multi-objective constraint optimization model by combining real-time node load and task requirements; calculating a suitability score and scheduling the task to the optimal node for execution; and dynamically adjusting the reinforcement learning algorithm parameters by collecting real-time data on task execution progress and comparing it with resource requirements. This method makes node pre-selection more aligned with task requirements, ensures scheduling decisions are consistent with the real-time state of the cluster and tasks, optimizes cluster load distribution, and enhances the dynamic adaptability of the scheduling process.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of cluster resource scheduling technology, and in particular to a dynamic computing resource elastic scheduling method for heterogeneous server clusters. Background Technology

[0002] Existing heterogeneous server cluster resource scheduling technologies mostly rely on fixed resource thresholds to select nodes, employ conventional deep reinforcement learning algorithms for task node pre-selection, and build simple optimization models based on initial load and task requirements to complete task deployment. Algorithm parameters remain constant after task scheduling and do not change based on task execution status. Some scheduling schemes rely solely on initial data for decision-making, failing to establish a collection and feedback loop for task execution information, and the scheduling process largely follows statically set rules.

[0003] Conventional deep reinforcement learning algorithms use fixed reward functions, which cannot adapt to changes in computational task resource requirements and cluster resource status. Node pre-selection results are prone to becoming disconnected from actual needs. Multi-objective constrained optimization models rely solely on initial data to calculate fit, lacking a dynamic feedback mechanism during task execution. Algorithm parameters cannot be updated in real time, scheduling decisions gradually deviate from the actual cluster operating status and task execution progress, server node load distribution becomes uneven, and the matching degree between tasks and resources continuously decreases.

[0004] To address the issue that the reward function of the improved deep reinforcement learning algorithm cannot dynamically adapt to the changing trends of task requirements and cluster resources, and to address the issue that task execution progress information cannot be fed back and algorithm parameters cannot be adjusted, it is necessary to improve the dynamic decision-making and real-time optimization mechanism for cluster resource scheduling. Summary of the Invention

[0005] The purpose of this invention is to address the shortcomings of existing technologies by proposing a dynamic computing resource elastic scheduling method for heterogeneous server clusters.

[0006] To achieve the above objectives, the present invention adopts the following technical solution: a dynamic computing resource elastic scheduling method for heterogeneous server clusters, comprising:

[0007] Receive the computing task description from the upper layer application, and parse out the resource requirements and dependency constraints of the computing task based on the task description.

[0008] Based on resource demand characteristics, an improved deep reinforcement learning algorithm is used to pre-select a set of initial candidate server nodes for computing tasks in the global resource state space of a heterogeneous server cluster. The improved deep reinforcement learning algorithm dynamically adjusts its reward function based on the actual resource demand characteristics of the computing task and the resource state change trend of the heterogeneous server cluster.

[0009] Based on the real-time load data of the initial candidate server nodes and the actual resource requirements of the computing tasks, a multi-objective constrained optimization model is constructed.

[0010] Solve the multi-objective constrained optimization model, calculate the task deployment adaptability score of each initial candidate server node, and schedule the calculation task to be executed on the server node with the highest adaptability score;

[0011] The execution progress information of the computing task on the server node is collected in real time and compared with the resource requirement characteristics. The parameters of the improved deep reinforcement learning algorithm are dynamically adjusted according to the comparison results.

[0012] As a further aspect of the present invention, the improved deep reinforcement learning algorithm is used to pre-select a set of initial candidate server nodes for the computing task in the global resource state space of the heterogeneous server cluster, specifically including:

[0013] The application receives a computing task description from the upper layer and extracts the resource requirement features of the computing task from the computing task description. The resource requirement features include at least the number of CPU cores, memory capacity, storage space, and type and number of accelerator cards required by the computing task.

[0014] Obtain a global resource status snapshot of the heterogeneous server cluster. The global resource status snapshot includes the static configuration attributes and dynamic load attributes of each server node in the cluster. The static configuration attributes include CPU architecture, total memory, storage type and capacity, and accelerator card model. The dynamic load attributes include CPU utilization, memory usage, storage read / write speed, and network bandwidth utilization.

[0015] The resource requirement characteristics of the computing task and the global resource state snapshot of the heterogeneous server cluster are encoded into a multi-dimensional feature vector, which is used as the environment state input for the improved deep reinforcement learning algorithm.

[0016] The improved deep reinforcement learning algorithm is based on the global resource state space of a heterogeneous server cluster. According to the preset exploration strategy, it starts from the current environment state and evaluates the expected long-term benefits of scheduling computing tasks to different server nodes in the action space.

[0017] The improved deep reinforcement learning algorithm selects the top few scheduling actions with the highest expected long-term returns, adds the server node pointed to by each scheduling action to a server node list, and outputs the server node list as a set of initial candidate server nodes pre-selected for the computing task.

[0018] As a further aspect of the present invention, the improved deep reinforcement learning algorithm dynamically adjusts its reward function based on the actual resource requirements of the computing task and the resource status change trend of the heterogeneous server cluster, specifically as follows:

[0019] The reward function of the improved deep reinforcement learning algorithm is initialized, and the reward function includes multiple basic reward items, including resource utilization improvement reward, task scheduling success reward, and service level agreement breach penalty.

[0020] In each training cycle of the improved deep reinforcement learning algorithm interacting with the environment, the actual resource consumption data of the computing task on the target server node in the current training cycle, as well as the change in the resource state of the target server node before and after the computing task is run, are obtained.

[0021] The actual resource consumption data is compared with the resource demand characteristics of the computing task parsed from the computing task description, and the resource demand prediction deviation value is calculated.

[0022] Obtain the resource status of the target server node at the start of the next training cycle, and predict the trend of resource status change of the target server node based on the resource status data of multiple historical training cycles.

[0023] Based on the resource demand prediction deviation and the resource status change trend, the weight adjustment coefficient of each basic reward item in the reward function is calculated. The weight adjustment coefficient is used to amplify the reward items that are more relevant to the recent scheduling goal and reduce the weight of the reward items that are less relevant to the recent scheduling goal in the next training cycle.

[0024] In the next training cycle, the reward function adjusted by the weight adjustment coefficient is used to calculate the immediate reward for different scheduling actions under the current environmental state, guiding the policy network of the improved deep reinforcement learning algorithm to optimize towards the direction of dynamically changing scheduling objectives.

[0025] As a further aspect of the present invention, the step of constructing a multi-objective constrained optimization model based on the real-time load data of the initial candidate server nodes and the actual resource requirements of the computing task specifically includes:

[0026] Obtain the real-time load data for each server node in the initial candidate server node list. The real-time load data includes real-time CPU utilization, real-time memory usage, real-time storage input / output operation rate, and real-time network latency.

[0027] The actual resource requirement characteristics of the computing task are parsed from the computing task description. These actual resource requirement characteristics include peak CPU core requirement, peak memory requirement, storage read / write bandwidth requirement, network bandwidth requirement, and maximum tolerable latency of the task.

[0028] The primary optimization objective is to maximize the overall resource utilization of the target server node, and the secondary optimization objective is to minimize the total migration cost of computing tasks throughout the cluster. A multi-objective optimization problem objective function is constructed.

[0029] The actual resource requirements of the computing task and the remaining available resources of each initial candidate server node are used as constraints. The constraints include that the peak CPU core requirement shall not exceed the number of remaining available CPU cores of the server node, the peak memory requirement shall not exceed the amount of remaining available memory of the server node, and the maximum tolerable latency of the task must be greater than the estimated network latency after the task is deployed to the corresponding server node.

[0030] By combining the objective function with the constraints, a multi-objective constrained optimization model is constructed to select the optimal target node from the initial candidate server nodes.

[0031] As a further aspect of the present invention, solving the multi-objective constrained optimization model and calculating the task deployment suitability score for each initial candidate server node specifically includes:

[0032] Obtain the completed multi-objective constraint optimization model, obtain the latest real-time load data of the initial candidate server nodes from the monitoring system of the heterogeneous server cluster, and substitute it into the constraint conditions of the multi-objective constraint optimization model to verify whether each initial candidate server node meets all hard constraint conditions.

[0033] For initial candidate server nodes that satisfy all hard constraints, a linear weighted sum method is used to integrate multiple optimization objective functions in the multi-objective constrained optimization model into a single-objective fitness evaluation function, and a corresponding weight coefficient is assigned to each optimization objective.

[0034] The real-time load data of each initial candidate server node and the actual resource requirements of the computing task are substituted into the fitness evaluation function for calculation.

[0035] The suitability evaluation function integrates the estimated resource utilization efficiency of the computing task to be scheduled on the corresponding initial candidate server node, the estimated task execution completion time, and the estimated cluster load balancing cost caused by this scheduling, and outputs a quantitative suitability score.

[0036] For all nodes in the initial candidate server node list that meet the constraints, a fitness score is calculated, and the calculation task is scheduled to be executed on the server node with the highest fitness score.

[0037] As a further aspect of the present invention, the suitability evaluation function comprehensively considers the estimated resource utilization efficiency of the scheduled computation task on the corresponding initial candidate server node, the estimated task execution completion time, and the estimated cluster load balancing cost caused by this scheduling, specifically including:

[0038] Based on the actual resource requirements of the computing task and the current resource utilization of the target server node, calculate the expected utilization of the four main resources of the target server node (CPU, memory, storage, and network) after scheduling the current computing task to the target server node.

[0039] The expected utilization rates of four main resources—CPU, memory, storage, and network—are weighted geometrically averaged to obtain the estimated node-level resource utilization efficiency of the current computing task on the corresponding candidate server node. The weighting coefficients are pre-configured based on the degree of dependence of the current computing task on each type of resource.

[0040] Based on the historical execution records of the computation task or the estimated execution cycle in the task description, combined with the current load of the target server node and the queuing status of other tasks on the same node, the estimated execution completion time of the task on the target server node is calculated.

[0041] The simulation migrates computing tasks from their current location to the target server node, calculates the amount of data transfer and service interruption time required during the migration process, and uses this as a quantified value of the cluster load balancing cost caused by this scheduling.

[0042] The node-level resource utilization efficiency, the reciprocal of the execution completion time, and the quantified value of the cluster load balancing cost are linearly combined according to preset weights to calculate the final adaptability score.

[0043] As a further aspect of the present invention, the real-time acquisition of computational task execution progress information on the server node, comparing it with resource requirement characteristics, and dynamically adjusting the parameters of the improved deep reinforcement learning algorithm based on the comparison results, specifically includes:

[0044] During the execution of the computing task on the target server node, its execution progress information is continuously collected. The execution progress information includes real-time CPU utilization curve, real-time memory usage curve, number of completed input / output operations, and amount of data processed.

[0045] Obtain the declared resource requirement characteristics from the computing task description, including the expected CPU utilization range, the expected memory usage limit, the expected total input / output operations, and the expected total amount of data to be processed.

[0046] The execution progress information is compared with the resource requirement features obtained from the computing task description to calculate the deviation between real-time resource consumption and expected resource requirements. The deviation includes CPU utilization deviation, memory usage deviation, and input / output operation rate deviation.

[0047] Based on the calculated deviation, the strength of the empirical feedback signal that the improved deep reinforcement learning algorithm should obtain in this scheduling decision is calculated, and the signal strength is positively correlated with the deviation.

[0048] The complete state, actions, and calculated empirical feedback signal strength of this scheduling decision are stored as empirical data in the empirical replay buffer of the improved deep reinforcement learning algorithm for subsequent updates to the algorithm's policy network parameters.

[0049] As a further aspect of the present invention, based on the calculated deviation, the strength of the empirical feedback signal that the improved deep reinforcement learning algorithm should obtain in this scheduling decision is calculated, specifically including:

[0050] Set a tolerance threshold for resource deviation. For each resource type, when the calculated deviation exceeds its corresponding tolerance threshold, it is determined that the corresponding resource type has deviated.

[0051] The statistics show the number of resource types that deviated during the current execution cycle of the calculation task, as well as the extent to which the deviation of each type of resource exceeded its tolerance threshold.

[0052] The base penalty coefficient is determined based on the number of resource types that deviate; the more types of resources that deviate, the larger the base penalty coefficient.

[0053] The weighted penalty is calculated based on the extent to which the deviation of each type of resource exceeds its tolerance threshold. The greater the deviation, the greater the weighted penalty.

[0054] Multiplying the basic penalty coefficient by the weighted penalty amount yields a negative empirical feedback signal strength value, the absolute value of which is the penalty strength of this scheduling decision. This penalty strength will be used to adjust the update gradient of the policy network parameters in the improved deep reinforcement learning algorithm.

[0055] As a further aspect of the present invention, the resource requirement characteristics of the computing task and the global resource state snapshot of the heterogeneous server cluster are encoded into a multi-dimensional feature vector, which is used as the environment state input for the improved deep reinforcement learning algorithm, specifically including:

[0056] The resource requirements of the computing task are normalized, and the absolute values ​​of CPU core requirements, memory capacity requirements, storage space requirements, and accelerator card quantity requirements are mapped to relative values ​​between zero and one.

[0057] One-hot encoding is performed on the static configuration attributes in the global resource status snapshot of the heterogeneous server cluster, converting categorical attributes such as CPU architecture and accelerator card model into high-dimensional sparse binary vectors;

[0058] The dynamic load attributes in the global resource status snapshot are standardized to eliminate the influence of load indicator dimensions caused by configuration differences between different server nodes, and the percentage values ​​such as CPU utilization and memory utilization are converted into scores under the standard normal distribution.

[0059] The normalized computational task resource requirement feature vector and the server node state feature vector after one-hot encoding and standardization are concatenated along the feature dimension to form a unified high-dimensional multidimensional feature vector.

[0060] The unified high-dimensional multidimensional feature vector, which serves as the state representation describing the "task to be scheduled - cluster environment", is input into the state feature extraction network of the improved deep reinforcement learning algorithm.

[0061] As a further aspect of the present invention, based on the resource demand prediction deviation and the resource status change trend, the weight adjustment coefficient of each basic reward item in the reward function is calculated, specifically as follows:

[0062] The resource demand prediction deviation value is decomposed according to resource type to obtain prediction deviation components for various resources such as CPU, memory, storage, and network.

[0063] The resource status change trend of the target server node is decomposed according to resource type to predict the direction and magnitude of the change in the utilization rate of each type of resource in the next period.

[0064] Establish a weight adjustment strategy table, which defines the mapping relationship between the prediction deviation components and resource status change trends of different resource types and the weight adjustment amounts of each basic reward item in the reward function;

[0065] Query the weight adjustment strategy table, and based on the various resource prediction deviation components calculated in the current period and the prediction results of resource status change trends in the next period, obtain the weight adjustment amounts corresponding to the basic reward items such as resource utilization improvement reward, task scheduling success reward, and service level agreement breach penalty in the reward function;

[0066] The obtained weight adjustment amount is added to the weight of the basic reward item in the previous period to obtain the new weight adjustment coefficient of each basic reward item in the reward function of the next period, thereby dynamically adjusting the optimization focus of the improved deep reinforcement learning algorithm.

[0067] Compared with the prior art, the advantages and positive effects of the present invention are as follows:

[0068] An improved deep reinforcement learning algorithm, based on the actual resource requirements of computing tasks and the changing trends of resource status in heterogeneous server clusters, dynamically adjusts the reward function. This algorithm pre-selects initial candidate server nodes within the cluster's global resource state space. The reward function's setting logic is linked to task requirements and cluster resource changes, ensuring the algorithm's decision-making aligns with the real-time status of the cluster and tasks. The candidate node selection process avoids decision biases caused by fixed reward functions, improving the fit between pre-selected nodes and computing task resource requirements. The reference value of the cluster's global resource state is fully integrated into the node pre-selection process, effectively reducing unreasonable candidate node selection results. The rationality of node pre-selection is continuously optimized, and the correlation between algorithm decisions and cluster resource distribution is closer.

[0069] The algorithm collects real-time information on the execution progress of computing tasks on server nodes and compares it with resource demand characteristics. Based on the comparison results, it dynamically adjusts and improves the parameters of the deep reinforcement learning algorithm. The state changes during task execution can be synchronously transformed into the basis for adjusting algorithm parameters. The algorithm's running logic can continuously iterate with the actual execution state of the task. Subsequent resource scheduling decisions can adapt to the real-time situation of task execution. The matching state between cluster resource scheduling decisions and task execution and cluster running status can be maintained for a long time. The load distribution status of server nodes can be kept stable. The dynamic adaptability of the scheduling process can run through the entire task execution process. The algorithm decision is synchronously associated with the real-time status of the cluster and task. The dynamic adjustment of the scheduling process can fit the actual changes in task execution. Attached Figure Description

[0070] Figure 1 This is a flowchart of the dynamic computing resource elastic scheduling method for heterogeneous server clusters as described in this invention;

[0071] Figure 2 A flowchart illustrating the process of pre-selecting initial candidate server nodes using an improved deep reinforcement learning algorithm;

[0072] Figure 3 A flowchart for constructing a multi-objective constrained optimization model. Detailed Implementation

[0073] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0074] In the description of this invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicating orientation or positional relationships, are based on the orientation or positional relationships shown in the accompanying drawings and are only for the convenience of describing the invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the invention. Furthermore, in the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0075] See Figure 1 This invention provides a dynamic computing resource elastic scheduling method for heterogeneous server clusters. The overall implementation scheme of the method is as follows:

[0076] The system receives a computation task description from the upper-layer application and parses the resource requirements and dependency constraints of the computation task based on this description. Based on the parsed resource requirements, an improved deep reinforcement learning algorithm is used to pre-select a set of initial candidate server nodes for the current computation task in the global resource state space of the heterogeneous server cluster. The reward function of this improved deep reinforcement learning algorithm can be dynamically adjusted based on the actual resource requirements of the computation task and the resource state change trend of the heterogeneous server cluster. A multi-objective constraint optimization model is constructed based on the real-time load data of the initial candidate server nodes and the actual resource requirements of the computation task. This multi-objective constraint optimization model is solved to calculate the task deployment suitability score for each initial candidate server node, and the computation task is scheduled to be executed on the server node with the highest suitability score. During the execution of the computation task, its execution progress information on the selected server nodes is collected in real time and compared with the resource requirements in the task description. Based on the comparison results, the internal parameters of the improved deep reinforcement learning algorithm are dynamically adjusted to achieve continuous optimization of the strategy.

[0077] In one embodiment of the present invention, see [reference] Figure 2The system receives a computing task description from the upper-layer application and extracts the resource requirement features of the computing task from the description. These resource requirement features include at least the number of CPU cores, memory capacity, storage space, and the type and number of accelerator cards required by the computing task. Simultaneously, it acquires a global resource status snapshot of the heterogeneous server cluster. This snapshot includes static configuration attributes and dynamic load attributes for each server node in the cluster. The static configuration attributes include CPU architecture, total memory, storage type and capacity, and accelerator card model. The dynamic load attributes include CPU utilization, memory usage, storage read / write speed, and network bandwidth utilization. The resource requirement features of the computing task are normalized, mapping absolute values ​​such as CPU core requirements, memory capacity requirements, storage space requirements, and accelerator card quantity requirements to relative values ​​between zero and one. The static configuration attributes in the global resource status snapshot of the heterogeneous server cluster are one-hot encoded, converting categorical attributes such as CPU architecture and accelerator card model into high-dimensional sparse binary vectors. The dynamic load attributes in the global resource state snapshot are standardized to eliminate the influence of load metric dimensions caused by configuration differences between different server nodes, converting percentage values ​​such as CPU utilization and memory utilization into scores under a standard normal distribution. The normalized computational task resource requirement feature vector is concatenated with the server node state feature vector after one-hot encoding and standardization along the feature dimension to form a unified high-dimensional multidimensional feature vector. This unified high-dimensional multidimensional feature vector serves as the state representation describing the "task to be scheduled - cluster environment" and is input into the state feature extraction network of the improved deep reinforcement learning algorithm. Based on the global resource state space of the heterogeneous server cluster, the improved deep reinforcement learning algorithm, according to a preset exploration strategy, starts from the current environment state and evaluates the expected long-term benefits of scheduling computational tasks to different server nodes in the action space. The improved deep reinforcement learning algorithm selects the top few scheduling actions with the highest expected long-term benefits, adds the server node pointed to by each scheduling action to a server node list, and outputs the server node list as a set of initial candidate server nodes pre-selected for the computational task.

[0078] In practical implementation, the system receives a computing task description from the upper-layer application and extracts the resource requirement characteristics of the computing task from it. These characteristics include at least the number of CPU cores, memory capacity, storage space, and the type and number of accelerator cards required by the computing task. A global resource status snapshot of the heterogeneous server cluster is obtained. This snapshot includes the static configuration attributes and dynamic load attributes of each server node in the cluster. Static configuration attributes include CPU architecture, total memory, storage type and capacity, and accelerator card model. Dynamic load attributes include CPU utilization, memory usage, storage read / write speed, and network bandwidth utilization. Furthermore, the resource requirement characteristics of the computing task are normalized by mapping absolute values ​​such as CPU core requirements, memory capacity requirements, storage space requirements, and accelerator card quantity requirements to relative values ​​between zero and one. The normalization process uses a min-max scaling method to convert resource requirement values ​​of different dimensions to a uniform scale. In some embodiments, static configuration attributes in the global resource status snapshot of a heterogeneous server cluster are one-hot encoded, converting categorical attributes such as CPU architecture and accelerator card model into high-dimensional sparse binary vectors. Each possible value of each categorical attribute corresponds to a binary bit; when a server node has that value, the corresponding bit is set to 1, otherwise it is set to 0. Optionally, dynamic load attributes in the global resource status snapshot are standardized to eliminate the influence of load metric dimensions caused by configuration differences between different server nodes. Percentage values ​​such as CPU utilization and memory utilization are converted into scores under a standard normal distribution. The standardization process uses the following formula:

[0079]

[0080] in: This represents the standardized score of the j-th dynamic load attribute on the i-th server node. This represents the original value of the j-th dynamic load attribute on the i-th server node. This represents the average value of the j-th dynamic load attribute across all server nodes in the cluster. This represents the standard deviation of the j-th dynamic load attribute across all server nodes in the cluster. In a specific implementation, the normalized computational task resource requirement feature vector and the server node state feature vector, after one-hot encoding and standardization, are concatenated along the feature dimension to form a unified high-dimensional multidimensional feature vector. In some embodiments, this unified high-dimensional multidimensional feature vector is used as a state representation describing the "task to be scheduled - cluster environment" and input into the state feature extraction network of the improved deep reinforcement learning algorithm. The improved deep reinforcement learning algorithm, based on the global resource state space of the heterogeneous server cluster, starts from the current environment state according to a preset exploration strategy and evaluates the expected long-term benefits of scheduling computational tasks to different server nodes in the action space. Optionally, the improved deep reinforcement learning algorithm selects the top few scheduling actions with the highest expected long-term benefits, adds the server node pointed to by each scheduling action to a server node list, and outputs the server node list as a set of initial candidate server nodes pre-selected for the computational task. It can be understood that the state feature extraction network consists of multiple fully connected neural networks, used to extract abstract feature representations from the high-dimensional multidimensional feature vector for use by the policy network of the improved deep reinforcement learning algorithm. In practice, the expected long-term return is calculated using an action-value function based on a deep Q-network architecture. The network parameters are iteratively updated to approximate the optimal scheduling strategy. This can be understood as an ε-greedy exploration strategy, where scheduling actions are randomly selected with probability ε during training to explore the network, and the scheduling action with the highest estimated long-term return is selected with probability 1-ε to achieve exploitation.

[0081] In one embodiment of the present invention, see [reference] Figure 3The improved deep reinforcement learning algorithm's reward function is initialized, comprising multiple basic reward items, including resource utilization improvement rewards, task scheduling success rewards, and service level agreement (SLA) breach penalties. In each training cycle of the improved deep reinforcement learning algorithm's interaction with the environment, the actual resource consumption data of the computational task on the target server node within the current training cycle, as well as the resource state changes of the target server node before and after the computational task execution, are obtained. The actual resource consumption data is compared with the resource requirement characteristics of the computational task parsed from the computational task description, and a resource requirement prediction deviation is calculated. The resource state of the target server node at the start of the next training cycle is obtained, and the resource state change trend of the target server node is predicted based on historical resource state data from multiple training cycles. Based on the resource requirement prediction deviation and the resource state change trend, a weight adjustment coefficient for each basic reward item in the reward function is calculated. This weight adjustment coefficient is used to amplify reward items more relevant to the recent scheduling target and reduce the weight of reward items less relevant to the recent scheduling target in the next training cycle. Specifically, the resource demand prediction deviation is decomposed by resource type to obtain prediction deviation components for various resources such as CPU, memory, storage, and network. The resource status change trend of the target server node is decomposed by resource type to predict the direction and magnitude of changes in the utilization rate of various resources in the next period. A weight adjustment strategy table is established, defining the mapping relationship between the prediction deviation components and resource status change trends of different resource types and the weight adjustment amounts of each basic reward item in the reward function. The weight adjustment strategy table is queried, and based on the prediction deviation components of various resources calculated in the current period and the prediction results of the resource status change trend in the next period, the weight adjustment amounts corresponding to the basic reward items in the reward function, such as resource utilization improvement rewards, task scheduling success rewards, and service level agreement breach penalties, are obtained. The obtained weight adjustment amounts are added to the weights of the basic reward items in the previous period to obtain the new weight adjustment coefficients for each basic reward item in the reward function in the next period, thereby dynamically adjusting the optimization focus of the improved deep reinforcement learning algorithm. In the next training cycle, the reward function adjusted by the weight adjustment coefficient is used to calculate the immediate reward for different scheduling actions under the current environmental state, guiding the policy network of the improved deep reinforcement learning algorithm to optimize towards the direction of dynamically changing scheduling objectives.

[0082] In specific implementations, the reward function of the improved deep reinforcement learning algorithm is initialized. This reward function includes several basic reward items, such as resource utilization improvement rewards, task scheduling success rewards, and service level agreement (SLA) breach penalties. During each training cycle of the improved deep reinforcement learning algorithm's interaction with the environment, the actual resource consumption data of the computational task on the target server node within the current training cycle, as well as the changes in the target server node's resource state before and after the computational task execution, are obtained. In specific implementations, the actual resource consumption data is compared with the resource requirement characteristics of the computational task parsed from the task description, and a resource requirement prediction deviation value is calculated. This deviation value reflects the degree of difference between the actual resource usage and the expected declaration. In some embodiments, the resource state of the target server node at the start of the next training cycle is obtained, and the resource state change trend of the target server node is predicted based on resource state data from multiple historical training cycles. This trend indicates the direction and rate of increase or decrease in node load within a short period in the future. Optionally, a weight adjustment coefficient for each basic reward item in the reward function is calculated based on the resource demand prediction deviation and resource status change trend. This weight adjustment coefficient is used to amplify reward items more relevant to the recent scheduling goal and reduce the weight of reward items less relevant to the recent scheduling goal in the next training cycle. In practice, the resource demand prediction deviation is decomposed by resource type to obtain prediction deviation components for various resources such as CPU, memory, storage, and network. Each prediction deviation component is the ratio of the absolute difference between actual consumption and declared demand to the declared demand. In practice, the resource status change trend of the target server node is decomposed by resource type to predict the direction and magnitude of the utilization rate change of various resources in the next cycle. This prediction can use a simple moving average method or a prediction model based on historical sequences. Essentially, a weight adjustment strategy table is established, defining the mapping relationship between the prediction deviation components and resource status change trends of different resource types and the weight adjustment amounts of each basic reward item in the reward function. In some embodiments, a weight adjustment strategy table is queried, and based on the various resource prediction deviation components calculated in the current period and the predicted resource status change trend for the next period, the weight adjustment amounts corresponding to the basic reward items in the reward function, such as resource utilization improvement reward, task scheduling success reward, and service level agreement breach penalty, are obtained. Optionally, the obtained weight adjustment amounts are added to the weights of the basic reward items in the previous period to obtain a new weight adjustment coefficient for each basic reward item in the reward function for the next period, thereby dynamically adjusting the optimization focus of the improved deep reinforcement learning algorithm. The calculation of dynamic weight adjustment can be expressed as:

[0083]

[0084] in: This represents the weight adjustment coefficient of the k-th basic reward term in the reward function during the next training cycle (time t+1). This represents the weight adjustment coefficient for the k-th basic reward term in the current training period (time t). This represents the weight adjustment amount obtained by querying the weight adjustment strategy table for the k-th basic reward item. In practice, during the next training cycle, the reward function adjusted by the weight adjustment coefficient is used to calculate the immediate reward for different scheduling actions under the current environment state, guiding the policy network of the improved deep reinforcement learning algorithm to optimize towards a dynamically changing scheduling objective. For example, in a specific scenario, assuming a computing task declares a GPU memory requirement of 16GB, while the peak memory consumption monitored during the actual running cycle is 20GB, the GPU memory resource requirement prediction deviation component is calculated as (20-16) / 16=0.25. Simultaneously, it is predicted that the target GPU server node's memory utilization will increase by 10% in the next cycle. According to the mapping definition of the weight adjustment strategy table, this may correspond to increasing the weight adjustment amount for resource utilization improvement rewards and decreasing the weight adjustment amount for task scheduling success rewards.

[0085] In one embodiment of the present invention, real-time load data of each server node in the pre-selected initial candidate server node list is obtained. The real-time load data includes real-time CPU utilization, real-time memory usage, real-time storage input / output operation rate, and real-time network latency. The actual resource requirement characteristics of the computing task are parsed from the computing task description. These actual resource requirement characteristics include peak CPU core demand, peak memory demand, storage read / write bandwidth demand, network bandwidth demand, and the maximum tolerable latency of the task. Maximizing the overall resource utilization of the target server node is the primary optimization objective, while minimizing the total migration cost of the computing task across the entire cluster is a secondary optimization objective. A multi-objective optimization problem objective function is constructed. The actual resource requirement characteristics of the computing task and the remaining available resources of each initial candidate server node are used as constraints. These constraints include that the peak CPU core demand must not exceed the number of remaining available CPU cores on the server node, the peak memory demand must not exceed the amount of remaining available memory on the server node, and the maximum tolerable latency of the task must be greater than the estimated network latency after the task is deployed to the corresponding server node. The objective function and the constraints are combined to construct a multi-objective constrained optimization model for selecting the optimal target node from the initial candidate server nodes.

[0086] In the specific implementation, real-time load data for each server node in the pre-selected initial candidate server node list is obtained. This real-time load data includes real-time CPU utilization, real-time memory usage, real-time storage I / O operation rate, and real-time network latency. In the specific implementation, the actual resource requirement characteristics of the computing task are parsed from the task description. These characteristics include peak CPU core requirements, peak memory requirements, storage read / write bandwidth requirements, network bandwidth requirements, and the task's maximum tolerable latency. In a specific example scenario, assuming a computing task "Training_Job_A" has the following actual resource requirements parsed from its description: it requires a peak of 8 CPU cores, a peak of 32GB of memory, a storage read / write bandwidth requirement of 200MB / s, a network bandwidth requirement of 50MB / s, and a maximum tolerable latency of 150 milliseconds. Meanwhile, the initial candidate server node list contains three nodes; their partial real-time load data and remaining resource information are shown in Table 1.

[0087] Table 1: Real-time Load and Resource Data of Initial Candidate Server Nodes

[0088]

[0089] In practical implementation, maximizing the overall resource utilization of the target server node is the primary optimization objective. Overall resource utilization is defined as the weighted harmonic mean of the utilization rates of four resources: CPU, memory, storage I / O, and network bandwidth. In some embodiments, minimizing the total migration cost of computing tasks across the entire cluster is a secondary optimization objective. The total migration cost of computing tasks includes data transfer costs and service interruption penalty costs. Optionally, the primary optimization objective can be expressed by the following formula:

[0090]

[0091] in: This represents the estimated overall resource utilization rate after the computation task is deployed to candidate node j. These represent the preset weights of four resources—CPU, memory, storage input / output, and network bandwidth—when calculating overall utilization. . These represent the estimated utilization rates of CPU, memory, storage I / O, and network bandwidth on node j after the computing task is deployed. In practice, the actual resource requirements of the computing task and the remaining available resources of each initial candidate server node are used as constraints. This means that the peak CPU core requirement must not exceed the number of remaining available CPU cores on the server node. For example, for the 8-core peak requirement of "Training_Job_A", nodes Node_101 (14 cores remaining), Node_205 (8 cores remaining), and Node_308 (24 cores remaining) all meet this constraint. In practice, the peak memory requirement must not exceed the amount of remaining available memory on the server node. In the example, "Training_Job_A" requires a 32GB peak memory requirement, which is met by nodes Node_101 (128GB remaining), Node_205 (32GB remaining), and Node_308 (192GB remaining). In some embodiments, the constraints include that the maximum tolerable latency of the task must be greater than the estimated network latency after the task is deployed to the corresponding server node. In the example, the maximum tolerable latency of "Training_Job_A" is 150 milliseconds, and the real-time network latency of each node is less than this value, thus satisfying the constraint. Optionally, the storage read / write bandwidth requirement must also be less than or equal to the remaining available storage bandwidth of the server node. In a specific implementation, the objective function is combined with all constraints to construct a multi-objective constrained optimization model for selecting the optimal target node from the initial candidate server nodes. It is understood that the constructed model will be used in the solution and scoring process of subsequent embodiments.

[0092] In one embodiment of the present invention, a completed multi-objective constraint optimization model is obtained. The latest real-time load data of the initial candidate server nodes is acquired from the monitoring system of the heterogeneous server cluster and substituted into the constraints of the multi-objective constraint optimization model to verify whether each initial candidate server node satisfies all hard constraints. For initial candidate server nodes that satisfy all hard constraints, a linear weighted sum method is used to integrate multiple optimization objective functions in the multi-objective constraint optimization model into a single-objective fitness evaluation function, and a corresponding weight coefficient is assigned to each optimization objective. The real-time load data of each initial candidate server node and the actual resource requirement characteristics of the computing task are substituted into the fitness evaluation function for calculation. The fitness evaluation function comprehensively considers the estimated resource utilization efficiency of the scheduled computing task on the corresponding initial candidate server node, the estimated task execution completion time, and the estimated cluster load balancing cost caused by this scheduling, and outputs a quantitative fitness score. Specifically, based on the actual resource requirements of the computing task and the current resource utilization of the target server node, the expected utilization of the four main resources (CPU, memory, storage, and network) of the target server node after scheduling the current computing task to the target server node is calculated. A weighted geometric average is applied to the expected utilization of these four main resources to obtain the estimated node-level resource utilization efficiency of the current computing task on the corresponding candidate server node. The weighting coefficients are pre-configured according to the degree of dependence of the current computing task on each type of resource. Based on the estimated execution cycle in the historical execution records or task description of the computing task, combined with the current load of the target server node and the queuing status of other tasks on the same node, the execution completion time of the computing task on the target server node is estimated. The migration of the computing task from its current location to the target server node is simulated, and the data transfer volume and service interruption time required during the migration process are calculated. This is used as the quantified value of the cluster load balancing cost caused by this scheduling. The node-level resource utilization efficiency, the reciprocal of the execution completion time, and the quantified value of the cluster load balancing cost are linearly combined according to preset weights to calculate the final suitability score. For all nodes in the initial candidate server node list that meet the constraints, a fitness score is calculated, and the calculation task is scheduled to be executed on the server node with the highest fitness score.

[0093] In specific implementation, the completed multi-objective constraint optimization model is obtained. The latest real-time load data of the initial candidate server nodes is acquired from the monitoring system of the heterogeneous server cluster and substituted into the constraints of the multi-objective constraint optimization model to verify whether each initial candidate server node satisfies all hard constraints. In specific implementation, a linear weighted sum method is used for the initial candidate server nodes that satisfy all hard constraints to integrate multiple optimization objective functions in the multi-objective constraint optimization model into a single-objective fitness evaluation function. A corresponding weight coefficient is assigned to each optimization objective, which is pre-set by the system administrator according to the cluster scheduling strategy. In some embodiments, the real-time load data of each initial candidate server node and the actual resource requirements of the computation task are substituted into the fitness evaluation function for calculation. Assuming that for the example task "Training_Job_A" described in the above embodiment and its three candidate nodes Node_101, Node_205, and Node_308, and all nodes satisfy the hard constraints, the estimated data required for calculation is shown in Table 2.

[0094] Table 2: Data Table for Calculating Candidate Node Fit Score

[0095]

[0096] In practical implementation, the suitability evaluation function comprehensively considers the estimated resource utilization efficiency of the scheduled computing task on the corresponding initial candidate server node, the estimated task execution completion time, and the estimated cluster load balancing cost caused by this scheduling, outputting a quantitative suitability score. This can be understood as calculating the expected utilization of the four main resources (CPU, memory, storage, and network) of the target server node after scheduling the current computing task to the target server node, based on the actual resource requirements of the computing task and the current resource utilization of the target server node. In some embodiments, a weighted geometric average is applied to the expected utilization of the four main resources (CPU, memory, storage, and network) to obtain the estimated node-level resource utilization efficiency of the current computing task on the corresponding candidate server node. The weighting coefficients are pre-configured according to the current computing task's dependence on each type of resource. In practical implementation, the execution completion time of the task on the target server node is estimated based on the historical execution records of the computing task or the estimated execution cycle in the task description, combined with the current load of the target server node and the queuing status of other tasks on the same node. Optionally, the simulation migrates the computation task from its current location to the target server node, calculating the data transfer volume and service interruption time required for the migration process, and using these as the quantified value of the cluster load balancing cost caused by this scheduling. In specific implementation, the node-level resource utilization efficiency, the reciprocal of the execution completion time, and the quantified value of the cluster load balancing cost are linearly combined according to preset weights to obtain the final fitness score. The fitness score calculation formula can be expressed as:

[0097]

[0098] in: This represents the final fit score for deploying the computation task to candidate node j. This indicates node-level resource utilization efficiency. Indicates the estimated execution completion time. This indicates the estimated cost of load balancing. The preset weighting coefficients, which are greater than zero, are used to balance the relative importance of different optimization objectives. It can be understood that by substituting the example data in the table into the formula and setting the weighting coefficients α=0.5, β=100, and γ=0.001 for calculation, the score F for Node_101 is approximately 0.361 (0.5 + 0.72 + 100 / 3600 - 0.001150), the score F for Node_205 is approximately 0.314 (0.5 + 0.65 + 100 / 4200 - 0.00110), and the score F for Node_308 is approximately 0.393 (0.5 + 0.80 + 100 / 3000 - 0.001300). In practice, a fitness score is calculated for all nodes in the initial candidate server node list that meet the constraints, and the computation task is scheduled to be executed on the server node with the highest fitness score. According to the calculation results in the example above, Node_308 has the highest fitness score, so the computation task "Training_Job_A" is scheduled to be executed on the Node_308 server node.

[0099] In one embodiment of the present invention, during the execution of a computing task on a target server node, its execution progress information is continuously collected. This progress information includes real-time CPU utilization curves, real-time memory usage curves, the number of completed input / output operations, and the amount of data processed. The declared resource requirement characteristics are obtained from the computing task description, including the expected CPU utilization range, the expected memory usage limit, the expected total number of input / output operations, and the expected total amount of data processed. The execution progress information is compared with the resource requirement characteristics obtained from the computing task description to calculate the deviation between real-time resource consumption and expected resource requirements. This deviation includes CPU utilization deviation, memory usage deviation, and input / output operation rate deviation. Based on the calculated deviation, the empirical feedback signal strength that the improved deep reinforcement learning algorithm should obtain in this scheduling decision is calculated. The signal strength is positively correlated with the deviation. Specifically, a resource deviation tolerance threshold is set. For each resource type, when the calculated deviation exceeds its corresponding resource deviation tolerance threshold, it is determined that the corresponding resource type has deviated. The number of resource types that deviated during the current execution cycle of the computation task is counted, along with the extent to which the deviation of each resource exceeded its tolerance threshold. A base penalty coefficient is determined based on the number of resource types that deviated; the more resource types that deviated, the larger the base penalty coefficient. A weighted penalty is calculated based on the extent to which the deviation of each resource exceeded its tolerance threshold; the larger the deviation, the larger the weighted penalty. The base penalty coefficient and the weighted penalty are multiplied to obtain a negative empirical feedback signal strength value. The absolute value of this negative value is the penalty strength of the current scheduling decision. This penalty strength will be used to adjust the update gradient of the policy network parameters in the improved deep reinforcement learning algorithm. The complete state, action, and calculated empirical feedback signal strength of this scheduling decision are stored as empirical data in the empirical replay buffer of the improved deep reinforcement learning algorithm for subsequent updates to the algorithm's policy network parameters.

[0100] In practice, the execution progress information of the computation task is continuously collected during its execution on the target server node. This progress information includes real-time CPU utilization curves, real-time memory usage curves, the number of completed input / output operations, and the amount of data processed. The declared resource requirements are obtained from the computation task description, including the expected CPU utilization range, the expected memory usage limit, the expected total number of input / output operations, and the expected total amount of data processed. In a specific example scenario, a deep learning model training task scheduled to node Node_308 declares the following resource requirements from its description: expected CPU utilization range of 70% to 85%, expected memory usage limit of 32GB, and expected total network data transfer volume of 500GB. In practice, the execution progress information is compared with the resource requirements obtained from the computation task description to calculate the deviation between real-time resource consumption and expected resource requirements. For example, real-time data collected at a certain monitoring moment during task execution might show: CPU utilization of 92%, memory usage of 28GB, and network data transfer volume of 300GB. It can be understood that deviation includes CPU utilization deviation, memory usage deviation, and input / output operation rate deviation, calculated as a standardized measure of the difference between the actual observed value and the expected declared value. In some embodiments, CPU utilization deviation can be calculated as the ratio of (real-time CPU utilization - expected upper limit of CPU utilization range) to the upper limit of the expected range. In the example, the expected upper limit is 85%, and the real-time value is 92%, so the CPU utilization deviation is (92-85) / 85≈0.082. In specific implementations, the strength of the empirical feedback signal that the improved deep reinforcement learning algorithm should obtain in this scheduling decision is calculated based on the calculated deviation. The strength of the empirical feedback signal is positively correlated with the deviation. Optionally, a resource deviation tolerance threshold is set. For each resource type, when the calculated deviation exceeds its corresponding resource deviation tolerance threshold, it is determined that the corresponding type of resource has deviated. In the example, assuming the preset CPU utilization deviation tolerance threshold is 5% (i.e., 0.05), the calculated CPU utilization deviation of 0.082 is greater than 0.05, so it is determined that the CPU resource has deviated. In practice, the number of resource types that deviate during the current execution cycle of the computation task is counted, along with the extent to which the deviation of each resource exceeds its tolerance threshold. In the example scenario, if only CPU utilization deviates, the number of resource types that deviate is 1; the extent to which the CPU utilization deviation exceeds its tolerance threshold (0.05) is 0.082 - 0.05 = 0.032. In some embodiments, a base penalty coefficient is determined based on the number of resource types that deviate; the more resource types that deviate, the larger the base penalty coefficient.It is understandable that the weighted penalty is calculated based on the extent to which the deviation of various resources exceeds their tolerance threshold; the greater the deviation, the greater the weighted penalty. In practice, the base penalty coefficient is multiplied by the weighted penalty to obtain a negative empirical feedback signal strength value. The absolute value of this value is the penalty strength for this scheduling decision. This penalty strength will be used to adjust the update gradient of the policy network parameters in the improved deep reinforcement learning algorithm. The formula for calculating the empirical feedback signal strength value can be expressed as:

[0101]

[0102] in: This represents the calculated strength value of the empirical feedback signal (a negative value indicates a penalty). This represents the base penalty coefficient determined by the number of resource types that deviate from the target value. This represents the set of all resource types that have deviated. This represents the calculated deviation value for resource type r. This represents the preset deviation tolerance threshold for resource type r. In the example scenario, if the base penalty coefficient B is set to 2.0, then P is calculated as -(2.0). 0.032) = -0.064. Optionally, the complete state, action, and calculated empirical feedback signal strength of this scheduling decision are stored as empirical data in the empirical replay buffer of the improved deep reinforcement learning algorithm for subsequent updates to the algorithm's policy network parameters. It can be understood that during subsequent policy network training, empirical data containing this empirical feedback signal strength will be sampled from the empirical replay buffer, thereby using the actual resource matching effect of the scheduling decision as a feedback signal to update and improve the scheduling policy.

[0103] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any other way. Any person skilled in the art may make changes or modifications to the above-disclosed technical content to create equivalent embodiments that can be applied to other fields. However, any simple modifications, equivalent changes, and modifications made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the protection scope of the present invention.

Claims

1. A method for dynamic and elastic scheduling of computing resources for heterogeneous server clusters, characterized in that, Includes the following steps: Receive the computing task description from the upper layer application, and parse out the resource requirements and dependency constraints of the computing task based on the task description. Based on resource demand characteristics, an improved deep reinforcement learning algorithm is used to pre-select a set of initial candidate server nodes for computing tasks in the global resource state space of a heterogeneous server cluster. The improved deep reinforcement learning algorithm dynamically adjusts its reward function based on the actual resource demand characteristics of the computing task and the resource state change trend of the heterogeneous server cluster. Based on the real-time load data of the initial candidate server nodes and the actual resource requirements of the computing tasks, a multi-objective constrained optimization model is constructed. Solve the multi-objective constrained optimization model, calculate the task deployment adaptability score of each initial candidate server node, and schedule the calculation task to be executed on the server node with the highest adaptability score; The execution progress information of the computing task on the server node is collected in real time and compared with the resource requirement characteristics. The parameters of the improved deep reinforcement learning algorithm are dynamically adjusted according to the comparison results.

2. The dynamic computing resource elastic scheduling method for heterogeneous server clusters according to claim 1, characterized in that, The improved deep reinforcement learning algorithm pre-selects a set of initial candidate server nodes for the computation task in the global resource state space of the heterogeneous server cluster, specifically including: The application receives a computing task description from the upper layer and extracts the resource requirement features of the computing task from the computing task description. The resource requirement features include at least the number of CPU cores, memory capacity, storage space, and type and number of accelerator cards required by the computing task. Obtain a global resource status snapshot of the heterogeneous server cluster. The global resource status snapshot includes the static configuration attributes and dynamic load attributes of each server node in the cluster. The static configuration attributes include CPU architecture, total memory, storage type and capacity, and accelerator card model. The dynamic load attributes include CPU utilization, memory usage, storage read / write speed, and network bandwidth utilization. The resource requirement characteristics of the computing task and the global resource state snapshot of the heterogeneous server cluster are encoded into a multi-dimensional feature vector, which is used as the environment state input for the improved deep reinforcement learning algorithm. The improved deep reinforcement learning algorithm is based on the global resource state space of a heterogeneous server cluster. According to the preset exploration strategy, it starts from the current environment state and evaluates the expected long-term benefits of scheduling computing tasks to different server nodes in the action space. The improved deep reinforcement learning algorithm selects the top few scheduling actions with the highest expected long-term returns, adds the server node pointed to by each scheduling action to a server node list, and outputs the server node list as a set of initial candidate server nodes pre-selected for the computing task.

3. The method for dynamic elastic scheduling of computing resources for heterogeneous server clusters according to claim 1, characterized in that, The improved deep reinforcement learning algorithm dynamically adjusts its reward function based on the actual resource requirements of the computational task and the resource status change trends of the heterogeneous server cluster, specifically as follows: The reward function of the improved deep reinforcement learning algorithm is initialized, and the reward function includes multiple basic reward items, including resource utilization improvement reward, task scheduling success reward, and service level agreement breach penalty. In each training cycle of the improved deep reinforcement learning algorithm interacting with the environment, the actual resource consumption data of the computing task on the target server node in the current training cycle, as well as the change in the resource state of the target server node before and after the computing task is run, are obtained. The actual resource consumption data is compared with the resource demand characteristics of the computing task parsed from the computing task description, and the resource demand prediction deviation value is calculated. Obtain the resource status of the target server node at the start of the next training cycle, and predict the trend of resource status change of the target server node based on the resource status data of multiple historical training cycles. Based on the resource demand prediction deviation and the resource status change trend, the weight adjustment coefficient of each basic reward item in the reward function is calculated. The weight adjustment coefficient is used to amplify the reward items that are more relevant to the recent scheduling goal and reduce the weight of the reward items that are less relevant to the recent scheduling goal in the next training cycle. In the next training cycle, the reward function adjusted by the weight adjustment coefficient is used to calculate the immediate reward for different scheduling actions under the current environmental state, guiding the policy network of the improved deep reinforcement learning algorithm to optimize towards the direction of dynamically changing scheduling objectives.

4. The method for dynamic elastic scheduling of computing resources for heterogeneous server clusters according to claim 1, characterized in that, The process involves constructing a multi-objective constrained optimization model based on the real-time load data of the initial candidate server nodes and the actual resource requirements of the computing tasks. This model specifically includes: Obtain the real-time load data for each server node in the initial candidate server node list. The real-time load data includes real-time CPU utilization, real-time memory usage, real-time storage input / output operation rate, and real-time network latency. The actual resource requirement characteristics of the computing task are parsed from the computing task description. These actual resource requirement characteristics include peak CPU core requirement, peak memory requirement, storage read / write bandwidth requirement, network bandwidth requirement, and maximum tolerable latency of the task. The primary optimization objective is to maximize the overall resource utilization of the target server node, and the secondary optimization objective is to minimize the total migration cost of computing tasks throughout the cluster. A multi-objective optimization problem objective function is constructed. The actual resource requirements of the computing task and the remaining available resources of each initial candidate server node are used as constraints. The constraints include that the peak CPU core requirement shall not exceed the number of remaining available CPU cores of the server node, the peak memory requirement shall not exceed the amount of remaining available memory of the server node, and the maximum tolerable latency of the task must be greater than the estimated network latency after the task is deployed to the corresponding server node. By combining the objective function with the constraints, a multi-objective constrained optimization model is constructed to select the optimal target node from the initial candidate server nodes.

5. The dynamic computing resource elastic scheduling method for heterogeneous server clusters according to claim 1, characterized in that, Solving the multi-objective constrained optimization model, the task deployment suitability score of each initial candidate server node is calculated, specifically including: Obtain the completed multi-objective constraint optimization model, obtain the latest real-time load data of the initial candidate server nodes from the monitoring system of the heterogeneous server cluster, and substitute it into the constraint conditions of the multi-objective constraint optimization model to verify whether each initial candidate server node meets all hard constraint conditions. For initial candidate server nodes that satisfy all hard constraints, a linear weighted sum method is used to integrate multiple optimization objective functions in the multi-objective constrained optimization model into a single-objective fitness evaluation function, and a corresponding weight coefficient is assigned to each optimization objective. The real-time load data of each initial candidate server node and the actual resource requirements of the computing task are substituted into the fitness evaluation function for calculation. The suitability evaluation function integrates the estimated resource utilization efficiency of the computing task to be scheduled on the corresponding initial candidate server node, the estimated task execution completion time, and the estimated cluster load balancing cost caused by this scheduling, and outputs a quantitative suitability score. For all nodes in the initial candidate server node list that meet the constraints, a fitness score is calculated, and the calculation task is scheduled to be executed on the server node with the highest fitness score.

6. The dynamic computing resource elastic scheduling method for heterogeneous server clusters according to claim 5, characterized in that, The suitability evaluation function comprehensively considers the estimated resource utilization efficiency of the scheduled computation task on the corresponding initial candidate server node, the estimated task execution completion time, and the estimated cluster load balancing cost caused by this scheduling, specifically including: Based on the actual resource requirements of the computing task and the current resource utilization of the target server node, calculate the expected utilization of the four main resources of the target server node (CPU, memory, storage, and network) after scheduling the current computing task to the target server node. The expected utilization rates of four main resources—CPU, memory, storage, and network—are weighted geometrically averaged to obtain the estimated node-level resource utilization efficiency of the current computing task on the corresponding candidate server node. The weighting coefficients are pre-configured based on the degree of dependence of the current computing task on each type of resource. Based on the historical execution records of the computation task or the estimated execution cycle in the task description, combined with the current load of the target server node and the queuing status of other tasks on the same node, the estimated execution completion time of the task on the target server node is calculated. The simulation migrates computing tasks from their current location to the target server node, calculates the amount of data transfer and service interruption time required during the migration process, and uses this as a quantified value of the cluster load balancing cost caused by this scheduling. The node-level resource utilization efficiency, the reciprocal of the execution completion time, and the quantified value of the cluster load balancing cost are linearly combined according to preset weights to calculate the final adaptability score.

7. The dynamic computing resource elastic scheduling method for heterogeneous server clusters according to claim 1, characterized in that, The real-time acquisition of computational task execution progress information on the server node is compared with resource requirement characteristics, and the parameters of the improved deep reinforcement learning algorithm are dynamically adjusted based on the comparison results. Specifically, this includes: During the execution of the computing task on the target server node, its execution progress information is continuously collected. The execution progress information includes real-time CPU utilization curve, real-time memory usage curve, number of completed input / output operations, and amount of data processed. Obtain the declared resource requirement characteristics from the computing task description, including the expected CPU utilization range, the expected memory usage limit, the expected total input / output operations, and the expected total amount of data to be processed. The execution progress information is compared with the resource requirement features obtained from the computing task description to calculate the deviation between real-time resource consumption and expected resource requirements. The deviation includes CPU utilization deviation, memory usage deviation, and input / output operation rate deviation. Based on the calculated deviation, the strength of the empirical feedback signal that the improved deep reinforcement learning algorithm should obtain in this scheduling decision is calculated, and the signal strength is positively correlated with the deviation. The complete state, actions, and calculated empirical feedback signal strength of this scheduling decision are stored as empirical data in the empirical replay buffer of the improved deep reinforcement learning algorithm for subsequent updates to the algorithm's policy network parameters.

8. The dynamic computing resource elastic scheduling method for heterogeneous server clusters according to claim 7, characterized in that, Based on the calculated deviation, the strength of the empirical feedback signal that the improved deep reinforcement learning algorithm should obtain in this scheduling decision is calculated, specifically including: Set a tolerance threshold for resource deviation. For each resource type, when the calculated deviation exceeds its corresponding tolerance threshold, it is determined that the corresponding resource type has deviated. The statistics show the number of resource types that deviated during the current execution cycle of the calculation task, as well as the extent to which the deviation of each type of resource exceeded its tolerance threshold. The base penalty coefficient is determined based on the number of resource types that deviate; the more types of resources that deviate, the larger the base penalty coefficient. The weighted penalty is calculated based on the extent to which the deviation of each type of resource exceeds its tolerance threshold. The greater the deviation, the greater the weighted penalty. Multiplying the basic penalty coefficient by the weighted penalty amount yields a negative empirical feedback signal strength value, the absolute value of which is the penalty strength of this scheduling decision. This penalty strength will be used to adjust the update gradient of the policy network parameters in the improved deep reinforcement learning algorithm.

9. The dynamic computing resource elastic scheduling method for heterogeneous server clusters according to claim 2, characterized in that, The resource requirement characteristics of the computational task and a snapshot of the global resource state of the heterogeneous server cluster are encoded into a multi-dimensional feature vector, which serves as the environment state input for the improved deep reinforcement learning algorithm. Specifically, this includes: The resource requirements of the computing task are normalized, and the absolute values ​​of CPU core requirements, memory capacity requirements, storage space requirements, and accelerator card quantity requirements are mapped to relative values ​​between zero and one. One-hot encoding is performed on the static configuration attributes in the global resource status snapshot of the heterogeneous server cluster, converting categorical attributes such as CPU architecture and accelerator card model into high-dimensional sparse binary vectors; The dynamic load attributes in the global resource status snapshot are standardized to eliminate the influence of load indicator dimensions caused by configuration differences between different server nodes, and the percentage values ​​such as CPU utilization and memory utilization are converted into scores under the standard normal distribution. The normalized computational task resource requirement feature vector and the server node state feature vector after one-hot encoding and standardization are concatenated along the feature dimension to form a unified high-dimensional multidimensional feature vector. The unified high-dimensional multidimensional feature vector, which serves as the state representation describing the "task to be scheduled - cluster environment", is input into the state feature extraction network of the improved deep reinforcement learning algorithm.

10. The dynamic computing resource elastic scheduling method for heterogeneous server clusters according to claim 3, characterized in that, Based on the resource demand forecast deviation and the resource status change trend, the weight adjustment coefficient of each basic reward item in the reward function is calculated, specifically as follows: The resource demand prediction deviation value is decomposed according to resource type to obtain prediction deviation components for various resources such as CPU, memory, storage, and network. The resource status change trend of the target server node is decomposed according to resource type to predict the direction and magnitude of the change in the utilization rate of each type of resource in the next period. Establish a weight adjustment strategy table, which defines the mapping relationship between the prediction deviation components and resource status change trends of different resource types and the weight adjustment amounts of each basic reward item in the reward function; Query the weight adjustment strategy table, and based on the various resource prediction deviation components calculated in the current period and the prediction results of resource status change trends in the next period, obtain the weight adjustment amounts corresponding to the basic reward items such as resource utilization improvement reward, task scheduling success reward, and service level agreement breach penalty in the reward function; The obtained weight adjustment amount is added to the weight of the basic reward item in the previous period to obtain the new weight adjustment coefficient of each basic reward item in the reward function of the next period, thereby dynamically adjusting the optimization focus of the improved deep reinforcement learning algorithm.