A computing power node topology scheduling method and system of a heterogeneous GPU cluster
By constructing a hierarchical association graph and employing multi-agent negotiation, the complexity and consistency issues of dynamic topology reconstruction in heterogeneous GPU clusters were resolved. This enabled efficient and low-latency topology scheduling for kilocalorie-level clusters, ensuring online optimization and consistency of cluster performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- LIANCHENG TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-19
AI Technical Summary
Existing heterogeneous GPU cluster scheduling technologies suffer from high decision complexity, high overhead, excessive decision latency, and poor distributed scheduling consistency when facing dynamic topology reconstruction in large-scale clusters, resulting in performance improvements being offset or link conflicts.
By constructing a hierarchical association graph and combining multi-agent revenue functions and distributed iterative negotiation, cross-rack link screening and strong connectivity constraint propagation at the node level are achieved. Atomic operation sequences are generated for topology reconstruction, reducing the state space and optimizing the decision-making process.
It achieves millisecond-level online decision-making capability in a kilocalorie-level cluster, ensuring consistency and performance improvement in topology reconstruction, while controlling decision and execution overhead.
Smart Images

Figure CN122247864A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer science and distributed systems technology, and in particular relates to a method and system for scheduling computing node topology in a heterogeneous GPU cluster. Background Technology
[0002] In the current field of heterogeneous GPU cluster scheduling technology, with the continuous growth in the scale of deep learning models, kilo-level and even ten-thousand-level clusters have become the mainstream infrastructure for large-scale model training and inference. The efficiency of distributed training is highly dependent on the degree of matching between the logical topology of the computing tasks and the underlying physical interconnect topology. However, in the actual operating environment, factors such as dynamic task arrival and departure, link congestion fluctuations, and heterogeneous node performance mean that static topology deployment strategies cannot continuously meet performance requirements. Therefore, it is necessary to introduce a dynamic logical topology reconstruction mechanism to adaptively adjust the placement location and communication mode of tasks.
[0003] Existing dynamic topology reconfiguration schemes face three major technical bottlenecks when dealing with large-scale clusters. First, logical topology reconfiguration is essentially a large-scale combinatorial optimization problem. In a cluster containing thousands of GPU nodes and hundreds of parallel tasks, it is necessary to simultaneously determine the placement of tasks on physical nodes and the selection of communication modes between nodes. Its search space can reach the order of 10^10, and traditional heuristic algorithms and integer programming methods cannot complete an effective solution within a millisecond-level online decision window. Second, the overhead of dynamic reconfiguration includes not only the deployment cost of the new topology but also the time loss of the decision computation itself. When the decision latency exceeds a certain percentage of the communication cycle, the performance improvement brought by reconfiguration will be completely offset by the decision overhead, causing the reconfiguration behavior to lose its positive benefits. Third, in a distributed scheduler architecture, different scheduling nodes may make independent topology reconfiguration decisions based on outdated or conflicting link state information, resulting in multiple tasks simultaneously selecting the same set of physical nodes to build a communication loop but with opposite data flow directions, causing link-level conflicts and communication deadlocks.
[0004] To address the aforementioned issues, existing technical solutions mostly employ centralized global optimization or simple distributed protocols. While centralized schedulers can guarantee global optimality, their solution time increases exponentially with cluster size, failing to meet online real-time requirements. Distributed allocation strategies based on consistent hashing or random sampling, while offering fast decision-making speeds, completely ignore physical topology connectivity and real-time link status, resulting in severely suboptimal topology performance after reconstruction. A few studies have addressed reconstruction overhead, primarily focusing on hardware-level configuration caching acceleration, failing to resolve the fundamental contradiction between state space explosion and distributed decision consistency at the level of decision algorithm complexity. Therefore, achieving low-latency, coordinated dynamic topology reconstruction in large-scale heterogeneous GPU clusters has become a critical technical challenge urgently needing breakthroughs in the field of distributed training scheduling. Summary of the Invention
[0005] Therefore, it is necessary to provide a method and system for scheduling computing node topology in heterogeneous GPU clusters to address the aforementioned technical problems.
[0006] Firstly, this application provides a method for scheduling the computing node topology of a heterogeneous GPU cluster, including:
[0007] S1. By collecting static physical connection relationships and dynamic runtime state data of GPU clusters, a hierarchical association map of physical topology and runtime state is constructed, including cluster layer, rack layer and node layer. Among them, the cluster layer records the interconnection link attributes between different racks, the rack layer records the interconnection link attributes between different nodes within the same rack, and the node layer records the interconnection link attributes between different GPUs within a single node and the runtime state data of each GPU.
[0008] S2. Based on the communication mode characteristics of the task to be scheduled, and combined with the interconnection link attributes of each layer in the hierarchical association graph, cross-rack link screening is performed at the cluster layer to obtain a set of feasible rack combinations; based on the communication mode characteristics of the task to be scheduled, and combined with the node-level interconnection link attributes of each rack in the set of feasible rack combinations, strong connectivity constraint propagation at the node level is performed to obtain a set of candidate node clusters within each rack; wherein, strong connectivity constraint propagation at the node level is used to select node combinations that meet the preset strong connectivity conditions within the direct link domain between nodes, based on the constraint requirements of the task to be scheduled on the communication delay between nodes.
[0009] S3. Treat each task to be scheduled as an independent agent, construct a multi-agent reward function, and use the candidate node cluster set as the policy space of each agent; based on the multi-agent reward function and policy space, solve the Nash equilibrium policy profile through distributed iterative negotiation to obtain the logical topology deployment scheme of each task to be scheduled; wherein, the multi-agent reward function includes communication time overhead, topology synchronization overhead and decision delay penalty.
[0010] S4. Based on the difference comparison results between the logical topology deployment scheme of each task to be scheduled and the current running topology scheme, generate an atomic operation sequence; by evaluating the execution cost and expected benefit of each atomic operation in the atomic operation sequence, select the subset of atomic operations with positive net benefit for application to complete the topology reconstruction of the GPU cluster.
[0011] Secondly, this application also provides a computing node topology scheduling system for heterogeneous GPU clusters, used to implement the method described in the first aspect, the system comprising:
[0012] The hierarchical topology modeling module is used to construct a hierarchical association map of physical topology and runtime status, including cluster layer, rack layer and node layer, by collecting static physical connection relationship and dynamic runtime status data of GPU cluster. Among them, the cluster layer records the interconnection link attributes between different racks, the rack layer records the interconnection link attributes between different nodes within the same rack, and the node layer records the interconnection link attributes between different GPUs within a single node and the runtime status data of each GPU.
[0013] The multi-level topology filtering module is used to perform cross-rack link filtering at the cluster layer based on the communication mode characteristics of the task to be scheduled and the interconnection link attributes of each layer in the hierarchical association graph, to obtain a set of feasible rack combinations. Based on the communication mode characteristics of the task to be scheduled and the node-level interconnection link attributes of each rack in the set of feasible rack combinations, it performs node-level strong connectivity constraint propagation to obtain a set of candidate node clusters within each rack. Among them, the node-level strong connectivity constraint propagation is used to filter out node combinations that meet the preset strong connectivity conditions in the direct link domain between nodes within each rack in the set of feasible rack combinations based on the constraint requirements of the task to be scheduled on the communication delay between nodes.
[0014] The multi-agent policy optimization module is used to treat each task to be scheduled as an independent agent, construct a multi-agent reward function, and use the candidate node cluster set as the policy space of each agent. Based on the multi-agent reward function and policy space, the Nash equilibrium policy profile is solved through distributed iterative negotiation to obtain the logical topology deployment scheme of each task to be scheduled. The multi-agent reward function includes communication time overhead, topology synchronization overhead, and decision delay penalty.
[0015] The topology dynamic reconstruction module is used to generate an atomic operation sequence based on the difference comparison results between the logical topology deployment scheme of each task to be scheduled and the current running topology scheme. By evaluating the execution cost and expected benefit of each atomic operation in the atomic operation sequence, a subset of atomic operations with positive net benefit is selected for application to complete the topology reconstruction of the GPU cluster.
[0016] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement a computing node topology scheduling method for a heterogeneous GPU cluster as described in the first aspect.
[0017] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a computing node topology scheduling method for a heterogeneous GPU cluster as described in the first aspect.
[0018] The aforementioned method and system for scheduling computing nodes in a heterogeneous GPU cluster constructs a hierarchical association graph comprising a cluster layer, a rack layer, and a node layer. This transforms the original flat node space into a search space with physical hierarchical constraints. Based on the communication pattern characteristics of the tasks to be scheduled, cross-rack link screening is first performed at the cluster layer to eliminate rack combinations that do not meet bandwidth and reliability requirements. Then, through strong connectivity constraint propagation at the node layer, candidate node clusters that satisfy the strong connectivity conditions within the direct link domain between nodes are selected within each rack. This reduces the state space from an exponential combinatorial explosion to a polynomial-level feasible region. Furthermore, each task is treated as an independent intelligent agent, using the set of candidate node clusters as the strategy... The algorithm constructs a multi-agent reward function that includes communication time overhead, topology synchronization overhead, and decision latency penalty. It solves the Nash equilibrium policy profile through distributed iterative negotiation, enabling agents to converge to a globally coordinated logical topology deployment scheme by relying only on local information exchange. This avoids the latency bottleneck of centralized optimization and the conflict trap of distributed decision-making. Finally, it generates atomic operation sequences through differential comparison and selects a subset of atomic operations with positive net returns for incremental application. This achieves smooth topology reconstruction while keeping decision and execution overhead within acceptable profit limits, thus realizing millisecond-level online decision-making capabilities in a kilocalorie cluster and ensuring consistency in multi-task topology reconstruction in a distributed environment. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments or related technologies of this application, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 A flowchart illustrating a method for scheduling computing node topology in a heterogeneous GPU cluster, provided by the present invention.
[0021] Figure 2 This is a schematic diagram illustrating the process of obtaining a feasible set of cabinet combinations in one optional embodiment of the present invention;
[0022] Figure 3 This is a schematic diagram of the structure of a computing node topology scheduling system for a heterogeneous GPU cluster provided by the present invention. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0024] refer to Figure 1 The document presents a flowchart illustrating a method for scheduling computing node topology in a heterogeneous GPU cluster, as provided in this application. This method includes the following steps:
[0025] S1. By collecting static physical connection relationships and dynamic runtime state data of GPU clusters, a hierarchical association map of physical topology and runtime state is constructed, including cluster layer, rack layer and node layer. Among them, the cluster layer records the interconnection link attributes between different racks, the rack layer records the interconnection link attributes between different nodes within the same rack, and the node layer records the interconnection link attributes between different GPUs within a single node and the runtime state data of each GPU.
[0026] Specifically, the purpose of this step is to construct a hierarchical relationship map that accurately reflects the physical structure and real-time operating status of the GPU cluster. This provides comprehensive and accurate basic data support for subsequent scheduling decisions, addressing the problem of fragmented and static topology information in existing technologies that leads to a disconnect between scheduling decisions and the actual cluster status. In practice, the collection of static physical connections uses a combination of offline pre-collection and periodic updates. During the offline pre-collection phase, the cluster management tool traverses the entire GPU cluster's hardware topology to obtain the physical connection relationships between racks, between racks and nodes, and between nodes and GPUs. This clarifies the hardware specifications of interconnect links at each level, covering the type, number, physical port number, and topology of interconnect links between different racks in the cluster layer; the interconnection method, maximum bandwidth, and distance between nodes within the same rack in the rack layer; the number, transmission rate, and latency parameters of interconnect links between different GPUs within a single server node in the node layer; and the connection channel attributes between GPUs and the motherboard CPU and memory. Periodic updates are performed at a preset cycle to synchronize changes in the cluster hardware topology and ensure the accuracy of the static data.
[0027] The collection of dynamic runtime status data adopts a real-time sampling mode. The sampling frequency is reasonably set according to the timeliness requirements of the data type. The specific collection content covers the runtime status data of each GPU in the node layer, including GPU core utilization, memory utilization, memory bandwidth utilization, GPU core temperature, power consumption, and real-time communication latency, bandwidth utilization, and packet loss rate of interconnect links between GPUs; real-time communication traffic, latency fluctuation, and congestion level of interconnect links between nodes within the same rack in the rack layer; and real-time transmission rate, latency jitter, and link load rate of cross-rack links in the cluster layer. Data collection adopts a distributed collection architecture, deploying a collection agent on each GPU node. CUDA Toolkit and nvidia-smi tools are used to obtain local GPU status data, and tcpdump and iftop tools are used to collect link communication data. The collection agent transmits real-time data to the data processing node through a message queue. The data processing node performs noise reduction, normalization, and abnormal data removal on the collected data, and associates the processed data with static physical connection relationship data.
[0028] The hierarchical relational graph is constructed using a graph database. Nodes in the graph are categorized into four types: cluster nodes, rack nodes, server nodes, and GPU nodes. Edges correspond to the interconnection links at each level, and edge attributes include static link attributes and dynamic runtime attributes. Hierarchical relationships between levels are achieved through parent-child node associations, forming a four-level hierarchical structure of "cluster-rack-node-GPU". Simultaneously, a dynamic attribute indexing mechanism is established in the graph for quickly querying real-time status data for a specific level, link, or GPU. Index keys include rack ID, node ID, GPU ID, and link ID, ensuring that the required topology and status information can be quickly obtained during subsequent scheduling decisions, thus resolving the problem of excessively high latency in topology information querying in existing technologies.
[0029] S2. Based on the communication mode characteristics of the task to be scheduled and the interconnection link attributes of each layer in the hierarchical association graph, cross-rack link screening is performed at the cluster layer to obtain a feasible rack combination set. Based on the communication mode characteristics of the task to be scheduled and the node-level interconnection link attributes of each rack in the feasible rack combination set, node-level strong connectivity constraint propagation is performed to obtain a candidate node cluster set within each rack. Among them, node-level strong connectivity constraint propagation is used to select node combinations that meet the preset strong connectivity conditions in the direct link domain between nodes within each rack in the feasible rack combination set, based on the constraint requirements of the task to be scheduled on the communication delay between nodes.
[0030] Specifically, this step, based on the communication requirements of the task to be scheduled, progressively selects suitable hardware resource combinations from the cluster layer to the node layer. This addresses the problem in existing technologies where scheduling decisions neglect the matching degree between task communication patterns and physical topology, leading to low communication efficiency. Simultaneously, strong connectivity constraint propagation ensures that the selected node combinations meet the task's communication latency requirements. First, the communication pattern characteristics of the task to be scheduled need to be extracted through the task parsing module. The core extracted content includes the task's communication topology type, communication strength, communication latency threshold, communication frequency, and the task's basic requirements for GPU resources. The extraction of communication pattern characteristics can be achieved by parsing the task's configuration file.
[0031] The core of cross-rack link filtering at the cluster layer is to select rack combinations that can meet the cross-rack communication requirements of tasks. The implementation process is as follows: First, query the interconnection link attributes between all racks from the hierarchical association graph, including link bandwidth, real-time latency, and load rate. Then, set filtering conditions based on the communication mode characteristics of the tasks to be scheduled. The filtering process is implemented using a greedy algorithm, and the specific steps are as follows:
[0032] The first step is to initialize the candidate rack pool and the feasible rack combination set. The candidate rack pool is initialized to all racks in the cluster, and the feasible rack combination set is initialized to an empty set. At the same time, the iteration termination condition of the greedy algorithm is set to either the candidate rack pool is empty or a rack combination that meets the number of GPUs required for the task has been generated.
[0033] The second step is to individually screen each rack in the candidate rack pool and determine whether its own load and link connection foundation meet the preset conditions, namely, the number of currently idle GPUs in the rack is not less than the minimum allocation value of the number of GPUs required by the task, and the rack has an interconnection link with at least one other rack in the cluster that meets the task link requirements. All racks that meet the conditions are retained in the candidate rack pool, and racks that do not meet the conditions are removed.
[0034] The third step is to select the rack with the best overall link performance from the candidate rack pool as the initial rack. The overall link performance is quantified by the link performance evaluation function. Routine racks with high link bandwidth, low real-time latency, and low load rate are given priority. This initial rack is then included as the first element in the temporary rack combination.
[0035] The fourth step involves selecting racks from the candidate rack pool that have direct interconnect links to the initial racks and meet the task link attribute requirements, based on the initial racks. These racks are then selected as candidate expansion racks. A link connectivity score is calculated for each combination of a candidate expansion rack and a temporary rack. A higher link connectivity score indicates higher communication efficiency between racks within the combination. The specific calculation method is as follows: The link connectivity score is quantified by comprehensively considering three core indicators—link bandwidth, real-time latency, and load rate—among all racks in the candidate expansion rack and temporary rack combination, using a weighted summation method. The calculation formula is as follows: .in The link connectivity score is given, with a value range of (0,1). The closer the score is to 1, the better the connectivity and the higher the communication efficiency. , , The weighting coefficients for link bandwidth, real-time latency, and load rate are respectively, satisfying... It can be dynamically adjusted according to the task communication mode, prioritizing tasks with high communication intensity. (Link bandwidth weight) Prioritize tasks with low latency requirements. (Real-time delay weight); The standardized score for link bandwidth is calculated using the following formula: , This represents the actual bandwidth of each interconnect link in the candidate extended cabinet and temporary cabinet combination. The maximum bandwidth of the cross-rack link in the cluster is given by the formula. The bandwidth index is standardized to the (0,1) range. The closer the actual bandwidth is to the maximum bandwidth, the higher the score. The standardized score for real-time latency is calculated using the following formula: , This represents the actual real-time latency of each interconnect link in the candidate expansion rack and temporary rack combination. This represents the maximum allowable communication latency threshold for the task. The lower the actual latency, the higher the score. When the actual latency exceeds the threshold, [the score is affected]. ; The standardized score for link load rate is calculated using the following formula: , This represents the actual load rate of each interconnect link in the candidate expansion rack and temporary rack combination. This represents the maximum allowable load rate threshold for the link. The lower the actual load rate, the higher the score. When the actual load rate exceeds the threshold, [the link is considered unsuitable]. After calculating the link connectivity score for each candidate expansion cabinet, the candidate expansion cabinet with the highest score is selected to be added to the temporary cabinet group.
[0036] The fifth step is to determine whether the total number of idle GPUs in the current temporary rack combination meets the number of GPUs required by the task. If it does, the temporary rack combination is included in the set of feasible rack combinations, and the racks in the combination are removed from the candidate rack pool. The temporary rack combination is then re-initialized, and steps three and four are repeated. If it does not meet the requirements, step four is repeated until the total number of idle GPUs in the temporary rack combination meets the task requirements or there are no candidate expansion racks that meet the conditions. If the temporary rack combination meets the requirements at this point, it is included in the set of feasible rack combinations; otherwise, the temporary rack combination is discarded.
[0037] The sixth step is to deduplicate and optimize the set of feasible rack combinations, eliminating combinations with link conflicts and excessive rack load. Link conflicts refer to the interconnection links between racks within a combination being occupied by multiple rack combinations simultaneously and the link load rate exceeding a preset threshold. Excessive rack load refers to the total GPU resources required by the tasks already running in the rack and the tasks allocated to the current temporary combination exceeding the total GPU capacity of the rack. Finally, the set of feasible rack combinations is obtained.
[0038] Strong connectivity constraint propagation at the node layer involves selecting candidate node clusters that meet the task communication latency constraints within each rack in the feasible rack combination set. Its core principle is to ensure, through a constraint propagation algorithm, that all nodes within a node cluster can achieve low-latency communication via direct links, forming a strongly connected subgraph. In practice, the strong connectivity conditions are first set: there must be at least one direct link between any two nodes within a node cluster, and the real-time communication latency of this direct link must not exceed a preset proportion of the task communication latency threshold. The link bandwidth must not be less than the single-node communication bandwidth requirement of the task, and the GPU status of all nodes within the cluster must meet the basic task requirements. The constraint propagation process uses a breadth-first search algorithm. Starting with any node within the rack that meets the basic GPU requirements, it queries the node's direct connections and determines whether these direct connections meet the strong connectivity conditions. If they do, the node is added to a temporary node cluster, and the search continues, repeating this process until no new nodes meeting the conditions are found. If the number of nodes in a temporary node cluster meets the task's requirement for GPU nodes, then the temporary node cluster is included in the candidate node cluster set; otherwise, the temporary node cluster is discarded, and the initial node is replaced and the search is restarted. Meanwhile, to avoid resource conflicts between candidate node clusters, each node can only be included in one candidate node cluster. If multiple search processes select the same node simultaneously, the candidate node cluster with a larger size and better constraint conditions is prioritized.
[0039] S3. Treat each task to be scheduled as an independent agent, construct a multi-agent reward function, and use the candidate node cluster set as the policy space of each agent; based on the multi-agent reward function and policy space, solve the Nash equilibrium policy profile through distributed iterative negotiation to obtain the logical topology deployment scheme of each task to be scheduled; wherein, the multi-agent reward function includes communication time overhead, topology synchronization overhead and decision delay penalty.
[0040] Specifically, this step addresses the conflict problem in multi-task scheduling within large-scale clusters. Through multi-agent modeling and Nash equilibrium solving, it achieves global coordination and optimal performance of logical topology deployment schemes for each task, while controlling decision latency and topology synchronization overhead. This overcomes the bottlenecks of poor real-time performance in centralized scheduling and suboptimal performance in distributed scheduling in existing technologies. First, in the multi-agent modeling process, each task to be scheduled corresponds to an independent agent. Each agent has autonomous decision-making capabilities, and its decision objective is to select a node cluster from the candidate node cluster set as its logical topology deployment carrier to maximize its own benefits while avoiding resource conflicts with other agents. The policy space of each agent is the set of candidate node clusters obtained in step S2. Each policy corresponds to the selection of a candidate node cluster. The feasibility of a policy is determined by the matching degree between the resource capacity of the candidate node cluster and the task requirements. If the resource capacity of the candidate node cluster matches the resource requirements of the task, the policy is feasible; otherwise, it is infeasible. Infeasible policies are eliminated in advance, reducing the policy space size and improving decision-making efficiency.
[0041] The construction of the multi-agent reward function is the core of decision optimization. Its design principle is to maximize the task's operational efficiency while minimizing scheduling overhead. The specific expression is as follows: .in Let i be the reward value of the i-th agent. , , Let be the weighting coefficient, satisfying It can be dynamically adjusted according to task priority, with high-priority tasks... The value is relatively larger; It is the inverse indicator of communication time overhead, namely communication efficiency, used to quantify the improvement effect of candidate node cluster communication performance on task running efficiency; This is a topology synchronization overhead item used to quantify various overheads during the topology deployment process; This is a decision delay penalty term, used to constrain the time consumption of the decision-making process and prevent excessive decision delay from offsetting the benefits of topology reconfiguration.
[0042] The specific calculation process for each item is as follows: Communication time overhead item The calculation is based on the link attributes of the candidate node cluster and the task communication mode. For a task with a ring communication topology, the calculation formula is as follows: .in This represents the average communication latency of direct links between all nodes within a node cluster. This represents the maximum communication latency of direct links between all nodes within a node cluster. Latency data is obtained in real-time from the hierarchical association graph. This calculation method ensures that lower communication latency and more uniform latency distribution are achieved. The larger the value, the higher the corresponding benefit. For tasks using a fully connected communication topology, the calculation formula is: .in The communication delay between node i and node j. The formula quantifies the communication efficiency in a fully connected communication mode by calculating the reciprocal of the average communication latency between all nodes, where is the number of nodes in the node cluster.
[0043] Topology synchronization overhead This mainly includes node configuration overhead, data migration overhead, and link switching overhead when deploying a new logical topology. The specific calculation formula is as follows: .in The time allocated to a node is positively correlated with the number of nodes in the node cluster, and its specific calculation formula is as follows: , The number of nodes within a node cluster. The configuration time for a single node is determined by the node's hardware specifications and configuration complexity. The data migration time is calculated using the following formula: , This represents the initial data volume for the task. This represents the average bandwidth of links within a node cluster. This refers to the link switching time, which is related to the link type and determined by the link hardware characteristics.
[0044] Decision delay penalty The calculation formula is:
[0045]
[0046] in, The time taken for the intelligent agent to go from acquiring a set of candidate node clusters to making a decision is statistically analyzed in real time through a timing module. The preset decision delay threshold is set according to the real-time requirements of cluster scheduling; This is the penalty coefficient, used to adjust the penalty intensity when the decision delay exceeds the threshold, ensuring that the decision-making process will not affect the positive benefits of topology reconstruction due to excessive delay.
[0047] The distributed iterative negotiation process for solving the Nash equilibrium policy profile is implemented using a federated reinforcement learning framework. This eliminates the need for a centralized scheduling node; agents achieve global coordination through local information exchange. The core principle is to optimize decisions through policy iteration and reward feedback. First, the policy update function and reward / loss calculation model for each agent are defined, where the policy update formula for agent i is: In the formula Choose an action for agent i in the (t+1)th iteration. The probability of selecting a particular candidate node cluster. Let its policy probability be given in the t-th iteration. The learning rate is used to control the step size of policy updates to avoid convergence oscillations caused by excessively fast updates. Its value range is (0,1), and it is dynamically adjusted by the real-time requirements of cluster scheduling. Let i be the reward function of agent i in the t-th iteration. strategy The gradient is used to guide the policy update in the direction of maximizing returns. Let i be the expected return of agent i in the t-th iteration, calculated using the following formula: ,in Indicates the strategy based on the t-th iteration. The expectation operator, The reward value for agent i is calculated from the multi-agent reward function in step S3. The specific steps are as follows:
[0048] The first step is for each agent to initialize its own policy by randomly selecting a feasible candidate node cluster as its initial policy, i.e., initialization. This ensures that the probability of selecting a feasible strategy is equal, and the probability of not selecting a strategy is 0, while simultaneously calculating its own initial payoff value. and their own strategies With earnings Broadcast to other intelligent agents via a distributed messaging protocol.
[0049] The second step involves each agent receiving policy information from other agents and determining whether resource conflicts exist (i.e., two or more agents selecting the same candidate node cluster). If a conflict exists, the loss of revenue in the conflict scenario is calculated using the following formula: In the formula Let be the gain or loss of agent i in the t-th iteration. Let be the reward value of agent i in the absence of conflict. Let i be the actual gain of agent i in a conflict situation. The correction is obtained by adjusting the link congestion level of the conflict node cluster, and the correction formula is as follows: ,in This is the conflict impact coefficient, with a value range of (0,1). The more severe the conflict, the greater the impact. The larger the value; The formula for calculating the degree of conflict is as follows: , The number of agents that conflict with agent i. This represents the total number of agents participating in the scheduling. After calculating the gains and losses, each agent adjusts its own policy based on the principle of maximizing gains, and updates the policy using the policy update formula. This reduces the probability of selecting conflict strategies and increases the probability of selecting candidate node clusters that are not occupied and have the second-best return value.
[0050] The third step involves each agent updating its own strategy and then recalculating the expected return. Compared with actual profit value Rebroadcast strategy With revenue information Repeat the conflict assessment, profit / loss calculation, and strategy adjustment process from step two.
[0051] The fourth step is to determine whether a Nash equilibrium has been reached. The formula for determining Nash equilibrium is as follows: In the formula Let i be the set of all feasible policies for agent i. Let i be the combination of policies of all agents except agent i at the t-th iteration. For agent i, the policy will be transferred from Replace with The new strategy after that. When this formula holds, the policy choices of all agents no longer change, and no agent can improve its expected return by adjusting its strategy alone, that is, a Nash equilibrium is reached. At this time, the policy combination of each agent is... This is the Nash equilibrium strategy profile, and the corresponding candidate node cluster selection is the logical topology deployment scheme for each task to be scheduled.
[0052] The entire iteration process terminates when the number of iterations reaches a preset upper limit or the expected return fluctuation of each agent is less than a preset threshold. The formula for calculating the expected return fluctuation is as follows: ,when ( When the preset fluctuation threshold is reached, the returns are considered to be stable, and the iteration can be terminated early to ensure that the decision-making process can meet the requirements of online real-time scheduling.
[0053] S4. Based on the difference comparison results between the logical topology deployment scheme of each task to be scheduled and the current running topology scheme, generate an atomic operation sequence; by evaluating the execution cost and expected benefit of each atomic operation in the atomic operation sequence, select the subset of atomic operations with positive net benefit for application to complete the topology reconstruction of the GPU cluster.
[0054] Specifically, this step achieves efficient and low-overhead execution of topology reconfiguration. It filters out the minimum atomic operation sequence through differential comparison and avoids invalid reconfiguration operations through benefit evaluation, solving the problems of excessive overhead and link conflicts in existing technologies, ensuring that the reconfiguration behavior brings positive performance benefits. First, a differential comparison is performed between the logical topology deployment scheme and the currently running topology scheme, implemented using a topology graph matching algorithm. Both topology schemes are transformed into graph structures, with nodes representing GPU nodes and edges representing communication links. The differences between the two graph structures are compared using a graph isomorphism algorithm. The core is to quantify the degree of topology difference through graph similarity calculation. The key formulas and specific processes are as follows: Define the currently running topology graph as... The topology diagram corresponding to the logical topology deployment scheme is as follows: ,in , These are the node sets (GPU nodes) of two different topologies. , These are the edge sets (communication links) of two different topological graphs. The core of topological graph matching is calculating the similarity between the two graphs, using the following formula: In the formula The similarity between two topological graphs is denoted by (0,1). The closer the similarity is to 0, the greater the difference between the two graphs. The closer the similarity is to 1, the more similar the two graphs are. , Let be the weight coefficients for node similarity and edge similarity, respectively, satisfying... It can be dynamically adjusted according to the topology scheduling priorities, with priority given to node allocation. When the value is larger, priority is given to the communication link. Take a larger value; The formula for calculating node similarity is: ,in This represents the number of nodes shared by the two graphs. The total number of nodes in both graphs is given by the node similarity metric, which measures the degree of overlap between the node sets of the two graphs. The formula for edge similarity is: ,in Let be the number of edges shared by the two graphs. Let be the total number of edges in both graphs. Edge similarity quantifies the degree of overlap in communication links between the two graphs. Through similarity calculation, when... ( When a preset similarity threshold is set, it is determined that there are significant differences between the two images, and the specific types of differences are further extracted, including newly added node clusters ( There is in Clusters of nodes not present in the data), clusters of deleted nodes ( There is in Clusters of nodes not present in the cluster), node cluster updates (changes in the composition or number of nodes within a cluster), and communication link adjustments (…). and (Differences between nodes and edges). During the differential comparison process, the focus is on the resource usage of node clusters and the connection status of links to avoid missing key differences, while filtering out irrelevant differences.
[0055] The generation of atomic operation sequences involves transforming the differences obtained from differential comparisons into a series of indivisible minimum operation units. Each atomic operation corresponds to a topology adjustment behavior, ensuring the atomicity of operation execution and avoiding issues such as topology inconsistencies and link conflicts during execution. Types of atomic operations include node cluster creation, node cluster deletion, node addition, node removal, and link switching. The order of atomic operation sequences follows the principle of "create before delete, prepare before execute," and dependency constraints are added to the atomic operation sequences to ensure executability.
[0056] The formula for calculating the net benefit of each atomic operation is as follows: The execution cost and expected benefit assessment of atomic operations are the core of selecting effective atomic operations. .in The net benefit of an atomic operation is used to determine whether an atomic operation has positive value. When atomic operations have a positive benefit, they can be included in the execution subset; when At that time, atomic operations do not yield positive returns and are therefore eliminated; The expected benefit of atomic operations is the quantified value of the performance improvement that can be brought about after topology reconstruction, which covers the comprehensive benefits in dimensions such as reduced communication latency, improved computing efficiency, and improved resource utilization. The execution cost of atomic operations is the quantified value of various losses generated during the execution of the operation, including time loss, resource consumption loss, and interference loss to existing tasks. Only atomic operations with positive net benefits will be included in the execution subset to ensure positive benefits of topology reconstruction.
[0057] The evaluation of execution costs includes time cost, resource cost, and disruption cost. Time cost refers to the execution time of atomic operations; the formula for calculating the time cost of node cluster creation operations is as follows: ,in Configure time for nodes. Configure time for the link; the formula for calculating the time cost of node joining operation is as follows: ,in The link synchronization time is denoted as _____. Resource cost refers to the cluster resources consumed during operation execution, quantified by the resource consumption ratio. Interference cost is the impact of operation execution on currently running tasks, quantified by task performance fluctuation values. Assuming that before a node exits atomic operation execution, the average communication latency of the currently running tasks is _____. During the operation, the communication latency of this task briefly increased due to link re-adaptation. Then the task performance fluctuation value can be quantified as The larger the fluctuation value, the greater the interference of the operation on the current task, and the higher the corresponding interference cost. This quantification method can intuitively reflect the degree of impact of atomic operations on the existing task.
[0058] The assessment of expected benefits is primarily based on the performance improvements after topology reconfiguration, including reductions in communication latency, improvements in computational efficiency, and improvements in resource utilization. The formula for calculating the expected benefits of node cluster creation operations is as follows: ,in To reduce communication latency for the refactored tasks, For task communication frequency, This represents the task priority weight. The formula for calculating the improvement in resource utilization is as follows: ,in To improve the utilization rate of node cluster resources after reconstruction, To reconstruct the resource utilization of the front node cluster, This represents the total resources of the node cluster.
[0059] In the specific evaluation process of expected returns, firstly, an evaluation metric threshold is set for each atomic operation. Then, the execution cost and expected return of each atomic operation are calculated. Atomic operations that meet the threshold requirements and have a positive net return are selected to form a subset of atomic operations. For atomic operations with dependencies, if the net return of the parent operation is negative, the child operation will also be removed to avoid the execution of invalid operations. The execution of the subset of atomic operations adopts a combination of parallel and serial execution. For atomic operations without dependencies, parallel execution is used to improve the efficiency of topology reconstruction; for atomic operations with dependencies, serial execution is used to strictly follow the dependency order and ensure the safety of operation execution. After execution, the topology and runtime state data of the cluster are updated in real time through a hierarchical association graph to complete the entire topology reconstruction process. At the same time, the execution results of each atomic operation are recorded for subsequent optimization and adjustment of the weight coefficients of the return function, further improving the accuracy and efficiency of scheduling decisions.
[0060] The aforementioned method for scheduling computing nodes in a heterogeneous GPU cluster constructs a hierarchical association graph comprising a cluster layer, a rack layer, and a node layer. This transforms the original flat node space into a search space with physical hierarchical constraints. Based on the communication pattern characteristics of the tasks to be scheduled, cross-rack link screening is first performed at the cluster layer to eliminate rack combinations that do not meet bandwidth and reliability requirements. Then, through strong connectivity constraint propagation at the node layer, candidate node clusters that satisfy the strong connectivity conditions within the direct link domain between nodes are selected within each rack. This reduces the state space from an exponential combinatorial explosion to a polynomial-level feasible region. Furthermore, each task is treated as an independent agent, with the candidate node cluster set serving as the policy space. A multi-agent benefit function is constructed, incorporating communication time overhead, topology synchronization overhead, and decision latency penalties. The Nash equilibrium strategy profile is solved through distributed iterative negotiation, enabling agents to converge to a globally coordinated logical topology deployment scheme relying solely on local information exchange. This avoids the latency bottleneck of centralized optimization and the conflict trap of distributed decision-making. Finally, atomic operation sequences are generated through differential comparison, and a subset of atomic operations with positive net benefits is selected for incremental application. This achieves smooth topology reconstruction while keeping decision and execution overhead within acceptable benefit limits, thus realizing millisecond-level online decision-making capabilities in a kilocalorie cluster and ensuring consistency in multi-task topology reconstruction in a distributed environment.
[0061] refer to Figure 2 In one optional embodiment, based on the communication mode characteristics of the task to be scheduled and combined with the interconnection link attributes of each layer in the hierarchical association graph, cross-rack link filtering is performed at the cluster layer to obtain a feasible rack combination set, including the following steps:
[0062] S11. Based on the synchronization mode characteristics and communication granularity characteristics of the task to be scheduled, query the corresponding bandwidth requirement threshold from the preset mapping table; based on the bandwidth requirement threshold, combined with the real-time available bandwidth parameters between racks recorded in the cluster layer of the hierarchical association graph, remove rack pairs whose real-time available bandwidth parameters are lower than the bandwidth requirement threshold, and retain rack pairs whose real-time available bandwidth parameters are higher than or equal to the bandwidth requirement threshold, to obtain a preliminary set of feasible rack pairs.
[0063] Specifically, the synchronization mode characteristics and communication granularity characteristics of the tasks to be scheduled are obtained by parsing the task configuration file. The synchronization mode characteristic characterizes the data synchronization mechanism between nodes of the task, specifically divided into synchronous and asynchronous. Synchronous synchronization requires all nodes to complete data calculations before synchronous data exchange, placing higher demands on link bandwidth and timeliness. Asynchronous synchronization allows nodes to calculate independently and provide asynchronous feedback, with relatively lower bandwidth requirements. The communication granularity characteristic is obtained by quantifying the data block size of a single communication session, specifically determined by parameters such as the task's model batch size and feature dimensions. The larger the data block, the higher the demand for cross-rack link bandwidth. A preset mapping table, built based on a large amount of task test data, establishes a one-to-one correspondence between the above two characteristics and bandwidth requirement thresholds, allowing direct querying of the minimum link bandwidth required for normal task communication.
[0064] The real-time available bandwidth between racks recorded in the cluster layer of the hierarchical association graph refers to the bandwidth currently available for transmitting task data on the interconnection links between rack pairs. It is obtained by subtracting the bandwidth already occupied by other tasks from the total link bandwidth and can be obtained through real-time indexing of the graph. During filtering, only rack pairs with real-time available bandwidth not lower than the queried bandwidth requirement threshold are retained, while rack pairs with bandwidth below the threshold are removed. This results in a preliminary set of feasible rack pairs, denoted as […]. ,in Indicates that it is made by the rack With cabinet The racks are assembled. This indicates the real-time available bandwidth parameters between the rack pairs. The bandwidth requirement threshold obtained from the query is the minimum link bandwidth required for normal communication of the task.
[0065] S12. Based on the real-time link retransmission rate parameters of each cabinet pair in the preliminary feasible cabinet pair set, remove cabinet pairs whose real-time link retransmission rate parameters exceed the preset retransmission rate threshold from the preliminary feasible cabinet pair set, and update the preliminary feasible cabinet pair set.
[0066] Specifically, the real-time link retransmission rate parameter reflects the communication stability of cross-rack links. An excessively high retransmission rate leads to increased task communication latency and decreased data transmission reliability. Therefore, a secondary screening of the initially feasible rack pair sets is necessary. The real-time link retransmission rate parameter is denoted as... , defined as the ratio of the number of retransmissions of data between rack pairs per unit time to the total number of transmissions, is calculated using the following formula: ,in This represents the number of data retransmissions per unit of time. The total number of data transmissions per unit time is both obtained in real time from the cluster layer of the hierarchical association graph.
[0067] The preset retransmission rate threshold is denoted as This is set based on the task's requirements for communication reliability, serving as the minimum standard for defining link communication stability. The filtering logic is as follows: if a certain cabinet pair... This indicates that the stability of its link communication cannot meet the task requirements, thus removing it from the preliminary feasibility list for rack-to-rack assembly. Remove from the middle; if If the rack pair is selected, then retain that rack pair. After the selection is complete, an updated preliminary set of feasible rack pairs is obtained, denoted as [reference needed]. ,in Indicates that it is made by the rack With cabinet The racks are assembled. Indicates the rack to The real-time link retransmission rate parameter.
[0068] S13. Based on the number of GPUs required for the task to be scheduled, and combined with the number of GPUs in the same node direct link domain recorded in the rack layer of the hierarchical association graph, traverse each rack pair in the updated preliminary feasible rack pair set, determine whether the sum of the number of GPUs in the node direct link domain of the two racks in the rack pair is greater than or equal to the number of GPUs required for the task to be scheduled, retain the rack pairs that meet the quantity condition, and generate a feasible rack combination set.
[0069] Specifically, the number of GPUs required for the task to be scheduled is denoted as . The number of GPUs in a rack configuration is determined by the task's model size and parallel computing requirements, and is the core quantity constraint for rack combination selection. A direct link domain refers to a set of GPU nodes within a rack that achieve low-latency communication via direct links. GPU nodes within the same link domain have low communication latency and high communication efficiency, meeting the task's communication latency requirements. The rack layer of the hierarchical association graph records the number of GPUs in this link domain within each rack, denoted as […]. ,in This represents the k-th rack. This refers to the number of GPUs within the same direct link domain that belong to the same node within the rack.
[0070] Traverse the updated set of preliminary feasible rack pairs Each rack in Calculate the total number of GPUs in the link domain of this rack pair using the following formula: The filtering and judgment logic is as follows: if This indicates that the rack can provide a sufficient number of low-latency communication GPU nodes to meet the hardware resource requirements of the task, and therefore should be retained; if If the number of GPUs required for a task cannot be met by the rack, then the rack will be removed.
[0071] All rack pairs that meet the quantity requirements together form a feasible rack combination set, denoted as . Each rack in this set meets the task's bandwidth, communication stability, and GPU quantity requirements, providing a feasible range of racks for selecting candidate node clusters at the subsequent node layer, ensuring that the selected node clusters can meet the task's communication and resource requirements.
[0072] In one optional embodiment, based on the communication mode characteristics of the task to be scheduled and combined with the node-level interconnection link attributes of each rack in the feasible rack combination set, strong connectivity constraint propagation at the node level is performed to obtain a set of candidate node clusters within each rack, including the following steps:
[0073] S21. Based on the communication frequency characteristics of the task to be scheduled, query the corresponding maximum allowable communication delay threshold from the preset frequency-delay mapping table.
[0074] Specifically, the communication frequency characteristics of the tasks to be scheduled are obtained by parsing the task configuration file, representing the number of communications between nodes per unit time. The higher the communication frequency, the stricter the requirements for inter-node communication latency. A preset frequency-latency mapping table is built based on cluster link performance test data. Using communication frequency as an index, it stores the maximum allowable communication latency threshold corresponding to different communication frequencies. This threshold represents the upper limit of inter-node communication latency that the task can withstand during normal operation, ensuring both inter-node communication efficiency and task parallel execution performance. By matching the obtained task communication frequency characteristics with the mapping table index, the corresponding maximum allowable communication latency threshold can be directly queried.
[0075] S22. Traverse each cabinet in the feasible cabinet combination set. Based on the static communication delay base between each node recorded in the cabinet layer of the hierarchical association graph, select a subset of nodes from the node set of the current cabinet where the static communication delay between any two nodes is lower than the maximum allowable communication delay threshold. Use the node subset as the preliminary candidate node set for the corresponding cabinet.
[0076] Specifically, the static communication delay baseline recorded in the cabinet layer of the hierarchical association graph refers to the basic communication delay between nodes within the cabinet under the condition of no link congestion. It is determined by the node hardware specifications, link type, and physical distance. It is pre-calibrated through offline testing and stored in the graph, and can be retrieved in real time through indexing. Each cabinet in the feasible cabinet combination set is traversed, and all nodes within that cabinet are extracted to form a node set. The static communication delay baseline between any two nodes in the set is then checked to see if it is lower than the maximum allowable communication delay threshold obtained from query S21. If the static communication delay baseline between a node and any other node in the set meets the requirement, it is included in the node subset; if the static communication delay baseline between any node exceeds the threshold, that node is removed. Finally, the nodes within each cabinet that meet the conditions form the preliminary candidate node set for that cabinet.
[0077] S23. Based on the real-time utilization and memory usage data of each GPU recorded in the node layer of the hierarchical association graph, remove GPUs with real-time utilization exceeding the preset utilization threshold from the preliminary candidate node set, and remove GPUs with memory usage exceeding the preset memory threshold, and update the preliminary candidate node set.
[0078] Specifically, the hierarchical association graph node layer records the real-time operating status data of each GPU. GPU real-time utilization represents the current computational load of the GPU, and memory usage data represents the currently used memory resources of the GPU. Both are collected in real-time and synchronized to the graph through the acquisition agent deployed on the nodes. Preset utilization and memory thresholds are set based on GPU hardware performance and task computation requirements to ensure that the selected GPUs have sufficient computing power and memory space to meet the running requirements of the scheduled tasks. During the selection process, the real-time utilization and memory usage data of each GPU in the initial candidate node set are checked one by one. GPUs with either metric exceeding the corresponding threshold are eliminated, and the remaining GPUs form an updated initial candidate node set, ensuring that all GPUs in the set are in an available state and have sufficient resources.
[0079] S24. Based on the direct link connection relationship between GPUs recorded in the node layer of the hierarchical association graph, perform connectivity component partitioning on the GPUs in the updated preliminary candidate node set, and divide the GPUs that are connected by direct link paths into the same node cluster to generate a candidate node cluster set.
[0080] Specifically, the hierarchical association graph node layer records the direct connection relationships between GPUs, clarifying whether direct communication links exist between GPUs within a single rack, including information such as the existence status and type of the direct connection links, which can be quickly obtained through the graph index. Connectivity component partitioning employs a depth-first search algorithm, with the following steps: First, let the set of all GPUs in the updated preliminary candidate node set be... ,in The total number of GPUs in the set; define the adjacency matrix. Characterizes the direct connection relationship between GPUs, where Indicates GPU With GPU There is a direct link. This indicates that there is no direct link between the two; define an access flag array. ,in Indicates GPU Visited. This indicates that the site has not been accessed and is in its initial state. ( Define the set of connected components as follows: This is used to store all the node clusters after partitioning. The specific steps of the algorithm are as follows: First, initialize the index. Traverse all GPU nodes, when At that time, initiate a depth-first search and select the current GPU. As the starting node, initialize the ephemeral node cluster. and set The first step is to mark it as visited; the second step is to use the adjacency matrix. Query the current node All directly connected, unvisited nodes, satisfy the condition. and nodes ,Will Add to temporary node cluster ,set up and with For the new current node, repeat the search process in step two; in step three, when the current node has no unvisited directly connected nodes, the search terminates, and a temporary node cluster is created. Add to connected component set Fourth step, update the index. Repeat steps one through three until all GPU nodes have been accessed (i.e., At this point, the set of connected components. This refers to the set of candidate node clusters. Within each node cluster, there are direct links between GPUs, enabling low-latency communication and meeting the strong connectivity constraints of the task.
[0081] In one optional embodiment, each task to be scheduled is treated as an independent agent, a multi-agent reward function is constructed, and the candidate node cluster set is used as the policy space for each agent; based on the multi-agent reward function and policy space, the Nash equilibrium policy profile is solved through distributed iterative negotiation to obtain the logical topology deployment scheme for each task to be scheduled, including the following steps:
[0082] S31. Define each task to be scheduled as an agent, and define each candidate node cluster in the candidate node cluster set as a policy of the corresponding agent.
[0083] Specifically, each task to be scheduled corresponds to an independent agent, each with autonomous decision-making capabilities. The decision-making objective is to select the optimal strategy to maximize its own gains while avoiding resource conflicts with other agents. The set of candidate node clusters constitutes the policy space for each agent, with each cluster representing an optional strategy. The feasibility of a strategy is determined by the matching degree between the resource capacity and link performance of the candidate node cluster and the needs of the corresponding agent (task to be scheduled). The matching degree is achieved through quantitative calculation, specifically using the following formula: In the formula The strategy selected for agent i The demand matching degree of (candidate node clusters) ranges from (0,1). A strategy is considered feasible if it is greater than or equal to a preset threshold. , These are the resource capacity matching weight and the link performance matching weight, respectively, to satisfy... ; As a resource capacity matching factor, For strategy The number of GPUs in the corresponding candidate node clusters The number of GPUs required for the task corresponding to agent i; This is a link performance matching factor. For strategy The average communication latency within the corresponding candidate node cluster, This represents the maximum allowable communication latency threshold for the task corresponding to agent i. Only candidate node clusters with a demand matching degree not lower than the preset threshold are retained as effective strategies, while invalid strategies are eliminated in advance to reduce the strategy space and improve decision-making efficiency.
[0084] S32. Based on the real-time bandwidth parameters of the physical link corresponding to the strategy selected by each agent and the conflict situation of sharing physical links between different agents' selected strategies, construct a communication time overhead item; wherein, the communication time overhead item includes the maximum message transmission time within the selected strategy and the additional conflict delay caused by sharing links between different strategies.
[0085] Specifically, the communication time overhead term quantifies the communication time consumed during task execution after the agent selects a certain strategy. It consists of two core parts, and its specific expression can be... In the formula The maximum message transmission time within the selected strategy is calculated using the following formula: , Let k be the size of the k-th message within the candidate node cluster. The real-time bandwidth of this message transmission link is obtained from the real-time index of the hierarchical association graph node layer and the rack layer. The additional collision delay caused by different agents sharing a link is calculated using the following formula: , For intelligent agents The set of physical links shared with other intelligent agents. For link The congestion coefficient (positively correlated with link load, with a value range of (0,1)). For shared links The number of agents, and the sum of the time spent on the two parts is the communication time overhead.
[0086] S33. Based on the communication direction conflicts that may occur in the logical topology corresponding to the strategies selected by each agent, construct a topology synchronization overhead term; wherein, the topology synchronization overhead term is the total time required to resolve the conflict on all links where direction conflicts occur.
[0087] Specifically, the topology synchronization overhead focuses on the synchronization time caused by directional conflicts in the communication links within the logical topology corresponding to the agent's policy. The specific expression can be... In the formula The set of physical links where communication direction conflicts occur is defined as follows: when multiple communication links in the logical topology corresponding to the policies of multiple agents have data flows in opposite directions on the same physical link, that link is included. ; For link The time required to resolve directional conflicts is positively correlated with the severity of the conflict, and the calculation formula is as follows: , For link The conflict severity coefficient (the more conflicting data streams, the larger the coefficient; where each range of conflicting data streams corresponds to a conflict severity coefficient, the conflict severity coefficient of the link is obtained by mapping the number of conflicting data streams of the link, and the value range can be [1,5]). The baseline conflict resolution time for a single link without severe conflicts is used to calculate the time for all conflicting links. The summation of these values represents the topology synchronization overhead.
[0088] S34. Based on the computation time consumed by each agent in this decision, construct a decision delay penalty term.
[0089] Specifically, the decision delay penalty term is used to constrain the time consumption of the agent's decision-making process, preventing excessive decision delay from offsetting the performance gains brought by topology reconstruction. The specific expression can be:
[0090]
[0091] in, The computation time for agent i's decision is calculated in real time by the timing module, which is the total time from the agent obtaining the candidate node cluster set, starting the policy evaluation, to finally determining the selected policy. The preset decision delay threshold is set according to the real-time requirements of cluster scheduling; This is the penalty coefficient (ranging from [0.1, 1]), used to adjust the penalty intensity. The decision delay penalty term is related to... Positive correlation The longer the duration, the greater the punishment. Exceed At that time, the number of penalties increases significantly, forcing the agent to improve its decision-making efficiency.
[0092] S35. By weighted summing of the communication time overhead, topology synchronization overhead, and decision delay penalty, the multi-agent reward function is obtained; the formula for calculating the multi-agent reward function is as follows:
[0093]
[0094] in, Indicates agent i in the policy profile The following profit value, This represents the policy profile formed by the joint policies of all agents. This represents the strategy chosen by agent i. This represents the policy of all agents except agent i. , , The weighting coefficients are satisfied. , Indicates agent i in the policy profile The following communication time overhead item, Indicates agent i in the policy profile The following topology synchronization overhead items, This represents the decision delay penalty term corresponding to the computation time consumed by agent i in this decision.
[0095] Specifically, For agent i in the policy profile The return value is a negatively correlated function, meaning that the smaller the expenses, the greater the return value, and the better the agent's decision-making. It is a policy profile composed of the policies of all agents, covering the policy selection results of all agents at present; The specific strategy selected for agent i, i.e., a candidate node cluster from the set of candidate node clusters; This is the policy combination of all agents except agent i, reflecting the impact of the decisions of other agents on the current agent.
[0096] , , Let be the weight coefficient, and satisfy... It can be dynamically adjusted according to task priority and scheduling requirements, improving the efficiency of high communication intensity tasks. Weighting focuses on communication time overhead, improving performance for high real-time tasks. The weighting is designed to penalize decision delays. For agent i in the policy profile The communication time overhead item is the sum of the two time consumption items constructed in S32; For agent i in the policy profile The topology synchronization overhead item is the sum of the resolution times of all conflicting links as counted in S33; This is the decision delay penalty term corresponding to the computation time consumed by agent i in this decision, i.e., the penalty value constructed in S34 based on the decision computation time. Through this reward function, the merits of different policy choices for each agent can be quantified, providing a core basis for subsequent distributed iterative negotiation to solve the Nash equilibrium.
[0097] In one optional embodiment, the formula for calculating the communication time overhead is:
[0098]
[0099] in, Indicates agent i in the policy profile The following communication time overhead item, This represents the policy chosen by agent i. The set of physical links contained in the corresponding logical topology. Represents the set of physical links A physical link connects the physical nodes and physical nodes , This parameter represents the message size of the task to be scheduled. Represents the physical links obtained from the hierarchical association map. Real-time available bandwidth parameters; This indicates the indicator function, which is used when agent i's policy... The policy of agent j The value is 1 when sharing a physical link, and 0 otherwise. This represents the additional conflict delay factor caused by agents i and j sharing a physical link.
[0100] Specifically, in the formula for calculating communication time overhead, For agent i in the policy profile The communication time overhead consists of two parts: the maximum message transmission time within the policy and the additional conflict delay of the multi-agent shared link. The sum of the two parts gives the complete communication time overhead.
[0101] in, This represents the policy chosen by agent i. The corresponding set of physical links contained in the logical topology, which is determined by the policy. The corresponding candidate node cluster consists of direct links between all GPU nodes, links between nodes within a rack, and links across racks. These links can be obtained through the node layer and rack layer indexes of the hierarchical association graph, which clarifies the total range of physical links occupied by the current strategy. Represents the set of physical links Two physical nodes are connected by a physical link, where , These can be GPU nodes, server nodes, or rack nodes, depending on the node types at either end of the link. The specific type is determined by the link's layer (node layer, rack layer) and is used to uniquely identify the set. Each physical link in the network.
[0102] in, The message size parameter for the task to be scheduled represents the size of the data block transmitted during a single communication of the task. It is obtained by parsing the configuration file of the task to be scheduled and is determined by parameters such as the task's model batch size, feature dimension, and data precision. It is a fixed value and is directly used as the core input for calculating the message transmission time. For physical links The real-time available bandwidth parameter is obtained from the hierarchical association graph in real time. It refers to the effective bandwidth that the current link can use to transmit data for this task. It is obtained by subtracting the bandwidth occupied by other tasks from the total link bandwidth. It dynamically changes with the cluster link load in real time and directly affects the message transmission time. The first part of the formula for the communication time overhead item. This is used to calculate the maximum message transmission time within the policy selected by agent i. (For the physical link set) Each link in ,pass Calculate the message transmission time for a single link, which is the ratio of message size to the real-time available bandwidth of the link. The lower the bandwidth, the longer the transmission time. Traverse all links and take the maximum transmission time as the maximum message transmission time within the strategy to ensure coverage of the worst-case scenario for link transmission.
[0103] in, The indicator function is used to determine the policy of agent i. With agent j ( ) strategy Whether to share physical links depends on whether the physical link sets corresponding to the two strategies intersect (i.e., sharing at least one physical link). When the two do not share a physical link, This function can be used to quickly mark link sharing conflicts.
[0104] in, This is the additional latency factor caused by the shared physical link between agent i and agent j. It is used to quantify the increase in latency caused by a single group of agents sharing a link. Its value is positively correlated with the number of shared links and the link load rate; the more shared links and the higher the load rate, the greater the latency increase. The larger the value, the more specific the calculation formula can be. In the formula The base latency factor is determined by the hardware specifications of the cluster link and is a fixed value, representing the base latency increment brought about by a single fully loaded shared link; To find the number of physical links shared by agents i and j, we calculate the set of physical links corresponding to their respective policies. and The intersection of the two is obtained, that is The number of intersection elements is the number of shared links; The average load rate of the shared link is calculated using the following formula: ,in For a single shared link The real-time load rate is obtained from the hierarchical association graph in real time. The above formula is used to quantify and calculate the degree of conflict delay caused by the shared link between agents i and j, which is accurate.
[0105] The second part of the formula for the communication time overhead term. This is used to calculate the total additional conflict delay caused by all other agents sharing the link with agent i. It iterates through all agents j except agent i, using the instruction function... Select agents that share links with agent i, and assign them their corresponding additional conflict delay factors. Summing these values will give the total additional collision delay caused by link sharing.
[0106] By summing the above two parts, the communication time cost of agent i under the current policy profile can be accurately quantified. This takes into account both the transmission efficiency of the internal links of the agent's own policy and the conflict impact of link sharing among multiple agents, providing accurate and reliable cost input for the construction of the multi-agent reward function and supporting the subsequent solution of the Nash equilibrium policy profile.
[0107] The aforementioned method for scheduling computing nodes in a heterogeneous GPU cluster constructs a hierarchical association graph comprising a cluster layer, a rack layer, and a node layer. This transforms the original flat node space into a search space with physical hierarchical constraints. Based on the communication pattern characteristics of the tasks to be scheduled, cross-rack link screening is first performed at the cluster layer to eliminate rack combinations that do not meet bandwidth and reliability requirements. Then, through strong connectivity constraint propagation at the node layer, candidate node clusters that satisfy the strong connectivity conditions within the direct link domain between nodes are selected within each rack. This reduces the state space from an exponential combinatorial explosion to a polynomial-level feasible region. Furthermore, each task is treated as an independent agent, with the candidate node cluster set serving as the policy space. A multi-agent benefit function is constructed, incorporating communication time overhead, topology synchronization overhead, and decision latency penalties. The Nash equilibrium strategy profile is solved through distributed iterative negotiation, enabling agents to converge to a globally coordinated logical topology deployment scheme relying solely on local information exchange. This avoids the latency bottleneck of centralized optimization and the conflict trap of distributed decision-making. Finally, atomic operation sequences are generated through differential comparison, and a subset of atomic operations with positive net benefits is selected for incremental application. This achieves smooth topology reconstruction while keeping decision and execution overhead within acceptable benefit limits, thus realizing millisecond-level online decision-making capabilities in a kilocalorie cluster and ensuring consistency in multi-task topology reconstruction in a distributed environment.
[0108] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0109] Based on the same inventive concept, this application also provides a system for implementing the above-mentioned method for scheduling the topology of computing nodes in a heterogeneous GPU cluster. The solution provided by this system is similar to the implementation described in the above method. Therefore, the specific limitations of one or more embodiments of the computing node topology scheduling system for heterogeneous GPU clusters provided below can be found in the limitations of the computing node topology scheduling method for heterogeneous GPU clusters described above, and will not be repeated here.
[0110] In one exemplary embodiment, such as Figure 3As shown, a computing node topology scheduling system 30 for heterogeneous GPU clusters is provided to implement the methods in the above-described method embodiments. The system includes:
[0111] The hierarchical topology modeling module 31 is used to construct a hierarchical association map of physical topology and runtime status, including cluster layer, rack layer and node layer, by collecting static physical connection relationship and dynamic runtime status data of GPU cluster. Among them, the cluster layer records the interconnection link attributes between different racks, the rack layer records the interconnection link attributes between different nodes within the same rack, and the node layer records the interconnection link attributes between different GPUs within a single node and the runtime status data of each GPU.
[0112] The multi-level topology filtering module 32 is used to perform cross-rack link filtering at the cluster layer based on the communication mode characteristics of the task to be scheduled and the interconnection link attributes of each layer in the hierarchical association graph, to obtain a set of feasible rack combinations; based on the communication mode characteristics of the task to be scheduled and the node layer interconnection link attributes of each rack in the set of feasible rack combinations, it performs node layer strong connectivity constraint propagation to obtain a set of candidate node clusters in each rack; wherein, the node layer strong connectivity constraint propagation is used to filter out node combinations that meet the preset strong connectivity conditions in the direct link domain between nodes within each rack in the set of feasible rack combinations based on the constraint requirements of the task to be scheduled on the communication delay between nodes.
[0113] The multi-agent policy optimization module 33 is used to treat each task to be scheduled as an independent agent, construct a multi-agent reward function, and use the candidate node cluster set as the policy space of each agent. Based on the multi-agent reward function and the policy space, the Nash equilibrium policy profile is solved through distributed iterative negotiation to obtain the logical topology deployment scheme of each task to be scheduled. The multi-agent reward function includes communication time overhead, topology synchronization overhead, and decision delay penalty.
[0114] The topology dynamic reconstruction module 34 is used to generate an atomic operation sequence based on the difference comparison results between the logical topology deployment scheme of each task to be scheduled and the current running topology scheme. By evaluating the execution cost and expected benefit of each atomic operation in the atomic operation sequence, a subset of atomic operations with positive net benefit is selected for application to complete the topology reconstruction of the GPU cluster.
[0115] Embodiments of this application also provide a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the aforementioned method embodiments.
[0116] Embodiments of this application also provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the above-described method embodiments.
[0117] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The components described as separate parts may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this disclosure according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0118] The above-described embodiments are merely illustrative of several implementation methods of the embodiments of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of the patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the embodiments of this application, and these modifications and improvements all fall within the protection scope of the embodiments of this application.
Claims
1. A method for scheduling computing node topology in a heterogeneous GPU cluster, characterized in that, The method includes: S1. By collecting static physical connection relationships and dynamic runtime state data of the GPU cluster, a hierarchical association map of physical topology and runtime state is constructed, including the cluster layer, rack layer and node layer; wherein, the cluster layer records the interconnection link attributes between different racks, the rack layer records the interconnection link attributes between different nodes within the same rack, and the node layer records the interconnection link attributes between different GPUs within a single node and the runtime state data of each GPU. S2. Based on the communication mode characteristics of the task to be scheduled, and combined with the interconnection link attributes of each layer in the hierarchical association graph, cross-rack link screening is performed at the cluster layer to obtain a feasible rack combination set; based on the communication mode characteristics of the task to be scheduled, and combined with the node-level interconnection link attributes of each rack in the feasible rack combination set, strong connectivity constraint propagation at the node level is performed to obtain a candidate node cluster set within each rack; wherein, the strong connectivity constraint propagation at the node level is used to select node combinations that meet the preset strong connectivity conditions within the direct link domain between nodes, based on the constraint requirements of the task to be scheduled on the communication delay between nodes; S3. Treat each task to be scheduled as an independent agent, construct a multi-agent reward function, and use the candidate node cluster set as the policy space of each agent; based on the multi-agent reward function and the policy space, solve the Nash equilibrium policy profile through distributed iterative negotiation to obtain the logical topology deployment scheme of each task to be scheduled; wherein, the multi-agent reward function includes a communication time overhead term, a topology synchronization overhead term, and a decision delay penalty term; S4. Based on the difference comparison results between the logical topology deployment scheme of each scheduled task and the current running topology scheme, generate an atomic operation sequence; by evaluating the execution cost and expected benefit of each atomic operation in the atomic operation sequence, select a subset of atomic operations with positive net benefit for application to complete the topology reconstruction of the GPU cluster.
2. The method according to claim 1, characterized in that, Based on the communication mode characteristics of the tasks to be scheduled, and combined with the interconnection link attributes of each layer in the hierarchical association graph, cross-rack link filtering is performed at the cluster layer to obtain a feasible rack combination set, including: S11. Based on the synchronization mode characteristics and communication granularity characteristics of the task to be scheduled, query the corresponding bandwidth requirement threshold from the preset mapping table; based on the bandwidth requirement threshold, combined with the real-time available bandwidth parameters between racks recorded in the cluster layer of the hierarchical association graph, remove rack pairs whose real-time available bandwidth parameters are lower than the bandwidth requirement threshold, and retain rack pairs whose real-time available bandwidth parameters are higher than or equal to the bandwidth requirement threshold, to obtain a preliminary set of feasible rack pairs; S12. Based on the real-time link retransmission rate parameter of each cabinet pair in the preliminary feasible cabinet pair set, remove cabinet pairs whose real-time link retransmission rate parameter exceeds the preset retransmission rate threshold from the preliminary feasible cabinet pair set, and update the preliminary feasible cabinet pair set. S13. Based on the number of GPUs required for the task to be scheduled, and combined with the number of GPUs in the same node direct link domain recorded in the rack layer of the hierarchical association graph, traverse each rack pair in the updated preliminary feasible rack pair set, determine whether the sum of the number of GPUs in the node direct link domain of the two racks in the rack pair is greater than or equal to the number of GPUs required for the task to be scheduled, retain the rack pairs that meet the quantity condition, and generate the feasible rack combination set.
3. The method according to claim 1, characterized in that, Based on the communication mode characteristics of the task to be scheduled, and combined with the node-level interconnection link attributes of each rack in the feasible rack combination set, strong connectivity constraint propagation at the node level is performed to obtain a set of candidate node clusters within each rack, including: S21. Based on the communication frequency characteristics of the task to be scheduled, query the corresponding maximum allowable communication delay threshold from the preset frequency-delay mapping table. S22. Traverse each cabinet in the feasible cabinet combination set, and based on the static communication delay base between each node recorded in the cabinet layer of the hierarchical association graph, select a subset of nodes from the node set of the current cabinet where the static communication delay between any two nodes is lower than the maximum allowable communication delay threshold, and use the subset of nodes as the preliminary candidate node set for the corresponding cabinet. S23. Based on the real-time utilization and video memory usage data of each GPU recorded in the node layer of the hierarchical association graph, remove GPUs with real-time utilization higher than a preset utilization threshold from the preliminary candidate node set, remove GPUs with video memory usage higher than a preset video memory threshold, and update the preliminary candidate node set. S24. Based on the direct link connection relationship between GPUs recorded in the node layer of the hierarchical association graph, perform connectivity component division on the GPUs in the updated preliminary candidate node set, and divide the GPUs that are connected by direct link paths into the same node cluster to generate the candidate node cluster set.
4. The method according to any one of claims 1 to 3, characterized in that, Each task to be scheduled is treated as an independent agent, a multi-agent benefit function is constructed, and the candidate node cluster set is used as the policy space of each agent. Based on the multi-agent reward function and the policy space, the Nash equilibrium policy profile is solved through distributed iterative negotiation to obtain the logical topology deployment scheme for each of the scheduled tasks, including: S31. Define each task to be scheduled as an agent, and define each candidate node cluster in the candidate node cluster set as a strategy of the corresponding agent; S32. Based on the real-time bandwidth parameters of the physical link corresponding to the strategy selected by each agent and the conflict situation of sharing the physical link between different agents' selected strategies, the communication time overhead item is constructed; wherein, the communication time overhead item includes the maximum message transmission time within the selected strategy and the additional conflict delay caused by different strategies sharing the link; S33. Based on the communication direction conflict situations that may occur in the logical topology corresponding to the strategy selected by each agent, construct the topology synchronization overhead item; wherein, the topology synchronization overhead item is the total time required to resolve the conflict on all links where direction conflicts occur; S34. Construct a decision delay penalty term based on the computation time consumed by each agent in this decision-making process; S35. The multi-agent reward function is obtained by weighted summing of the communication time overhead, the topology synchronization overhead, and the decision delay penalty; wherein the calculation formula of the multi-agent reward function is: in, Indicates agent i in the policy profile The following profit value, This represents the policy profile formed by the joint policies of all agents. This represents the strategy chosen by agent i. This represents the policy of all agents except agent i. , , The weighting coefficients are and satisfy the following conditions: , Indicates agent i in the policy profile The communication time overhead item mentioned below, Indicates agent i in the policy profile The following topology synchronization overhead items, The decision delay penalty term represents the computation time consumed by agent i in this decision.
5. The method according to claim 4, characterized in that, The formula for calculating the communication time overhead is as follows: in, Indicates agent i in the policy profile The communication time overhead item mentioned below, This represents the policy chosen by agent i. The set of physical links contained in the corresponding logical topology. Represents the set of physical links A physical link connects physical nodes and physical nodes , This indicates the message size parameter of the task to be scheduled. This represents the physical links obtained from the hierarchical association map. Real-time available bandwidth parameters; This indicates the indicator function, which is used when agent i's policy... With agent j's policy The value is 1 when sharing a physical link, and 0 otherwise. This represents the additional conflict delay factor caused by agents i and j sharing a physical link.
6. A computing node topology scheduling system for a heterogeneous GPU cluster, used to implement the method according to any one of claims 1 to 5, characterized in that, The system includes: The hierarchical topology modeling module is used to construct a hierarchical association map of physical topology and runtime status, including cluster layer, rack layer and node layer, by collecting static physical connection relationships and dynamic runtime status data of GPU clusters. The cluster layer records the interconnection link attributes between different racks, the rack layer records the interconnection link attributes between different nodes within the same rack, and the node layer records the interconnection link attributes between different GPUs within a single node and the runtime status data of each GPU. A multi-level topology filtering module is used to perform cross-rack link filtering at the cluster layer based on the communication mode characteristics of the task to be scheduled and the interconnection link attributes of each layer in the hierarchical association graph, to obtain a set of feasible rack combinations; based on the communication mode characteristics of the task to be scheduled and the node-level interconnection link attributes of each rack in the set of feasible rack combinations, it performs node-level strong connectivity constraint propagation to obtain a set of candidate node clusters within each rack; wherein, the node-level strong connectivity constraint propagation is used to filter out node combinations that meet the preset strong connectivity conditions within the direct link domain between nodes based on the constraint requirements of the task to be scheduled on the communication delay between nodes within each rack in the set of feasible rack combinations. A multi-agent policy optimization module is used to treat each task to be scheduled as an independent agent, construct a multi-agent reward function, and use the candidate node cluster set as the policy space of each agent; based on the multi-agent reward function and the policy space, the Nash equilibrium policy profile is solved through distributed iterative negotiation to obtain the logical topology deployment scheme of each task to be scheduled; wherein, the multi-agent reward function includes a communication time overhead term, a topology synchronization overhead term, and a decision delay penalty term; The topology dynamic reconstruction module is used to generate an atomic operation sequence based on the difference comparison results between the logical topology deployment scheme of each scheduled task and the current running topology scheme; by evaluating the execution cost and expected benefit of each atomic operation in the atomic operation sequence, a subset of atomic operations with positive net benefit is selected for application to complete the topology reconstruction of the GPU cluster.
7. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the method of any one of claims 1 to 5.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 5.