Method for allocating gpus based on gpu cluster topology
By generating local topology connectivity graphs in large-scale distributed computing clusters and combining them with device health information management, dynamically adjusting weight values and subgraph diversity generation methods, the problems of accuracy and real-time performance in resource allocation under dynamic environments are solved, achieving efficient resource allocation and system stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- VIRTAI TECH BEIJING CO LTD
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-12
Smart Images

Figure CN122204697A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and more specifically to a GPU allocation method based on GPU cluster topology. Background Technology
[0002] When constructing a global connectivity view for a large-scale distributed computing cluster, a unique and complex technical challenge arises: how to accurately and efficiently integrate the topology information and health status of multiple nodes in a dynamically changing cluster environment, while ensuring the real-time nature and optimality of resource allocation strategies. The specific scenario focuses on a heterogeneous cluster consisting of hundreds of computing nodes, each containing multiple computing units, interconnected through different types of high-speed connection media (such as fiber optic cables and copper cables), with significant differences in media performance.
[0003] The challenge lies in how to quickly integrate local information into a consistent global view after the topology information management module of a node collects direct connection data between local computing units, amidst frequent dynamic changes between nodes (such as the addition of new nodes or media failures), without view distortion due to data synchronization delays. Simultaneously, the device health information management module needs to periodically check the status of each computing unit and connection medium. However, due to the large cluster size, the contradiction between the detection frequency and data volume makes real-time updates to the global view difficult, especially during fault marking and removal; how to avoid misjudgments or omissions of faulty elements? Furthermore, in the optimal subgraph search module, facing large-scale computing task demands submitted by users, how to quickly filter out subgraphs that meet the resource quantity requirements from the global view, and calculate weights based on the performance differences of the connection media to ensure that the allocation strategy prioritizes high-performance paths while also considering the balance of remaining resources, without causing task allocation delays due to excessive computational complexity. These issues are particularly prominent in high-load, dynamically changing cluster environments, directly impacting the overall system performance and reliability. Summary of the Invention
[0004] The purpose of this invention is to provide a GPU allocation method based on GPU cluster topology, thereby solving the problems existing in the prior art.
[0005] To achieve the above objectives, the present invention provides the following technical solution: a GPU allocation method based on GPU cluster topology, comprising the following steps: S1. Obtain topology information from each computing node in the cluster. Use the topology information management module to parse the direct and indirect connection information between GPU computing units in the computing nodes. Use the user demand quantity parsing logic to extract the number of GPU computing units required for the task. Combine this with connection integrity verification to generate a local topology connectivity graph of the current node and obtain a preliminary node connection dataset. S2. Based on the local topology connectivity graph and the principle of uniform node distribution, the data is reported to the global connectivity view system. The communication medium connection relationship and path delay information in the view are updated in combination with cross-node constraints and media type filtering mechanism to determine the global connectivity distribution status. S3. Periodically query the operating status of each GPU computing unit and its communication medium through the device health information management module. If a connection failure is detected, adjust the weight value of the corresponding medium according to the dynamic weight update rules, and remove the faulty unit from the global connection view to obtain the updated health connection dataset. S4. For the updated healthy connection dataset, a subgraph diversity generation method is used to extract multiple candidate subgraphs from the global connection view. The cumulative weight of the communication path in each subgraph is calculated by combining the subgraph size limit and path length evaluation logic to determine the initial connection cost of each subgraph. S5. Summarize the connection costs of each candidate subgraph according to the total weight summary logic, compare the total weight values one by one through the weight comparison mechanism, and sort the unselected subgraphs in combination with the remaining subgraph optimization rules to obtain the optimal subgraph label with the smallest total weight.
[0006] As can be seen from the above technical solution, the present invention has the following beneficial effects: This invention effectively solves several key problems in the construction of a global connectivity view and resource allocation in large-scale heterogeneous distributed computing clusters. By collecting local topology information of each computing node and generating a preliminary connection dataset between nodes, and then combining the communication medium type and path delay information across nodes to construct a unified global connectivity view, the accuracy and real-time performance of topology integration are effectively improved. Periodic status checks are performed using a device health information management module, combined with dynamic weight adjustment and fault unit removal mechanisms, ensuring that only valid connections are retained in the global view, avoiding scheduling errors caused by misjudgment or omission. During resource allocation, a subgraph diversity extraction and cumulative weight evaluation method is adopted, which prioritizes low-cost, high-performance paths while ensuring resource requirements are met, improving the scheduling efficiency of computing tasks and system throughput. Simultaneously, through candidate subgraph sorting and a backup resource pool mechanism, resource redundancy backup and dynamic adjustment are achieved, further enhancing the system's fault tolerance and stability. Especially in scenarios with frequent node changes and high load operation, this method significantly reduces the risk of resource allocation delay and topology distortion, improving the overall availability, scalability, and reliability of the system. Attached Figure Description
[0007] Figure 1 This is a flowchart of the GPU allocation method based on GPU cluster topology according to the present invention; Figure 2 This is a diagram showing the topology of the GPU server. Detailed Implementation
[0008] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0009] like Figure 1 As shown, the present invention provides a technical solution: a GPU allocation method based on a GPU cluster topology, comprising the following steps: S1. Obtain topology information from each computing node in the cluster. Use the topology information management module to parse the direct and indirect connection information between GPU computing units in the computing nodes. Use the user demand quantity parsing logic to extract the number of GPU computing units required for the task. Combine this with connection integrity verification to generate a local topology connectivity graph of the current node and obtain a preliminary node connection dataset. S2. Based on the local topology connectivity graph and the principle of uniform node distribution, the data is reported to the global connectivity view system. The communication medium connection relationship and path delay information in the view are updated in combination with cross-node constraints and media type filtering mechanism to determine the global connectivity distribution status. S3. Periodically query the operating status of each GPU computing unit and its communication medium through the device health information management module. If a connection failure is detected, adjust the weight value of the corresponding medium according to the dynamic weight update rules, and remove the faulty unit from the global connection view to obtain the updated health connection dataset. S4. For the updated healthy connection dataset, a subgraph diversity generation method is used to extract multiple candidate subgraphs from the global connection view. The cumulative weight of the communication path in each subgraph is calculated by combining the subgraph size limit and path length evaluation logic to determine the initial connection cost of each subgraph. S5. Summarize the connection costs of each candidate subgraph according to the total weight summary logic, compare the total weight values one by one through the weight comparison mechanism, and sort the unselected subgraphs in combination with the remaining subgraph optimization rules to obtain the optimal subgraph label with the smallest total weight. S6. By using the optimal subgraph marking and weight recording storage mechanism, the connection path and communication medium type of the optimal subgraph are stored as the basis for allocation. If the optimal subgraph involves cross-node paths, the high-speed interconnect medium path based on NVLink is selected first to determine the final GPU computing unit allocation scheme. S7. Update the resource status in the global connection view according to the final allocation scheme, use the candidate subgraph storage method to save unused subgraph data as a backup resource pool, and combine the health connection dataset to monitor the status of the allocated GPU computing units in real time to obtain a dynamically adjusted cluster topology view.
[0010] This implementation relies on the complex topology within the GPU cluster. First, the topology information management module parses the direct or indirect connections between GPUs within each computing node, constructing a local connectivity graph containing bandwidth, latency, and physical topology information. Based on the user-submitted task requirements, the target number of GPUs is calculated. Then, by integrating the local connectivity relationships of multiple nodes, a unified cross-node topology model is constructed in the global connectivity view system. This model, after incorporating communication medium types (such as NVLink) and cross-node constraints, generates a global connectivity distribution graph with complete structure and path latency. The system periodically checks the operating status of GPU devices and their connection media through the device health information management module. Upon detecting anomalies, the weight values of the corresponding paths are immediately adjusted to ensure the global connectivity view reflects the real-time reliability of the operation. The updated healthy connectivity dataset is used to generate multiple candidate subgraphs. A subgraph diversity generation method ensures that the selected candidate schemes cover different topology characteristics, and their communication costs are calculated using path length, medium type, and cumulative weights. The system further selects the subgraph with the lowest communication cost through a total weight aggregation mechanism and optimizes and marks the remaining subgraphs to ensure overall scheduling efficiency. The GPU paths and media information contained in the final selected optimal subgraph will be used as the basis for actual allocation. When cross-node communication needs arise, high-bandwidth and low-latency NVLink links will be prioritized. The system further updates the global view resource status and stores unselected subgraphs in the reserve pool, realizing dynamic monitoring and topology adaptive adjustment after allocation.
[0011] S1 includes obtaining topology information from each compute node in the cluster, parsing the direct and indirect high-speed connection information between GPU computing units in the compute nodes through the topology information management module; extracting the number of GPU computing units required for the task from the topology information using demand quantity parsing logic, and generating a local topology connectivity graph for the current node by combining connection integrity verification; obtaining a preliminary node connection dataset for the local topology connectivity graph, and verifying the connection consistency in the dataset by comparing it with the overall cluster connection data through global topology integration; if the connection consistency is lower than a preset threshold, obtaining additional connection data from the cluster node parsing and updating the node connection dataset; and determining the preliminary node connection dataset based on the updated node connection dataset.
[0012] In this embodiment, when the cluster is running, the topology information management module sequentially obtains the topology information of each computing node in a fixed order. The topology information is stored in the form of multiple records. Each record corresponds to a physical or logical connection between two GPU computing units in the node. The topology information includes at least the node identifier of the computing node, the identifier of the GPU computing unit at the local end, the identifier of the GPU computing unit at the peer end, the connection medium type, the number of intermediate devices traversed by the connection path, the theoretical bandwidth value of the connection, the theoretical latency value of the connection, and a flag indicating whether it is a direct high-speed connection. The node identifier is used to uniquely represent the topology information within the cluster. A compute node identifier is assigned to each node during cluster deployment according to its physical sequence number or management number, and remains unchanged during system operation. A GPU compute unit identifier uniquely identifies a GPU compute unit within the same node. This identifier is assigned during node startup based on the hardware slot order or driver return order, and remains unchanged unless the node hardware configuration changes. The connection media type distinguishes different physical connection buses and switching channels. During cluster deployment, administrators compile a list of all supported connection media in the current cluster, marking high-bandwidth media types as high-speed media types in the configuration file, and marking the remaining media types as non-high-speed. The classification of media type is used, and the classification result is stored as a fixed configuration in the cluster configuration data. The number of intermediate devices indicates the number of switching or forwarding devices traversed on the physical path between the local and remote GPU computing units. This number is directly given by the driver interface based on the actual routing path or obtained by the probe program through hop-by-hop detection during topology probing. The theoretical bandwidth value indicates the data transmission rate that the connection can achieve under non-congestion conditions. This value is written into the topology information by the device driver according to the nominal bandwidth in the hardware specifications. The theoretical latency value indicates the time required to complete one data transmission and reception. This value is determined during the cluster initialization phase by checking each connection. The system sends a fixed amount of test data and measures the round-trip time. Then, it takes the average of multiple measurements and writes it into the topology information. The flag indicating whether a connection is a direct high-speed connection is determined by the topology information management module based on the number of intermediate devices and the type of connection medium. When the number of intermediate devices in a connection record is 0 and the connection medium type is marked as high-speed medium type in the configuration file, and the theoretical bandwidth value of the record is greater than or equal to the bandwidth lower limit value pre-recorded in the configuration file, the topology information management module sets the direct high-speed connection flag of the record to yes. Otherwise, it is set to no. The bandwidth lower limit value is determined by the administrator during the cluster deployment phase based on the nominal bandwidth of all connection media.
[0013] In this embodiment, the specific determination method is as follows: First, the nominal bandwidth of all media types is collected, and these bandwidth values are sorted from largest to smallest. The smallest bandwidth value among the top three media types after sorting is taken as the lower limit of bandwidth and written into the configuration file. This ensures that connections marked as direct high-speed connections belong to connection types with relatively high overall bandwidth. After completing the topology information collection, when a task needs to be executed, the system generates task description data, which includes the number of GPU computing units required by the task. This number is explicitly entered by the user as a positive integer through the task submission interface when the user submits the task. After receiving the task description data, the requirement quantity parsing logic reads the number of GPU computing units required by the task and uses this value as the target number when selecting the GPU computing unit set. This value remains unchanged throughout the allocation process. When processing a computing node, the requirement quantity parsing logic first counts the number of GPU computing units in the available state in the node's topology information. If available... If the number of available GPU computing units is less than the number required for the task, the node is skipped and does not participate in the local topology construction within the node. If the number of available GPU computing units is greater than or equal to the number required for the task, the requirement resolution logic selects a candidate set of GPU computing units within the node according to fixed rules. The selection rules are as follows: prioritize GPU computing units with a large number of direct high-speed connections with other GPU computing units, sort these GPU computing units from high to low according to the number of direct high-speed connections, and add them to the candidate set from the top few of the sorted results until the number of GPU computing units in the candidate set equals the number required for the task. When there are GPU computing units with the same number of direct high-speed connections, sort them according to the value of the GPU computing unit identifier from small to large to determine the order of addition. After obtaining the candidate set, a connection integrity check is performed, which is completed by the topology information management module based on the topology information of the node.
[0014] Specifically, the calculation process is as follows: The topology information management module first establishes a connection table for the current node in memory. Each record in the connection table corresponds to a connection between any two GPU computing units in the candidate set. If there are records in the topology information where the local and peer identifiers are exactly the same as the identifiers of these two GPU computing units and are marked as "direct high-speed connection," then a record marked as a direct high-speed connection is generated in the connection table. If there are no direct high-speed connection records but there are connection records with a number of intermediate devices greater than 0, then a record marked as an indirect connection is generated in the connection table. To ensure no omissions, the topology information management module... The above check is performed on each different pair of GPU computing units in the candidate set. After establishing the connection table, the topology information management module performs a reachability check on each GPU computing unit in the candidate set. The specific steps of the reachability check are as follows: taking the GPU computing unit as the starting point, create a list to be accessed, add the starting point to the list to be accessed, and simultaneously create an accessed set, add the starting point to the accessed set. Then, take out one GPU computing unit from the list to be accessed in turn, and retrieve all relationship records in the connection table that take the GPU computing unit as the local end or the peer end. For the other end GPU computing unit in each relationship record, if the G... If a PU computing unit is not yet in the visited set, it is added to the visited set and simultaneously added to the unvisited list. The steps of retrieving and expanding from the unvisited list are repeated until the unvisited list is empty. When the unvisited list is empty, the visited set contains the total number of GPU computing units reachable from the starting point via direct or indirect high-speed connections. The topology information management module compares this number with the total number of GPU computing units in the candidate set. If they are equal, it means that all GPU computing units in the candidate set can be reached from that starting point. If the above reachability check is performed on all GPU computing units in the candidate set as starting points... If the reachable number equals the total number of candidate sets, the connection integrity check passes. Based on this, the topology information management module generates a local topology connectivity graph for the current node. The local topology connectivity graph uses each GPU computing unit in the candidate set as a node and each connection record in the connection relationship table as an edge. On each edge, the connection medium type, number of intermediate devices, theoretical bandwidth value, and theoretical latency value are recorded. If there are multiple indirect paths, the path with the smallest theoretical latency value is selected first as the path corresponding to the edge, and the theoretical latency value of the path is set as the sum of the theoretical latency values of each connection segment.
[0015] The accumulation process involves summing the theoretical latency values for each physical connection segment along the path to obtain a total latency value. After the local topology connectivity graph is generated, the topology information management module traverses each edge in the graph, extracting the local GPU computing unit identifier, remote GPU computing unit identifier, connection medium type, number of intermediate devices, theoretical bandwidth value, and theoretical latency value for each edge. These are then written into the preliminary node connection dataset as records, thus obtaining the preliminary node connection dataset for that node. After the preliminary node connection dataset is generated, global topology integration performs a connection consistency comparison on this dataset. Global topology integration relies on the overall cluster connection data established during cluster deployment and historical operation. The overall cluster connection data records the confirmed valid connection information between all computing nodes within the cluster. Each record includes at least the local node identifier, local GPU computing unit identifier, remote node identifier, remote GPU computing unit identifier, connection medium type, theoretical bandwidth value, and theoretical latency value. Global topology integration first reads all cluster-wide connection data records related to the current node from the storage medium. Then, it processes each record in the preliminary node connection dataset one by one. When processing a preliminary node connection record, global topology integration searches the cluster-wide connection data for a record with the same local node identifier, peer node identifier, local GPU computing unit identifier, peer GPU computing unit identifier, and connection medium type. If such a record exists, it is counted as a consistent record; otherwise, it is counted as an inconsistent record. After all records have been processed, global topology integration counts the number of consistent record entries and the total number of records in the preliminary node connection dataset. The number of consistent record entries is divided by the total number of entries to obtain a decimal between 0 and 1 as the connection consistency value. The closer this value is to 1, the more consistent the preliminary node connection dataset is with the cluster-wide connection data. The preset threshold is determined during the cluster trial operation phase.
[0016] In this embodiment, the determination method is as follows: After the cluster deployment is completed, several representative computing nodes are selected, and the topology information of these nodes is manually checked to confirm the actual connection relationships of these nodes. Without modifying the actual connection relationships, the above-mentioned preliminary node connection dataset generation and connection consistency calculation process is run for each node, and the connection consistency value range obtained when the topology information is correct and complete is recorded. For example, the detection results show that the consistency values of these nodes are all between 0.95 and 1. Then, in the same batch of nodes, topology information with missing or incorrect information is constructed by deliberately deleting some connection records or modifying some connection parameters, and the consistency calculation process is run again. At this time, the consistency values obtained are mostly lower than 0.85. Based on these two sets of results, the system selects the nodes with a consistency value between 0.85 and 0.95. A fixed value between 0.95 and 0.95 is selected as the preset threshold. To ensure a safety margin, the preset threshold is specifically set to 0.9, and this value is written into the configuration file as a unified standard for determining the consistency of all nodes. During actual operation, the global topology integration compares the connection consistency value calculated by each node with the preset threshold of 0.9. If the connection consistency value is greater than or equal to 0.9, it is determined that the initial node connection dataset of that node meets the consistency requirements with the overall cluster connection data under the current situation, and no additional data is required. If the connection consistency value is less than 0.9, it is determined that the initial node connection dataset of that node is missing or incorrect, and additional connection data needs to be obtained through cluster node parsing. At this time, cluster node parsing performs a probe on all GPU computing units on that node one by one.
[0017] Further, the process of probing one by one is as follows: For each different pair of GPU computing units, a fixed amount of test data is sent through the underlying communication interface. If the test data can successfully travel back and forth and the return time is within a reasonable range, a new connection record is generated. The local GPU computing unit identifier and the remote GPU computing unit identifier correspond to the current pair. The connection medium type is returned by the underlying interface. The theoretical bandwidth and theoretical latency values are calculated based on the test results. The number of intermediate devices is determined based on the path information reported by the underlying interface. The set of all probed records is used as the additional connection data. After obtaining the additional connection data, the topology information management module updates the original preliminary node connection dataset. The update rules are as follows: If a GPU pair not present in the original dataset appears in the additional connection data, these new records are directly added to the node connection dataset. If a record with the same local and remote GPU computing unit identifiers as the original dataset exists in the additional connection data, the connection medium type, number of intermediate devices, theoretical bandwidth, and theoretical latency values of the two records are compared. When the contents of the two records are not completely identical, the record in the additional connection data replaces the original record. If no record with the same GPU combination as an existing record is found, the existing record is retained, and the updated node connection dataset is obtained through the above merging process. After the node connection dataset is updated, the global topology integration repeats the aforementioned connection integrity verification and connection consistency calculation process based on the updated node connection dataset. If the connection integrity verification passes and the new connection consistency value is greater than or equal to the preset threshold of 0.9, the updated node connection dataset is determined as the final preliminary node connection dataset for that node and used in subsequent steps. If the connection consistency value is still lower than 0.9, the cluster node resolution performs additional connection data collection and dataset update processes again, repeating the above comparison and calculation process. The system sets a fixed maximum number of retries in the configuration, for example, a maximum of 3 retries. If the connection consistency value is still lower than 0.9 after the above update and comparison process has been repeated 3 times, it is determined that the topology information of that node cannot be consistent with the overall cluster connection data within an acceptable range. This node will no longer be included in the candidate node set in this GPU allocation process. Thus, the process is executed sequentially through the topology information management module, the demand quantity parsing logic, the connection integrity verification, the global topology integration, and the cluster node resolution.
[0018] S2 includes obtaining node distribution uniformity data from the local topology connectivity graph, filtering preliminary communication medium connection relationships by comparing distances and load differences between nodes through cross-node constraints, and obtaining a path delay information set; for the path delay information set, using a medium type filtering mechanism to extract compatible medium types from the path delay information set and comparing them with existing data in the global connectivity view system to determine the updated communication medium connection relationships; based on the updated communication medium connection relationships, reporting node distribution uniformity data to the global connectivity view system to determine the preliminary consistency of the connection distribution status; if the preliminary consistency meets a preset threshold, integrating the results extracted by the cross-node constraints and the medium type filtering mechanism to obtain extended connection path information in the global scope; and verifying the medium compatibility in the connection distribution status through the extended connection path information to determine the connection distribution status in the global scope.
[0019] In this embodiment, after generating the local topology connectivity graph in step S1, the system obtains a local topology connectivity graph data in memory containing multiple computing nodes, GPU computing units within each node, and connections between nodes. Each record in the local topology connectivity graph includes at least a node identifier, a GPU computing unit identifier within the node, an identifier of the peer node connected to that GPU computing unit, an identifier of the peer GPU computing unit, a connection medium type, the number of intermediate devices traversed on the path, the path delay value of the connection, and the bandwidth information of the connection calculated in the previous stage. The system first obtains node distribution uniformity data from this local topology connectivity graph. The node distribution uniformity data includes at least the distribution of each node in the current local topology connectivity graph. The graph displays the number of GPU computing units participating in the connection, the total number of available GPU computing units for that node, and the ratio between the two. The total number of available GPU computing units for each node is entered into a configuration file by operations personnel during cluster deployment, based on the actual number of installed GPUs, and loaded into memory at system startup. The number of GPU computing units participating in the connection is obtained by traversing the local topology graph. During traversal, the system processes each connection record in the graph. For each record processed, the GPU computing units on the two nodes corresponding to the local and peer node identifiers involved in that record are counted once. If a GPU computing unit appears multiple times in different records, it is only counted as participating in the connection once. The counting process continues until all nodes are counted. Next, the system calculates the GPU utilization ratio for each node by dividing the number of GPU computing units involved in the connection by the total number of available GPU computing units in that node. The node identifier, the number of GPU computing units involved in the connection, the total number of available GPU computing units, and the GPU utilization ratio are then written into the node distribution uniformity data in node order. Simultaneously, the number of nodes involved in the connection and the number of nodes not involved in the connection are counted for subsequent uniformity assessment. Subsequently, the system filters each cross-node connection in the local topology connectivity graph based on cross-node constraints pre-configured during the cluster deployment phase. These cross-node constraints are fixedly configured by operations personnel during the deployment phase and include at least the maximum allowed distance between nodes and any... The maximum allowable load difference, maximum allowable path hop count, and maximum allowable path delay are specified during service execution. The node distance value represents the proximity of two nodes in terms of physical rack location or network topology. During the deployment phase, operations personnel assign an integer distance value to each pair of nodes based on the rack location and switch hierarchy in the data center. This distance value is equal to the number of switching device layers that need to be traversed in the network from one node to another. The maximum distance value is an upper limit selected by operations personnel from all node distance values based on the network structure and the service's tolerance for latency. For example, after calculating the distances between all nodes, if it is found that the distance value of most well-performing connections is no greater than 3, then the maximum distance value is fixed at 3.The load difference value is calculated based on the aforementioned GPU usage ratio. During step two, the system calculates the current GPU usage ratio for each node. The load difference is defined as the absolute value of the difference between the GPU usage ratios of any two nodes. The maximum allowable load difference value is determined by operations personnel during the trial operation phase through multiple rounds of task scheduling results. Specifically, during the trial operation phase, the distribution of GPU usage ratios of each node is recorded during multiple rounds of task scheduling. When the load difference is too large, the task execution time increases significantly or some nodes remain under high load for an extended period. Based on these results, operations personnel select a value not exceeding 1 as the maximum load difference upper limit. In this implementation, the maximum load difference value is fixed at 0.3. The maximum allowable path hop count is determined by the network topology and... With the performance of intermediate devices determined, operations and maintenance personnel count the number of intermediate devices required for a path from one node to another during the deployment phase. An excessive number of intermediate devices introduces significant additional latency. Therefore, a maximum hop count that ensures performance for most tasks is selected from the statistical results; in this implementation, this is fixed at 4. The maximum allowable path latency is determined by the application scenario's end-to-end latency requirements. During the cluster trial operation phase, round-trip times are measured by sending fixed-size data along different paths, and the average value is calculated. All measured path latency values are sorted from smallest to largest. An upper limit is selected as the maximum allowable path latency, ensuring tasks can complete normally; in this implementation, this is fixed at a specific value in milliseconds and recorded in the configuration.
[0020] After obtaining the above parameters, the system performs a comparison operation on each cross-node connection record in the local topology connectivity graph. A cross-node connection is one where the identifier of the local node and the identifier of the peer node in the record are different. For each cross-node connection, the system first looks up the GPU usage ratio of the local node and the peer node in the node distribution uniformity data, calculates the absolute value of the difference between the two as the load difference of the connection, and then reads the distance between the two nodes from the node distance table generated during the deployment phase. At the same time, it reads the number of intermediate devices along the path and the path delay value from the connection record. The system checks four conditions in sequence: the node distance value is not greater than the maximum distance value, which is not greater than 3 in this embodiment; the load difference is not greater than the maximum load difference value of 0.3; the number of intermediate devices is not greater than the maximum allowed path hops of 4; the path delay value is not greater than the maximum allowed path delay value, such as... If all four conditions are met, the system extracts the local node identifier, peer node identifier, local GPU computing unit identifier, peer GPU computing unit identifier, connection medium type, and path delay value for that connection, writes them into the preliminary communication medium connection relationship set, and simultaneously writes the path delay value into the path delay information set to form a record. Each record in the path delay information set contains the starting node identifier, ending node identifier, starting GPU computing unit identifier, ending GPU computing unit identifier, connection medium type, and the path delay value corresponding to that connection. If any of the four conditions are not met, the system discards that connection, does not write it into the preliminary communication medium connection relationship set, and does not write its corresponding path delay value into the path delay information set. After all cross-node connections have been processed, a path delay information set that satisfies the cross-node constraints is obtained.
[0021] Subsequently, the system executes a media type filtering mechanism on the path delay information set. This mechanism is configured by operations personnel during the cluster deployment phase based on the actual deployed communication media types and their combination rules. Specifically, operations personnel list all existing communication media types in the cluster, such as high-speed interconnect media, general-purpose bus media, and Ethernet media, and categorize these media types according to whether they can appear simultaneously on the same cross-node path, resulting in a compatible media type combination list and a list of disallowed media combinations. These two lists are then written into the configuration data. Each record in the path delay information set contains only one specific connection media type; therefore, media type filtering... The mechanism directly compares the connection media type in the record with the list of compatible media types. If the media type exists in the list, the path delay information record is retained; otherwise, the record is deleted and will not participate in subsequent global data comparisons. After obtaining a set of path delay information containing only compatible media types, the system compares each record in this set with existing data in the global connectivity view system. The global connectivity view system stores a global connectivity data table, and each record in this table contains at least the start node identifier, end node identifier, start GPU computing unit identifier, and end GPU computing unit identifier. When processing a record in the path delay information set, the system searches for a matching record in the global connectivity data table based on the start node identifier, end node identifier, start GPU compute unit identifier, end GPU compute unit identifier, and connectivity media type. If a matching record is found, the system compares the path delay value in this new record with the path delay value of the existing record in the global connectivity data table. If the difference is less than a pre-fixed delay update threshold, it indicates that the delay change is within the normal fluctuation range. The system keeps the path delay value in the global connectivity data table unchanged and only adds the most recent path delay value to the record. When the difference between the two timestamps is greater than or equal to the delay update threshold, it indicates that the connection performance has changed significantly. The system updates the path delay value of the record in the global connection data table to the path delay value in the current record and marks the record as updated. The delay update threshold is determined by the operation and maintenance personnel during the cluster trial operation phase by repeatedly measuring the delay of the same path. The method is to continuously measure multiple sets of delay data for the same connection at different times, count the maximum difference of the natural fluctuation of the delay, multiply this difference by a safety factor greater than 1, and use the result as the delay update threshold. This value is written to the configuration file. In this embodiment, the delay update threshold is fixed to a specific millisecond value.If no identical record is found in the global connectivity data table, the system adds the path delay information as a new communication medium connection relationship to the global connectivity data table. Thus, after each comparison and update process, the communication medium connection relationships in the global connectivity view system form an updated set of communication medium connection relationships. This updated set represents the currently known global connections that satisfy cross-node constraints and pass the media type filtering mechanism. Next, based on this updated set of communication medium connection relationships, the system reports the aforementioned node distribution uniformity data to the global connectivity view system and performs a preliminary consistency judgment on the connection distribution status.
[0022] Specifically, the process is as follows: The global connection view system first counts the number of cross-node connection entries currently participated in by each node based on the updated communication medium connection relationship set, recording this number as the node's current connection count. Simultaneously, it reads the node's average connection count over a past period and the pre-set upper and lower limits of the target connection count from historical operational data. The average connection count is obtained by accumulating the connection counts of each node and calculating the average after each scheduling cycle during system operation. The upper and lower limits of the target connection count are set by operations personnel during cluster deployment, taking into account node hardware capabilities, energy consumption limitations, and expected load distribution, and written into the configuration file. The global connection view system compares the current connection count of each node with the target connection count upper and lower limits. When the current connection count falls between the target connection count upper and lower limits, the node is marked as a node with a normal connection status. When the current connection count is less than the lower limit or greater than the upper limit, the node is marked as a node with an abnormal connection status. The system then counts the number of nodes with normal connection status and the total number of nodes in the cluster. Dividing the number of nodes with normal connection status by the total number of nodes yields a value between 0 and 1, which serves as the initial consistency number for the connection distribution status. The value is closer to 1, indicating that the current connection distribution is more in line with the expected target. The preset threshold is determined by the operation and maintenance personnel during the trial operation phase by observing the connection distribution results over multiple scheduling cycles. Specifically, during the trial operation, the ratio of the number of nodes with normal connection status to the total number of nodes is recorded in multiple rounds of scheduling. When the task execution effect is good and the hardware utilization of each node is reasonable, these ratios are usually close to 1. The operation and maintenance personnel select a value slightly lower than the minimum value from these ratios as the initial consistency preset threshold. In this embodiment, the preset threshold is fixed at 0.9 and written into the system configuration. During formal operation, the global connection view system compares the initial consistency value calculated each time with 0.9. When the initial consistency value is greater than or equal to 0.9, the system determines that the connection distribution status under the current updated communication medium connection relationship meets the requirements. Then, based on the determination result, the extended connection path information acquisition process is entered. When the initial consistency value is less than 0.9, the system considers that the current connection distribution deviates from the expectation and needs to be optimized by adjusting the connection selection strategy in subsequent steps. In this embodiment, the extended path search is only executed when the initial consistency value is greater than or equal to the preset threshold.
[0023] The process of acquiring extended connection path information is based on the updated set of communication medium connection relationships. The system performs a path search once for each pair of different start and end nodes in the global connection view system. The path search is constrained by using only connection edges from the updated set of communication medium connection relationships, using only media types allowed by the media type filtering mechanism, and ensuring that each connection segment on the path satisfies the maximum distance, maximum load difference, maximum path hop count, and maximum path delay values in the cross-node constraints. During the search, the system starts from the start node, using all connection edges directly connected to the start node that satisfy the constraints as first-level candidate paths. It then continues to expand outwards from the endpoints of these candidate paths. During expansion, the same conditional judgment is performed on each newly added connection edge, while simultaneously accumulating the number of connection segments already included in the current path and the path delay values for each segment. When the number of connection segments exceeds the maximum path hop count of 4, the system immediately terminates the expansion of that path and discards it. Similarly, when the accumulated path delay value exceeds the maximum allowed path delay value, the path expansion is also terminated and the path is discarded. When a path reaches the target termination node without violating any constraints, the system records this path as a valid extended connection path. The system writes the identifiers of the nodes traversed sequentially along the path, the connection medium type of each connection segment, and the cumulative path delay value into the extended connection path information set. After performing path searches for all combinations of start and end nodes, a global extended connection path information set is obtained. The system then uses this extended connection path information set to verify the media compatibility in the connection distribution state. The verification process involves checking each extended connection path to see if the media type combinations of all connection segments on the path completely fall into the compatible media type combination list configured during the deployment phase. If the media types of all connection segments on a path belong to the same group of compatible media type combinations or belong to allowed combinations in the compatible list, the path is marked as a media-compatible path. If any two connection segments in a path have media type combinations that appear in the list of disallowed media combinations, the path is marked as a media-incompatible path and removed from the extended connection path information set.After deleting all media-incompatible paths, the system categorizes the remaining media-compatible paths according to their start and end nodes. For each node, it counts the number of media-compatible path entries passing through it, the average path latency of these paths, and the distribution of media types used. These statistical results are written into the connection distribution status data structure of the global connection view system. Each record in this data structure corresponds to a node or a pair of nodes, including the number of connections for that node globally, the proportion of media types used in these connections, and the average latency of these connections. Multiple records are combined to form the global connection distribution status. Subsequent steps, when allocating GPU computing units, directly read the connection quality and distribution between nodes from this connection distribution status, thereby making allocation decisions based on the extended connection path information that has been verified for media compatibility and meets cross-node constraints.
[0024] S3 includes obtaining operational status data from the GPU computing unit through the device health information management module, checking abnormal indicators for communication media to obtain a preliminary fault identifier set; based on the preliminary fault identifier set, adjusting the corresponding weight values by comparing media load differences and historical fault records using dynamic weight update rules to determine the weight-adjusted media list; for the weight-adjusted media list, if a connection fault is detected, the faulty unit is removed from the global connection view, and the view data after removal is obtained; using the view data after removal, the backup media switching mechanism is integrated to update the health connection dataset via backup path allocation to obtain the updated health connection dataset.
[0025] In this embodiment, during cluster operation, the device health information management module obtains operational status data from each GPU computing unit according to a fixed monitoring cycle. The monitoring cycle length is determined by maintenance personnel during the cluster deployment phase based on the business requirements for real-time health monitoring and system load, after multiple tests during the trial operation phase. In this embodiment, the monitoring cycle length is fixed at 60 seconds, and this value is written into the configuration file. At the beginning of each monitoring cycle, the device health information management module sequentially polls all GPU computing units, reading the current node identifier, GPU computing unit identifier, and the identifier of each communication medium physically connected to that GPU computing unit for each GPU computing unit. The system obtains the current core utilization percentage, memory usage percentage, current temperature value, number of accumulated error events in the current monitoring period, total amount of data transmitted through each communication medium in the current monitoring period, number of retransmissions and packet loss counts for each medium in the current monitoring period, and average round-trip latency value measured by probe messages from the underlying driver. It also obtains the current link status flag of each communication medium from the communication medium status register or driver interface. The link status flag value is uniformly defined as two states during the deployment phase: normal and interrupted. Normal means that the link has successfully established a connection and completed data transmission and reception in the most recent probes. Interrupted means that the most recent probes have failed or the hardware has reported that the link is disconnected.During the cluster deployment and trial operation phase, maintenance personnel set abnormal thresholds for each operational status indicator based on the nominal values in the hardware manual and monitoring data from multiple rounds of trial operation. The method for determining the temperature abnormal threshold is to run near-full load tasks for an extended period during the trial operation phase, recording the highest temperature value of each GPU computing unit during stable operation. This highest value is then added to a fixed safety margin to obtain the upper temperature limit. In this implementation, the upper temperature limit is fixed at 85 degrees Celsius. The method for determining the upper limits for error event counts, retransmission counts, and packet loss counts is to record the maximum values of these counts over multiple monitoring periods during the trial operation phase, provided that they do not affect task correctness and the link is in a normal state. This maximum value is then multiplied by a safety factor greater than 1 to obtain the final adopted upper limit values. In this implementation, the upper limit for error event counts is fixed at 10, the upper limit for retransmission counts is fixed at 20, and the upper limit for packet loss counts is fixed at 20. The upper limit for the number of measurements is fixed at 5. The upper limit for the average round-trip delay is determined by continuously measuring multiple sets of round-trip delay data for each communication medium under normal conditions during the trial operation phase, statistically analyzing the maximum and average values, and selecting a fixed value slightly larger than these values as the upper limit. In this embodiment, the upper limit for the average round-trip delay is fixed at 5 milliseconds. The lower limit for the minimum acceptable bandwidth is determined by statistically analyzing the data transmission volume per unit time for each communication medium under normal working load during the trial operation phase, converting this data transmission volume into a bandwidth value, taking the minimum value among multiple measurements, and then subtracting a certain safety margin to obtain the final lower limit for the bandwidth. In this embodiment, the lower acceptable bandwidth value is converted into a fixed value for the amount of data transmitted per second. The device health information management module reads all the above thresholds from the configuration file into memory once when the system starts up and keeps them unchanged throughout the entire operation.
[0026] At the end of each monitoring cycle, the device health information management module performs anomaly checks on each communication medium individually. Specifically, it compares the temperature, number of error events, number of retransmissions, number of packet losses, average round-trip time (RTT), and bandwidth values collected for that medium during the cycle with the aforementioned upper limits for temperature, number of error events, number of retransmissions, number of packet losses, average RTT, and minimum acceptable bandwidth. If the temperature value exceeds the upper limit, a temperature anomaly is determined; if the number of error events exceeds the upper limit, an error event anomaly is determined; if the number of retransmissions exceeds the upper limit, a retransmission anomaly is determined; if the number of packet losses exceeds the upper limit, a packet loss anomaly is determined; and if the average RTT value exceeds the upper limit, an anomaly is determined. For delay anomalies, if the bandwidth value calculated by dividing the total data volume by the monitoring period is less than the minimum acceptable bandwidth value, it is determined to be a bandwidth insufficiency anomaly. If the link status is marked as interrupted, it is determined to be a link interruption anomaly. As long as any of the above anomalies are established, the device health information management module will consider the communication medium as having a health problem within the current monitoring period and generate a preliminary fault identification record to write into the preliminary fault identification set. This record includes at least the node identifier, the identifiers of the two GPU computing units connected to the medium, the medium identifier, the medium type, the specific indicator type that triggered the anomaly, the actual value of the indicator within the current period, and the end time of the current monitoring period. The preliminary fault identification set is cleared and refilled within each monitoring period, thus completely recording the list of all communication media that were detected as having anomalies within the current period.
[0027] After obtaining the initial set of fault identifiers, the device health information management module dynamically updates the weight value of each communication medium based on this set. The medium weight value quantifies the risk level of the medium in subsequent path calculations; a higher value indicates a higher risk. The weight values are stored in the data maintained internally by the device health information management module. Each record corresponds to one communication medium and includes at least the current weight value, historical fault count, most recent fault time, cumulative fault duration period, and current load ratio. The historical fault count is used to count the total number of times the medium has been recorded in the initial fault identifier set from system startup to the current moment. The most recent fault time is the time of the most recent fault identifier record. The cumulative fault duration period is used to count how many monitoring periods the most recent continuous fault has lasted. The current load ratio is the ratio obtained by dividing the bandwidth value of the medium in the current period by the rated bandwidth value of the medium. The rated bandwidth value is written into the configuration file according to the hardware specifications during the deployment phase and read in when the system starts. The weight value initialization rule is as follows: when the system starts or a new communication medium is added to the management, its current weight value is set to 0, the historical fault count is set to 0, the cumulative fault duration period is set to 0, and the most recent fault time is empty. During the deployment phase, the device health information management module also configures a target load ratio and a load difference threshold for each type of media. The target load ratio is obtained by statistically analyzing the average load level of similar media under long-term stable operation during the trial operation phase. For example, if a certain type of high-speed interconnect media maintains a load level of about 60% in most task scenarios, the target load ratio is set to 0.6. The load difference threshold is determined by comparing the deviation between the target load ratio and the actual load ratio in multiple rounds of operation with the task execution success rate. In this embodiment, the load difference threshold is fixed at 0.2, which means that when the deviation of the current load ratio of a certain medium from the target load ratio exceeds 0.2, the load is considered to be abnormally high or low.
[0028] The execution process of the dynamic weight update rule is as follows: The device health information management module traverses each fault identifier record in the preliminary fault identifier set. For each record, it retrieves the current weight value, historical fault count, most recent fault time, cumulative fault duration period, and current load ratio of the corresponding communication medium from the internal data. It increments the historical fault count by 1 to obtain the new historical fault count, writes the end time of the current monitoring period into the most recent fault time field, and then determines whether the medium also appeared in the preliminary fault identifier set in the previous monitoring period. If so, it increments the cumulative fault duration period by 1; otherwise, it resets the cumulative fault duration period to 1. Then, based on the type of the medium, it reads the corresponding target load ratio and load difference threshold from the configuration, subtracts the target load ratio from the current load ratio, and takes the absolute value to obtain the current load difference value of the medium. When the load difference value is greater than the load difference threshold of 0.2, the medium is marked as a load abnormal medium. The device health information management module then calculates the weight increment value for the communication medium. The weight increment value consists of three parts: basic weight increment, historical fault additional increment, and load additional increment. The basic weight increment is fixed at 1, meaning that any medium detected as abnormal will have its weight increased by at least 1. The historical fault additional increment is 1 when the number of historical faults of the medium reaches or exceeds 3 in the last 24 hours, and 0 otherwise. The time window of the last 24 hours is determined by comparing the current time with the time of the most recent fault and by tracing back the historical fault records. The load additional increment is 1 when the current load difference value of the medium is greater than the load difference threshold of 0.2, and 0 otherwise. The final weight increment value is equal to the sum of the three items. Therefore, in this embodiment, the weight increment value is at least 1 and at most 3. The device health information management module adds the current weight value to the weight increment to obtain a new weight value. To prevent the weight from growing indefinitely or becoming negative, a lower limit of 0 and an upper limit of 10 are defined for the weight value during the deployment phase. When the calculated new weight value is greater than 10, it is truncated to 10; when the new weight value is less than 0, it is truncated to 0. The truncated weight value is then written back to the record of the communication medium. After the initial fault identifier set is traversed, the weight values of all media that experienced anomalies in this period are updated. For media that do not appear in the initial fault identifier set, the device health information management module can reset its cumulative fault duration period count to zero at the end of each monitoring period and slowly decay its weight value in small fixed steps as needed. For example, the weight value is reduced by 1 but not lower than 0 for each fault-free period to reflect that long-term healthy operation will gradually reduce the risk.After the weight update is completed, the device health information management module filters out media with a current weight value greater than 0 from all communication media records. These media are then sorted by weight value from largest to smallest to form a weight-adjusted media list. This list retains media that have recently experienced anomalies or have load deviations. Each record in the list includes at least the media identifier, the identifier of the node to which it belongs, the identifiers of the two connected GPU computing units, the media type, the current weight value, the number of historical failures, the cumulative number of failure durations, and the current load ratio.
[0029] After obtaining the media list with adjusted weights, the system further determines connection failures based on the list and updates the global connection view. The connection failure determination parameters are uniformly set during deployment and trial operation, including the failure duration period threshold and the weighted failure threshold. The failure duration period threshold is used to require a certain type of abnormality to occur continuously for a certain number of periods before it is determined to be a real connection failure, in order to avoid false judgments caused by short-term fluctuations. In this embodiment, the failure duration period threshold is fixed at 3 monitoring periods. The weighted failure threshold is used to confirm the overall health status of the media through the weight value. In this embodiment, the weighted failure threshold is fixed at 10. An additional duration threshold is configured to determine whether high weights persist. In this embodiment, the duration threshold is fixed at 30 minutes. At the end of each monitoring cycle, the device health information management module iterates through the weighted media list and executes connection failure detection logic for each media record: If the cumulative number of failure duration cycles for the media is greater than or equal to the failure duration cycle threshold of 3, and each of the last 3 monitoring cycles has at least one of the following abnormalities: abnormal temperature, abnormal error event, abnormal retransmission, abnormal packet loss, abnormal latency, or insufficient bandwidth, then it is determined to be a connection failure caused by the continuous exceeding of abnormal indicators; if the current weight value of the media is equal to the weight failure threshold of 10 and the duration calculated based on the last failure time and the current time is greater than or equal to 30 minutes, then it is determined to be a connection failure caused by long-term high risk; if the link status mark of the media has been continuously interrupted in the last 3 monitoring cycles, then it is determined to be a physical link interruption failure. As long as any of the above three conditions are met, the device health information management module generates a fault unit record, which includes the media identifier, the identifier of the node to which it belongs, the identifiers of the two connected GPU computing units, the media type, the failure type, and the failure confirmation time, and sends the record to the component that maintains the global connection view to trigger a view update.The global connection view stores all connection records that the system currently considers usable for path computation. Each record contains at least the medium identifier corresponding to the connection between nodes, the node to which it belongs, the GPU computing unit identifier of the connection, and the path attribute information of the connection. After receiving a faulty unit record, the global connection view first searches for connection records with the same medium identifier in its own data and deletes the record from the global connection view. Then, it checks whether the two GPU computing units of the faulty medium connection still maintain at least one healthy connection with any node in the cluster through other media. If a GPU computing unit no longer appears in the view, all records related to that GPU computing unit are deleted or marked as unavailable, and that GPU computing unit is excluded from subsequent path computation. When there are multiple faulty unit records in a monitoring period, the global connection view performs the above deletion operation on each record in turn. After completion, the view data after removal is obtained. The view data after removal fully reflects the remaining global connection relationships after removing all faulty connections and unavailable GPU computing units in this period.
[0030] After obtaining the removed view data, the system uses a backup media switching mechanism to restore node pairs whose paths were interrupted due to a failure. Based on the restored data, the system updates the healthy connection dataset. The backup media switching mechanism is a concrete implementation of backup path allocation. Internally, it pre-stores a list of backup paths and a set of backup media information. The backup path list is generated offline during system deployment and early operation phases. The generation process involves enumerating possible multi-hop paths between nodes based on the known physical topology and historical connection data. For each pair of different nodes, feasible backup paths composed of multiple healthy media combinations are listed, excluding media on the current primary path. The calculation of each backup path... Using the cumulative delay value of the path, the starting node identifier, ending node identifier, the order of intermediate nodes traversed by the path, the media identifier used by each connection in the path, the media type, and the cumulative delay value are written into the backup path list. The backup media information is updated in real time by the device health information management module based on the current operating status data and weight data. Each record includes the media identifier, the node to which it belongs, the media type, the current weight value, and the current health status flag. When the current weight value of a medium is 0 and no abnormality occurs within a monitoring period, the current health status flag is set to healthy. When a medium is still in a healthy state but has not yet been used as the main path, it is regarded as a backup medium. After each monitoring cycle ends and the view data after removal is obtained, the backup media switching mechanism first checks whether each pair of nodes still has at least one connection path in the view. If a pair of nodes that previously had a connection no longer has any direct or indirect connection in the current view, the node pair is marked as a path interruption node pair. Subsequently, for each path interruption node pair, the backup path list is searched for all backup path records with the same start and end nodes as the node pair. These records are sorted in ascending order of cumulative delay value. Then, each segment of media contained in the path is checked to see if it appears in the current backup media information and is marked as healthy. Whether the current weight value of the medium is less than the weight threshold of the backup medium is determined to be 3 in this embodiment after repeated testing of the stability of the backup path during the trial operation phase. When all media in a backup path meet the conditions of being healthy and having a weight value less than 3, this backup path is used as a candidate backup path for the node pair. The backup path with the smallest cumulative latency value is selected from the sorting list as the final activated backup path. The media and connection relationship corresponding to each connection segment in this path are written into the removed view data to complete the actual access of the backup path. The health status of these media in the backup media information is updated to occupied to prevent the same media from being reused by multiple backup paths. After all path interruption nodes are processed, the removed view data is supplemented with new backup paths to form a temporary repaired view data.
[0031] Based on this, the device health information management module generates an updated health connection dataset according to the latest post-repair view data. Each record in the health connection dataset corresponds to a connection unit currently recognized as healthy by the system, and includes at least the node identifier, the identifiers of the two connected GPU computing units, the communication medium identifier, the medium type, the current weight value, and the number of error events, retransmissions, packet losses, average round-trip latency, and bandwidth usage ratio for that connection in the most recent monitoring period. The generation process is as follows: the device health information management module iterates through each connection record in the post-repair view data. For each connection record, it starts from the media identifier... The system retrieves the current weight value of the medium from the weighted data and checks whether the medium appears in the preliminary fault identifier set for the current monitoring period. If the current weight value of the medium is less than the backup medium weight threshold of 3 and it is not recorded in the preliminary fault identifier set for the current period, the connection is identified as a healthy connection. The system reads the number of error events, retransmissions, packet losses, average round-trip latency, and bandwidth usage ratio corresponding to the connection from the operational status data, and combines this with the basic connection information to form a healthy connection record, which is then added to the updated healthy connection dataset. If the current weight value of the medium is greater than or equal to 3 or there are abnormal records in the current period, the connection is not added to the healthy connection dataset. After the traversal is complete, the updated healthy connection dataset retains only the connection units that meet the health conditions. This dataset reflects both the connection status after fault detection and weight adjustment to remove faulty mediums and the effective connection status after the critical path is repaired through the backup medium switching mechanism. It provides complete and health-screened basic data for subsequent candidate subgraph generation and GPU allocation calculation, thereby reducing the risk of using potentially problematic mediums while ensuring availability.
[0032] S4 includes obtaining global connection view data from the updated healthy connection dataset, extracting multiple candidate subgraphs by dividing view nodes using a graph segmentation method, and obtaining an extracted subgraph set; for the extracted subgraph set, obtaining the internal path list of each subgraph by path counting in combination with subgraph size constraints, and determining the path list data; based on the path list data, calculating the length value of the communication path by accumulating the distance between nodes using path length evaluation logic, and obtaining the length evaluation result; based on the length evaluation result, integrating the backup path integration mechanism by replacing the weight parameters of each path with backup links to obtain a cumulative weight set; for the cumulative weight set, comparing the differences between subgraphs and determining the initial connection cost by the weight summation, and obtaining the cost distribution of each subgraph.
[0033] In this embodiment, after completing step S3, an updated healthy connection dataset is obtained. Each record in this dataset corresponds to a communication connection that is currently in a healthy state. It includes at least the node identifier of the computing node to which it belongs, the starting GPU computing unit identifier, the ending GPU computing unit identifier, the communication medium type, the current weight value of the communication medium determined in step S3, the distance between nodes in the physical topology, and a flag indicating whether the connection is a backup link. The node identifier is numbered sequentially according to the physical node order and written into the configuration during cluster deployment. The GPU computing unit identifier is numbered sequentially according to the slot order within each node and remains unchanged. The communication medium type is set with several fixed enumeration values based on the actual interconnect bus and network type used during deployment. The current weight value is obtained in step S3 through a dynamic weight update rule and is an integer not less than 0; a larger value indicates a higher risk level for the connection. The distance between nodes is determined by maintenance personnel during cluster deployment based on the physical location of the rack. The switching device hierarchy is pre-calculated. For connections between GPUs within the same computing node, the distance between nodes is fixed at 1. For cross-node connections, the number of switching device hierarchy levels traversed from the starting node to the ending node is used as the distance between nodes for that connection and is fixedly written into the configuration. The backup link marker is determined in step S3 when a backup path is allocated through the backup medium switching mechanism. Newly added connections through the backup path are marked as backup links, while existing primary connections are marked as non-backup links. When executing step S4, a global connection view data is first constructed based on the updated healthy connection dataset. Each connection record in the dataset is mapped to an edge in the graph structure, and each GPU computing unit appearing in the dataset is mapped to a view node in the graph. The edge retains the node identifier, GPU computing unit identifier, communication medium type, current weight value, distance between nodes, and backup link marker, thus forming an undirected weighted graph in memory, which is the global connection view data.
[0034] Subsequently, in order to extract multiple candidate subgraphs using the graph segmentation method, it is necessary to first determine the subgraph size limit parameter. The subgraph size limit parameter is a positive integer representing the upper limit of the number of GPU computing units allowed in each subgraph. This parameter is determined based on task statistics during the cluster trial operation phase. Specifically, during the trial operation, the maximum number of GPU computing units required by all executed tasks is recorded, and this maximum value is multiplied by 2 to obtain the value of the subgraph size limit parameter. In this embodiment, if the maximum GPU requirement of a task obtained during the trial operation phase is 8, the subgraph size limit parameter is fixed at 16 and written into the configuration file for subsequent segmentation. At the same time, for the currently being processed task, the user has already entered the number of GPU computing units required by the task during the task submission phase. This parameter is an integer greater than 0, which has been parsed and saved in step S1 and is directly read and used in step S4.The graph partitioning method is implemented as follows: First, extract the GPU computing unit identifiers of all view nodes from the global connectivity view data. Sort these identifiers by node identifier and GPU computing unit identifier to generate an unassigned node list. Then, starting from the first node in the unassigned node list, create a new candidate subgraph for that node, add the node to the node set of that subgraph, and simultaneously create a queue of nodes to be expanded and add the node to that queue. Then, execute a loop expansion process. In each loop, take the head node from the queue to be expanded and query the global connectivity view data for adjacent nodes that are directly connected to that node via a healthy connection and whose corresponding edge records still exist. For each adjacent node, if the adjacent node does not appear in the node set of any subgraph and the number of nodes in the current subgraph node set is less than the subgraph size limit parameter 16, add the adjacent node to the node set of the current subgraph and add it to the queue of nodes to be expanded, thus gradually expanding the subgraph while maintaining connectivity. If, during the expansion process, a node is found in the current subgraph node set... If the number of nodes equals the subgraph size limit parameter 16, the expansion of the subgraph stops, even if there are still unvisited adjacent nodes, they will not be added to the subgraph. When the queue to be expanded is empty, it means that all connected regions that can be covered within the size limit using the starting node as the seed have been added to the current subgraph. At this time, the node set of the current subgraph is determined. The system scans the unallocated node list, deletes all node identifiers belonging to the current subgraph node set from the unallocated list, and packages the node set of the subgraph and all edge records inside the subgraph into a subgraph record and adds it to the extracted subgraph set. Then, the unallocated node list is checked. If there are still nodes in the list, the next node in the list is used as the starting node of the new subgraph and the above process is repeated until the unallocated node list is empty or the number of remaining nodes is less than the number of GPU computing units required for the current task. At this time, the extracted subgraph set is constructed. Each subgraph in the set contains several GPU computing units and all healthy connections between them, and the number of nodes in each subgraph does not exceed the subgraph size limit parameter 16.
[0035] After extracting candidate subgraphs, for the extracted subgraph set, it is necessary to obtain the internal path list of each subgraph by path counting, taking into account the subgraph size limit, and determine the path list data. The purpose of path counting is to enumerate all simple communication paths in each subgraph within the hop count limit. To control the complexity of path enumeration, the maximum path hop count parameter also needs to be pre-determined during the deployment phase. The maximum path hop count parameter is a positive integer representing the maximum number of edges allowed in a path when counting internal paths. This parameter is determined during the trial operation phase by testing the communication latency and task execution time under different hop counts. For example, during the trial operation, it was found that when the path contains more than 4 connections, the latency increases significantly and the task performance decreases significantly. Therefore, in this implementation, the maximum path hop count parameter is determined. The path hop count parameter is fixed at 4 and this value is written into the configuration. During the path counting process, the system processes each subgraph in the extracted subgraph set sequentially. First, it determines whether the number of nodes in the subgraph is greater than or equal to the number of GPU computing units required by the current task. If the number of nodes in a subgraph is less than the number of GPU computing units required by the task, the subgraph cannot independently carry out the task and will not participate in the path counting. Instead, it is kept in the set for use by other tasks. For subgraphs with a number of nodes greater than or equal to the number of GPU computing units required by the task, the system sorts all GPU computing units in the subgraph by their identifiers in ascending order and uses each sorted GPU computing unit as the starting point for path search to perform path enumeration with a limited number of hops.
[0036] The specific enumeration algorithm is as follows: When processing a starting node within a subgraph, first create a path record containing that starting node and add it to the current processing queue. Simultaneously, initialize the internal path list of the subgraph to empty. In each loop, retrieve the first path from the current processing queue, read the last node of that path as the current expansion node, and search for all adjacent nodes directly connected to the current expansion node within the subgraph. For each adjacent node, check if it already appears in the node sequence of the current path. If it does, skip it to avoid forming a cycle. If it doesn't, further check if the number of edges in the current path has reached the maximum path hop count parameter of 4. If the number of edges is less than 4, copy the current path to generate a new path, add the adjacent node to the end of the node sequence of the new path, and add an edge identifier from the current expansion node to the adjacent node to the end of the edge sequence of the new path. Add the new path to the current processing queue and update the current path accordingly. The system writes the path to the internal path list of the subgraph. If the current path has 4 edges, it will not be expanded, but it will still be written to the internal path list as a complete communication path. When the current processing queue is empty, all simple paths with a hop count of no more than 4 originating from the current node are recorded in the internal path list. After processing the current node, the system continues to process the next node until all nodes in the subgraph have been processed as origins. Finally, the internal path list of the subgraph is obtained. Each record in the internal path list includes at least the subgraph identifier, the origin GPU computing unit identifier, the destination GPU computing unit identifier, and the sequence of nodes and edges in the path. To avoid the same path being counted repeatedly, the system compares the node sequence of the path before writing a new path to the list. For paths with the same node sequence and the same origin and destination, only one record is kept, thus forming a deduplicated path list. The path list data also needs to record the path number of each path for subsequent reference by path.
[0037] After obtaining the path list data, the system uses path length evaluation logic to calculate the length of each communication path by accumulating the distances between nodes, thus obtaining the length evaluation result. The key parameter in the path length evaluation logic is the distance between nodes. This parameter is configured according to the physical topology during the cluster deployment phase. For connections between GPUs within the same physical node, this distance is fixed at 1. For cross-node connections, this distance is the number of network switching layers that must be traversed from one node to another, and is a positive integer greater than or equal to 1. These values are uniformly written into the distance table during deployment and loaded into memory during system runtime. The specific process of length evaluation is as follows: the system traverses each path record in the path list data. For each path, it reads the node sequence of the path, sequentially taking the first and second nodes as the first node pair, taking the second and third nodes as the second node pair, and so on until the second-to-last node and the last node... The last node is considered the last node pair. For each node pair, the system looks up the corresponding node distance value in the node distance table. If the two nodes are on the same physical node, the node distance value is 1. If the two nodes are on different physical nodes, the node distance value is the number of exchange layers recorded during deployment. The system first uses the node distance value of the first node pair as the current path length accumulation value. Then, it adds the node distance values of each subsequent node pair to this accumulation value. For example, when a path contains 3 node pairs, it first records the distance value of the first node pair, then adds the distance value of the second node pair to get a new accumulation value, and then adds the distance value of the third node pair to get the total length value of the path. The total length value is a positive integer. The larger the value, the more node distances the path spans. The system writes this total length value, along with the subgraph identifier, start identifier, and end identifier of the path, into the length evaluation result data to form a length evaluation result set.
[0038] After obtaining the length assessment results, the system integrates the backup path integration mechanism based on these results. It replaces each path with a backup link and accumulates the weight parameters of each path to obtain a cumulative weight set. The parameters used in the backup path integration mechanism include path length, distance between nodes, current connection weight, length influence coefficient, and backup replacement threshold. The length influence coefficient is a positive integer representing the degree of influence of path length on path weight. This coefficient is determined during the trial operation phase by comparing the differences in task execution time and error rate of paths with different lengths. For example, during the trial operation, the lengths of multiple paths and the corresponding task execution times are measured. It is found that when the path length increases by 1 unit, the task execution time increases by an acceptable fixed percentage on average. Based on this relationship, the maintenance personnel select an integer as the length influence coefficient. In this embodiment, the length influence coefficient is fixed at 1, meaning that for every 1 unit increase in the distance between nodes, a fixed length contribution is added to the weight parameter of that connection segment. The backup replacement threshold is a positive integer used to determine when the current main link needs to be replaced by a backup link. This threshold is determined during the trial operation phase based on the relationship between the current weight value of the connection and the actual failure rate. Specifically, after long-term monitoring of a large number of connections, when the current weight value exceeds a certain value, the probability of the connection experiencing a serious failure in the subsequent time increases significantly. The integer near this dividing point is used as the backup replacement threshold. In this embodiment, the backup replacement threshold is fixed at 5.
[0039] The cumulative weight calculation process is as follows: The system traverses each path in the path list data. When processing a path, it first finds the corresponding record in the length evaluation result set and reads the total length value of the path. Simultaneously, it processes each connection on the path segment by segment according to the edge sequence of the path. For each connection segment on the path, the system obtains the current weight value and backup link marker of the connection from the updated healthy connection dataset, and obtains the corresponding node distance value from the node distance table. Then, it calculates the weight parameter for the connection segment. Specifically, it multiplies the node distance value by the length influence coefficient to obtain the length contribution value of the connection segment, and then adds the length contribution value to the current weight value to obtain the initial weight parameter of the connection segment. This initial weight parameter is an integer not less than the current weight value. Subsequently, it determines whether to apply backup link replacement logic to the connection segment based on the backup replacement threshold. If the current weight value of the connection segment is less than the backup replacement threshold of 5, it is considered that the risk level of the connection is still within an acceptable range, and the system does not perform a replacement operation and directly adopts the backup link replacement logic. The initial weight parameters mentioned above are used as the final weight parameters for this connection segment. If the current weight value of this connection segment is greater than or equal to 5, the connection is considered to be of high risk. The system searches in the updated healthy connection dataset for other connection records that have the same starting and ending GPU computing units as this connection segment and whose backup links are marked as backups. If no backup links that meet the conditions are found, replacement is not possible, and the system still uses the initial weight parameters of the main connection as the final weight parameters for this connection segment. If several backup links that meet the conditions are found, the system calculates the sum of the length contribution value and the current weight value of each backup link according to the aforementioned method to obtain their respective candidate weight parameters. Then, the smallest value among these candidate weight parameters is selected as the replacement weight parameter for this connection segment, and this replacement weight parameter is used as the final weight parameter for this connection segment. The initial weight parameters of the original main connection are no longer used in the weight accumulation of this path. In this way, the high-risk main link is replaced by a backup link.
[0040] The system sequentially performs the aforementioned weight parameter calculation and alternative replacement judgment for each connection segment on the path, and accumulates the final weight parameters of each connection segment to obtain the cumulative weight value of the path. The accumulation method is to use the final weight parameter of the first connection segment as the initial cumulative value, and then add the final weight parameters of the second connection segment, the third connection segment, and so on until the last connection segment to the cumulative value. The integer obtained after the calculation is the cumulative weight value of the path. This value reflects both the path length and the comprehensive cost of all connection risks on the path. The system writes the cumulative weight value of each path, along with the corresponding subgraph identifier, start identifier, and end identifier, into the cumulative weight set to form a record in the cumulative weight set.
[0041] After obtaining the cumulative weight set, the system needs to compare the differences between subgraphs based on the cumulative weight set, determine the initial connection cost through the weight sum, and obtain the cost distribution of each subgraph, providing a basis for subsequent steps to select subgraphs with lower costs. The specific calculation process is as follows: The system first establishes a cost statistics structure indexed by subgraph identifier. This structure creates a statistical record for each subgraph in the extracted subgraph set, setting an initial connection cost field in each record with an initial value of 0. Then, it iterates through each path record in the cumulative weight set. For each record, it reads its subgraph identifier and the cumulative weight value of the path. Then, in the cost statistics structure, it finds the corresponding subgraph's statistical record based on the subgraph identifier, adds the cumulative weight value of that path to the subgraph's initial connection cost field, and so on. After processing all path records in the cumulative weight set, the initial connection cost of each subgraph is obtained. The value stored in this field is the sum of the weights of all communication paths within the subgraph. This sum of weights is a positive integer. The smaller the value, the shorter the overall length of all paths within the subgraph and the fewer high-weight connections it contains, resulting in lower overall communication costs and risk levels. The system then sorts the initial connection cost values of all subgraphs in ascending order to form a cost distribution list. Each item records the subgraph identifier and the corresponding initial connection cost value. This cost distribution list is the cost distribution result for each subgraph. In subsequent steps, subgraphs with smaller initial connection cost values in the cost distribution list will be selected as preferred candidate subgraphs to participate in the actual allocation of GPU computing units.
[0042] S5 includes obtaining candidate subgraph extraction results from the global connection view, generating an internal path list by statistically analyzing inter-node links using a path counting method, and determining the communication path length value; for the internal path list, replacing the accumulated weight parameters by integrating backup paths, where the weight parameters are added segment by segment based on node distances to obtain an accumulated weight set; based on the accumulated weight set, applying the weight sum judgment to calculate the preliminary connection cost by summing the elements in the set, and obtaining cost distribution ranking data; using the cost distribution ranking data, executing a cost comparison mechanism to compare the total weight value and determine the optimization order of the remaining subgraphs; if the total weight value is the smallest, then marking the optimal subgraph, obtaining the optimal subgraph mark with the smallest total weight.
[0043] In this embodiment, after completing step S4, several candidate subgraphs and their internal healthy connection information have been saved in the global connection view. Step S5 first reads the candidate subgraph extraction results from the global connection view. Each record in the candidate subgraph extraction results contains at least the subgraph identifier, a list of identifiers of all GPU computing units within the subgraph, and connection details of all healthy connections within the subgraph. The connection details include the starting GPU computing unit identifier, the ending GPU computing unit identifier, the identifier of the computing node to which it belongs, the communication medium type, the distance value between nodes, the current weight value determined in step S3, and a mark indicating whether it is a backup link. The distance value between nodes is determined in advance by the operation and maintenance personnel based on the cluster physical topology during the cluster deployment and trial operation phases. In this embodiment, for nodes located within the same computing node... The distance between nodes connecting GPUs in the unit is fixed at 1. For connections across computing nodes, the distance between nodes is fixed at the number of switching layers that need to be traversed in the network from the starting node to the ending node. This value is an integer greater than or equal to 1 and is written into the node distance configuration table. The current weight value is obtained in step S3 through the dynamic weight update rules of the device health information management module and is an integer not less than 0 and not greater than 10. The backup link mark is determined in step S3 through the backup medium switching mechanism. Connections accessed by backup paths are marked as backup links, and original primary connections are marked as non-backup links. After obtaining the candidate subgraph extraction results, step S5 sequentially executes the path counting method for each subgraph, generates an internal path list through inter-node link statistics, and determines the communication path length value in the same process.
[0044] Specifically, the key parameter used in the path counting method is the maximum path hop count parameter. This parameter is determined by the operations and maintenance personnel during the cluster deployment and trial operation phases by testing the relationship between the path hop count and communication performance. Specifically, during the trial operation phase, test paths with different hop counts are constructed, representative tasks are run on these paths, and end-to-end latency and task completion time are recorded. When the number of connection segments in the path exceeds a certain specific value, the latency and task time deteriorate significantly. The operations and maintenance personnel use this specific value as the maximum allowed path hop count. In this embodiment, the maximum path hop count parameter is fixed at 4 and written into the configuration file for the path counting method to call.
[0045] Furthermore, for any candidate subgraph, the path counting method first extracts all GPU computing unit identifiers from the GPU computing unit list of the subgraph, sorts these identifiers in ascending order by their respective node identifiers and GPU identifiers to generate an ordered node list, and then performs a simple path search on each GPU computing unit in the ordered node list as a starting point. The search scope is limited to the subgraph and the path length does not exceed 4 connections. When processing a specific starting point, the system creates an initial path containing only that starting point, adds the path to the current path processing queue, and simultaneously creates an internal path list for the subgraph, writing path records to it in subsequent processes. In each loop, the first path is taken from the path processing queue, and the last GPU computing unit of that path is taken as the current expansion node. Based on the internal connection details of the subgraph, all adjacent GPU computing units directly connected to the current expansion node through healthy connections are searched, and these adjacent GPU computing units are processed one by one as expansion candidate nodes. For each expansion candidate node, it is first checked whether it already appears in the node sequence of the current path. If the expansion candidate node already exists in the current path, it is not expanded to avoid generating loops. If the expansion candidate node does not appear... In the current path, it continues to check whether the number of connected segments already contained in the current path is less than the maximum path hop count parameter 4. When the number of connected segments is less than 4, the current path is copied into a new path, the expansion candidate node is appended to the end of the node sequence of the new path, and the connection identifier from the current expansion node to the expansion candidate node is appended to the end of the edge sequence of the new path according to the connection details. The new path is added to the path processing queue and written to the internal path list at the same time. When it is found that the number of connected segments contained in the current path is equal to 4, no new nodes are expanded on the path. Instead, the path is directly written to the internal path list as a complete path but is not added to the queue for expansion. When the path processing queue is empty, it means that all simple paths without duplicate nodes have been generated with the starting point as the source and without more than 4 connected segments. The path counting method then repeats the above process, starting from the next GPU computing unit in the sorted list within the subgraph, until every GPU computing unit in the subgraph has completed a path search as a starting point. This results in an internal path list for the subgraph. Each record in the internal path list contains at least the subgraph identifier, the starting GPU computing unit identifier, the ending GPU computing unit identifier, the sequence of GPU computing units along the path, and the corresponding connection identifier sequence.
[0046] It should be noted that, in order to eliminate duplicate paths, before writing a path into the internal path list, the system compares the node sequence of the new path with the existing paths in the list one by one. When it is found that the start point, end point, and intermediate node sequence of a certain path are completely consistent with the existing paths, the path will not be written again, thus ensuring that each path appears only once. While generating the internal path list, the system calculates the communication path length value according to the node distance configuration table. Specifically, when generating or writing each path record, the system immediately reads the node sequence of the path, forms the first node pair with the second node, looks up the corresponding node distance value in the node distance configuration table, and writes this distance value into the path length accumulation variable as the initial path length. Then, the system sequentially takes the second and third nodes, the third and fourth nodes, until the last pair of adjacent nodes forms subsequent node pairs. For each node pair taken, the system reads the corresponding node distance value from the node distance configuration table and adds this value to the path length accumulation variable. When all node pairs have been processed, the integer value in the path length accumulation variable is the communication path length value of the path. The system writes this length value, along with the path identifier, into the path length result data, thus completing the determination of the internal path list and the communication path length value. Next, step S5 performs a backup path integration process for the internal path list. During the backup path integration process, weight parameters are added to each path segment by segment to form a cumulative weight set. The weight parameters of each connected segment are added segment by segment based on the node distance.
[0047] Specifically, the backup path integration process involves parameters such as the current weight value, the distance between nodes, the length influence coefficient, and the backup replacement threshold. The current weight value and the distance between nodes have been determined in the updated healthy connection dataset. The length influence coefficient and the backup replacement threshold are determined through statistical analysis during the cluster trial operation phase. The length influence coefficient is used to represent the contribution of the distance between nodes to the connection risk assessment. In this embodiment, the length influence coefficient is fixed at 1, meaning that for every unit increase in the distance between nodes, 1 length contribution is added to the weight parameter of that connection segment. The backup replacement threshold is used to determine when to use a backup link to replace the current primary link. During the trial operation phase, a large number of connections are monitored over a long period of time. When it is found that the current weight value of a connection reaches or exceeds a certain integer, the probability of it experiencing a serious failure within a certain period of time increases significantly. Based on this, the operation and maintenance personnel select this integer as the backup replacement threshold. In this embodiment, the backup replacement threshold is fixed at 5 and written into the configuration. During cumulative weight calculation, the system iterates through each path record in the internal path list. For a given path, it first reads the node sequence and edge sequence, creates a path cumulative weight variable, and sets its initial value to 0. Then, it processes each connection segment on the path sequentially according to the edge sequence. For a given connection segment, the system reads the current weight value, inter-node distance, and whether it is a backup link from the updated healthy connection dataset. When calculating the initial weight parameter for this connection segment, it first multiplies the inter-node distance by the length influence coefficient 1 to obtain the length contribution, and then adds the length contribution to the current weight value to obtain the initial weight parameter for this connection segment. This initial weight parameter is an integer not less than the current weight value. Afterward, the system compares the current weight value with the backup replacement threshold 5. If the current weight value is less than 5, the system considers the risk level of this connection segment to be within an acceptable range. The system directly uses the initial weight parameters as the final weight parameters for the connection segment without performing any backup replacement. If the current weight value is greater than or equal to 5, the connection segment is considered to have a high risk. The system searches for all backup links marked as backups in the updated healthy connection dataset by starting and ending GPU computing unit identifiers. If no backup link that meets the conditions is found, the initial weight parameters of the main connection are kept unchanged and used as the final weight parameters for the connection segment. If one or more backup connections are found, the system calculates the sum of the distance between nodes and the current weight value for each backup connection in the same way to obtain candidate weight parameters. The smallest value among all candidate weight parameters is selected as the replacement weight parameter for the connection segment, and this replacement weight parameter replaces the initial weight parameter of the main connection as the final weight parameter for the connection segment, thereby achieving backup link replacement for the connection segment.
[0048] Once the final weight parameter of the connection segment is determined, the system adds the value to the path cumulative weight variable and then continues to process the next connection segment until the last connection segment on the path is processed. At this point, the integer value in the path cumulative weight variable is the cumulative weight value of the path. This cumulative weight value comprehensively reflects the cumulative result of the path length and the risk factors of all connection segments in the path. The system writes the cumulative weight value, along with the subgraph identifier, start identifier, and end identifier of the path, into the cumulative weight set. Each record in the cumulative weight set corresponds to a path and its total weight. After calculating the cumulative weights of all paths, step S5 applies the weight sum judgment based on the cumulative weight set. It calculates the initial connection cost of each subgraph by summing the elements in the set and obtains the cost distribution ranking data. Specifically, the system creates a subgraph cost statistics table, creates a cost record for each subgraph in the candidate subgraph extraction results, sets a subgraph identifier field and an initial connection cost field in each record, and sets the initial value of the initial connection cost field to 0. Then, it iterates through each path record in the cumulative weight set. For each path record, it reads the subgraph identifier and the cumulative weight value of the path, finds the cost record of the corresponding subgraph in the subgraph cost statistics table, and adds the cumulative weight value of the path to the initial connection cost field of the subgraph. This process continues until all path records in the cumulative weight set have been processed. At this point, the integer value of the initial connection cost field in each subgraph cost record is the sum of the cumulative weight values of all paths within the subgraph. This sum is the initial connection cost of the subgraph.
[0049] Furthermore, the system then organizes the initial connection costs and subgraph identifiers of all subgraphs into a cost distribution list, and sorts the cost distribution list in ascending order of initial connection costs. The sorted list is the cost distribution ranking data, which clearly shows the total weight value of each subgraph and its relative position among all candidate subgraphs. Next, step S5 executes a cost comparison mechanism using the cost distribution ranking data to compare the total weight values among subgraphs and determine the optimization order of the remaining subgraphs. Specifically, the system reads the first record from the sorted cost distribution ranking data, takes the initial connection cost of the subgraph corresponding to that record as the current minimum total weight value, and takes the subgraph identifier in that record as the current optimal subgraph candidate identifier. Then, it reads the second, third, and so on, up to the last record, in the sorted order. For each record, it compares the initial connection cost in that record with the current minimum total weight value. If the initial connection cost of that record is less than the current minimum total weight value, the current minimum total weight value is updated to the initial connection cost of that record, and the current optimal subgraph is selected. The candidate identifier is updated to the subgraph identifier of the record, and the previous optimal subgraph candidate is added to the remaining subgraph optimization order list. If the initial connection cost of a record is equal to the current minimum total weight value, it is sorted in parallel according to a pre-set deterministic rule. For example, subgraphs with fewer nodes are selected first. When the number of nodes is the same, the subgraph identifier values are compared and the one with the smaller value is selected as the better subgraph. The unselected subgraphs are added to the remaining subgraph optimization order list in the order of comparison, thus giving a clear order when the costs are equal. If the initial connection cost of a record is greater than the current minimum total weight value, the current optimal subgraph candidate identifier is not changed, but the subgraph is added to the remaining subgraph optimization order list in the sorting order. After all records are processed, the subgraph corresponding to the current optimal subgraph candidate identifier is the subgraph with the smallest total weight value among all candidate subgraphs. Step S5 performs the optimal subgraph marking operation accordingly, adds an optimal subgraph mark field to the subgraph cost statistics table for the subgraph and sets the field to a valid state, and outputs the identifier of the subgraph as the optimal subgraph mark with the smallest total weight.
[0050] S6 includes obtaining the connection path and communication medium type through optimal subgraph marking, storing them as the allocation basis to obtain a path storage set; for the path storage set, determining whether cross-node paths are involved, if so, prioritizing the path of NVLink interconnection link to determine the medium path sequence; based on the medium path sequence, integrating the weight recording mechanism to obtain the allocation basis parameters; using the allocation basis parameters, generating a GPU computing unit allocation scheme, and determining the final GPU computing unit allocation scheme.
[0051] In this embodiment, after completing step S5, the system has obtained a unique optimal subgraph marker in the global connection view. This optimal subgraph marker is an integer number determined in step S5 by comparing the total weight values of all candidate subgraphs. In this embodiment, the optimal subgraph marker is numbered sequentially from 1 and remains fixed when the subgraph is created. When executing step S6, the system first reads the optimal subgraph marker and searches the data structure of the global connection view according to the condition that the subgraph identifier field is equal to the optimal subgraph marker. The retrieved subgraph records contain all GPU computing unit identifiers within the subgraph and connection records of all healthy connections within the subgraph. Each connection record contains at least the starting GPU computing unit identifier, the ending GPU computing unit identifier, the computing node identifier of the starting GPU, the computing node identifier of the ending GPU, the communication medium type, the distance between nodes, the current medium weight value calculated in the aforementioned steps, and a marker indicating whether it is a backup link. The GPU computing unit identifier is numbered sequentially from 1 within each node according to the physical slot order during the cluster deployment phase. The system remains unchanged during operation. The computing node identifier is numbered sequentially from 1 according to the rack or management number during the deployment phase. Several fixed values are predefined in the configuration file for the communication medium type. In this embodiment, at least the NVLink type representing high-bandwidth dedicated interconnection, the type representing general bus interconnection, and the type representing Ethernet interconnection are included. The distance value between nodes is determined according to the physical topology and network switching level during the deployment phase. For connections within the same computing node, 1 is fixed. For connections across computing nodes, the positive integer of the number of switching device levels that need to be passed on the path from the starting node to the ending node is filled in. The current medium weight value is calculated by the device health information management module in step S3 according to indicators such as temperature, error events, retransmission, packet loss, and latency through explicit weighting rules and is limited to between 0 and 10. The larger the value, the higher the failure risk of the connection. Whether it is a backup link is determined during the backup medium switching process in step S3. Connections added to the health connection dataset through the backup path are marked as backup links. Connections that existed in the initial configuration phase and were not replaced by the backup mechanism are marked as non-backup links.After obtaining all connection records corresponding to the optimal subgraph, the system first creates an empty path list structure in memory to construct the path storage set. This structure reserves fields for path identifier, starting GPU identifier, ending GPU identifier, path node sequence, path media type sequence, node identifier sequence for each connection segment, and path cumulative weight value. The path identifier counter is initialized to 1. Then, the system calls the internal path list data that has been generated and saved in the global connection view in steps S4 and S5. In these internal path list data, each record has its own subgraph identifier, starting GPU identifier, ending GPU identifier, sequence of nodes along the way, and corresponding cumulative weight value. The system performs a judgment on each record in the internal path list. When the subgraph identifier is equal to the optimal subgraph marker, the record is selected into the path storage set construction process.
[0052] Specifically, the current path identifier counter value is written to the path identifier field, the starting GPU identifier and ending GPU identifier in the record are written to their respective fields, the path node sequence in the record is copied one by one and written to the path node sequence field, the communication medium type of each segment connected to the healthy connection dataset in the path is written sequentially to the path medium type sequence field, the identifier of the computing node to which each segment belongs is written to the node sequence field in the path order, the cumulative weight value corresponding to the record in the cumulative weight set is written to the path cumulative weight value field, and after writing is completed, the path identifier counter is incremented by 1, and the next internal path record is processed until all path records whose subgraph identifier is equal to the optimal subgraph marker have been processed. At this time, the path list in memory is the path storage set, and each record in the path storage set corresponds to a feasible communication path inside the optimal subgraph. After constructing the path storage set, the system performs a cross-node path determination process for each path in the set. This process does not introduce new modules; it is entirely implemented by comparing existing fields. Specifically, it sequentially reads the path node sequence and its corresponding node sequence for each record in the path storage set. For a given path, the system starts from the first node in the path node sequence and forms the first connection segment together with the second node. The corresponding starting node identifier is the identifier of the compute node to which the first node belongs, and the ending node identifier is the identifier of the compute node to which the second node belongs. When the starting and ending compute node identifiers are not equal, the system sets the cross-node flag field in the path record to cross-node and marks the path as a cross-node path. After processing all connection segments, if no connection segment has a starting and ending compute node identifier different, the cross-node flag field of the path record is set to local path, indicating that the path is transmitted only within a single compute node. The system performs the above steps to determine the cross-node path for each path in the path storage set. After the determination, each path record has a clear cross-node flag, and there are no unmarked cases.
[0053] After the cross-node flag is clearly defined, the system begins to execute the NVLink-based media priority selection logic to generate a media path sequence. In this implementation, NVLink priority selection does not depend on numerical thresholds, but on a defined sorting rule. The sorting rule is fixed by the operation and maintenance personnel during system deployment based on the performance test results of different media combinations during the trial operation phase, and is not dynamically adjusted during operation. In practice, the system first establishes an empty media path sequence structure. Each record reserves a starting GPU identifier field, an ending GPU identifier field, a path node sequence field, a path media type sequence field, a path cross-node flag field, and a path cumulative weight value field. Then, the system scans all records in the path storage set according to the path identifier order. For records with a cross-node flag of "local path", the system directly writes the record into the media path sequence as is, keeping the corresponding starting GPU identifier, ending GPU identifier, path node sequence, media type sequence, and path cumulative weight value unchanged, and sets the cross-node flag to "local path". For records with a cross-node flag of "cross-node", the system first constructs a path grouping table in memory according to the starting GPU identifier and the ending GPU identifier, and groups all paths with the same starting and ending points and a cross-node flag of "cross-node" into the same group. After grouping, NVLink priority filtering is performed separately for each starting and ending point combination. The filtering process involves the system performing the following calculations for each path in the group: traversing the path media types. For each connection segment, if the starting and ending computation node identifiers are not equal, it is considered a cross-node connection. In such cross-node connections, if the communication medium type is equal to the NVLink type defined in the configuration, the number of cross-node NVLink segments of the path is incremented by 1; otherwise, the number of cross-node non-NVLink segments of the path is incremented by 1. After traversal, the number of cross-node NVLink segments and the number of cross-node non-NVLink segments of the path are obtained. The total number of cross-node connection segments is equal to the sum of the two. After the above statistics are completed for all paths in a group, the system first checks whether there is a path in the group with a number of cross-node non-NVLink segments equal to 0, that is, a path with a number of cross-node NVLink segments equal to the total number of cross-node connection segments. If so, the system forms a subset of these paths and sorts them in ascending order of cumulative path weight value. During sorting, the integer cumulative weight values are strictly compared without introducing fuzzy judgment. The first path after sorting is regarded as the preferred medium path for the start-end combination.If no path in the group has zero cross-node non-NVLink segments, the system sorts the entire group by the number of cross-node NVLink segments from largest to smallest. Among paths with the same number of cross-node NVLink segments, it further sorts them by the number of cross-node non-NVLink segments from smallest to largest. If multiple paths still exist after the first two sorting steps, they are sorted by their cumulative weight values from smallest to largest. After sorting, the first path in the sorted results is taken as the preferred media path for the start-endpoint combination, and this path record is written to the media path sequence, changing its cross-node flag to a cross-node flag. During the writing process, the start-node GPU identifier, end-node GPU identifier, path node sequence, path media type sequence, and path cumulative weight value from the original path record are directly copied to the corresponding fields in the media path sequence. This ensures that each pair of GPUs needing to communicate in the media path sequence has only one explicitly selected preferred path, and NVLink media priority is implemented for cross-node paths.
[0054] After the media path sequence is generated, the system integrates the weight recording mechanism according to the media path sequence to obtain the allocation basis parameters. In this embodiment, the weight recording mechanism does not introduce a new structure, but performs a determined calculation rule on the existing path cumulative weight value and path structure data to generate a set of score data. This set of score data serves as the direct basis for the subsequent allocation algorithm to select the GPU combination. The weight recording mechanism first creates a set of allocation parameters in memory. Each record contains at least the path number, the starting GPU identifier, the ending GPU identifier, the cumulative weight value of the path, the number of cross-node connection segments in the path, the number of NVLink connection segments in the path, the number of GPU computing units in the path, and the path score value. The path number is directly taken from the sequence number recorded in the media path sequence. The cumulative weight value of the path comes from the aforementioned cumulative weight set and is a positive integer. The number of cross-node connection segments in the path is obtained by counting the number of connection segments in the path whose starting and ending computing node identifiers are not equal and is an integer greater than or equal to 0. The number of NVLink connection segments in the path is obtained by counting the number of connection segments in the path media type sequence whose media type is equal to NVLink type and is an integer greater than or equal to 0. The number of GPU computing units in the path is obtained by counting all GPU identifiers in the path node sequence after deduplication and is an integer greater than or equal to 2. The weighting recording mechanism involves three weighting factors: cross-node penalty factor, NVLink reward factor, and coverage weight factor. These three factors are given fixed values in the configuration file before system deployment and are not adjusted during operation. In this implementation, the cross-node penalty factor is fixed at 1, the NVLink reward factor is fixed at 1, and the coverage weight factor is fixed at 1. The method for determining these three values is to record the operation of a large number of tasks during the system trial operation phase, and to statistically analyze different combinations of cross-node ratios, different NVLink usage ratios, and different numbers of GPUs covered by the path. The impact of these combinations on task execution time and failure rate is observed. Under the premise of ensuring simple calculation and clear indicators, the integer 1 is selected as the unified value of the three weighting factors and used in a fixed manner. The weighting mechanism performs the following steps when calculating the path score for each path: First, the cumulative weight value of the path is taken as the base score. Then, the cross-node penalty score is calculated by multiplying the number of cross-node connection segments in the path by the cross-node penalty factor 1 to obtain an integer. Next, the NVLink reward score is calculated by multiplying the number of NVLink connection segments in the path by the NVLink reward factor 1 to obtain an integer. Then, the coverage reward score is calculated by multiplying the number of GPU computing units in the path by the coverage weight factor 1 to obtain an integer. Subsequently, the base score is added to the cross-node penalty score to obtain an intermediate score. Then, the NVLink reward score is subtracted from the intermediate score, and the coverage reward score is subtracted. The result is recorded as the path score of the path. This value is an integer and may be less than the cumulative weight value, but it will not be a decimal.The system performs the above steps for each path in the media path sequence, and writes the calculated path score value, along with the corresponding path number, starting GPU identifier, ending GPU identifier, and other data, into the allocation basis parameter set. At this point, each record in the allocation basis parameter set contains a quantitative description of the path's impact on the allocation decision.
[0055] After obtaining the allocation basis parameter set, the system generates a GPU computing unit allocation scheme based on this set, determining the final GPU computing unit allocation scheme. The generation process strictly depends on the parameter of the number of GPU computing units required by the task and the path score value in the allocation basis parameter set. The parameter of the number of GPU computing units required by the task is entered by the user as a positive integer in the interface when the task is submitted. In this embodiment, it is assumed that this value is 8. The requirement quantity parsing logic has already parsed this value and stored it in the task description data structure in step S1. When performing allocation in step S6, this value is directly read as the target number of GPUs. The system first checks the total number of GPU computing units in the optimal subgraph. If the total number is less than 8, the optimal subgraph cannot provide 8 GPUs at the same time. In the complete process of this embodiment, the system only discusses the case where the total number is greater than or equal to 8. When the total number is greater than or equal to 8, the system enters the path-driven GPU selection process.
[0056] The specific selection process is as follows: The system sorts the set of allocation criteria parameters from smallest to largest path score. When the path scores are the same, they are sorted from smallest to largest cumulative weight value. If they are still the same, they are sorted from smallest to largest path number. After sorting, a path priority list is obtained. The path that appears earlier in the list is more suitable as the base path for the allocation combination. The system then initializes an empty current candidate GPU set, using GPU identifiers as keys. Initially, it does not contain any GPUs. Next, it retrieves the first record from the path priority list and adds the starting and ending GPU identifiers from this record to the current candidate GPU set. If the path node sequence also contains intermediate GPU nodes, these intermediate GPU identifiers are also added to the set. After adding, the number of different GPU identifiers in the set is counted. If the number is less than 8, the second record is retrieved from the path priority list, and the same addition operation is repeated, adding all GPU identifiers involved in the second path to the current candidate GPU set. If a GPU identifier already exists in the set, it remains unchanged and is not counted again. The number of GPUs in the set is counted again. If it is still less than 8, the third path, the fourth path, and so on are retrieved and processed in sequence. After each path is added, the number of GPUs in the current candidate GPU set is immediately counted. When a certain count result reaches or exceeds 8 for the first time, the system immediately stops retrieving new paths from the path priority list. At this time, the current candidate GPU set stores a set of GPU candidate sets covered by the better-scoring paths, and the number of GPUs in this set is greater than or equal to 8.The system then performs a truncation operation based on the relationship between the number of GPUs in the current candidate GPU set and the target GPU set. If the number of GPUs in the current candidate GPU set is exactly 8, the system directly uses all GPUs in that set as the final GPU list for allocation. If the number of GPUs in the current candidate GPU set is greater than 8, the system selects 8 GPUs from that set according to a specific truncation rule. The truncation rule is as follows: First, count the number of times each GPU identifier appears in all paths added to the current candidate GPU set during the path selection process. Create a statistical set of each GPU identifier and its frequency. Then, sort these GPU identifiers from highest to lowest frequency; the higher the frequency, the more likely the GPU is to be selected. The GPUs with higher scores are in key positions and contribute more to the overall communication quality, so the system prioritizes retaining these GPUs. When the cumulative number of GPUs retained after sorting reaches or exceeds 8 for the first time, the system checks the number of GPUs in the current frequency tier. If the cumulative number of retained GPUs before adding the current tier was less than 8, and the cumulative number of retained GPUs after adding all the GPUs in the current tier is greater than 8, it means that the number of GPUs in the current tier is more than the remaining slots. At this time, the system sorts the GPUs in the current tier by their GPU identifier values from smallest to largest, and only selects the top-ranked GPUs that can fill the remaining slots. The GPUs in the remaining tiers are no longer included in the final allocation list. Through the above steps, the system obtains a final GPU set containing 8 GPU computing unit identifiers. The system then filters all path records whose start and end points are within the final GPU set based on the media path sequence. These path records are used as the communication paths that the task will use first during execution, forming a subset of communication paths. The final GPU computing unit allocation scheme is composed of the task identifier, the final GPU set, and this subset of communication paths. The system writes the above information into the allocation scheme record structure and uses it as the output of step S6 for subsequent updates to the global connection view and healthy connection dataset based on the scheme.
[0057] S7 includes using a healthy connection dataset to obtain connection metrics of the allocated GPU computing units and obtain a status monitoring sequence; for the status monitoring sequence, it is determined whether there are abnormal fluctuations; if so, unused subgraph data in the backup resource pool is integrated to determine the reuse adjustment path; through the reuse adjustment path, the resource distribution in the global connection view is updated to obtain the intermediate topology; based on the intermediate topology and combined with real-time status monitoring, a dynamically adjusted cluster topology view is generated.
[0058] In this embodiment, after completing step S6 and generating the final GPU computing unit allocation scheme, the system first uses the healthy connection dataset obtained in step S3 as the basic data source. Each record in the healthy connection dataset includes at least the identifier of the computing node, the identifier of the starting GPU computing unit, the identifier of the ending GPU computing unit, the communication medium type, the distance between nodes, the current medium weight value, the number of error events, the number of retransmissions, the number of packet losses, the average round-trip latency, and the bandwidth usage ratio in the most recent monitoring period. The distance between nodes has been determined according to the physical topology during the cluster deployment phase. For connections within the same computing node, it is fixed at 1. For cross-node connections, it is set to an integer not less than 1 based on the number of switching device layers actually traversed from the starting node to the ending node. The current medium weight value has been determined in step S3 based on indicators such as temperature, error events, retransmissions, packet losses, round-trip latency, and bandwidth. The weight update rule is calculated and limited to between 0 and 10. When executing step S7, the system first reads the list of all GPU computing unit identifiers allocated to this task from the final GPU computing unit allocation scheme obtained in step S6, and filters out records in the healthy connection dataset that both the starting GPU and the ending GPU belong to the list. These records are regarded as "connections of GPU computing units after allocation". Then, a data table is built in memory to record the connection status of this task. Each row of the data table corresponds to a selected record in the healthy connection dataset, and its computing node identifier, starting GPU identifier, ending GPU identifier, communication medium type, and distance between nodes are fixed and saved. In order to track the running status of these connections in the time dimension, the system reads various indicators from the device health information collection logic at a fixed monitoring cycle. The length of the monitoring cycle is determined by the operation and maintenance personnel through multiple experiments during the cluster trial operation phase.
[0059] Specifically, during the trial operation phase, stress test tasks with different cycle lengths of 30 seconds, 60 seconds, and 120 seconds were run to compare the relationship between alarm detection time and monitoring overhead under different cycle lengths. Ultimately, a fixed monitoring cycle length of 60 seconds was selected as a compromise between alarm response time and system overhead, and this value was written to the configuration file and remained unchanged during system operation. At the end of each monitoring cycle, for each connection in the data table, the system calls the underlying driver interface to read the number of error events, retransmissions, packet loss, average round-trip latency, and bandwidth usage ratio for that connection during the cycle. These values, along with the connection's... At the end of this cycle, the current media weight value is written into a time-ordered record sequence. This record sequence expands sequentially from 1 according to the monitoring cycle number. A time-ordered state sequence is maintained for each connection. The state sequences of all connections are merged to form the state monitoring sequence for this task. Each item in the state monitoring sequence clearly corresponds to a specific indicator value for a certain monitoring cycle and a certain connection, with no missing fields. To determine whether there are abnormal fluctuations in the state monitoring sequence, the system needs to pre-determine several threshold parameters to describe the "fluctuation amplitude" during the cluster deployment phase. During the trial operation phase, the system... Maintenance personnel select a time interval free of actual faults and with stable service operation. Within this interval, they record the number of error events, retransmissions, packet losses, average round-trip latency, and bandwidth usage for each connection across multiple consecutive monitoring periods. For each connection and each metric, they calculate the absolute value of the difference between the current period's value and the previous period's value, and then statistically analyze the maximum and average values of these absolute differences under normal conditions. For example, statistical analysis shows that the maximum absolute value of the difference in the number of error events between consecutive periods under normal conditions is 2, and the average is approximately 1; the maximum absolute value of the difference in the number of retransmissions is 4, and the average is... The absolute value of the difference in packet loss counts is around 2, the maximum absolute value is 1, the average is close to 0, the maximum absolute value of the change in average round-trip latency is 1 millisecond, and the maximum absolute value of the change in bandwidth utilization is 0.15. Based on these statistical results, to avoid normal fluctuations being falsely reported as abnormal, maintenance personnel will set the abnormal fluctuation threshold for each type of indicator to a specific value that is slightly higher than the maximum normal difference. For example, the error event fluctuation threshold is set to 3, the retransmission count fluctuation threshold is set to 5, the packet loss count fluctuation threshold is set to 2, the average round-trip latency fluctuation threshold is set to 1 millisecond, and the bandwidth utilization fluctuation threshold is set to 0.2. These fixed values are written into the configuration file as abnormal fluctuation threshold parameters. Simultaneously, to avoid occasional spikes in a single monitoring cycle directly triggering topology adjustments, the system also sets a specific cycle number threshold for "fluctuation duration" based on observations of the impact of multiple abnormal samples on task operation during the trial run phase. Only when the same connection continuously exceeds the abnormal fluctuation threshold on the same indicator within several consecutive monitoring cycles is it considered a true abnormal fluctuation. In this implementation, the fluctuation duration cycle threshold is fixed at 3, meaning that the fluctuation of this indicator must exceed the corresponding threshold for at least 3 consecutive monitoring cycles before it is considered that the connection has an abnormal fluctuation on that indicator that requires processing. This value is also written into the configuration file and remains unchanged during operation. During formal operation, after each new monitoring cycle, the system performs an abnormal fluctuation check on the status monitoring sequence. Specifically, for each connection and each indicator in the status monitoring sequence, the current... The system calculates the absolute difference between the current value and the previous value of the indicator, taking the values of the current and previous periods as examples. If this absolute difference exceeds the corresponding abnormal fluctuation threshold, the system marks the current period connected to the indicator as an abnormal fluctuation event and increments the continuous abnormal count for that connection by 1. If the absolute difference is less than or equal to the threshold, the continuous abnormal count for that connection is reset to zero. When the continuous abnormal count for any connection on any indicator reaches or exceeds 3, the system adds this connection to the abnormal connection list, marks its overall status as having abnormal fluctuations, and records the starting period number and current period number of the abnormality in the status monitoring sequence. When the continuous abnormal count for all connections and all indicators is less than 3, the abnormal connection list is empty, and the system considers there to be no abnormal fluctuations requiring processing in the status monitoring sequence within the current monitoring period. It only appends the current period's data collection results without triggering any topology adjustment operations.
[0060] When the abnormal connection list is not empty and at least one connection is determined to be abnormally fluctuating, the unused subgraph data in the backup resource pool is integrated to determine the reuse adjustment path for the abnormal connection. The construction of the backup resource pool relies on the storage process of candidate subgraph data during the initial execution of step S7. Specifically, after completing the subgraph cost evaluation in step S5 and the optimal subgraph selection in step S6, the system saves the relevant data of candidate subgraphs that were not selected as the optimal subgraphs and are not currently participating in task GPU allocation to the backup resource pool. Each subgraph data item includes at least a subgraph identifier, a list of GPU computing unit identifiers within the subgraph, and a list of healthy connection records within the subgraph. Each record in the healthy connection record list is linked to a central record cell in the healthy connection data. The information includes the starting GPU identifier, ending GPU identifier, associated compute node identifier, communication medium type, inter-node distance, and current medium weight value. When determining the reuse adjustment path, the system processes each abnormal connection in the abnormal connection list one by one. For a given abnormal connection, it first reads the starting GPU identifier, ending GPU identifier, and associated compute node identifier from the healthy connection dataset. Then, it uses a path search method with a limited hop count in the current global connection view to find an alternative path from the starting GPU to the ending GPU. The hop count limit parameter of the path search method can use the maximum path hop count parameter determined in steps S4 and S5; in this embodiment, this parameter is fixed at 4. During path search, all connections must belong to the healthy connection dataset. The system also needs to limit the connection medium weights in candidate paths. Therefore, during the trial operation phase, maintenance personnel determine a connection replacement weight threshold based on the relationship between medium weight values and actual failure probabilities. Specifically, during the trial operation, a large number of connections' medium weight changes and subsequent failure scenarios are recorded over a long period. When the medium weight value of a connection exceeds a certain integer value multiple times consecutively, the subsequent actual failure probability increases significantly. Maintenance personnel use this integer value as a demarcation point where the connection is not suitable for reuse in new tasks. In this implementation, the connection replacement weight threshold is fixed at 7 and written into the configuration file for unified use. Therefore, during path search, the system considers each candidate path... When checking the selected path, it is required that each connection segment on the path belongs to the healthy connection dataset, and that the current medium weight value of each connection segment is less than 7. When the weight value of any connection segment in a path is greater than or equal to 7, the path is removed from the candidate set. If at least one alternative path that meets the above conditions can be found within the current global connection view, the system calculates the cumulative weight value of each path in these paths according to the cumulative weight calculation method used in step S5. That is, the distance value between nodes of each connection segment in the path is added to the current medium weight value according to a fixed rule to obtain a positive integer, and the path with the smallest cumulative weight value is selected as the reuse adjustment path corresponding to the abnormal connection without accessing the backup resource pool.If no suitable alternative path is found within the current global connection view, the system will then search for alternative subgraphs in the backup resource pool. Specifically, it will check the GPU compute unit identifier list of each subgraph in the backup resource pool to see if it contains both the starting and ending GPUs of the abnormal connection. When a backup subgraph is found to contain both GPUs, it will be added to the candidate backup subgraph set. For each subgraph in the candidate backup subgraph set, the system will perform a path search with a limited number of hops only once within that subgraph. The search rules are the same as described above, namely, a maximum hop count of 4 and a media weight value of less than 7 for all connections on the path. Furthermore, it will calculate the hop count for each candidate path. The system calculates the cumulative weight value and selects the path with the smallest cumulative weight value from all candidate backup paths as the reuse adjustment path for the abnormal connection. If no path meets the condition in any of the backup subgraphs, it indicates that the backup resource pool cannot currently provide an adjustment path for the abnormal connection. In this case, the system retains the existing path of the abnormal connection, only maintaining its abnormal marker, and continues to observe it in subsequent monitoring cycles. After the above processing, the system generates a set of reuse adjustment paths for all connections in the abnormal connection list for which alternative paths can be found. The system writes the starting GPU identifier, ending GPU identifier, path node sequence, and communication medium type sequence of these paths into the reuse adjustment path data set.
[0061] Subsequently, the system updates the resource distribution in the global connection view by reusing and adjusting paths to obtain the intermediate topology. Specifically, the system first performs a "freeze" operation on each connection in the abnormal connection list that is successfully matched to a reusing and adjusting path in the global connection view. That is, the usage status mark of the corresponding record is updated to "pending removal" but not immediately deleted from the data structure so that the original connection can be restored if the reusing and adjusting path verification fails later. Then, the system processes each path in the reusing and adjusting path data set one by one. For each connection segment in a reusing and adjusting path, the system checks whether a corresponding connection record already exists in the global connection view. If it exists and the record is still in a healthy state, a resource occupancy count field in the record is incremented by 1. This resource occupancy count field is used in the system design to count how many tasks are currently using the connection simultaneously. In this embodiment, the initial resource occupancy count is... The value is 0 when the connection is not used by any task, incremented by 1 each time a new task reuses the connection, and decremented by 1 when the task ends. If the connection does not exist in the global connection view, the system copies the static information of the connection from the healthy connection dataset to create a new connection record, sets its health status to healthy, sets the resource usage count to 1, and adds the connection to the global connection view. After all connection segments of the reuse adjustment path have been processed, the system uniformly processes the abnormal connection records previously marked as "to be removed". If it is confirmed that the GPUs at both ends of these abnormal connections have maintained connectivity through the reuse adjustment path, these abnormal connection records are completely deleted from the global connection view, thus obtaining a new connection relationship graph. This new connection relationship graph is called the intermediate topology in this embodiment. The intermediate topology contains the global connection state after the abnormal connections are called out and the reuse adjustment path is introduced.
[0062] After obtaining the intermediate topology, the system generates a dynamically adjusted cluster topology view based on the intermediate topology and real-time status monitoring. Real-time status monitoring continues to be executed by the same health information collection logic at fixed intervals, with the monitoring cycle length remaining the previously set 60 seconds. After each new monitoring cycle, the system re-traverses all connections and all GPU computing units participating in the task within the intermediate topology, reading the latest error event count, retransmission count, packet loss count, average round-trip latency, bandwidth usage ratio, and corresponding current media weight value for each connection according to the field format of the health connection dataset. The system then updates the media weight value using the same rules as in step S3. To identify which multiplexing adjustment paths are stable and which are still under observation in the dynamic topology, the system additionally sets a health stability threshold parameter during the trial operation phase. This parameter is used to determine whether a connection remains in a low-risk state over a relatively long period. The determination method is based on the trial operation... In the operation phase, multiple long-term stable connections without failures are selected, and the maximum and average values of the media weight values of these connections are calculated over multiple consecutive monitoring periods. Based on this, the operation and maintenance personnel select an integer slightly higher than the upper limit of these normal weight values as the health and stability threshold. In this embodiment, the health and stability threshold is fixed at 3 and written into the configuration file. During operation, when the media weight value of a connection on a multiplexing adjustment path is always less than or equal to 3 for at least 3 consecutive monitoring periods, and the number of error events, retransmissions, and packet losses do not exceed the absolute abnormal upper limit defined in step S3, the system updates the status marker of the connection in the topology view from "new multiplexing connection" to "stable and healthy connection". Conversely, if the media weight value of a newly connected connection continues to rise or frequent errors occur in a short period of time, the system marks its status as "observation connection" in the topology view and prioritizes its inclusion in the abnormal fluctuation detection of subsequent status monitoring sequences, thereby forming a feedback loop.On the other hand, to reflect the resource pressure of each computing node in the dynamic topology view, the system statistically analyzes the connectivity and average load level of each computing node based on the intermediate topology at the end of each monitoring cycle. Connectivity is defined as the number of connection records in which the node participates as a start or end node, and average load level is defined as the average current utilization of all GPU computing units on the node. These values are obtained by sampling the GPU utilization on the node and calculating the arithmetic mean. The system then compares these statistical results with pre-set node connectivity and load thresholds. These two thresholds are determined during the trial operation phase by observing the characteristics of nodes before and after task execution. For example, the node connectivity threshold can be set to 80% of the node's physical maximum supported number of connections, and the node load threshold can be set to 90% of the average utilization. When a node's connectivity exceeds the connectivity threshold or its average load exceeds the load threshold... When the value is high, the system marks the node as a high-pressure node in the dynamic topology view and reduces the reuse priority of the connections on that node during the subsequent reuse adjustment path selection process. Finally, the dynamically adjusted cluster topology view is based on the intermediate topology structure, overlaying the latest health status flag, current medium weight value, whether it belongs to the reuse adjustment path, and whether it is in the observation state for each connection, as well as the resource pressure flag of each node. This unified data structure presents the topology relationship and operational health status of GPU computing units and communication connections globally at the current moment. The scheduling logic will directly use this dynamically adjusted cluster topology view as the topology input when executing steps S1 to S6 in the next round. This allows the GPU allocation method of the present invention to achieve real-time dynamic adjustment of the cluster topology throughout the entire task lifecycle through the cooperation of status monitoring sequences, backup resource pools, reuse adjustment paths, and intermediate topology structures.
[0063] like Figure 2 As shown, Figure 2 This describes the GPU server's topology. Within a node, communication primarily occurs via NVLink and PCIe, while between nodes, communication is via RDMA or TCP. Intra-node communication is more efficient than inter-node communication. Therefore, the main card allocation strategy is to prioritize allocating cards within the node if the number of cards within the node meets customer demand. When allocating cards within a node, priority is given to cards whose internal GPU connections are all NVLink-based, as NVLink has higher bandwidth and speed than PCIe. If not all cards are NVLink-based, the GPU communication library (NCCL) will revert to PCIe communication, resulting in wasted cards.
[0064] 2) When allocating GPU cards, priority will be given to assigning graphics card combinations that all have NVLink, and only then will PCIe cards be selected for allocation. In addition, the connection path of the current card combination will be the shortest, and the connection paths of the remaining card combinations will also be the shortest.
[0065] 3) Construction of the topology graph and search for the optimal subgraph: Based on the above principles, each GPU is abstracted as a graph vertex V, assuming a total of k GPU devices. The communication medium between GPUs is abstracted as an edge with weight W. Since NVLink communication is faster than PCIe, and the speed difference between NVLink and PCIe is significant, the NVLink weight is assumed to be 1, and the PCIe weight is the ratio of the NVLink speed to the PCIe speed, which is V. The goal is to find a specific number of subgraphs (corresponding to the number of cards needed by the user) based on the graph.
[0066] Based on the above objectives, we first construct a subgraph based on the NVLINK connections of the GPUs at each node, and then construct a subgraph based on PCIe connections (NVLINK has a lower weight than PCIe). During the construction process, we need to combine node topology information and GPU health information. (CPU and memory information may also be included later).
[0067] If the number of GPUs a user has is less than the maximum number of GPUs a single node can have, then the optimal GPU combination is found based on all subgraphs within the node, and efforts are made to ensure that all connections between nodes are NVLinks.
[0068] If the number of user GPUs exceeds the number of GPUs on any single node, then cross-node subgraph combinations are found based on the global topology graph. In the node combination search, RDMA has higher priority than TCP, and TCP nodes will only be allocated after all RDMA nodes have been allocated.
[0069] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A GPU allocation method based on GPU cluster topology, characterized in that, include: S1. Obtain topology information from each computing node in the cluster. Use the topology information management module to parse the direct and indirect connection information between GPU computing units in the computing nodes. Use the user demand quantity parsing logic to extract the number of GPU computing units required for the task. Combine this with connection integrity verification to generate a local topology connectivity graph of the current node and obtain a preliminary node connection dataset. S2. Based on the local topology connectivity graph and the principle of uniform node distribution, the data is reported to the global connectivity view system. The communication medium connection relationship and path delay information in the view are updated in combination with cross-node constraints and media type filtering mechanism to determine the global connectivity distribution status. S3. Periodically query the operating status of each GPU computing unit and its communication medium through the device health information management module. If a connection failure is detected, adjust the weight value of the corresponding medium according to the dynamic weight update rules, and remove the faulty unit from the global connection view to obtain the updated health connection dataset. S4. For the updated healthy connection dataset, a subgraph diversity generation method is used to extract multiple candidate subgraphs from the global connection view. The cumulative weight of the communication path in each subgraph is calculated by combining the subgraph size limit and path length evaluation logic to determine the initial connection cost of each subgraph. S5. Summarize the connection costs of each candidate subgraph according to the total weight summary logic, compare the total weight values one by one through the weight comparison mechanism, and sort the unselected subgraphs in combination with the remaining subgraph optimization rules to obtain the optimal subgraph label with the smallest total weight.
2. The GPU allocation method based on GPU cluster topology according to claim 1, characterized in that: S1 includes: The topology information is obtained from each compute node in the cluster, and the direct high-speed connection information and indirect connection information between GPU computing units in the compute nodes are parsed through the topology information management module. The number of GPU computing units required for the task is extracted from the topology information using demand quantity parsing logic, and a local topology connectivity graph of the current node is generated by combining it with connection integrity verification. For a local topology connectivity graph, a preliminary node connection dataset is obtained, and the consistency of the connections in the dataset is verified by comparing it with the overall cluster connectivity data through global topology integration. If the connection consistency is lower than the preset threshold, additional connection data is obtained from the cluster node parsing and the node connection dataset is updated. Based on the updated node connection dataset, determine the initial node connection dataset.
3. The GPU allocation method based on GPU cluster topology according to claim 1, characterized in that: S2 includes: Obtain node distribution uniformity data from the local topology connectivity graph, and filter the preliminary communication medium connection relationship by comparing the distance and load difference between nodes through cross-node constraints to obtain a set of path delay information; For the path delay information set, a media type filtering mechanism is used to extract compatible media types from the path delay information set and compare them with existing data in the global connection view system to determine the updated communication media connection relationship. Based on the updated communication medium connection relationship, report the node distribution uniformity data to the global connection view system to determine the preliminary consistency of the connection distribution status; If the initial consistency meets the preset threshold, the results extracted by the cross-node constraint conditions and the media type filtering mechanism are integrated to obtain the extended connection path information in the global scope. By extending the connection path information, we can verify the media compatibility in the connection distribution state and determine the connection distribution state on a global scale.
4. The GPU allocation method based on GPU cluster topology according to claim 1, characterized in that: S3 includes: The device health information management module obtains operating status data from the GPU computing unit, checks abnormal indicators for the communication medium, and obtains a preliminary set of fault identifiers. Based on the initial fault identifier set, the corresponding weight values are adjusted by comparing media load differences and historical fault records using dynamic weight update rules to determine the media list after weight adjustment. For the media list after weight adjustment, if a connection failure is detected, the faulty unit is removed from the global connection view, and the view data after removal is obtained; By integrating the removed view data and the backup media switching mechanism, the updated healthy connection dataset is obtained by allocating and updating the healthy connection dataset via the backup path.
5. The GPU allocation method based on GPU cluster topology according to claim 1, characterized in that: S4 includes: The global connectivity view data is obtained from the updated healthy connectivity dataset. The graph segmentation method is used to extract multiple candidate subgraphs by dividing the view nodes, resulting in a set of extracted subgraphs. For the extracted set of subgraphs, the internal path list of each subgraph is obtained by path counting, taking into account the subgraph size limit, and the path list data is determined. Based on the path list data, the path length evaluation logic is used to calculate the length of the communication path by accumulating the distance between nodes, and the length evaluation result is obtained. Based on the length evaluation results, the backup path integration mechanism replaces the weight parameters of each path with backup links to obtain the cumulative weight set; For the cumulative weight set, the differences between subgraphs are compared and the initial connection cost is determined by the sum of the weights, thus obtaining the cost distribution of each subgraph.
6. The GPU allocation method based on GPU cluster topology according to claim 1, characterized in that: S5 includes: The candidate subgraph extraction results are obtained from the global connection view. An internal path list is generated by statistical analysis of inter-node links using the path counting method to determine the communication path length value. For the internal path list, the accumulated weight parameters are replaced by integrating alternative paths. The weight parameters are obtained by adding the node distances segment by segment to obtain the accumulated weight set. Based on the cumulative weight set, the weight sum judgment is applied to calculate the preliminary connection cost by summing the elements in the set, and the cost distribution sorting data is obtained. By sorting the cost distribution data, a cost comparison mechanism is executed to compare the total weight values and determine the optimization order of the remaining subgraphs. If the total weight value is the smallest, then mark the optimal subgraph, and obtain the label of the optimal subgraph with the smallest total weight.
7. The GPU allocation method based on GPU cluster topology according to claim 1, characterized in that, It also includes S6, which uses an optimal subgraph marking and weight recording storage mechanism to store the connection paths and communication medium types of the optimal subgraph as the basis for allocation. If the optimal subgraph involves cross-node paths, the high-speed interconnect medium path based on NVLink is selected first to determine the final GPU computing unit allocation scheme, specifically including: By identifying the optimal subgraph, the connection path and communication medium type are obtained and stored as the basis for allocation, resulting in a path storage set. For each path storage set, determine whether it involves cross-node paths. If so, prioritize the path of the NVLink interconnect link to determine the media path sequence.
8. The GPU allocation method based on GPU cluster topology according to claim 7, characterized in that: S6 further includes: Based on the media path sequence, the weight recording mechanism is integrated to obtain the allocation basis parameters; The allocation criteria parameters are used to generate a GPU computing unit allocation scheme and determine the final GPU computing unit allocation scheme.
9. The GPU allocation method based on GPU cluster topology according to claim 7, characterized in that, This also includes S7, updating the resource status in the global connection view according to the final allocation scheme, using a candidate subgraph storage method to save unused subgraph data as a backup resource pool, and combining the healthy connection dataset to monitor the status of allocated GPU computing units in real time to obtain a dynamically adjusted cluster topology view, specifically including: Using a healthy connection dataset, connection metrics of the allocated GPU computing units are obtained to get a status monitoring sequence; For the aforementioned status monitoring sequence, determine whether there are any abnormal fluctuations. If so, integrate the unused subgraph data in the backup resource pool to determine the reuse and adjustment path.
10. The GPU allocation method based on GPU cluster topology according to claim 9, characterized in that: The S7 also includes: By reusing and adjusting the path, the resource distribution in the global connection view is updated to obtain the intermediate topology; Based on the intermediate topology and combined with real-time status monitoring, a dynamically adjusted cluster topology view is generated.