A graphics card cluster computing power dynamic scheduling method and system
By acquiring sampling data from the graphics card and network, the computing power and bandwidth requirements of the graphics card cluster are calculated, and task priority scores and dynamic scheduling queues are generated. This solves the problem of low resource utilization of the graphics card cluster and achieves efficient resource allocation and task scheduling.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TIANJIN HONGXIN WENHUA TECHNOLOGY CO LTD
- Filing Date
- 2026-04-13
- Publication Date
- 2026-06-30
AI Technical Summary
Existing methods for scheduling computing power in graphics card clusters cannot adapt to the dynamic changes in the task phase, resulting in some graphics card chips being idle or overloaded for a long time. It is difficult to balance computing power and bandwidth requirements, resulting in low resource utilization. Furthermore, there is a lack of real-time awareness of the collaborative operation status of graphics card chips, making it difficult to achieve dynamic optimization of resource allocation.
By acquiring the utilization rate and network traffic sampling sequence of each graphics card in the graphics card cluster, the operation phase label is determined, the peak computing power demand and peak network bandwidth demand are calculated, task priority scores are generated, the resource scheduling queue is reordered, and network path pre-allocation is performed in combination with the phase switching time window to generate a reserved bandwidth dynamic channel, thereby achieving precise resource configuration.
It improves the resource utilization and computing efficiency of graphics card clusters, solves the problem of resource allocation being out of sync with task requirements, breaks through the limitations of traditional scheduling methods, and achieves targeted task scheduling and avoidance of communication blockage under complex working conditions.
Smart Images

Figure CN122309170A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of graphics card chip technology, and in particular to a method and system for dynamic scheduling of computing power in a graphics card cluster. Background Technology
[0002] Currently, in the field of graphics card chip technology, with the continuous explosion of demand for artificial intelligence training and large-scale scientific computing, the rationality of the computing power scheduling of graphics card clusters, as the core infrastructure of high-performance computing, is directly related to the execution efficiency and resource utilization value of computing tasks.
[0003] Existing GPU cluster computing power scheduling methods mainly rely on fixed partition allocation or single resource metric scheduling. For example, GPU computing power is allocated according to a preset ratio, or resources are allocated only based on memory usage, or the different needs of GPU chips during the computing and communication phases are ignored. However, this approach is clearly insufficient in complex operating environments. Fixed partitions cannot adapt to dynamic changes in task phases, resulting in some GPU chips being idle or overloaded for extended periods; single metric scheduling struggles to balance computing power and bandwidth requirements, easily leading to communication congestion; and there is a lack of real-time awareness of the collaborative operating status of GPU chips, especially in scenarios with multiple concurrent tasks and frequent phase switching, making it difficult to achieve dynamic optimization of resource allocation and resulting in low resource utilization.
[0004] In summary, existing technologies are insufficient to achieve precise dynamic scheduling of the computing power of graphics card clusters, and cannot meet the dual demands of high-performance computing efficiency and resource utilization in the field of graphics card chip technology. Summary of the Invention
[0005] This invention provides a method and system for dynamic scheduling of computing power in a graphics card cluster, so as to achieve precise dynamic scheduling of computing power in the graphics card cluster and meet the dual requirements of high-performance computing efficiency and resource utilization in the field of graphics card chip technology.
[0006] Firstly, in order to solve the above-mentioned technical problems, the present invention provides a method for dynamic scheduling of computing power in a graphics card cluster, comprising: Obtain the utilization sampling sequence of each graphics card in the graphics card cluster and the network traffic sampling sequence; The operation phase labels are determined based on the utilization sampling sequence and the traffic sampling sequence, and the operation phase labels are concatenated to obtain a label sequence; the operation phase labels include computationally intensive phases and communication-intensive phases; Based on the tag sequence, calculate the peak computing power requirement during the computationally intensive phase and the peak network bandwidth requirement during the communication-intensive phase, respectively. The task priority score is calculated based on the peak computing power demand and the peak network bandwidth demand, and the preset resource scheduling queue is reordered based on the task priority score to obtain the adjusted task queue. The target graphics card group is obtained by filtering the adjusted task queue, and the target graphics card group is bound to the tasks in the adjusted task queue to obtain the position allocation result; Based on the location allocation result and the specific time window for switching the label sequence calculation stage, and combined with the pre-acquired gradient synchronization data volume, network path pre-allocation is performed to obtain a reserved bandwidth dynamic channel.
[0007] Secondly, the present invention provides a dynamic scheduling system for the computing power of a graphics card cluster, comprising: The data acquisition module is used to obtain the utilization sampling sequence of each graphics card in the graphics card cluster and the network traffic sampling sequence; The phase analysis module is used to determine the operation phase label based on the utilization sampling sequence and the traffic sampling sequence, and to concatenate the operation phase labels to obtain a label sequence; the operation phase label includes computationally intensive phases and communication-intensive phases; The peak calculation module is used to calculate the peak computing power demand during the computationally intensive phase and the peak network bandwidth demand during the communication-intensive phase, respectively, based on the tag sequence. The priority sorting module is used to calculate the task priority score based on the peak computing power demand and the peak network bandwidth demand, and to reorder the preset resource scheduling queue based on the task priority score to obtain the adjusted task queue. The location allocation module is used to filter out target graphics card groups according to the adjustment task queue, and bind the target graphics card groups with tasks in the adjustment task queue to obtain location allocation results; The path pre-allocation module is used to pre-allocate network paths based on the location allocation result and the specific time window for switching the label sequence calculation stage, and in combination with the pre-acquired gradient synchronization data volume, to obtain a reserved bandwidth dynamic channel.
[0008] Compared with the prior art, the present invention has the following beneficial effects: (1) This invention collects the utilization sampling sequence and network traffic sampling sequence of the graphics card cluster, and through timestamp alignment, outlier removal and operation stage label analysis, accurately distinguishes the computation-intensive and communication-intensive stages, and obtains the stage classification label sequence. It breaks through the limitation of traditional fixed partitions that cannot adapt to the dynamic changes of tasks, explores the correlation characteristics between graphics card operation and network transmission, eliminates equipment jitter and data noise interference, provides high-precision basic data support for computing power scheduling, effectively improves the accuracy of stage identification, and solves the problem of resource allocation and task requirements being out of sync.
[0009] (2) The present invention calculates the peak computing power demand and the peak network bandwidth demand according to the tag sequence, and generates a task priority score by weighted fusion. It reorders the resource scheduling queue, breaks through the limitation of traditional single index scheduling that makes it difficult to balance computing power and bandwidth, accurately captures the resource demand differences of different tasks, provides multi-dimensional basis for scheduling and sorting, significantly improves the targeting of task scheduling under complex working conditions, and makes up for the defects of existing technologies that are prone to communication blockage.
[0010] (3) This invention matches low-interference candidate graphics card groups according to the scheduling order, generates reserved bandwidth channels by pre-allocating network paths in combination with the stage switching time window, and iteratively updates the running status sequence to form a closed loop. It breaks through the limitations of traditional lack of dynamic adjustment and pre-allocation mechanism, provides a precise resource configuration basis for graphics card clusters, solves the resource contention problem caused by stage switching, takes into account both computing efficiency and resource utilization, and meets the stringent requirements of high-performance computing for the accuracy of computing power scheduling. Attached Figure Description
[0011] Figure 1 This is a schematic diagram of a method for dynamic scheduling of computing power in a graphics card cluster provided in the first embodiment of the present invention; Figure 2 This is a schematic diagram of a dynamic scheduling system for computing power of a graphics card cluster provided in the second embodiment of the present invention. Detailed Implementation
[0012] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0013] Reference Figure 1 The first embodiment of the present invention provides a method for dynamic scheduling of computing power in a graphics card cluster, comprising the following steps: S101, obtain the utilization sampling sequence of each graphics card in the graphics card cluster and the network traffic sampling sequence; S102, determine the operation phase label based on the utilization sampling sequence and the traffic sampling sequence, and concatenate the operation phase labels to obtain a label sequence; the operation phase label includes computationally intensive phase and communication-intensive phase; S103, calculate the peak computing power requirement during the computationally intensive phase and the peak network bandwidth requirement during the communication-intensive phase based on the tag sequence. S104, calculate the task priority score based on the peak computing power demand and the peak network bandwidth demand, and reorder the preset resource scheduling queue based on the task priority score to obtain the adjusted task queue. S105, the target graphics card group is obtained by filtering the adjusted task queue, and the target graphics card group is bound to the tasks in the adjusted task queue to obtain the position allocation result; S106, based on the location allocation result and the specific time window for the switching of the tag sequence calculation stage, and combined with the pre-acquired gradient synchronization data volume, network path pre-allocation is performed to obtain a reserved bandwidth dynamic channel.
[0014] In step S101, obtaining the utilization sampling sequence of each graphics card in the graphics card cluster and the network traffic sampling sequence includes: Real-time utilization data of each graphics card in the graphics card cluster is collected at a preset period to form an original utilization sequence; Synchronously collect real-time transmission traffic data from network devices to form raw traffic sequences; Outlier removal is performed on the original utilization rate sequence and the original flow rate sequence to obtain the processed utilization rate sequence and the processed flow rate sequence. The processed utilization sequence and the processed traffic sequence are respectively processed by data format normalization to obtain the graphics card utilization sampling sequence and the network traffic sampling sequence.
[0015] It should be noted that, firstly, real-time utilization data of each graphics card in the graphics card cluster is collected at a preset period to form the raw utilization sequence. This data is collected through hardware monitoring interfaces provided by the graphics card manufacturers (such as NVIDIA's NVML interface and AMD's ADL interface), covering key indicators such as graphics card computing core utilization and memory bandwidth utilization. The preset period is set based on historical task phase switching patterns. Statistics from the past three months show that over 90% of the phase switching intervals are between 0.3 and 1 second. Therefore, the basic period is set to 0.5 seconds, which captures phase changes without excessively consuming system resources. For complex tasks (such as ultra-large-scale model training), the period can be reduced to 0.2 seconds to improve data timeliness; for simple computational tasks, it can be increased to 1 second to reduce collection overhead. For example, an AI training cluster uses the NVML interface to collect the core utilization of 8 graphics cards at a 0.5-second interval, resulting in the original utilization sequence [98, 99, 97, 10, 96, 98, 99, 95], where 10 is an outlier caused by device jitter.
[0016] Next, real-time transmission traffic data from network devices is collected synchronously to form the raw traffic sequence. This is done through the network interface card's built-in monitoring module and the switch's traffic statistics interface. Collected metrics include uplink transmission rate, downlink transmission rate, and packet forwarding volume. The collection cycle is consistent with the GPU utilization collection cycle to ensure timestamp alignment and avoid data timing misalignment. The collection scope covers all core switches, access switches, and direct links between GPUs within the GPU cluster, comprehensively reflecting the network transmission status. For example, synchronously collecting the uplink transmission rate of the cluster's core switches at 0.5-second intervals yields a raw traffic sequence of [120Mbps, 130Mbps, 125Mbps, 500Mbps, 122Mbps, 128Mbps], where 500Mbps represents burst traffic data.
[0017] Outlier removal is performed on the original utilization and traffic sequences to obtain the processed utilization and traffic sequences. A 3-standard-deviation principle combined with business logic verification is used. First, the mean and standard deviation of the sequences are calculated. Data exceeding 3 standard deviations are marked as suspected outliers. Then, business logic verification (e.g., GPU utilization cannot exceed 100%, network traffic will not instantly drop to 0, and there is no service interruption) confirms the anomaly. This threshold is based on statistics from nearly 100,000+ normal operation data points. Over 99.7% of normal data falls within 3 standard deviations. Outliers caused by equipment jitter often exceed this range. In industrial scenarios, this can be relaxed to 4 standard deviations to balance anomaly removal accuracy and data integrity. For example, the mean of a GPU's original utilization sequence is 85, and the standard deviation is 8. The mean ± 3 standard deviations ranges from 61 to 109. The value 10 in the sequence exceeds the lower limit, and there are no service interruption records. This is determined to be an outlier caused by equipment jitter and is removed.
[0018] Subsequently, the processed sequences are formatted to standardize the time granularity and numerical range. When obtaining the GPU utilization sampling sequence and network traffic sampling sequence, the time granularity is standardized to the sampling period (e.g., 0.5 seconds). Missing data is supplemented using linear interpolation to ensure sequence continuity. The numerical range is mapped to the [0,1] interval through minimum-maximum normalization, eliminating magnitude differences between different indicators and different devices. For example, the processed GPU utilization sequence is [98, 99, 97, 96, 98, 99, 95], with a minimum of 95 and a maximum of 99. After normalization, it becomes [0.75, 1.0, 0.5, 0.25, 0.75, 1.0, 0.0], with a standardized time granularity of 0.5 seconds, forming a regular utilization sampling sequence.
[0019] In step S102, determining the operation phase label based on the utilization sampling sequence and the traffic sampling sequence, and concatenating the operation phase labels to obtain a label sequence, includes: Extract the timestamps of the utilization sampling sequence and the traffic sampling sequence, and associate and bind the utilization data and traffic data corresponding to the same timestamp to obtain the associated dataset; The associated dataset is extracted according to a fixed time window to obtain multiple consecutive data blocks to be analyzed; Feature extraction is performed on each of the data blocks to be analyzed to determine the corresponding operational stage label for the data block; All operation stage labels are assembled in time window order to form a label sequence for stage classification.
[0020] It should be noted that, firstly, when extracting the timestamps from the utilization rate sampling sequence and the traffic sampling sequence, and associating and binding utilization rate data and traffic data corresponding to the same timestamp, the timestamp is marked with the system time at the time of collection (accurate to milliseconds), ensuring that the time base of the two types of data is consistent. The association and binding uses a timestamp difference matching rule, setting a maximum allowable difference of 10 milliseconds. That is, when the timestamp difference between utilization rate data and traffic data is within 10 milliseconds, they are considered as matching data from the same time point and are associated and stored. This difference threshold is based on the statistical setting of the synchronization accuracy of the collection devices. The time difference of most synchronously collected devices does not exceed 5 milliseconds, and the 10-millisecond threshold can cover minor synchronization deviations. Industrial-grade high-precision clusters can lower the threshold to 5 milliseconds, while ordinary clusters can relax it to 15 milliseconds. For example, the value corresponding to timestamp 1699999999000 in the utilization rate data is 0.98, and the value corresponding to timestamp 1699999999008 in the traffic data is 0.25. The difference of 8 milliseconds is less than 10 milliseconds, so the two are associated and bound as a set of matching data.
[0021] Next, the associated dataset is truncated according to a fixed time window. When multiple consecutive data blocks to be analyzed are obtained, the size of the fixed time window is set based on the continuous pattern of historical task stages. Statistics on the stage switching intervals of various GPU cluster tasks over the past three months show that more than 90% of the running stages last between 10 and 60 seconds. Combined with a 0.5-second acquisition cycle, the basic time window is set to 20 sampling points (corresponding to 10 seconds), which can fully capture stage features without causing stage confusion due to an excessively large window. For tasks with long stages such as ultra-large-scale model training, the time window can be adjusted to 30 sampling points (15 seconds), and for short-cycle computation tasks, it can be adjusted to 10 sampling points (5 seconds). Those skilled in the art can flexibly adjust this according to the task type. During truncation, the data blocks are continuously slid in chronological order, with the sliding step size consistent with the acquisition cycle (0.5 seconds) to ensure continuous and complete data block coverage.
[0022] For example, the associated dataset contains 100 sets of matching data. It is cut into 5 consecutive data blocks to be analyzed by time windows of 20 sampling points. Each data block covers the running data within 10 seconds, and adjacent data blocks overlap by 19 sampling points to ensure that the stage transition is not lost.
[0023] Subsequently, feature extraction was performed on each data block to be analyzed. When determining the corresponding operational stage label for the data block, feature extraction included utilization and traffic features. Utilization features included mean, variance, and peak percentage (the percentage of sampling points with values greater than 0.8). Traffic features included mean, growth rate, and peak percentage (the percentage of sampling points with values greater than 0.7). All features were mapped to the [0,1] interval using min-max normalization. The stage determination used an LSTM+Attention model. The training set contained over 50,000 data blocks labeled with operational stage labels (computationally intensive and communication-intensive data each accounting for 50%), divided into training and validation sets in a 7:3 ratio. The model consisted of 3 LSTM hidden layers with 128 nodes per layer, 4 attention heads, and the AdamW optimizer. The initial learning rate was 0.001, decreasing by 0.1 every 20 rounds until reaching 0.0001. The loss function was cross-entropy loss. The validation set accuracy was monitored in real-time during training, and iteration stopped when the fluctuation was less than 0.002 for 12 consecutive rounds. The judgment rule is as follows: if the probability of the computationally intensive stage output by the model is greater than 0.6, it is marked as a computationally intensive stage; if the probability of the communication-intensive stage is greater than 0.6, it is marked as a communication-intensive stage; if both are less than 0.6, the judgment is made in conjunction with the business logic (if the previous stage was computationally intensive, the marking will be carried over).
[0024] For example, the average utilization rate of a certain data block to be analyzed is 0.92, the peak percentage is 0.85, the average traffic is 0.2, and the peak percentage is 0.1. The model calculates and outputs a probability of 0.92 for the computationally intensive stage, thus identifying it as a computationally intensive stage.
[0025] Finally, when concatenating all runtime stage labels in time window order to form a stage-classified label sequence, consecutive identical runtime stage labels are merged during the concatenation process, retaining only stage switching nodes to reduce sequence redundancy. For example, if three consecutive data blocks are all identified as computationally intensive stages, they are merged into a single "computationally intensive" label segment, marked with start and end timestamps. When the runtime stage label of a data block changes, a new corresponding label segment is added and concatenated sequentially in time. For example, the runtime stage labels of the five data blocks to be analyzed are computationally intensive, computationally intensive, communication intensive, computationally intensive, and computationally intensive, respectively. The resulting label sequence after concatenation is [computationally intensive (10-30 seconds), communication intensive (30-40 seconds), computationally intensive (40-60 seconds)], clearly showing the stage switching pattern.
[0026] In step S103, calculating the peak computing power demand during the computationally intensive phase and the peak network bandwidth demand during the communication-intensive phase based on the tag sequence includes: Traverse the label sequence, filter out the time intervals marked as computationally intensive stages, and extract the training batch size and model parameter count within the corresponding time intervals; The peak computing power requirement during the computationally intensive phase is calculated by multiplying the training batch size by the number of model parameters and combining the peak data of GPU utilization. Filter the time intervals marked as communication-intensive phases in the label sequence, and extract the gradient synchronization data volume and communication duration within the corresponding intervals; The peak network bandwidth demand during the communication-intensive phase is calculated by dividing the gradient synchronization data volume by the communication duration and combining it with the peak network traffic data.
[0027] It should be noted that, firstly, when filtering the time intervals marked as computationally intensive stages for the label sequence, the label types are matched segment by segment in chronological order to extract the start and end timestamps of each computationally intensive stage, thus clarifying the corresponding time interval. The training batch size is obtained from the configuration file or runtime logs of the distributed training framework and represents the number of samples in the global training batch; the model parameter count is extracted through the parameter statistics interface provided by the framework, including trainable parameters and fixed parameters (such as embedding layer parameters), and is uniformly counted in units of "number". Both must accurately correspond to the time intervals of the computationally intensive stages to ensure that the extracted parameters belong to the actual runtime configuration of that stage.
[0028] In this implementation case, the peak computing power requirement during the computationally intensive phase is calculated by multiplying the training batch size by the number of model parameters and combining this with peak GPU utilization data. First, the product of the training batch size and the number of model parameters is calculated to obtain the theoretical computational load per batch. The peak GPU utilization data is extracted from the utilization sampling sequence of this computationally intensive phase, taking the maximum value in the sequence (normalized and needing to be restored to the original percentage). A computing power conversion factor is introduced, based on the GPU's FP32 / FP16 / FP8 computing power specifications. The factor is set to 0.5 for FP16 precision (0.5 floating-point operations per parameter per batch), 1.0 for FP32 precision, and 0.25 for FP8 precision, and can be flexibly adjusted according to the training precision. Peak computing power requirement = (Training batch size × Number of model parameters × Computing power conversion factor) × (Peak GPU utilization / 100), in FLOPS.
[0029] For example, with a training batch size of 2048, a model parameter count of 175 billion, a computing power conversion factor of 0.5 at FP16 precision, and a peak GPU utilization of 98%, the calculated peak computing power requirement is approximately 1.76e15 FLOPS (1.76P FLOPS).
[0030] Subsequently, the time intervals marked as communication-intensive stages in the filter label sequence are analyzed. When extracting the gradient synchronization data volume and communication duration within the corresponding intervals, communication-intensive stages are filtered by label type to clarify their time intervals. The gradient synchronization data volume is extracted from the communication logs of the distributed training framework and represents the total data volume transmitted by communication operations such as All-Reduce and Broadcast within that stage, with the unit uniformly set to bytes. The communication duration is the difference between the end timestamp and the start timestamp of that stage, with the unit set to seconds, ensuring that the duration calculation perfectly matches the communication process corresponding to the data volume.
[0031] For example, in a certain communication-intensive phase of the tag sequence, the time interval is 30-40 seconds. The amount of gradient synchronization data extracted from the communication log is 500GB (500×1024^3 bytes), and the communication duration is 10 seconds, which accurately reflects the scale and time of communication data transmission in this phase.
[0032] Finally, by dividing the gradient synchronization data volume by the communication duration and combining this with the peak network traffic data, the peak network bandwidth demand during the communication-intensive phase is calculated. First, the theoretical bandwidth demand (in bytes / second) is obtained by dividing the gradient synchronization data volume by the communication duration. The peak network traffic data is extracted from the traffic sampling sequence of this communication-intensive phase, taking the maximum value in the sequence (normalized and needing to be restored to the original rate). A peak traffic ratio coefficient is introduced, which is statistically set based on the ratio of actual traffic to theoretical bandwidth during historical communication phases. The base value is 0.9; for clusters with complex network topologies, this can be increased to 0.95, and for simple topology clusters, it can be decreased to 0.85. Peak network bandwidth demand = (gradient synchronization data volume / communication duration) × peak traffic ratio coefficient, with the unit uniformly set to GB / s (1GB / s = 1024^3 bytes / second).
[0033] For example, with a gradient synchronization data volume of 500GB, a communication duration of 10 seconds, a theoretical bandwidth requirement of 50GB / s, and a network traffic peak ratio of 0.95, the calculated peak network bandwidth requirement is 50 × 0.95 = 47.5GB / s.
[0034] In step S104, the step of calculating a task priority score based on the peak computing power demand and the peak network bandwidth demand, and reordering a preset resource scheduling queue based on the task priority score to obtain an adjusted task queue, includes: Invoke the weighting coefficients pre-configured for the peak computing power demand and the peak network bandwidth demand, respectively; The peak computing power requirement, the peak network bandwidth requirement, and the corresponding weight coefficient are weighted and summed to obtain the priority score of each task to be assigned. The tasks in the preset resource scheduling queue are rearranged according to the priority scores from high to low to obtain the adjusted task queue.
[0035] It should be noted that, firstly, when configuring preset weight coefficients for peak computing power demand and peak network bandwidth demand, the weight combination is set based on feedback from historical scheduling data. Statistics on the scheduling success rate and resource utilization of various tasks over the past year show that the peak computing power demand has a slightly greater impact on task execution efficiency than the peak network bandwidth demand. Therefore, the weight of the peak computing power demand is set to 0.55, and the weight of the peak network bandwidth demand is set to 0.45. This combination has been verified in multiple scenarios and can ensure efficient execution of computationally intensive tasks while also considering the bandwidth requirements of communication-intensive tasks. In computationally intensive scenarios such as supercomputing centers, the computing power weight can be increased to 0.6, and in scenarios with frequent communication such as cloud service clusters, the bandwidth weight can be increased to 0.5. Those skilled in the art can flexibly adjust this according to the cluster's positioning. For example, for a very large-scale model training task with extremely high computing power demand, increasing the computing power weight to 0.6 and setting the network bandwidth weight to 0.4 better aligns with the core resource requirements of the task.
[0036] Next, the peak computing power requirement and peak network bandwidth requirement are weighted and summed with their corresponding weights to obtain the priority score for each task to be assigned. First, the two peak values are normalized to the minimum-maximum range, mapping them to the [0,1] interval to eliminate calculation bias caused by different magnitudes. After normalization, the two indicators are multiplied by preset weights and then summed to obtain the priority score, which ranges from 0 to 1. A higher value indicates a higher urgency and importance of the task for resources. For example, if a task has a normalized peak computing power requirement of 0.9 and a peak network bandwidth requirement of 0.7, calculated with weights of 0.55 and 0.45, the priority score is 0.9 × 0.55 + 0.7 × 0.45 = 0.495 + 0.315 = 0.81, reflecting the urgency and importance of the task's resource requirements.
[0037] Subsequently, tasks in the preset resource scheduling queue are rearranged according to their priority scores from highest to lowest. When adjusting the task queue, if multiple tasks have the same score, they are sorted by their submission time, with earlier submissions taking precedence. A new scheduling queue is generated after sorting, with high-priority tasks concentrated at the front of the queue, receiving priority access to the computing power and bandwidth resources of the GPU cluster. Ordinary priority tasks are arranged sequentially according to their scores. For example, if the preset scheduling queue contains four tasks with scores of 0.81, 0.78, 0.68, and 0.55, and the first three exceed the basic threshold of 0.7 (0.68 is a scenario where the threshold is lowered), the adjusted scheduling order after reordering is 0.81→0.78→0.68→0.55, with the higher-priority tasks entering the resource allocation process first.
[0038] In step S105, the step of filtering the target graphics card group according to the adjustment task queue and binding the target graphics card group with the tasks in the adjustment task queue to obtain the position allocation result includes: Based on the adjusted task queue, determine the tasks to be assigned and the types of resource requirements; Based on the resource requirement type, a suitable candidate graphics card group is selected, which includes an idle graphics card group and a graphics card group whose network topology distance is within a preset distance threshold. Collect real-time network traffic fluctuation data and task conflict records of the candidate graphics card group, and calculate the quantitative value of the interference degree. If the quantified value of the interference level is lower than the preset traffic interference threshold, the candidate graphics card group is determined to meet the task operation requirements, and the candidate graphics card group is used as the target graphics card group. The target graphics card group is bound to the task to be assigned, and the correspondence between the graphics card group identifier and the task identifier is recorded to obtain the position allocation result.
[0039] It should be noted that, firstly, when determining the resource requirement type of a task to be assigned based on adjusting the task queue, the ratio of peak computing power requirement to peak network bandwidth requirement in the priority score is used for judgment: a computing power ratio higher than 60% is considered a compute-intensive task, a bandwidth ratio higher than 60% is considered a communication-intensive task, and a ratio close to both is considered a mixed task. When selecting suitable candidate graphics card groups, the selection criteria for idle graphics card groups are: an average graphics card utilization rate lower than 30% (this threshold is based on graphics card rated load statistics; below 30%, high computing power requirements can be responded to quickly), and the graphics card core frequency is stable at above 90% of its rated value; graphics card groups with network topology distances within a preset distance threshold are selected through cluster network topology calculations, with a hop count ≤ 2 from the task source node switch, the fewer the hop count, the lower the communication latency. For example, if a task has a computing power ratio of 75%, it is determined to be a compute-intensive task, and 3 groups of idle graphics cards with an average utilization rate of 25% and stable core frequencies are selected as candidates.
[0040] Next, real-time network traffic fluctuation data and task conflict records of candidate graphics card groups are collected. When calculating the quantification value of interference level, real-time network traffic fluctuation data is collected at 0.5-second intervals through the switch monitoring interface, and the variance of the traffic data in the past minute is calculated. The larger the variance, the more severe the fluctuation. The task conflict records count the number of times the graphics card group competes for resources with other tasks in the past 5 minutes, such as bandwidth occupation conflicts and computing power scheduling conflicts. The quantification value of interference level is obtained by weighted summation of the normalized value of traffic fluctuation variance and the normalized value of task conflict count. The weight of traffic fluctuation is set to 0.6 (network fluctuation has a more direct impact on communication), and the weight of task conflict is set to 0.4. Both are mapped to the [0,1] interval through minimum-maximum normalization. The final quantification value takes the value of 0-1, and the larger the value, the more severe the interference. For example, if the normalized value of traffic variance of a candidate graphics card group in the past minute is 0.2 and the normalized value of task conflict count in the past 5 minutes is 0.1, the quantification value of interference level is 0.2×0.6 + 0.1×0.4=0.16, which is a low level of interference.
[0041] If the interference level quantification value is lower than the preset traffic interference threshold, the candidate graphics card group is deemed to meet the task operation requirements. The traffic interference threshold is set based on historical task operation stability statistics. Analysis of data from the past year shows that when the interference level quantification value is below 0.3, the task interruption rate is below 2%, and communication latency fluctuation is less than 10 milliseconds. Therefore, the base threshold is set to 0.3. Computationally intensive tasks are less sensitive to interference and can be adjusted upwards to 0.35; communication-intensive tasks are more sensitive to interference and can be adjusted downwards to 0.25. Those skilled in the art can flexibly adjust this according to the task type. For example, if the interference level quantification value of a candidate graphics card group for a communication-intensive task is 0.23, which is lower than the downward adjustment threshold of 0.25, the graphics card group is deemed to meet the operation requirements.
[0042] Subsequently, candidate graphics card groups that meet the requirements are bound to the tasks to be assigned, and the correspondence between graphics card group identifiers and task identifiers is recorded. When obtaining the location allocation results, the binding priority is sorted by the interference level quantization value from smallest to largest, with the candidate group with the lowest interference being bound first. If multiple candidate groups have the same interference level, they are bound according to the idle computing power of the graphics card group from highest to lowest (for compute-intensive tasks) or the network bandwidth from highest to lowest (for communication-intensive tasks). After binding, a location allocation table is generated, specifying the task identifier, graphics card group identifier, binding time, and resource allocation quota to ensure accurate correspondence between tasks and graphics card groups. For example, if the interference level quantization values of three candidate graphics card groups are 0.16, 0.23, and 0.28 respectively, sorted from smallest to largest, the first group with the lowest interference is bound to the task, and the correspondence between task identifier T001 and graphics card group identifier G003 is recorded to form the location allocation results.
[0043] In step S106, the step of pre-allocating network paths based on the location allocation result and the specific time window for switching the tag sequence calculation stage, combined with the pre-acquired gradient synchronization data volume, to obtain a reserved bandwidth dynamic channel includes: Based on the location allocation results, determine the communication link of the graphics card group corresponding to the task to be assigned; Analyze the temporal pattern of phase switching in the tag sequence, calculate the critical time point of phase switching, and delineate the specific time window when the communication-intensive phase is about to begin. Extract the amount of gradient synchronization data to be transmitted within the specific time window, and calculate the instantaneous bandwidth requirement to meet the transmission requirements; In the communication link of the graphics card group, an idle physical path with a carrying capacity not lower than the instantaneous bandwidth requirement is selected, and bandwidth reservation rules are configured on the idle physical path to generate a dynamic channel for reserved bandwidth.
[0044] It should be noted that, when determining the communication link for the graphics card group corresponding to the task to be assigned based on the location allocation result, the location allocation result clearly defines the source graphics card group and the target graphics card group bound to the task (such as the local graphics card group and the remote synchronization graphics card group in a multi-card collaborative task). The communication link is generated based on the cluster network topology map. A depth-first search algorithm is used to traverse the physical connection paths between graphics card groups to select the main path with the fewest hops and the lowest link loss, while reserving two backup paths to cope with sudden failures. The link information includes the switch port identifier, the link bandwidth limit, and the transmission delay baseline value to ensure that the link is traceable and monitorable.
[0045] For example, a task is bound to source graphics card group G003 and target graphics card group G005. The main communication link is found to be G003-core switch S1-access switch S3-G005 through topology search. The link bandwidth limit is 80GB / s and the baseline latency is 1.2 milliseconds.
[0046] Subsequently, the temporal patterns of stage switching in the label sequence were analyzed, the critical time points for stage switching were calculated, and the specific time windows for entering the intensive communication phase were defined. An LSTM time series prediction model was then used to analyze the label sequences. The model training set contained over 30,000 historical stage switching data points, covering switching patterns for different task types, and was divided into training and validation sets in a 7:3 ratio. The model consisted of three hidden layers, each with 128 nodes. The optimizer was AdamW, with an initial learning rate of 0.001, decaying by 0.1 every 25 epochs until reaching 0.0001. The mean squared error loss function was used. During training, the prediction error of the validation set was monitored in real time, and iteration stopped when the fluctuation was less than 0.002 for 15 consecutive epochs. The critical time point was a preset warning point before the end of the intensive communication phase. Based on historical switching interval statistics, the basic warning time was set to 5 seconds, which could be extended to 8 seconds for long-cycle tasks and shortened to 3 seconds for short-cycle tasks. The specific time window was the period from the critical time point to the official start of the intensive communication phase, ensuring sufficient time for pre-operation. For example, the label sequence shows that a computationally intensive phase is expected to end in 60 seconds, the model predicts a switching error of ±0.3 seconds, the critical time point is set at 55 seconds, and the specific time window is 55-60 seconds, covering the warning and preparation phase before the switch.
[0047] Next, the amount of gradient synchronization data to be transmitted within a specific time window is extracted. When calculating the instantaneous bandwidth requirement to meet the transmission requirements, the gradient synchronization data is extracted from the communication configuration file of the distributed training framework, including synchronization data such as parameter gradients and optimizer states, with the unit uniformly in bytes. Instantaneous bandwidth requirement = (gradient synchronization data amount × redundancy coefficient) ÷ time window duration. The redundancy coefficient is set based on link transmission loss, with a base value of 1.1. For clusters with complex network topologies, it can be increased to 1.2, and for clusters with simple topologies, it can be decreased to 1.05, to offset the overhead of packet loss and retransmission during transmission.
[0048] For example, if the gradient synchronization data volume within a certain time window is 500GB, the time window duration is 10 seconds, and the redundancy coefficient is 1.1, the instantaneous bandwidth requirement is calculated to be (500 × 1.1) ÷ 10 = 55GB / s, ensuring that the data is transmitted within the window.
[0049] In this implementation case, it is required to select idle physical paths in the graphics card group communication links with a carrying capacity not less than the instantaneous bandwidth requirement, and configure bandwidth reservation rules on the idle physical paths. When generating a dynamic channel with reserved bandwidth, the idle physical paths collect path load data in real time through the switch monitoring interface, and filter out paths with a current load below 30% and a carrying capacity ≥ the instantaneous bandwidth requirement. The bandwidth reservation rules are implemented through the QoS configuration of the network devices, setting the minimum guaranteed bandwidth of the path to the instantaneous bandwidth requirement, and the maximum limited bandwidth to 90% of the link bandwidth limit to avoid bandwidth waste. The dynamic channel is bound to the path identifier, reserved bandwidth value, and effective time window, generating a unique channel identifier for subsequent parameter synchronization. For example, in the communication links, an idle path with a carrying capacity of 60GB / s and a current load of 25% is selected, a rule is configured with a minimum guaranteed bandwidth of 55GB / s and a maximum limited bandwidth of 72GB / s, and an effective time window of 55-60 seconds is bound to generate a dynamic channel with reserved bandwidth CH007 to ensure high-speed transmission of gradient synchronization data.
[0050] In another implementation, after obtaining the reserved bandwidth dynamic channel, the method further includes: The system monitors the phase changes of the tag sequence in real time, and generates a parameter synchronization trigger command when a communication-intensive phase is detected to start the tag. Obtain the gradient data to be synchronized, import the gradient data into the reserved bandwidth dynamic channel, and transmit it to the target graphics card group; The gradient data transmitted by all target graphics card groups is aggregated and calculated to obtain the aggregated result of the global gradient; Based on the aggregation results and the preset parameter update rules, calculate the model parameter adjustment amount; The parameter adjustment amount is written into the model storage unit to complete the parameter update and obtain the parameter update result, which is used for the next round of model calling.
[0051] It should be noted that, firstly, the stage changes of the tag sequence are monitored in real time. When a communication-intensive stage start marker is detected, a parameter synchronization trigger command is generated. A 1-second sliding window is used to monitor the tag sequence in real time. The judgment rule is that if three consecutive sampling points within the window are marked as "communication-intensive", it is determined to be a start. This consecutive sampling point threshold is set based on historical stage switching data statistics. In the switching scenarios of the past year, more than 98% of the real stage switching will last for more than three sampling points, and only 2% of the false markings are single points. This threshold can effectively avoid false triggers. The monitoring frequency is consistent with the tag sequence sampling frequency (0.5 seconds / time) to ensure rapid response to stage changes. The generated trigger command includes task identifier, channel identifier, and start timestamp to ensure that the command is accurately associated with the target resource.
[0052] Next, the gradient data to be synchronized is imported into the reserved bandwidth dynamic channel. When transmitted to the target graphics card group along the channel, the gradient data is extracted from the gradient cache of the graphics card's video memory. After extraction, the data is compressed using the LZ4 compression algorithm, with a compression ratio set to 1:1.1 (based on historical gradient data redundancy statistics, this compression ratio balances compression efficiency and decompression time). During the import process, the effective time window and bandwidth reservation status of the channel are verified to ensure the channel is available. A data packet fragmentation strategy is used during transmission, with a fragment size set to 1MB. Each fragment is appended with a sequence number and a CRC32 checksum. After receiving the data, the target graphics card group reassembles and verifies it according to the sequence number. If the verification fails, retransmission is triggered to ensure data integrity. For example, 500GB of gradient data becomes 455GB after LZ4 compression, divided into 455,000 fragments of 1MB each, and imported into the reserved channel CH007 for transmission. After receiving the data, the target graphics card group reassembles and verifies it, finding no fragment loss or errors.
[0053] Subsequently, the gradient data transmitted from all target GPU groups is aggregated to obtain the aggregated global gradient. The Ring-AllReduce algorithm is used for distributed aggregation, which reduces communication overhead and improves aggregation efficiency. The training set contains over 20,000 distributed gradient aggregation data points, covering different GPU group sizes (2-32 groups). The block size is set to 1MB, the communication step size is adapted to the number of GPU groups, and the number of iterations is 50 to ensure load balancing during the aggregation process. The aggregation process transmits gradient fragments in a ring topology. Each GPU group receives a fragment from the previous node, adds it to its local gradient, and then transmits it to the next node until all fragments are accumulated. Ultimately, all target GPU groups obtain the same global gradient aggregation result. For example, with 8 target GPU groups transmitting gradient fragments in a ring topology, each fragment being 1MB, after 7 rounds of communication, the gradient accumulation results of all GPU groups are consistent, and the global gradient aggregation result is 0.045, with precise and consistent values.
[0054] Subsequently, based on the aggregation results and the preset parameter update rules, the model parameter adjustment amount is calculated. The preset parameter update rules use the Adam optimizer, which can adaptively adjust the learning rate to adapt to the parameter update needs at different stages. The optimizer's training set contains over 30,000 model parameter update data points. The learning rate is initially set to 0.001, decaying by 0.1 every 100 rounds until it reaches 0.0001. The momentum parameters β1=0.9, β2=0.999, and the numerical stability parameter epsilon=1e-8. The adjustment amount calculation combines the global gradient aggregation results with the optimizer's historical state (the first and second moments of the preceding gradients). First, the exponential moving averages of the first and second moments are updated, and then the adaptive learning rate is calculated based on the moving average results. The final adjustment amount = global gradient aggregation result × adaptive learning rate. For example, if the global gradient aggregation result is 0.045, the current first-order moment moving average of the optimizer is 0.02, and the second-order moment moving average is 0.001, the adaptive learning rate is calculated to be 0.00095, and the parameter adjustment is approximately 0.045 × 0.00095 ≈ 0.00004275.
[0055] Finally, the parameter adjustments are written to the model storage unit to complete the parameter update. When obtaining the parameter update results, the model storage unit is the globally shared memory pool of the graphics card cluster. A synchronous write mechanism is used to ensure that all target graphics card groups complete the parameter update simultaneously, avoiding data inconsistency. During the write process, the parameter values before and after the update, the adjustments, and the update timestamp are recorded to form an update log. The parameter update results include the updated model weight matrix, optimizer state (first moment, second moment), and update log, providing baseline data for the next iteration. For example, writing an adjustment of 0.00004275 to the shared memory pool updates the corresponding model parameters from 1.2300 to 1.22995725, records the update timestamp as 62.3 seconds, and synchronously updates the optimizer state, completing the parameter update and generating a result log.
[0056] In summary, this invention discloses a dynamic scheduling method for GPU cluster computing power, including obtaining GPU utilization and network traffic sampling sequences, aligning with timestamps and performing stage analysis to obtain computationally and communication-intensive stage label sequences; calculating the peak computing power and bandwidth requirements of the two types of stages, and prioritizing tasks; matching low-interference candidate GPU groups, and generating reserved bandwidth channels by pre-allocating network paths in conjunction with stage switching windows; and synchronizing execution parameters and updating the running status sequence in a closed loop. This achieves precise dynamic scheduling of GPU cluster computing power, meeting the dual requirements of high-performance computing efficiency and resource utilization.
[0057] Reference Figure 2 The second embodiment of the present invention provides a dynamic scheduling system for computing power of a graphics card cluster, comprising: The data acquisition module is used to obtain the utilization sampling sequence of each graphics card in the graphics card cluster and the network traffic sampling sequence; The phase analysis module is used to determine the operation phase label based on the utilization sampling sequence and the traffic sampling sequence, and to concatenate the operation phase labels to obtain a label sequence; the operation phase label includes computationally intensive phases and communication-intensive phases; The peak calculation module is used to calculate the peak computing power demand during the computationally intensive phase and the peak network bandwidth demand during the communication-intensive phase, respectively, based on the tag sequence. The priority sorting module is used to calculate the task priority score based on the peak computing power demand and the peak network bandwidth demand, and to reorder the preset resource scheduling queue based on the task priority score to obtain the adjusted task queue. The location allocation module is used to filter out target graphics card groups according to the adjustment task queue, and bind the target graphics card groups with tasks in the adjustment task queue to obtain location allocation results; The path pre-allocation module is used to pre-allocate network paths based on the location allocation result and the specific time window for switching the label sequence calculation stage, and in combination with the pre-acquired gradient synchronization data volume, to obtain a reserved bandwidth dynamic channel.
[0058] It should be noted that the graphics card cluster computing power dynamic scheduling system provided in this embodiment of the invention is used to execute all the process steps of the graphics card cluster computing power dynamic scheduling method in the above embodiment. The working principle and beneficial effects of the two are one-to-one, so they will not be described again.
[0059] It should be noted that the system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the system device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.
[0060] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. In particular, it should be noted that any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention for those skilled in the art.
Claims
1. A method for dynamic scheduling of computing power in a graphics card cluster, characterized in that, include: Obtain the utilization sampling sequence of each graphics card in the graphics card cluster and the network traffic sampling sequence; The operation phase labels are determined based on the utilization sampling sequence and the traffic sampling sequence, and the operation phase labels are concatenated to obtain a label sequence; the operation phase labels include computationally intensive phases and communication-intensive phases; Based on the tag sequence, calculate the peak computing power requirement during the computationally intensive phase and the peak network bandwidth requirement during the communication-intensive phase, respectively. The task priority score is calculated based on the peak computing power demand and the peak network bandwidth demand, and the preset resource scheduling queue is reordered based on the task priority score to obtain the adjusted task queue. The target graphics card group is obtained by filtering the adjusted task queue, and the target graphics card group is bound to the tasks in the adjusted task queue to obtain the position allocation result; Based on the location allocation result and the specific time window for switching the label sequence calculation stage, and combined with the pre-acquired gradient synchronization data volume, network path pre-allocation is performed to obtain a reserved bandwidth dynamic channel.
2. The method for dynamic scheduling of computing power in a graphics card cluster according to claim 1, characterized in that, The acquisition of the utilization sampling sequence of each graphics card in the graphics card cluster and the network traffic sampling sequence includes: Real-time utilization data of each graphics card in the graphics card cluster is collected at a preset period to form an original utilization sequence; Synchronously collect real-time transmission traffic data from network devices to form raw traffic sequences; Outlier removal is performed on the original utilization rate sequence and the original flow rate sequence to obtain the processed utilization rate sequence and the processed flow rate sequence. The processed utilization sequence and the processed traffic sequence are respectively processed by data format normalization to obtain the graphics card utilization sampling sequence and the network traffic sampling sequence.
3. The method for dynamic scheduling of computing power in a graphics card cluster according to claim 1, characterized in that, The step of determining the operation phase label based on the utilization rate sampling sequence and the traffic flow sampling sequence, and concatenating the operation phase labels to obtain a label sequence, includes: Extract the timestamps of the utilization sampling sequence and the traffic sampling sequence, and associate and bind the utilization data and traffic data corresponding to the same timestamp to obtain the associated dataset; The associated dataset is extracted according to a fixed time window to obtain multiple consecutive data blocks to be analyzed; Feature extraction is performed on each of the data blocks to be analyzed to determine the corresponding operational stage label for the data block; All operation stage labels are assembled in time window order to form a label sequence for stage classification.
4. The method for dynamic scheduling of computing power in a graphics card cluster according to claim 1, characterized in that, The step of calculating the peak computing power demand during the computationally intensive phase and the peak network bandwidth demand during the communication-intensive phase based on the tag sequence includes: Traverse the label sequence, filter out the time intervals marked as computationally intensive stages, and extract the training batch size and model parameter count within the corresponding time intervals; The peak computing power requirement during the computationally intensive phase is calculated by multiplying the training batch size by the number of model parameters and combining the peak data of GPU utilization. Filter the time intervals marked as communication-intensive phases in the label sequence, and extract the gradient synchronization data volume and communication duration within the corresponding intervals; The peak network bandwidth demand during the communication-intensive phase is calculated by dividing the gradient synchronization data volume by the communication duration and combining it with the peak network traffic data.
5. The method for dynamic scheduling of computing power in a graphics card cluster according to claim 1, characterized in that, The step of calculating a task priority score based on the peak computing power demand and the peak network bandwidth demand, and then reordering a preset resource scheduling queue based on the task priority score to obtain an adjusted task queue, includes: Invoke the weighting coefficients pre-configured for the peak computing power demand and the peak network bandwidth demand, respectively; The peak computing power requirement, the peak network bandwidth requirement, and the corresponding weight coefficient are weighted and summed to obtain the priority score of each task to be assigned. The tasks in the preset resource scheduling queue are rearranged according to the priority scores from high to low to obtain the adjusted task queue.
6. The method for dynamic scheduling of computing power in a graphics card cluster according to claim 1, characterized in that, The step of filtering the target graphics card group according to the adjusted task queue and binding the target graphics card group with the tasks in the adjusted task queue to obtain the position allocation result includes: Based on the adjusted task queue, determine the tasks to be assigned and the types of resource requirements; Based on the resource requirement type, a suitable candidate graphics card group is selected, which includes an idle graphics card group and a graphics card group whose network topology distance is within a preset distance threshold. Collect real-time network traffic fluctuation data and task conflict records of the candidate graphics card group, and calculate the quantitative value of the interference degree. If the quantified value of the interference level is lower than the preset traffic interference threshold, the candidate graphics card group is determined to meet the task operation requirements, and the candidate graphics card group is used as the target graphics card group. The target graphics card group is bound to the task to be assigned, and the correspondence between the graphics card group identifier and the task identifier is recorded to obtain the position allocation result.
7. The method for dynamic scheduling of computing power in a graphics card cluster according to claim 1, characterized in that, The step of calculating the specific time window for switching based on the location allocation result and the label sequence, and pre-allocating network paths in conjunction with the pre-acquired gradient synchronization data volume to obtain a reserved bandwidth dynamic channel includes: Based on the location allocation results, determine the communication link of the graphics card group corresponding to the task to be assigned; Analyze the temporal pattern of phase switching in the tag sequence, calculate the critical time point of phase switching, and delineate the specific time window when the communication-intensive phase is about to begin. Extract the amount of gradient synchronization data to be transmitted within the specific time window, and calculate the instantaneous bandwidth requirement to meet the transmission requirements; In the communication link of the graphics card group, an idle physical path with a carrying capacity not lower than the instantaneous bandwidth requirement is selected, and bandwidth reservation rules are configured on the idle physical path to generate a dynamic channel for reserved bandwidth.
8. The method for dynamic scheduling of computing power in a graphics card cluster according to claim 4, characterized in that, After obtaining the reserved bandwidth dynamic channel, the method further includes: The system monitors the phase changes of the tag sequence in real time, and generates a parameter synchronization trigger command when a communication-intensive phase is detected to start the tag. Obtain the gradient data to be synchronized, import the gradient data into the reserved bandwidth dynamic channel, and transmit it to the target graphics card group; The gradient data transmitted by all target graphics card groups is aggregated and calculated to obtain the aggregated result of the global gradient; Based on the aggregation results and the preset parameter update rules, calculate the model parameter adjustment amount; The parameter adjustment amount is written into the model storage unit to complete the parameter update and obtain the parameter update result, which is used for the next round of model calling.
9. A dynamic scheduling system for computing power of a graphics card cluster, characterized in that, include: The data acquisition module is used to obtain the utilization sampling sequence of each graphics card in the graphics card cluster and the network traffic sampling sequence; The phase analysis module is used to determine the operation phase label based on the utilization sampling sequence and the traffic sampling sequence, and to concatenate the operation phase labels to obtain a label sequence; the operation phase label includes computationally intensive phases and communication-intensive phases; The peak calculation module is used to calculate the peak computing power demand during the computationally intensive phase and the peak network bandwidth demand during the communication-intensive phase, respectively, based on the tag sequence. The priority sorting module is used to calculate the task priority score based on the peak computing power demand and the peak network bandwidth demand, and to reorder the preset resource scheduling queue based on the task priority score to obtain the adjusted task queue. The location allocation module is used to filter out target graphics card groups according to the adjustment task queue, and bind the target graphics card groups with tasks in the adjustment task queue to obtain location allocation results; The path pre-allocation module is used to pre-allocate network paths based on the location allocation result and the specific time window for switching the label sequence calculation stage, and in combination with the pre-acquired gradient synchronization data volume, to obtain a reserved bandwidth dynamic channel.