Network parameter configuration method, device, system and storage medium

By deploying intelligent network agents and model performance monitoring modules in the computing system, training status information can be obtained in real time, and network parameters can be dynamically adjusted. This solves the problem that network parameter configuration cannot adapt to the dynamic traffic patterns of model training, and improves model training efficiency.

CN121907685BActive Publication Date: 2026-06-23SHANGHAI BIREN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI BIREN TECH CO LTD
Filing Date
2026-03-20
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In existing technologies, network parameter configurations cannot adapt to the dynamic and bursty traffic patterns during the training process of artificial intelligence models, resulting in low model training efficiency.

Method used

By deploying intelligent network agents and model performance monitoring modules in the computing system, training status information can be obtained in real time, and network parameters can be dynamically adjusted to meet the model training requirements, thus forming a closed-loop optimization.

Benefits of technology

It effectively reduces training latency caused by network congestion and improves the overall efficiency of distributed training.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121907685B_ABST
    Figure CN121907685B_ABST
Patent Text Reader

Abstract

The embodiment of the present disclosure provides a network parameter configuration method, device, system and storage medium, and relates to the technical field of artificial intelligence. The network parameter configuration method comprises the following steps: each network device of a system obtains training state information in the process of a computing device executing a target training task, and network state information of the network device; at least according to the training state information, a parameter configuration target for configuring network parameters of the network device is determined; according to the parameter configuration target, the training state information and the network state information, the network parameters are configured; wherein the network parameters can comprise parameters for configuring a congestion notification mechanism in the network device, and the parameter configuration target can comprise a target set based on reducing the execution delay of the target training task caused by network congestion.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and more specifically, to a network parameter configuration method, device, system, and storage medium. Background Technology

[0002] With the development of Artificial Intelligence (AI) technology, large-scale models trained on massive amounts of data have become a key driving force for technological progress. AI models often require training to achieve good inference capabilities, and the training process typically demands significant computing and storage resources. Various parallel strategies can be combined for training, such as data parallelism (DP), model parallelism (MP), pipeline parallelism (PP), and tensor parallelism (TP). These parallel strategies allow for the partitioning of models and data, distributing them across multiple computing devices (such as GPUs) for parallel processing, thereby shortening training time and supporting larger-scale models. This necessitates the implementation of extensive data interaction between multiple computing devices. In other words, when the target model is trained in a computing system composed of multiple devices, the system network demands high throughput, low latency, and synchronous communication. Therefore, configuring network parameters to meet the needs of model training and improve training efficiency has become a pressing technical problem. Summary of the Invention

[0003] In view of this, the present disclosure proposes a new technical solution for network parameter configuration.

[0004] According to a first aspect of the present disclosure, a network parameter configuration method is provided, applied to any one of a plurality of network devices in a computing system, the computing system further comprising a plurality of computing devices connected through the plurality of network devices, for performing a target training task of training a target model; the method includes:

[0005] Acquire training status information of the computing device during the execution of the target training task, and network status information of the network device;

[0006] At least based on the training state information, a parameter configuration target for configuring the network parameters of the network device is determined; wherein, the network parameters include parameters for configuring the congestion notification mechanism in the network device, and the parameter configuration target includes a target set based on reducing the execution delay of the target training task due to network congestion;

[0007] The network parameters are configured according to the parameter configuration target, the training state information, and the network state information.

[0008] Optionally, the training status information is training status information collected by a model performance monitoring module deployed in the computing device; the model performance monitoring module is implemented by adding event-driven callback functions to the training nodes of the target model, and the training nodes include at least one of the following: iteration start node, forward propagation phase start node, back propagation phase start node, parameter update phase start node, or iteration end node.

[0009] Optionally, the computing system further includes a server connected to the network devices and computing devices, the server being used to collect training status information collected by each of the computing devices during the execution of the target training task and send it to each of the network devices;

[0010] Obtaining training state information during the execution of the target training task by the computing device includes:

[0011] Receive the training status information sent by the server.

[0012] Optionally, the training status information includes the current training stage of the target training task;

[0013] Determine, at least based on the training state information, the parameter configuration target for configuring the network parameters of the network device, including:

[0014] When the current training phase is the forward propagation phase, the parameter configuration objective includes a low latency objective; or, when the current training phase is the backpropagation phase or the parameter update phase, the parameter configuration objective includes a high throughput objective.

[0015] Optionally, at least based on the training state information, a parameter configuration target for configuring the network parameters of the network device is determined, including:

[0016] The parameter configuration target is determined based on the training state information and the network state information;

[0017] The network status information includes at least one of the following: the bandwidth information of the network device, the data latency information of the network device, the congestion notification information of the network device, or the flow control information of the network device.

[0018] Optionally, the training status information includes the iteration time of the target training task and the device utilization rate of the computing device executing the target training task; the network status information includes the flow control information of the network device;

[0019] The parameter configuration objectives include a first objective, a second objective, and a third objective; wherein, the weight of the first objective is greater than that of the second objective, the weight of the second objective is greater than that of the third objective, the first objective represents increasing the device utilization rate, the second objective represents reducing the iteration time, and the third objective represents reducing the number of flow control operations.

[0020] Optionally, the network status information includes the current traffic type of the network device; there are multiple data streams in the network device, and the network status information includes the current traffic type of each data stream, with different data streams corresponding to different parameter configuration targets;

[0021] When the current traffic type of the data stream is a first traffic type, the parameter configuration target corresponding to the data stream is a low latency target, and the first traffic type indicates that the current traffic of the data stream is less than a preset traffic threshold; or, when the current traffic type of the data stream is a second traffic type, the parameter configuration target corresponding to the data stream is a high throughput target, and the second traffic type indicates that the current traffic of the data stream is greater than or equal to a preset traffic threshold.

[0022] Optionally, the training status information includes at least one of the following: the current training stage of the target training task, the iteration time of the target training task, the target data amount that needs to be transferred between multiple computing devices in the current iteration of the target training task, the device utilization rate of the computing device executing the target training task, and the communication waiting time of the computing device executing the target training task.

[0023] According to a second aspect of the present disclosure, a network device is provided, including a memory and a processor, the memory being configured to store computer instructions, and the processor being configured to invoke the computer instructions from the memory to perform the method described in the first aspect.

[0024] According to a third aspect of the present disclosure, a computing system is provided, the computing system including a plurality of network devices and a plurality of computing devices, the plurality of computing devices being connected through the plurality of network devices; wherein: the computing devices are used to perform a target training task of training a target model; and the network devices are used to perform the method described in the first aspect.

[0025] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided having a computer program stored thereon, the computer program implementing the method described in the first aspect when executed by a processor.

[0026] Based on the network parameter configuration method provided in this disclosure, network devices no longer blindly configure and adjust parameters based solely on their own underlying metrics such as queue size and packet loss. Instead, they intelligently adjust network parameters according to the training status information of the model training task, dynamically optimize network performance, and form a closed-loop optimization from application performance feedback to network parameter adjustment. This enables data transmission in the network to meet the needs of model training, effectively reduces training latency caused by congestion, and improves the overall efficiency of distributed training.

[0027] Other features and advantages of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description

[0028] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments of the present disclosure and, together with the description, serve to explain the principles of the disclosure.

[0029] Figure 1 This is a schematic diagram of a computing system provided in an embodiment of this disclosure.

[0030] Figure 2 This is a flowchart illustrating a network parameter configuration method provided in an embodiment of this disclosure.

[0031] Figure 3 This is a schematic diagram of the structure of a network device provided in an embodiment of this disclosure. Detailed Implementation

[0032] Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of the present disclosure.

[0033] The following description of at least one exemplary embodiment is merely illustrative and is not intended to limit the scope of this disclosure or its application or use.

[0034] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0035] In all the examples shown and discussed herein, any specific values ​​should be interpreted as merely exemplary and not as limitations. Therefore, other examples of exemplary embodiments may have different values.

[0036] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0037] Computing devices, represented by Graphics Processing Units (GPUs), are increasingly used in the training and inference of artificial intelligence models. During AI model training, massive datasets can be used to repeatedly adjust billions or even hundreds of billions of parameters within the model through complex mathematical optimizations (such as gradient descent). This allows the model to learn patterns, features, and knowledge from the data, ultimately acquiring the intelligence to perform specific tasks (such as language understanding, image generation, and scientific computing). Such training typically requires substantial computing and storage resources, and can be achieved using a combination of parallel strategies, such as data parallelism (DP), model parallelism (PP), pipelined parallelism (PP), and tensor parallelism (TP). Based on these parallel strategies, the model and data can be partitioned and distributed across multiple computing devices (such as GPUs) for parallel processing, thereby shortening training time and supporting larger-scale models. This necessitates extensive data interaction between multiple computing devices. For example, model training may involve various communication operations, such as AllReduce, AllGather, Broadcast, and one or more peer-to-peer (P2P) communication operations. in:

[0038] Global reduction involves all participating processes (GPUs) sending their data (such as gradients) to all other processes, and then performing a reduction operation (such as summation) on all received data, ultimately resulting in the same outcome for each process. This process ensures that model parameters remain synchronized across all GPUs. It's important to note that a process can run on a single computing device.

[0039] A full collection can be achieved by each process sending its own data blocks to all other processes, and eventually each process collects the data blocks from all processes and concatenates them in order into a complete set of data.

[0040] Broadcasting can be a process (e.g., rank 0) sending its own data to all other processes.

[0041] Point-to-point communication can be used in pipeline parallelism, where GPUs in adjacent stages need to send and receive data through asynchronous operations such as isend and irecv to achieve forward and backward propagation of the pipeline.

[0042] The efficiency of these communication operations directly determines the performance of the entire training system. Especially in critical steps such as gradient synchronization, numerous global reduction operations can place enormous pressure on the network. These communication operations require extensive data exchange between multiple computing devices, demanding high throughput, low latency, and synchronous communication from the system network. Therefore, configuring network parameters to meet the needs of model training and improve its overall efficiency has become a pressing technical problem.

[0043] To address the problems in related technologies, this disclosure provides embodiments such as Figure 1 The illustrated computing system 1000 may include a computing device 1100 and a network device 1200. The computing device may include processors of the type such as Graphics Processing Unit (GPU), General-Purpose Graphics Processing Unit (GPGPU), Neural-network Processing Unit (NPU), or Tensor Processing Unit (TPU). The network device may be a switch, such as a smart switch.

[0044] For example, there can be multiple computing devices and network devices in a computing system. For instance, computing devices may include computing device 1, computing device 2, computing device 3, ..., computing device n; network devices may include network device 1, network device 2, network device 3, ..., network device m; n computing devices can be interconnected through m network devices to form a computing cluster for training artificial intelligence models. For example, multiple computing devices can be used to perform a target training task to train a target model. The target model can be any artificial intelligence model, such as a large language model, a multimodal model, an image processing model, a text processing model, an audio processing model, or a specialized model for various fields such as scientific computing and reinforcement learning.

[0045] like Figure 1 As shown, data interaction between computing devices can be carried out through multiple network devices (such as switches). The efficiency of model training is highly dependent on the communication efficiency of network devices, and the network parameters of network devices have a significant impact on communication efficiency. In related technologies, the network parameters of network devices often need to be manually configured by users and cannot be adaptively configured according to the actual needs of model training, resulting in low model training efficiency.

[0046] Continue as Figure 1As shown, in the computing system provided in this embodiment, a Smart Network Agent (SNA) can be deployed in each network device, and a model performance monitoring module can be deployed in each computing device. This model performance monitoring module can be integrated into the model training framework. The computing system may also include a server 1300, which can be connected to the network devices and computing devices to collect relevant information from each computing device and / or network device, perform status monitoring, and control or configure the computing devices and / or network devices. For example, this server may be a Central Policy Server (CPS).

[0047] The model performance monitoring module can be inserted into the model training loop based on the relevant mechanisms of the model training architecture to capture the training status information in real time during the execution of the target training task by the computing device. The training status information can include key indicators such as training stage information, iteration time, and gradient synchronization size, and the training status information is sent to network devices or servers.

[0048] This server (e.g., CPS) can collect training status information from various computing devices and send it to various network devices; it can also collect status information of intelligent network agents on various network devices, coordinate policy updates, and distribute the latest policy model to various network devices. This server can employ a fault-tolerant design, such as deploying two servers (primary and backup), or multiple servers forming a server cluster, to avoid single points of failure.

[0049] The intelligent network agent can be a lightweight process deployed on various network devices (such as switches) for local state collection, policy enforcement, and real-time decision-making. Each intelligent network agent operates as an independent intelligent agent, running the network parameter configuration method provided in this disclosure embodiment to dynamically configure the network parameters of the network devices.

[0050] In some examples, intelligent network agents on various network devices can obtain training status information during the execution of a target training task by the computing device, and configure the network parameters of the network device based on the training status information to make the network parameters conform to the parameter configuration objectives. The parameter configuration objectives include reducing the execution delay of the target training task due to network congestion, and the network parameters include parameters used to configure the congestion notification mechanism in the network device.

[0051] In this way, through the cooperation of computing devices, servers, and network devices in the computing system, the network devices can obtain training status information of the target training task in a timely manner (such as training stage, iteration time, device utilization, or communication latency). Based on the training status information, the real-time requirements and traffic characteristics of model training are determined, thereby dynamically adjusting network parameters. This solves the problem in related technologies where users need to manually configure and adjust network parameters, resulting in network parameter configurations that cannot adapt to the dynamic and sudden traffic patterns of model training. It enables network parameters to accurately and timely match the differentiated requirements of network parameters at different states during model training, thereby improving the overall efficiency of model training.

[0052] It should be noted that, Figure 1 The structure of the computing system shown is illustrative. The computing system in this embodiment is not limited to the above structure and may include more or fewer devices as needed, or the devices may be combined or split. For example, the computing system may not include a server. As another example, the computing system may also include control devices (e.g., CPU), storage devices (e.g., RAM), etc. The storage devices can be used to store instructions and / or data, which can be retrieved and used by the control devices or computing devices. For example, the storage devices can store program instructions executed by the control devices or computing devices, or they can store data such as text, images, audio, and configuration parameters. The control devices can control the computing devices to execute related processes or tasks to achieve related system functions, such as training, inference, scientific computing, or image processing of artificial intelligence models based on user needs.

[0053] Figure 2 This is a flowchart illustrating a network parameter configuration method provided in an embodiment of this disclosure. This network parameter configuration method can be... Figure 1 The computing system shown can be executed, for example, by a network device, or by a combination of a network device and a computing device, or by a combination of a network device, a computing device, and a server. Figure 2 As shown, the network parameter configuration method of this embodiment may include the following steps S210 to S230.

[0054] Step S210: Obtain training status information of the computing device during the execution of the target training task, and network status information of the network device.

[0055] The training status information can be information collected by the computing device and sent to the network devices. The computing device can directly send the training status information to each network device, or it can first send the training status information to the server, and then the server sends it to each network device.

[0056] In some examples, the training status information may include at least one of the following: the current training stage of the target training task, the iteration time of the target training task, the amount of target data that needs to be transferred between multiple computing devices in the current iteration of the target training task, the device utilization of the computing device executing the target training task, and the communication wait time of the computing device executing the target training task.

[0057] The current training phase can characterize the specific computation and communication period in which the target training task is currently located. For example, the current training phase may include the forward propagation phase, the back propagation phase, or the parameter update phase. The traffic patterns and network requirements of different training phases can be different. For example, forward propagation is computationally intensive and has less communication, and is more sensitive to data latency; the back propagation phase and the parameter update phase require a large amount of data communication such as gradient synchronization and parameter synchronization, and have high data throughput requirements.

[0058] The iteration time characterizes the total time required to complete one training iteration. For example, a training iteration may include iteration start, forward propagation, back propagation, parameter update, and iteration end. The time from the start to the end of the iteration can be used as the iteration time. Optionally, if the current iteration is not yet complete, the time from the start of the current iteration to the current moment can be used as the iteration time for the current iteration. The overall training efficiency can be evaluated using the iteration time.

[0059] This target data volume characterizes the total amount of data that needs to be synchronized across all computing devices (e.g., GPUs) via aggregated communication, such as the total amount of gradient data or parameter data. This target data volume helps determine the upcoming network communication load. This information allows the network system to anticipate traffic volume and prepare sufficient buffer resources in advance, serving as a key input for proactive congestion control. The target data volume can be calculated by tallying the data volume of each computing device, and then aggregated by network devices or servers to obtain the overall target data volume.

[0060] Device utilization can be used to determine the percentage of time a computing device spends performing effective computations (rather than idle waiting), such as GPU utilization. If network congestion causes communication delays, device utilization will decrease; therefore, improving device utilization can be one of the goals of network parameter optimization. For multiple computing devices, the overall device utilization can be obtained by averaging the device utilization of each device.

[0061] This communication latency characterizes the time a computing device must idle while waiting for data during model training due to network latency, such as ensemble communication latency. Communication latency leads to reduced GPU utilization and impacts model training efficiency; therefore, reducing communication latency can be one of the goals of network parameter optimization.

[0062] In this way, based on the above training state information, the data communication requirements of the model training task in the network can be determined, so as to achieve deep collaboration between the network and the model training task.

[0063] The aforementioned network status information may be the network device's own status information. For example, the network status information may include at least one of the following: the network device's bandwidth information (e.g., link utilization), the network device's data latency information (e.g., queue length information), the network device's congestion notification information (e.g., explicit congestion notification tag ratio), or the network device's flow control information (e.g., the number of times priority flow control is triggered).

[0064] The network device may include multiple queues, each with a different transmission priority. The aforementioned data latency information may include the current length of each queue and the average length over a preset duration (e.g., the last 5 seconds).

[0065] The explicit congestion notification tagging ratio described above represents the proportion of data packets marked with explicit congestion notification flags among the data packets sent from network devices. When a network device detects queue congestion (such as a queue length exceeding a threshold), it proactively tags the IP header of the data packets. This explicit congestion notification tagging ratio quantifies the strength and frequency of congestion signals actively sent by the network device to the data sender, and therefore can serve as information for determining whether network congestion has occurred and its severity.

[0066] The aforementioned flow control information can be the number of times a network device triggers Priority Flow Control (PFC), such as the number of times a network device sends a "pause frame" to the upstream device to prevent packet loss. When a current network device (such as a switch) detects congestion in its receive queue, it sends a "pause frame" to the upstream device. This pause frame requires the upstream device to stop sending traffic of a specific priority group within a specified time. The PFC mechanism can effectively prevent packet loss caused by insufficient processing capacity of downstream devices. However, PFC also has some drawbacks, such as potentially causing "head-of-line blocking," where congestion in one queue may cause traffic in other non-congested queues on the same port to be paused, thereby reducing overall link utilization. Therefore, PFC can be used in conjunction with ECN. ECN is used for end-to-end rate control, while PFC acts as a last resort to prevent packet loss when ECN fails to effectively control congestion. Therefore, this flow control information is one of the important pieces of information for monitoring network device performance.

[0067] Step S220: Determine the parameter configuration target for configuring the network parameters of the network device based at least on the training state information.

[0068] The network parameters can include parameters for configuring the congestion notification mechanism in the network device. For example, this congestion notification mechanism can be Explicit Congestion Notification (ECN). When the queue length of a network device exceeds a preset threshold (e.g., Kmin), a congestion flag (ECN Congestion Experienced, CE) can be added to the IP header of passing data packets instead of directly discarding them. Upon receiving a data packet with this congestion flag, the receiving end notifies the sending end of network congestion via a transport layer protocol (e.g., TCP). Upon receiving this notification, the sending end proactively reduces its transmission rate, thereby alleviating network congestion and preventing queue overflow and packet loss. The ECN mechanism achieves lossless transmission through early warning, avoiding packet loss during model training. The parameters of this explicit congestion notification can include a low labeling threshold (Kmin), a high labeling threshold (Kmax), and a maximum labeling probability (Pmax). It should be noted that the specific meanings of the low labeling threshold, high labeling threshold, and maximum labeling probability can be found in descriptions in related technologies, and will not be repeated in this embodiment.

[0069] The parameter configuration objective can include goals set to reduce the execution latency of the target training task due to network congestion. This objective can be varied; for example, it can include high throughput or low latency goals. It can also include reducing the number of flow control operations, or improving device utilization or reducing iteration time—indicators related to the model training process. This objective can be used to constrain the configuration or update strategy of network parameters; for example, it can be an ECN strategy.

[0070] Different training state information can correspond to different parameter configuration objectives. For example, the training state information may include the current training stage of the target training task; when the current training stage is the forward propagation stage, the parameter configuration objective may include a low latency objective; or, when the current training stage is the backpropagation stage or the parameter update stage, the parameter configuration objective may include a high throughput objective.

[0071] It should be noted that the distributed training process of a large model can be performed in multiple rounds of iteration in a fixed order. Each iteration can be executed in the order of forward propagation -> back propagation -> parameter update. The communication patterns and traffic characteristics between nodes (computing devices) differ at different stages. In this embodiment, through collaboration with the large model training framework, the current training stage can be perceived, and parameter configuration targets (such as ECN strategies) can be dynamically adjusted. For example, during the back propagation stage, gradient synchronization generates a large amount of high-throughput data streams (such as elephant streams). At this time, the network device can set the parameter configuration target to a high-throughput target, i.e., it can automatically switch to a high-throughput mode, increase the ECN threshold, and prioritize bandwidth utilization. During the forward propagation stage, computationally intensive tasks dominate, and network traffic is relatively low. At this time, the network device can set the parameter configuration target to a low-latency target, i.e., it can automatically switch to a low-latency mode, decrease the ECN threshold, and ensure that control signaling and other data streams can pass through quickly.

[0072] In this way, network devices can intelligently switch parameter configuration targets based on the training phase, so that network resources are used most rationally at different stages, thereby improving the overall model training efficiency.

[0073] Step S230: Configure the network parameters according to the parameter configuration target, training state information, and network state information.

[0074] For example, network devices can configure network parameters based on Proximal Policy Optimization (PPO) to ensure that the network parameters meet the parameter configuration objectives. This PPO can be a policy-based reinforcement learning algorithm that directly parameterizes and optimizes the policy function. Unlike value-based algorithms, PPO can handle continuous action spaces more smoothly. By limiting the magnitude of each policy update, PPO can prevent drastic policy changes during training, thus ensuring learning stability and improving sample efficiency. This allows for the simple and efficient output of continuous network parameters, which may include congestion notification parameters. Thus, network parameters configured in this way can meet the parameter configuration objectives.

[0075] Using the methods described in steps S210 and S230, each network device in the computing system acquires training status information during the execution of the target training task, as well as network status information of the network devices. At least based on the training status information, a parameter configuration target for configuring the network parameters of the network devices is determined. The network parameters are then configured according to the parameter configuration target, the training status information, and the network status information. The network parameters may include parameters for configuring the congestion notification mechanism in the network devices, and the parameter configuration target may include a target set to reduce the execution delay of the target training task caused by network congestion. In this way, the network devices no longer blindly configure and adjust parameters based solely on their own low-level metrics such as queue size and packet loss. Instead, they intelligently adjust network parameters based on the training status information of the model training task, dynamically optimizing network performance and forming a closed-loop optimization from application performance feedback to network parameter adjustment. This ensures that data transmission in the network meets the needs of model training, effectively reducing training latency caused by congestion and improving the overall efficiency of distributed training.

[0076] In some embodiments of this disclosure, since the computing system employs a large-scale distributed framework to execute training tasks—that is, there are multiple computing devices and network devices involved in the target training task—the network parameter configuration method can be executed distributed across various network devices. For example, a multi-agent reinforcement learning (MARL) framework can be used, deploying independent agents (e.g., SNAs) on each network device (e.g., a switch) to distribute the network parameter configuration method provided in this embodiment.

[0077] By adopting this distributed decision-making mechanism, the computational load of each agent is fixed and independent of the network size, which greatly improves the scalability of the system. Furthermore, since the analysis and decision-making are carried out locally, there is no need to wait for instructions from the central controller (such as a server), which can quickly respond to microsecond-level network congestion changes and improve the timeliness of network parameter adjustment. This enables the system to cope with changes in data transmission requirements during model training and improve model training efficiency.

[0078] In some embodiments of this disclosure, the training status information mentioned above is training status information collected by a model performance monitoring module deployed in a computing device; the model performance monitoring module can be implemented by adding an event-driven callback function to the training node of the target model, and the training node may include at least one of the following: an iteration start node, a forward propagation phase start node, a back propagation phase start node, a parameter update phase start node, or an iteration end node.

[0079] This callback function can be implemented using a lightweight hook mechanism. For example, a registered callback function can be created within the main training loop of the target model.

[0080] In this way, the model performance monitoring module can be inserted into the training node of the target model through a lightweight hook mechanism, achieving non-intrusive integration without modifying the core code of the training framework, thus improving the ease of deployment and compatibility. Furthermore, when the target model's training task reaches this node, it can acquire training status information and send it to network devices or servers, enabling network devices to obtain training status information promptly and reliably, achieving microsecond-level parameter adjustments.

[0081] In some examples, the model performance monitoring module in the computing device can directly send the collected training status information to the network device; alternatively, it can send the training status information to a server, which then forwards it to each network device. For instance, the server can collect the training status information from each computing device in the computing system, synthesize it into the overall training status information of the entire computing system, and then send this training status information to each network device. In this way, the network devices can receive the training status information sent by the server.

[0082] In this way, by centrally collecting and distributing training state information from various computing devices through the server, global information aggregation and collaboration are achieved. This enables each network device to make decisions not only based on local information but also indirectly obtain a global perspective. This is conducive to forming collaborative optimization strategies among multiple network devices, avoiding potential strategy conflicts or local suboptimal results in distributed decision-making, and improving the overall integrity and consistency of the entire computing system optimization.

[0083] In some embodiments of this disclosure, the specific implementation of step S220, which determines the parameter configuration target for configuring network parameters of the network device based on training state information, may include: determining the parameter configuration target based on training state information and network state information.

[0084] In this way, by combining the training state and the network state, and comprehensively considering the application layer requirements and network layer state, a deep integration of network and model training is achieved, ensuring the stability and health of the network itself while meeting application performance goals.

[0085] In some examples, the aforementioned training state information includes the iteration time of the target training task and the device utilization rate of the computing device executing the target training task; the aforementioned network state information includes the flow control information of the network device; the parameter configuration objective can include multiple sub-objectives, and assign weights to each of the sub-objectives to obtain a comprehensive parameter configuration objective. For example, the parameter configuration objective can include training layer objectives and network layer objectives, with the weight of the training layer objective being greater than the weight of the network-side objective, i.e., prioritizing the training layer objective. This ensures that the optimization objective of the network parameters primarily serves the business objective of improving model training efficiency, such as prioritizing key indicators like device utilization and iteration time, rather than merely optimizing the intermediate states of the network itself. This avoids the network device adopting overly conservative strategies to pursue local metrics (such as extremely low PFC counts), thereby achieving global optimization of training performance overall.

[0086] For example, the training layer objective in this parameter configuration objective may include a first objective and a second objective, and the network layer objective may include a third objective, wherein:

[0087] The first objective can be characterized by increasing device utilization. Optimizing network parameter configuration based on this objective can reduce GPU idle waiting caused by network congestion, thereby improving the effective utilization of computing resources.

[0088] The second objective can be characterized by reducing iteration time. Optimizing network parameter configuration based on this objective can shorten the single training cycle and accelerate the overall training process.

[0089] This third objective can characterize reducing the number of flow control operations. Optimizing network parameter configuration based on this objective can avoid network instability or throughput degradation caused by frequent priority flow control (such as PFC).

[0090] The first, second, and third objectives can each be assigned independent weights. For example, the weight of the first objective can be greater than that of the second objective, and the weight of the second objective can be greater than that of the third objective.

[0091] Thus, this embodiment establishes a closed-loop feedback control from training performance metrics such as device utilization and iteration time to network ECN parameters. Compared to traditional network parameter optimization that only focuses on network-level metrics (such as bandwidth and latency), this embodiment can directly optimize the ultimate goal of training efficiency, enabling the agent's decisions to directly serve the overall performance of the training task. This end-to-end optimization breaks down the barrier between the application layer and the network layer, realizing a paradigm shift in application-driven networking.

[0092] It should be noted that the above parameter configuration objectives can be used as indicators in the reward function of parameter configuration, and the network parameter configuration can be constrained and guided to meet the above parameter configuration objectives through the reward function.

[0093] In some examples, the network status information mentioned above includes the current traffic type of the network device; if there are multiple data streams in the network device, the network status information includes the current traffic type of each data stream, and different data streams correspond to different parameter configuration targets.

[0094] When the current traffic type of the data stream is the first traffic type, the parameter configuration target corresponding to the data stream is the low latency target. The first traffic type indicates that the current traffic of the data stream is less than the preset traffic threshold. This first traffic type can be called a rat stream.

[0095] When the current traffic type of the data stream is the second traffic type, the parameter configuration target of the corresponding data stream is a high throughput target. The second traffic type indicates that the current traffic of the data stream is greater than or equal to a preset traffic threshold. This second traffic type can be called an elephant stream.

[0096] In this example, the system not only perceives the training phase but also identifies traffic types, achieving flow-aware networking. By analyzing characteristics such as packet size, transmission frequency, source and destination IP addresses, and port numbers, the system can distinguish traffic types online (e.g., elephant flows and mouse flows). For elephant flows (such as gradient data transmission flows), an aggressive ECN strategy can be adopted, setting a larger threshold and allowing deeper queue buffers to maximize throughput; for mouse flows (such as control signaling), a conservative ECN strategy can be adopted, setting a near-zero threshold to ensure minimal latency. This differentiated service can be achieved through the multi-queue mechanism of network devices and DSCP tagging, ensuring that the two types of traffic are physically isolated and do not interfere with each other. This provides finer-grained network parameter tuning at the data flow level, improves the reliability of network data transmission, and thus improves model training efficiency.

[0097] In some embodiments of this disclosure, steps S220 and S230 can employ algorithms such as Markov Decision Process (MDP) or decision trees to configure and adjust network parameters based on training state information and network state information. For example, this Markov Decision Process can provide a mathematical framework for the agent based on Reinforcement Learning (RL), enabling the agent deployed in the network device to learn to obtain the maximum long-term cumulative reward by trying different actions, thereby reducing the delay in the execution of the target training task caused by network congestion.

[0098] Traditional methods that employ fixed network parameter configurations or manual parameter adjustments fall short when dealing with the rapidly changing network traffic during large-scale model training. Reinforcement learning based on Markov decision processes offers a novel paradigm: an agent learns optimal decision-making strategies through trial and error in continuous interaction with the network environment. The agent no longer needs to know all network models and traffic patterns in advance; instead, it gradually optimizes its behavior by observing environmental states, executing actions, and receiving feedback rewards, ultimately aiming to maximize long-term cumulative rewards. This adaptive and self-learning capability is key to solving the dynamic network optimization problem in the embodiments of this disclosure.

[0099] In this embodiment, the entire computing system performing the target training task can be viewed as an MDP. Specifically, each network device (or port or queue within the network device) requiring ECN tuning can be modeled as an independent agent executing a Markov decision process. This Markov decision process can include the following decision elements: state space, action space, transition probability, reward function, and discount factor. By precisely defining these elements, we can formalize the network parameter configuration tuning problem and apply reinforcement learning algorithms to solve for the optimal policy. At each discrete time step, the agent deployed on the network device can choose to execute a network parameter adjustment action based on the current network state information and training state information. After execution, the environment will transition to a new state, and the agent will be rewarded. This process is repeated, and the agent's goal is to learn an optimal policy that maximizes the long-term cumulative reward.

[0100] In some examples, the parameter configuration target determined in step S220 above can be the reward function of the Markov decision process, which is a decision element. In step S230, other decision elements of the Markov decision process can be determined based on the training state information and network state information, and the network parameters can be configured and adjusted based on the Markov decision process.

[0101] For example, within the Markov decision process framework, accurately defining the state space, action space, and reward function is crucial for the successful application of reinforcement learning. Specifically:

[0102] The state space defines the environmental information that an agent (such as a network device) can perceive. In this embodiment, the state is a multi-dimensional vector containing state information that comprehensively reflects the network's operational status. This state information can include the aforementioned training state information and network state information. For example, it can include network state information such as the network device's queue length (instantaneous and average), link utilization, ECN tagging rate, and PFC backpressure counts, as well as training state information obtained from the training framework, such as training state, training data volume, iteration time, and device utilization. The training state information and network state information provide a foundation for the agent to make accurate decisions.

[0103] The action space defines the actions an agent (such as a network device) can take in each state. In this embodiment, the action corresponds to adjusting the parameters of the switch's ECN mechanism. The parameters of the action space may include a low labeling threshold (Kmin), a high labeling threshold (Kmax), and a maximum labeling probability (Pmax). To reduce decision-making complexity, these continuous or high-dimensional parameters are discretized; for example, Kmin and Kmax can take a series of preset values, and Pmax can also be discretized with a certain step size. The agent's task is to select an optimal action from these discrete combinations.

[0104] For example, the action space (Kmin, Kmax, Pmax) can include any one or more of the following first to fifth combinations of parameters.

[0105] The first combination is an aggressive mode combination, prioritizing low latency, for example (Kmin, Kmax, Pmax) = (10, 50, 10).

[0106] The second combination is a balanced mode combination of throughput and latency, for example, (Kmin, Kmax, Pmax) = (20, 100, 20).

[0107] The third combination is a balanced mode combination 2 for throughput and latency, for example (Kmin, Kmax, Pmax) = (40, 200, 40).

[0108] The fourth combination is a high-throughput mode combination, for example (Kmin, Kmax, Pmax) = (80, 400, 80).

[0109] The fifth combination of elephant flow patterns, for example (Kmin, Kmax, Pmax) = (160, 800, 100).

[0110] It should be noted that the numerical units of Kmin and Kmax mentioned above can be kilobytes (KB), and the numerical unit of Pmax mentioned above can be percentage.

[0111] The reward function acts as a compass in reinforcement learning, quantifying the merits of network parameter configurations and guiding the learning process to achieve the desired configuration. It guides the network agent to achieve the optimal balance between maximizing network throughput and minimizing network latency. A weighted reward function can be set, for example, r1 = w11 × T(R) + w12 × D(L), where T(R) is the reward component related to throughput, D(L) is the penalty component related to latency, and w11 and w12 are their respective weight coefficients. By adjusting the weights, the optimization objective can be flexibly adapted to different traffic types (such as elephant flows and mouse flows).

[0112] The reward function in this embodiment can also guide the parameter configuration of the network device to achieve a balance between the training layer and the network layer, and prioritize the needs of the training layer. For example, the reward function can use the following formula: r2 = w21 × r_th + w22 × r_delay + w23 × r_pfc; where r2 is the final comprehensive reward value, r_th is the first reward component (i.e., the first objective) determined based on device utilization, r_delay is the second reward component (i.e., the second objective) determined based on iteration time, r_pfc is the third reward component (i.e., the third objective) determined based on flow control information (e.g., the number of times priority flow control is triggered), w21 is the weight of the first reward component, w22 is the weight of the second reward component, and w23 is the weight of the third reward component. We can set w21 > w22 > w23, for example, w21 is 0.6, w22 is 0.3, and w23 is 0.1.

[0113] In this way, by employing a multi-objective weighted sum in the reward function, improving training efficiency (with device utilization and iteration time as the core) is set as the highest-weighted optimization objective, while ensuring network stability (with reducing PFC triggering as a constraint) is set as a secondary objective. This design ensures that the agent's learning direction always prioritizes accelerating model training and achieves an automated optimal balance between increasing throughput, reducing latency, and maintaining network health, thereby achieving the core objective of improving training efficiency.

[0114] In some examples, when configuring network parameters using Proximal Policy Optimization (PPO) in step S230 above, the state space, action space, preset time step, preset experience buffer size, PPO policy network parameters, training epochs, etc., can be used as inputs to obtain network parameters that meet the parameter configuration objectives based on the reward function.

[0115] In some embodiments of this disclosure, the above-described network parameter configuration method can be executed by an agent in a network device. This agent can adopt a structure that separates training and inference. The agent is generated in advance through offline training and deployed on various network devices of the computing system. During the online model training process, the agents deployed on each network device can perform online inference.

[0116] During the offline training phase, the policy network can be pre-trained on simulators or historical data to avoid performance fluctuations in the early stages of online learning. For example, historical training log data can be collected first to generate a simulated network environment. An agent can be trained in this simulated environment, which can be a multi-agent reinforcement learning framework. After training, a policy model file for the agent will be generated, preparing it for online applications.

[0117] During the online inference phase, the system starts and loads the offline-trained policy model file. Each network device collects the training state and network state of the target model in real time, and performs lightweight local computation (e.g., forward inference) based on the agent, followed immediately by executing the corresponding network configuration. By deploying the agent on each network device, the decision latency can be reduced to less than 1 millisecond.

[0118] Furthermore, during the online inference process, the system collects execution results as feedback and periodically uses this real data to fine-tune the model in order to achieve continuous optimization and adapt to environmental drift.

[0119] Using this architecture, the agent can converge to a stable policy in a simulated environment in about 2,000 training episodes, which corresponds to about 4 hours of real training; online fine-tuning only requires 50 to 100 iterations to adapt to new tasks.

[0120] It should be noted that, taking the training of a 175B (175 billion) parameter model using a cluster of 512 GPUs as an example, in actual model training tests, the network parameter configuration method in the above embodiments of this disclosure, compared with the static ECN configuration (Kmin=50KB, Kmax=200KB, Pmax=100%), significantly improved model training efficiency. GPU utilization increased from 72% to 89%, and the effective computation time increased by 17 percentage points. Gradient synchronization time was reduced by an average of 23%, and the overall iteration time was shortened by 15%. It is estimated that several weeks of training time can be saved in the scenario of training a model with hundreds of billions of parameters. In the AllReduce communication-intensive stage, the overall network throughput increased by 31%, approaching the theoretical bandwidth limit (98% link utilization).

[0121] Furthermore, by employing the network parameter configuration method described in the above embodiments of this disclosure, network parameters can be automatically configured according to the training state, adapting to complex traffic models and reducing the complexity and cost of operation and maintenance management. When network link failures or topology changes occur, this method can automatically identify and adjust network parameters, achieving policy adjustments and performance recovery within seconds, while manual troubleshooting and adjustment typically takes more than 30 minutes. Moreover, since the intelligent agents are deployed on various network devices for lightweight online inference, the system can support more than 1000 intelligent agents running simultaneously, with control plane overhead accounting for less than 0.1% of network bandwidth, reducing the impact of this parameter configuration method on training data bandwidth and ensuring that decision latency has no impact on training performance.

[0122] Figure 3 This is a schematic diagram of the structure of a network device provided in an embodiment of this disclosure, such as... Figure 3 As shown, the network device 1200 may include a memory 1201 and a processor 1202. The memory can be used to store computer instructions, and the processor can be used to retrieve computer instructions from the memory to execute all or part of the steps of any of the methods in the foregoing embodiments of this disclosure. The processor may be one or more, and these processors may execute instructions individually or jointly. Similarly, the memory may be one or more, and these memories may store the aforementioned computer instructions individually or jointly. The network device may be a switch, such as a smart switch; this embodiment does not limit this to a specific type.

[0123] This disclosure also provides an embodiment of, such as Figure 1 The computing system shown may include multiple network devices and multiple computing devices, with the multiple computing devices connected through multiple network devices; wherein: the computing devices can be used to perform a target training task to train a target model; and the network devices can be used to perform the network parameter configuration method of any of the foregoing embodiments of this disclosure.

[0124] This disclosure also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements all or part of the steps of any of the methods in the foregoing embodiments of this disclosure. Optionally, the computer-readable storage medium may be a non-transitory storage medium, but is not limited thereto, and may also be a temporary storage medium.

[0125] This disclosure also provides a computer program product that may include a computer program that, when executed by a processor, can implement all or part of the steps of any of the methods in the foregoing embodiments of this disclosure.

[0126] This disclosure may be a system, method, and / or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement any of the methods in the foregoing embodiments of this disclosure.

[0127] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media may include, for example, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), static random access memory (SRAM), compact disc-read-only memory (CD-ROM), digital versatile disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any combination thereof. The computer-readable storage medium used herein is not to be interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0128] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, network devices, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0129] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​(e.g., Smalltalk, C++, etc.) and conventional procedural programming languages ​​(e.g., the "C" language or similar programming languages). The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network (e.g., a local area network or a wide area network), or it may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays, or programmable logic arrays, may execute computer-readable program instructions to implement various aspects of the embodiments of this disclosure by utilizing state information from the computer-readable program instructions.

[0130] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0131] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0132] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0133] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. Each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should be noted that embodiments of the present disclosure may include some or all of the functions marked in the multiple blocks in the drawings, and may also include other functions not shown in the blocks in the drawings. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions. It should be noted that implementation in hardware, implementation in software, and implementation using a combination of software and hardware are all equivalent.

[0134] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, and are not limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein. The scope of this disclosure is defined by the appended claims.

Claims

1. A method for configuring network parameters, characterized in that, Any one of a plurality of network devices applied to a computing system, the computing system further comprising a plurality of computing devices connected through the plurality of network devices, for performing a target training task of training a target model; The method includes: The system acquires training status information during the execution of the target training task by the computing device, as well as network status information of the network device; wherein the training status information includes the current training stage of the target training task. At least based on the training state information, a parameter configuration target for configuring the network parameters of the network device is determined; wherein, the network parameters include parameters for configuring the congestion notification mechanism in the network device, and the parameter configuration target includes a target set based on reducing the execution delay of the target training task due to network congestion; Configure the network parameters according to the parameter configuration target, the training state information, and the network state information; The step of determining the parameter configuration target for configuring the network parameters of the network device based at least on the training state information includes: when the current training phase is the forward propagation phase, the parameter configuration target includes a low latency target; or, when the current training phase is the backpropagation phase or the parameter update phase, the parameter configuration target includes a high throughput target.

2. The method according to claim 1, characterized in that, The training status information is obtained by collecting training status information through the model performance monitoring module deployed in the computing device; The model performance monitoring module is implemented by adding event-driven callback functions to the training nodes of the target model. The training nodes include at least one of the following: iteration start node, forward propagation phase start node, back propagation phase start node, parameter update phase start node, or iteration end node.

3. The method according to claim 2, characterized in that, The computing system also includes a server connected to the network devices and computing devices. The server is used to collect training status information collected by each of the computing devices during the execution of the target training task and send it to each of the network devices. Obtaining training state information during the execution of the target training task by the computing device includes: Receive the training status information sent by the server.

4. The method according to claim 1, characterized in that, Determining the parameter configuration target for configuring the network parameters of the network device based at least on the training state information also includes: The parameter configuration target is determined based on the training state information and the network state information; The network status information includes at least one of the following: the bandwidth information of the network device, the data latency information of the network device, the congestion notification information of the network device, or the flow control information of the network device.

5. The method according to claim 4, characterized in that, The training status information also includes the iteration time of the target training task and the device utilization rate of the computing device executing the target training task; the network status information includes the traffic control information of the network device; The parameter configuration objectives include a first objective, a second objective, and a third objective; wherein, the weight of the first objective is greater than that of the second objective, the weight of the second objective is greater than that of the third objective, the first objective represents increasing the device utilization rate, the second objective represents reducing the iteration time, and the third objective represents reducing the number of flow control operations.

6. The method according to claim 4, characterized in that, The network status information includes the current traffic type of the network device; there are multiple data streams in the network device, and the network status information includes the current traffic type of each data stream, with different data streams corresponding to different parameter configuration targets; When the current traffic type of the data stream is a first traffic type, the parameter configuration target corresponding to the data stream is a low latency target, where the first traffic type indicates that the current traffic of the data stream is less than a preset traffic threshold; or, When the current traffic type of the data stream is the second traffic type, the parameter configuration target corresponding to the data stream is a high throughput target, and the second traffic type indicates that the current traffic of the data stream is greater than or equal to a preset traffic threshold.

7. The method according to any one of claims 1 to 6, characterized in that, The training status information also includes at least one of the following: the iteration time of the target training task, the amount of target data that needs to be transferred between multiple computing devices in the current iteration of the target training task, the device utilization rate of the computing device executing the target training task, and the communication waiting time of the computing device executing the target training task.

8. A network device, characterized in that, The device includes a memory and a processor, the memory being used to store computer instructions, and the processor being used to retrieve the computer instructions from the memory to perform the method of any one of claims 1 to 7.

9. A computing system, characterized in that, The computing system includes multiple network devices and multiple computing devices, wherein the multiple computing devices are connected through the multiple network devices; wherein: The computing device is used to perform the target training task of training the target model; The network device is used to perform the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the method of any one of claims 1 to 7.