Physical perception dynamic topology reconstruction method for large model hybrid parallel training
By constructing a physically-aware topology state index and dynamic weight polarization decision, the problem of insufficient topology adaptability in large-scale hybrid parallel training is solved, achieving stable and efficient network topology reconstruction, improving training efficiency and stability, and ensuring the deployability of the topology reconstruction scheme.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-04-17
- Publication Date
- 2026-06-26
AI Technical Summary
In existing large-scale model hybrid parallel training, the network topology is difficult to adaptively adjust, lacks interpretable quantitative evidence, long-tail latency is difficult to suppress, dynamic reconstruction is prone to oscillations and is difficult to deploy in practice, resulting in insufficient training efficiency and stability.
By constructing a physically-aware topology state index and combining it with dynamic weighted polarization decisions of the training phase, stable and deployable dynamic network topology reconstruction is achieved using in-band network telemetry and sampled stream telemetry data. This reduces end-to-end latency and suppresses long-tail latency, satisfying budget wall and physical reachability constraints.
It significantly reduces end-to-end latency, improves the stability and efficiency of the training process, enhances the interpretability for operations and maintenance personnel, avoids control plane oscillations caused by frequent reconfigurations, and ensures the engineering feasibility of the topology reconfiguration solution.
Smart Images

Figure CN122053472B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of data center network and distributed computing system optimization technology, specifically relating to a physical perception dynamic topology reconstruction method for large-scale model hybrid parallel training. Background Technology
[0002] As the training scale of large language models continues to expand, hybrid parallel training has become a common method to improve training efficiency. In actual training, the network communication requirements of different parallel stages, such as data parallelism and pipeline parallelism, vary significantly and change dynamically with the training process. Therefore, network performance has become an important factor affecting the training efficiency of large models.
[0003] Existing large-scale model training networks typically employ a pre-defined, fixed topology, maintaining largely unchanged connectivity throughout the training process. This approach struggles to adapt to the varying communication needs of different training stages when facing hybrid parallel training. Since different parallel stages prioritize network performance differently, a fixed topology cannot be adjusted accordingly, easily leading to a mismatch between network resources and training communication requirements, thus impacting overall training efficiency. Furthermore, current technologies generally lack quantitative data on the relationship between network physical state and training communication performance. While network physical factors such as logical hop count and available link bandwidth directly affect training communication latency, the associated relationships often lack clear and interpretable quantitative descriptions, causing network adjustment strategies to rely heavily on experience and hindering fine-grained control.
[0004] Furthermore, existing solutions are insufficient in suppressing long-tail latency issues during training. Large model training has strong synchronization requirements, and the completion time of training steps is often limited by the slowest communication process. Fixed topologies struggle to optimize critical communication paths in a timely manner when facing local hotspots, sudden traffic spikes, or phase transitions, easily leading to amplified tail latency and reduced overall cluster training efficiency.
[0005] In dynamic adjustment scenarios, without effective benefit assessment and stability constraints, network reconfiguration may be frequently triggered due to short-term fluctuations, not only increasing overhead but also affecting the continuity of the training process and system stability. Meanwhile, existing solutions do not adequately consider practical deployment constraints. Network topology adjustments not only need to meet performance optimization goals but are also limited by port resources, link capabilities, and overall resource conditions. Without corresponding constraints, the resulting adjustment scheme may be difficult to deploy and implement in actual training clusters.
[0006] Therefore, existing technologies still have problems when it comes to large-scale hybrid parallel training of models, such as difficulty in adaptively adjusting network topology during training, lack of interpretable quantitative basis, difficulty in suppressing long-tail latency, insufficient dynamic control stability, and weak practical deployability. Further improvements are urgently needed. Summary of the Invention
[0007] To address the challenges of varying dominant bottlenecks, lack of interpretable quantitative relationships between topological physical variables and training communication performance in hybrid parallel phase training, and the oscillations inherent in dynamic reconstruction coupled with difficulties in deploying generated topologies at the physical layer, this application presents a physically-aware dynamic topology reconstruction method for large-scale hybrid parallel training. This method constructs a physically-aware topology state index, combines it with dynamic weight polarization decisions of the training phase, and achieves stable, deployable, and low-latency dynamic network topology orchestration under engineering constraints such as budget walls and physical reachability domains. The physical state variables are supported by data infrastructure provided by in-band network telemetry (INT) and sampled-stream telemetry (sFlow). This significantly reduces end-to-end latency and suppresses long-tail latency while ensuring hardware physical deployability and control stability, thereby improving the training efficiency and stability of large-scale models.
[0008] To achieve the above objectives, this application employs the following technical solution:
[0009] This application discloses a physically-aware dynamic topology reconstruction method for large-scale model hybrid parallel training. The method is implemented through a physically-aware dynamic topology reconstruction system and specifically includes the following steps:
[0010] Step 1: Collect the current training phase signal and network topology physical state variables during the large model training process online. The network topology physical state variables include logical hop count and link available bandwidth indicators. Construct a comprehensive effective bandwidth for the current training phase signal task.
[0011] Step 2: Based on historical operational data, establish a quantitative mapping relationship between network topology physical state variables and end-to-end latency to obtain logical hop count sensitivity parameters and bandwidth sensitivity parameters, providing a basis for subsequent topology state index calculation;
[0012] Step 3: Based on the comprehensive effective bandwidth constructed in Step 1, the logical hop count sensitivity parameter and bandwidth sensitivity parameter obtained in Step 2, construct a unified, online-calculated topology state index to determine whether the current network topology matches the current training phase, and serve as the direct basis for triggering reconstruction.
[0013] Step 4: Combine the topology state index constructed in Step 3 to perform a prospective evaluation of the candidate actions, and under the premise of meeting the preset engineering constraints, select the action with the highest comprehensive score for network topology reconstruction.
[0014] Step 5: Record the measured latency and network status after network topology reconstruction, and update the logical hop count sensitivity parameter, bandwidth sensitivity parameter, and stability control parameter.
[0015] A further improvement of this application is that step 1 specifically includes the following steps:
[0016] Step 1.1: Acquire the signal of the current training phase online from the large model training framework. The training phase includes the signal at time [time]. Is it in the data parallel dominant phase? and at the moment Is it in the pipeline parallel dominant phase? Furthermore, data parallelism dominates the phase. Parallel dominance phase of pipeline The following conditions must be met:
[0017]
[0018] in, This indicates the current sampling time. At any given sampling time, the physical perception dynamic topology reconstruction system corresponds to only one current dominant training phase.
[0019] Step 1.2: After identifying the training phase, the physical sensing dynamic topology reconstruction system uses in-band network telemetry (INT) or sampled stream telemetry (sFlow) to collect network topology physical state variables of training-related traffic through links, switching nodes and end-to-end paths, builds a data foundation and continuously monitors the network topology physical state variables.
[0020] Step 1.3: Perform windowed aggregation on the collected network topology physical state variables to suppress instantaneous jitter, and obtain the windowed average logical hop count, defined as:
[0021]
[0022] in, For logical hop count, The length of the time window;
[0023] Similarly, the window average available link bandwidth metric is defined as follows:
[0024]
[0025] in, This is a metric for available bandwidth on a link, representing the link's bandwidth. At any moment The available bandwidth is calculated from the link load status, queue occupancy status and sampling flow traffic statistics reported by the in-band network telemetry (INT) and the sampling flow telemetry (sFlow).
[0026] If the current training phase task involves a set of critical links Then, construct the comprehensive effective bandwidth for the current training phase task:
[0027]
[0028] in, Indicates link The importance weights for the current training phase task are as follows: the higher the importance weight, the more important the link. The more critical the situation.
[0029] A further improvement in this application is that step 2 specifically includes the following steps:
[0030] Step 2.1: Before the physical sensing dynamic topology reconstruction system runs, samples are extracted to form a training sample set based on historical operation logs, telemetry records, and topology snapshots saved by the controller. For any historical sample Record the following fields:
[0031] ;
[0032] in, Indicates the first Does each sample belong to the data parallel dominant phase? , Indicates the first Does each sample belong to the pipeline parallel dominant phase? , Indicates the first The logical hop count corresponding to each sample Indicates the first The available bandwidth metric for each sample link Indicates the first The end-to-end delay corresponding to each sample;
[0033] Step 2.2, based on logical hop count With the Link available bandwidth metrics for each sample The reciprocal of the time, establishing end-to-end latency And solve for the sensitivity parameters:
[0034]
[0035] in, Represents the basic delay term. This represents the logical hop count sensitivity parameter; This represents the bandwidth sensitivity parameter.
[0036] A further improvement in this application is that, in step 3, determining whether the current network topology matches the current training phase specifically involves identifying whether the current training phase belongs to the data parallel-dominated stage or the pipeline parallel-dominated stage. If it belongs to the data parallel-dominated stage, the weight of the overall effective bandwidth is increased. And reduce the logical hop weight If the pipeline is in a parallel-dominated phase, increase the weight of the logical hop count. And reduce the weight of the overall effective bandwidth. .
[0037] A further improvement in this application is that step 3 specifically includes the following steps:
[0038] Step 3.1: Construct the topology state index :
[0039]
[0040] in, This represents the average logical hop count for the window. Indicates the overall effective bandwidth. The larger the value, the more unfavorable the current topology is to the current training phase. The smaller the value, the better the current topology matches the current training requirements. Indicates time Logical hop count weight, Indicates time The weight of the overall effective bandwidth, and and Dynamically determined based on logical hop count sensitivity parameters and bandwidth sensitivity parameters:
[0041] When data parallelism dominates the phase When, defined:
[0042] When the pipeline parallelism dominates the phase When, defined:
[0043] in, This indicates the logical hop count sensitivity under data parallelism-dominant phase. This indicates the bandwidth sensitivity under the data parallelism-dominant phase. This indicates the logic hop count sensitivity under the pipelined parallel dominant phase. This indicates the bandwidth sensitivity under the pipelined parallel dominant phase.
[0044] Step 3.2: Set the topology status trigger threshold And determine whether to enter the reconstruction process, when the topology state index When the controller determines that there is a mismatch between the current network topology and the current training phase, it proceeds to step 4. At that time, the existing topology remains unchanged.
[0045] A further improvement of this application is that, in step 4, the preset engineering constraints include at least a combination of the following constraints: budget wall constraints. Action execution threshold constraints Physical reachability constraints .
[0046] A further improvement of this application is that, in step 4, the prospective evaluation is to calculate candidate actions. At any moment Net income :
[0047]
[0048] in, Candidate actions The predicted performance improvement of the topology state index is as follows. Candidate actions Refactoring overhead penalty item.
[0049] A further improvement in this application is that step 4 specifically includes the following steps:
[0050] Step 4.1, Set the time. The network graph corresponding to the current topology is , Indicates time Network topology diagram, Represents a set of nodes. Indicates time The set of links; when a reconfiguration is detected, the controller generates a set of candidate actions:
[0051]
[0052] in, Indicates time The set of candidate actions Indicates the first One candidate reconstruction action, Indicates the number of candidate actions;
[0053] Step 4.2: For each candidate action Construct the virtual topology after the action is executed:
[0054] ,
[0055] In virtual topology Recalculate the logical hop count and overall effective bandwidth Trigger threshold based on topology status Calculate the predicted topological state index after the action is executed:
[0056]
[0057] Define candidate actions The resulting improvement in condition :
[0058]
[0059] like This indicates the candidate action. To improve the current topology, if This indicates the candidate action. No profit may worsen the situation;
[0060] Step 4.3: Construct the overall net return scoring function for the action:
[0061]
[0062] in, Indicates action Overall net income score Indicates the weighting of improved returns. This indicates switching the cost weight. Indicates resource cost weight; Indicates the amount of improvement in the condition. For the first The reconstruction overhead of each candidate action Indicates the first Resource cost of each candidate action;
[0063] Step 4.4: Apply budget wall constraints Physical reachability constraints Action execution threshold constraints Among the actions that satisfy the budget wall constraint, physical reachability constraint, and action execution threshold constraint, the action with the highest comprehensive score is selected as the final action to be executed. After execution, the current topology will be updated to... ,in, express Budget consumption, Indicates the budget wall threshold. This represents the function for determining the physical reachability region. This represents the overall net income scoring function. This indicates the threshold for action execution.
[0064] A further improvement in this application is that step 5 specifically includes the following steps:
[0065] Step 5.1: After completing the final execution action Then, the prediction error is defined as follows:
[0066]
[0067] in, For the final execution action The measured end-to-end latency after execution
[0068] The delay is predicted based on the current large model;
[0069] like This indicates that the actual latency is higher than expected. This means the current model overestimates the benefits of the action; if This indicates that the actual latency is lower than expected. This means that the current model underestimates the benefits of actions;
[0070] Step 5.2: Based on the prediction error Update the logical hop count sensitivity parameter and bandwidth sensitivity parameter for the corresponding phase, defining the logical hop count sensitivity update step size as follows: The bandwidth sensitivity update step size is If the current moment belongs to the data-parallel dominant phase, then update:
[0071]
[0072]
[0073] in, This represents the currently observed logical hop count. This represents the currently observed combined effective bandwidth; if the current moment belongs to the pipeline parallelism-dominant phase, then only update... and ;
[0074] Step 5.3: To prevent the physical sensing dynamic topology reconstruction system from executing multiple topology actions consecutively within a short period of time, a cooling-off time threshold is introduced. Let the time when the last refactoring was completed be... Then only if the following conditions are met: Only when the physical sensing dynamic topology reconfiguration system is in a certain state will it be allowed to re-enter the execution phase of network topology reconfiguration.
[0075] Step 5.4: Based on the cooling time threshold in Step 5.3, introduce hysteresis control: Define a high threshold for refactoring trigger. Reconstruction exit low threshold and satisfy ,
[0076] when When the physical sensing dynamic topology reconstruction system enters the reconstruction determination state, only when the physical sensing dynamic topology reconstruction system has entered the reconstruction determination state, will the system proceed with the reconstruction determination process. Only when the time comes will the process exit the reconstruction determination state;
[0077] Step 5.5: The physical sensing dynamic topology reconstruction system writes the reconstruction results of this round into the log database and returns to step 1 to continue the next round of online monitoring and decision-making, that is, repeats steps 1-5.
[0078] A further improvement of this application is that the physical sensing dynamic topology reconstruction system includes:
[0079] The state awareness module is used to acquire training phase signals online and to acquire network topology physical state variables based on in-band network telemetry or sampling stream telemetry technology.
[0080] The parameter fitting module is used to fit sensitivity parameters between end-to-end latency and network topology physical state variables based on historical data.
[0081] The decision control module is used to construct the topology state index and perform dynamic polarization of the phase weights, and select the optimal reconstruction action under the constraints of budget wall and physical reachability.
[0082] The topology execution module is used to issue flow table or optical path switching commands based on the decision results to perform topology reconstruction.
[0083] The beneficial effects of this application are:
[0084] (1) Performance improvement: Through physical perception optimization, the average end-to-end latency has been significantly reduced.
[0085] (2) Enhanced stability: The long-tail delay is effectively reduced and the high-quantile delay is significantly improved, which reduces the variance of training step time and makes the training process smoother and more stable.
[0086] (3) High interpretability: Through the sensitivity parameters obtained by fitting, the operation and maintenance personnel can intuitively quantify the specific contribution of bandwidth and hop count to performance.
[0087] (4) Control robustness: Budget walls and cooling mechanisms effectively suppress frequent refactoring caused by small network fluctuations, avoiding control plane oscillations.
[0088] (5) Engineering feasibility: The introduction of physical reachability domain constraints ensures that all generated topology reconstruction instructions can be executed under existing hardware conditions, thus solving the problem of disconnect between theoretical optimization and engineering deployment. Attached Figure Description
[0089] Figure 1 This is a schematic diagram of the timing tracking and dynamic reconstruction process of end-to-end delay in an embodiment of this application.
[0090] Figure 2 This is a schematic diagram of the dynamic polarization of the physical perception weights in the embodiments of this application.
[0091] Figure 3 This is a schematic diagram of the topology reconstruction action budget consumption and budget wall interception mechanism in the embodiments of this application.
[0092] Figure 4 This is a comparative schematic diagram of the cumulative distribution function of end-to-end delay in the embodiments of this application.
[0093] Figure 5 This is a schematic diagram comparing the delay distribution box plots of different training phases in the embodiments of this application.
[0094] Figure 6 This is a schematic diagram comparing the average performance gains of typical refactoring actions in the embodiments of this application.
[0095] Figure 7 This is a schematic diagram illustrating the dynamic trade-off between logical hop count and allocated bandwidth in an embodiment of this application.
[0096] Figure 8 This is a schematic diagram of the physical sensing dynamic topology reconstruction system of this application.
[0097] Figure 9 This is a flowchart illustrating the physical sensing dynamic topology reconstruction method of this application. Detailed Implementation
[0098] The embodiments of the present invention will be disclosed below with reference to the drawings. For clarity, many practical details will be described in the following description. However, it should be understood that these practical details are not intended to limit the present invention. That is, in some embodiments of the present invention, these practical details are not essential. In addition, for the sake of simplicity, some conventional structures and components will be shown in the drawings in a simple schematic manner.
[0099] like Figure 9 As shown, this application presents a physical-aware dynamic topology reconstruction method for large-scale model hybrid parallel training. This method is implemented through a physical-aware dynamic topology reconstruction system, such as... Figure 8As shown, the system is deployed in a data center cluster containing a hybrid optoelectronic switching architecture. Computational nodes are interconnected via a switching network. The control plane includes a state-aware module, a parameter fitting module, a decision control module, and an SDN controller. The system performs online monitoring and decision-making according to a preset control cycle, which can be aligned with the training time steps or be an integer multiple of multiple training time steps. Figure 8 As shown, the physical sensing dynamic topology reconstruction system includes: a state sensing module for online acquisition of training phase signals and acquisition of network topology physical state variables based on in-band network telemetry or sampled stream telemetry technology; a parameter fitting module for fitting sensitivity parameters between end-to-end delay and network topology physical state variables based on historical data; a decision control module for constructing a topology state index and performing dynamic polarization of phase weights, selecting the optimal reconstruction action under budget wall constraints, physical reachability constraints, and action execution threshold constraints; and a topology execution module for issuing flow table or optical path switching commands to execute topology reconstruction based on the decision results.
[0100] The physical sensing dynamic topology reconstruction method specifically includes the following steps:
[0101] Step 1: Collect the current training phase signal and network topology physical state variables during the large model training process online. The network topology physical state variables include at least the logical hop count and available link bandwidth. Construct a comprehensive effective bandwidth for the current training phase signal task. Specifically, this includes the following steps:
[0102] Step 1.1: Acquire the signal of the current training phase online from the large model training framework. The training phase includes the signal at time [time]. Is it in the data parallel dominant phase? and in time Is it in the pipeline parallel dominant phase? Furthermore, data parallelism dominates the phase. Parallel dominance phase of pipeline The following conditions must be met:
[0103]
[0104] in, This indicates the current sampling time. At any given sampling time, the physical perception dynamic topology reconstruction system corresponds to only one current dominant training phase; that is, when a certain phase is in an active state, its corresponding indicator is 1, and the rest are 0.
[0105] For example, in a 64-card training cluster, when the system detects that the current stage mainly involves cross-replica gradient synchronization operations, a significant increase in AllReduce traffic, and communication duration concentrated at the synchronization tail, it can be determined that... When micro-batch transfers are detected between different pipeline stages and point-to-point communication between adjacent stages is dominant, it can be determined that... .
[0106] Step 1.2: After identifying the training phase, the physical sensing dynamic topology reconstruction system uses in-band network telemetry (INT) or sampled stream telemetry (sFlow) to collect network topology physical state variables of training-related traffic through links, switching nodes and end-to-end paths, builds a data foundation and continuously monitors the network topology physical state variables.
[0107] This application focuses on collecting at least two types of physical state variables: one is the logical hop count, denoted as... , Indicates at time The first is the logical hop count of the current critical training communication flow from source to destination, used to reflect the hierarchical length of the communication path under the current topology. The second is the available bandwidth of the link, denoted as... , Indicates the starting node number of the link; Indicates the link termination node number; Indicates at time Next, node With nodes Available bandwidth of the link between them; available bandwidth refers to the remaining bandwidth resources available for training traffic after deducting the current background bandwidth usage.
[0108] In practical systems, logical hop count The preferred value is the logical hop count of the critical training communication flow on the current actual forwarding path; if this value cannot be obtained directly, the route mapping path maintained by the controller can be used as an approximation; when routing information is lacking, the shortest feasible path can be used as an approximate estimate. Link available bandwidth metric. It can be calculated from the link load status, queue occupancy status and sFlow sampled traffic statistics reported by INT.
[0109] For example, for a link with a physical rated bandwidth of 100Gbps, if its background occupancy is currently estimated to be 28Gbps using INT / sFlow, then: The unit of 72 here is Gbps, indicating that the link still has 72Gbps available for training communication at the current moment.
[0110] Step 1.3: Due to the short-term fluctuations in network state during training, directly using instantaneous measurements at a single moment can easily lead to the controller becoming overly sensitive to transient anomalies, thus causing unnecessary topology reconfiguration. Therefore, the collected network topology physical state variables are windowed and aggregated to suppress transient jitter, and the average logical hop count of the window is defined as:
[0111]
[0112] in, For logical hop count, The length of the time window;
[0113] Similarly, the window average available link bandwidth metric is defined as follows:
[0114]
[0115] in, This is a metric for available bandwidth on a link, representing the link's bandwidth. At any moment The available bandwidth or equivalent available bandwidth is calculated from the link load status, queue occupancy status and sampling flow traffic statistics reported by the in-band network telemetry INT.
[0116] If the current training phase task involves a set of critical links Then, we further construct the comprehensive effective bandwidth for the current training phase task:
[0117]
[0118] in, Indicates link The importance weights for the current training phase task are as follows: the higher the importance weight, the more important the link. The more critical the situation.
[0119] Window length A reasonable range can be set to 3 to 20 sampling periods. If If it's too small, it won't be able to smooth out instantaneous jitter; if... If the value is too large, it will weaken the system's response speed to changes in the real state. In this embodiment, considering the matching relationship between the training steps and the network sampling period, a value is selected. This value can maintain good online responsiveness while smoothing out short-term fluctuations.
[0120] For example, if the logical hop count of a critical communication stream in five consecutive sampling periods is 3, 3, 4, 3, 3, then: This value is more stable than the value at a single moment and is more suitable as input for subsequent scoring models.
[0121] Step 2: Based on historical operational data, establish a quantitative mapping relationship between network topology physical state variables and end-to-end latency to obtain logical hop count sensitivity parameters and bandwidth sensitivity parameters, providing a basis for subsequent topology state index calculation; specifically including the following steps:
[0122] Step 2.1: Before the physical sensing dynamic topology reconstruction system runs, samples are extracted to form a training sample set based on historical operation logs, telemetry records, and topology snapshots saved by the controller. For any historical sample Record the following fields:
[0123] ;
[0124] in, Indicates the first Does each sample belong to the data parallel dominant phase? , Indicates the first Does each sample belong to the pipeline parallel dominant phase? , Indicates the first The logical hop count corresponding to each sample Indicates the first The available bandwidth metric for each sample link Indicates the first The end-to-end latency corresponding to each sample can be obtained by statistically analyzing the training communication completion time, the critical communication flow completion time, or the synchronization blocking duration.
[0125] Step 2.2, based on logical hop count With the Link available bandwidth metrics for each sample The reciprocal of the time, establishing end-to-end latency And solve for the sensitivity parameters:
[0126]
[0127] in, Represents the basic delay term. This represents the logical hop count sensitivity parameter; This represents the bandwidth sensitivity parameter.
[0128] The reason for adopting This is because, all other things being equal, a larger bandwidth results in a smaller communication latency, and the two are inversely correlated. Therefore, using the inverse of the bandwidth term is more in line with physical intuition. Considering that the dominant bottlenecks differ in different training phases, this application fits the data separately for each phase.
[0129] The fitting model for the data parallel-dominant phase is:
[0130] The fitting model for the parallel dominant phase of the pipeline is:
[0131] in, , These represent the basic delay terms for the two training phases, respectively. , These represent the logical hop count sensitivity under the two training phases, respectively. , These represent the bandwidth sensitivity under the two training phases, respectively.
[0132] The parameters are solved using least-squares fitting. Taking the data-parallel dominant phase as an example, the objective function is:
[0133]
[0134] in, This represents the number of historical samples belonging to the data-parallel dominant phase. Taking the pipelined parallel dominant phase as an example, its objective function is:
[0135]
[0136] in, This indicates the number of historical samples belonging to the pipeline parallel dominant phase.
[0137] After fitting, the obtained parameters all have clear physical meanings. For example, if A larger value indicates that in the data-parallel-dominated phase, bandwidth variations have a more significant impact on latency. The larger value indicates that in the parallel-dominant phase of the pipeline, the change in the number of hops has a more significant impact on the delay.
[0138] Step 2.3, Example of Fitting Process. Taking data parallelization as the dominant phase as an example, assume 200 samples are selected from the historical logs, and each sample contains... , and After inputting these data into the fitting program, the parameters are obtained by least squares solution. , and For example, if the fitting results show... Much larger This indicates that under this phase, prioritizing bandwidth increase usually yields greater benefits than simply reducing the number of hops.
[0139] Taking the parallel dominant phase of the pipeline as an example again, if the fitting yields... Higher than This indicates that in scenarios with frequent communication between adjacent stages, shortening the logical path and reducing intermediate forwarding layers can often reduce the overall tail latency.
[0140] In this embodiment, the number of fitted samples The reasonable range is set to 100 to 5000. Too small a sample size will reduce parameter stability, while too large a sample size will increase online update overhead. Considering the experimental cluster size and training time, this experiment uses... .
[0141] Step 3: Based on the comprehensive effective bandwidth constructed in Step 1, and the logical hop count sensitivity parameter and bandwidth sensitivity parameter obtained in Step 2, construct a unified, online-calculated topology state index. This index is used to determine whether the current network topology matches the current training phase and serves as the direct basis for triggering reconstruction. Specifically, this includes the following steps:
[0142] Step 3.1: Construct the topology state index The topology state index is a weighted combination of logical hop count and bandwidth term:
[0143]
[0144] in, This represents the average logical hop count for the window. Indicates the overall effective bandwidth. The larger the value, the more unfavorable the current topology is to the current training phase. The smaller the value, the better the current topology matches the current training requirements. Indicates time Logical hop count weight, Indicates time The weight of the overall effective bandwidth, and and Dynamically determined based on logical hop count sensitivity parameters and bandwidth sensitivity parameters:
[0145] When data parallelism dominates the phase When, defined:
[0146] When the pipeline parallelism dominates the phase When, defined:
[0147] in, This indicates the logical hop count sensitivity under data parallelism-dominant phase. This indicates the bandwidth sensitivity under the data parallelism-dominant phase. This indicates the logic hop count sensitivity under the pipelined parallel dominant phase. This indicates the bandwidth sensitivity under the pipelined parallel dominant phase.
[0148] In this method, the weights are derived from historical data fitting rather than being subjectively assigned, thus offering greater interpretability. For example, if a fitting result satisfies... Then, under the data parallelism-dominated phase, there is This indicates that the system is more concerned about insufficient bandwidth at this stage. Conversely, if the pipeline parallelism-dominant phase satisfies... This indicates that the system is more focused on reducing the number of logical hops at this stage.
[0149] Step 3.2: To avoid frequent responses to minor fluctuations, set a topology state trigger threshold. And determine whether to enter the reconstruction process, when the topology state index When the controller determines that there is a significant mismatch between the current network topology and the current training phase, it proceeds to step 4, which involves the generation, evaluation, filtering, and execution of network topology reconstruction. At that time, the existing topology remains unchanged.
[0150] The weight and The weights are dimensionally labeled to ensure that the hop count term and the inverse bandwidth term can be weighted comparablely. A reasonable range for the threshold can be set to 0.5 to 5.0. If the threshold is too low, the system will be too sensitive and prone to frequent refactoring; if the threshold is too high, the system will miss valuable refactoring opportunities. In this embodiment, based on the distribution of indicators in the simulation environment, a threshold of 0.5 is selected. This value can effectively distinguish between acceptable topology states and topology states that need to be reconstructed.
[0151] For example, in a parallel-dominant phase of a pipeline, if the current weights are calculated as follows: If the experimental threshold is 1.5, the system enters the topology reconstruction evaluation process; if the result is obtained during the data parallelization-dominated phase... If the result is positive, it means that the current topology can still meet the training requirements and there is no need to reconstruct it.
[0152] Step 4: Based on the topology state index constructed in Step 3, perform a prospective evaluation of the candidate actions, and, under the premise of satisfying preset engineering constraints, select the action with the highest comprehensive score for network topology reconstruction; in this step, the preset engineering constraints include at least a combination of the following constraints: budget wall constraint. Action execution threshold constraints Physical reachability constraints .
[0153] Budget wall constraints The budget for refactoring must not exceed the preset budget threshold. The budget consumption includes one, two, or three of the following: control plane overhead, link switching latency, and additional resource consumption. A reasonable range for the budget wall can be set to 1 to 20 units. If this threshold is too small, the system will be overly conservative, missing effective reconfiguration opportunities; if the threshold is too large, too many reconfigurations may occur in a short period. In this embodiment, [the threshold is set to...]. This indicates that the maximum budget consumption for a single candidate refactoring action is 5 units of budget.
[0154] Action execution threshold constraint The time interval between two consecutive reconstruction actions must meet the cooldown time threshold. Furthermore, the predicted net return increase of the target action must exceed the preset action execution threshold. When satisfied An action is only eligible to be executed when a certain threshold is reached. A reasonable range for this threshold is 0-1. If the threshold is too low, it will introduce many actions with very small marginal returns; if the threshold is too high, it may miss actions with moderate returns but overall benefits. In this embodiment, we take... .
[0155] Physical reachability constraints The generated topology must meet the following requirements: whether there are enough remaining ports on the switching equipment, whether the link capacity after reconstruction exceeds the hardware limit, whether the target connection relationship is within the range of physical connections that allow reconstruction, and whether the basic network connectivity is maintained after the action is executed.
[0156] Step 4 specifically includes the following steps:
[0157] Step 4.1, Set the time The network graph corresponding to the current topology is , Indicates time Network topology diagram, Represents a set of nodes. Indicates time Link set; node set A unified graph model may include training nodes, exchange nodes, or both; a set of links. This indicates the physical or logical connection relationships that have been established.
[0158] When a reconfiguration is detected, the controller, considering the current critical communication flow, idle ports, reconfigurable link capabilities, and existing connection relationships, first determines the communication node pairs or communication paths that need to be prioritized for optimization. Then, it generates a set of candidate actions by constructing one or more reconfiguration options around the critical communication flow, such as adding connections, deleting connections, and adjusting paths.
[0159]
[0160] in, Indicates time The set of candidate actions Indicates the first One candidate reconstruction action, Indicates the number of candidate actions;
[0161] Each candidate action may include one or more of the following types: adding a direct connection between key node pairs within the allowed switching plane; deleting a low-yield connection to free up port resources for a high-yield connection; adjusting the mapping path of key communication flows to traverse paths with fewer hops or higher bandwidth; or changing the logical adjacency relationship between some training nodes to shorten the topological distance between key communication pairs in a specific phase.
[0162] For example, if the critical communication in the current pipeline stage is the activation transmission between stage1 and stage2, and its existing path takes 4 hops, the controller can generate the following candidate actions: adjust the connection between the node where stage1 is located and the intermediate exchange node to a shorter path, or directly reconstruct a new adjacency relationship, so that the communication is reduced to 3 hops.
[0163] Step 4.2: Perform look-ahead evaluation on candidate actions and calculate the reconstructed state. For each candidate action... The controller does not execute immediately, but instead constructs a virtual topology in the control plane after the action is executed:
[0164] ,
[0165] In virtual topology Recalculate the logical hop count for the critical communication flow. and overall effective bandwidth Trigger threshold based on topology status Calculate the predicted topology state index after the action is executed:
[0166]
[0167] Define candidate actions The resulting improvement in condition :
[0168]
[0169] like This indicates the candidate action. To improve the current topology, if This indicates the candidate action. No obvious benefit may worsen the situation;
[0170] For example, if the current topological state index is 1.92, and a candidate action is predicted to yield an index of 1.41, then: This indicates that the action has positive benefits.
[0171] Step 4.3: State improvement alone is insufficient to determine whether to execute an action, as topology reconfiguration consumes control and hardware resources and introduces additional switching costs. Therefore, a comprehensive net benefit scoring function for the action is constructed:
[0172]
[0173] in, Indicates action Overall net income score Indicates the weighting of improved returns. This indicates switching the cost weight. Indicates resource cost weight; Indicates the amount of improvement in the condition. For the first The reconstruction overhead of each candidate action Indicates the first Resource cost of each candidate action.
[0174] Improve return weighting Resource cost weight and switching overhead weights This is used to balance the relationship between benefits and costs. Its reasonable range can be set from 0.1 to 10. If a certain weight is too small, the impact of the corresponding factor is underestimated; if it is too large, it will suppress other factors. In this embodiment, we take: .
[0175] This set of values indicates that this embodiment prioritizes improving benefits while explicitly penalizing switching costs and resource consumption, so that the system will not be frequently reconfigured for minor benefits.
[0176] For example, if the improvement benefit weight of a candidate action is 0.60, the switching cost weight is 0.40, and the resource cost weight is 0.25, then: If another action provides a greater improvement but comes at a significantly higher cost, its overall net benefit score may not be better.
[0177] Step 4.4: Apply budget wall constraints Physical reachability constraints Action execution threshold constraints Among the actions that satisfy the budget wall constraint, physical reachability constraint, and action execution threshold constraint, the action with the highest comprehensive score is selected as the final action to be executed. After execution, the current topology will be updated to... ,in, express Budget consumption, Indicates the budget wall threshold. This represents the function for determining the physical reachability region. This represents the overall net income scoring function. This indicates the threshold for action execution.
[0178] For example, during a parallel phase of a pipeline, the system generates three candidate actions. Action 1 reduces the critical path hop count from 4 to 3, but requires the use of 2 additional high-value ports; Action 2 increases bandwidth from 70Gbps to 85Gbps, but the hop count remains unchanged; Action 3 improves both hop count and bandwidth, but the switching overhead is too high. After prospective evaluation and scoring, if Action 1 has the highest overall score and meets the budget wall and physical reachability requirements, then Action 1 is executed; if Action 1 does not meet the budget constraints, then Action 2, which has the second highest score and is compliant, is selected instead.
[0179] Step 5: Record the measured latency and network status after network topology reconstruction, and update the logical hop count sensitivity parameter, bandwidth sensitivity parameter, and stability control parameter. The stability control parameter includes the cooldown time threshold. Reconstruction triggers high threshold Reconstruction exit low threshold Specifically, the steps include the following:
[0180] Step 5.1: After completing the final execution action Then, the prediction error is defined as follows:
[0181]
[0182] in, For the final execution action The measured end-to-end latency after execution The delay is predicted based on the current large model;
[0183] like This indicates that the actual latency is higher than expected. This means the current model overestimates the benefits of the action; if This indicates that the actual latency is lower than expected. This means that the current model underestimates the benefits of actions;
[0184] The system also records a reconstruction log, which includes: the current training phase, and the phases before and after reconstruction. Actions to be performed Type, corresponding number Reconstruction overhead of each candidate action , No. Resource cost of each candidate action , Budget consumption The actual performance changes after execution, that is, the measured changes in end-to-end communication latency, the effect of suppressing long-tail latency, and the improvement of training communication efficiency after the topology reconstruction action is actually executed;
[0185] Step 5.2: Based on the prediction error Update the logical hop count sensitivity parameter and bandwidth sensitivity parameter for the corresponding phase, defining the logical hop count sensitivity update step size as follows: The bandwidth sensitivity update step size is If the current moment belongs to the data-parallel dominant phase, then update:
[0186]
[0187]
[0188] in, This represents the currently observed logical hop count. This represents the currently observed overall effective bandwidth; if the current moment belongs to the pipeline parallelism-dominant phase, then only update... and This ensures that different training phases have independent physical sensitivity parameters.
[0189] A reasonable range for the update step size is 0.001 to 0.1. If the step size is too small, the model update will be too slow; if the step size is too large, it will easily cause parameter oscillations. In this embodiment, we take... .
[0190] Step 5.3: To prevent the physical sensing dynamic topology reconstruction system from executing multiple topology actions consecutively within a short period of time, a cooling-off time threshold is introduced. Let the time when the last refactoring was completed be... Then only if the following conditions are met: Only when the physical sensing dynamic topology reconfiguration system is in a certain state will it be allowed to re-enter the execution phase of network topology reconfiguration.
[0191] A reasonable range for the cooldown time threshold can be set to 1 to 20 training steps or 1 to 20 control cycles. If the threshold is too small, the effect of suppressing frequent switching will be insignificant; if the threshold is too large, it may reduce the system's response speed to sudden bottlenecks. In this embodiment, the threshold is set to... This indicates that there must be at least 3 control cycles between two consecutive reconstructions.
[0192] Step 5.4: Based on the cooling time threshold in Step 5.3, introduce hysteresis control: Define a high threshold for refactoring trigger. Reconstruction exit low threshold and satisfy ,
[0193] when When the physical sensing dynamic topology reconstruction system enters the reconstruction determination state, only when the physical sensing dynamic topology reconstruction system has entered the reconstruction determination state, will the system proceed with the reconstruction determination process. Only when this happens will the process exit the reconstruction decision state; this can avoid... The system repeatedly enters and exits the reconstruction judgment state when experiencing slight fluctuations near the boundary of the trigger interval.
[0194] The reasonable range for the two thresholds can be set as follows: Between 1.2 and 3.0; It must be between 0.8 and 2.5, and must be less than In this embodiment, take This setting can effectively suppress control oscillations caused by boundary jitter.
[0195] Step 5.5: The physical perception dynamic topology reconstruction system writes the reconstruction results of this round into the log database and returns to step 1 to continue the next round of online monitoring and decision-making, i.e., repeating steps 1-5. This forms a complete closed loop: perceiving the current training phase and network physical state; quantifying the impact of the current topology on training communication; triggering reconstruction when the topology state deteriorates; verifying the reconstruction results online; and using the real results to correct subsequent decision-making basis.
[0196] For example, during a pipeline parallelism-dominated phase, the system predicts that a hop reduction action can reduce end-to-end latency by 10%, but actual measurements show only a 4% reduction. This indicates that the current estimate of hop reduction benefits is overly optimistic. In this case, updating... This can reduce the scoring deviation of subsequent similar actions; at the same time, if there have been multiple consecutive reconstructions recently, the cooling mechanism will prevent the system from switching again immediately in the next control cycle, thereby ensuring the overall control stability.
[0197] This application realizes online dynamic reconstruction of network topology for large-scale hybrid parallel training processes. Figures 1 to 7 This indicates that the method can adaptively adjust the optimization focus based on training phase changes, reduce end-to-end communication latency and long-tail latency, and maintain relatively stable control behavior while taking into account reconfiguration overhead and engineering deployability.
[0198] Simulation experiments based on real LLM training trajectories show that: Figure 1 As shown, after introducing the method of this application, the end-to-end delay curve responds rapidly and decreases at the phase switching point, significantly lower than the static baseline. Figure 4 As shown, the cumulative distribution function curve of the delay shifts significantly to the left in the high delay range (350ms-500ms), proving that the long-tail delay is effectively eliminated. Figure 7As shown, the topology configuration point selected by the system is always located below the hardware physical constraint boundary, and dynamically resides in the low hop number region or high bandwidth region under different phases, verifying the algorithm's physical perception capability and phase adaptation capability.
[0199] like Figure 2 As shown, the latency weight and bandwidth weight can dynamically polarize with the change of training phase. In the pipeline parallelism-dominated stage, the system focuses more on low hop count optimization; in the data parallelism-dominated stage, the system focuses more on high bandwidth guarantee, thereby achieving differentiated optimization for different communication modes.
[0200] like Figure 3 As shown, the budget consumption corresponding to the refactoring action never exceeded the budget wall constraint threshold throughout the simulation process, indicating that this application can effectively control the refactoring cost and meet the engineering cost constraints during dynamic refactoring.
[0201] like Figure 5 As shown, in both pipelined and data-parallel scenarios, the end-to-end delay distribution of the TSI dynamic topology scheme is generally better than the static baseline, indicating that the proposed method has a stable delay reduction effect under different dominant phases. (Refer to the appendix of the specification.) Figure 1 and Figure 5 It can be seen that the end-to-end delay curve and delay distribution are generally better than the static baseline scheme.
[0202] like Figure 6 As shown, the average delay reduction of each candidate reconstruction action has a clear distinction, and the budget wall mechanism can intercept the low-return actions corresponding to most small fluctuations, thereby avoiding control oscillations and improving the quality of reconstruction decisions.
[0203] This method addresses the differentiated communication bottlenecks in the data parallel and pipeline parallel phases of large language model training. It acquires training phase signals online and constructs a data foundation using in-band network telemetry (INT) and / or sampled stream telemetry (sFlow) techniques. Network topology physical state variables are collected and aggregated in a windowed manner, including at least logical hop count and available link bandwidth. Based on historical samples, the quantitative relationship between end-to-end latency and topology physical variables is fitted to obtain physically meaningful sensitivity parameters. A topology state index is constructed, and the weights of the hop count and bandwidth terms are dynamically polarized according to the training phase. Candidate reconstruction actions are prospectively evaluated, incorporating reconstruction overhead and resource costs. Deployable actions are selected and executed by combining budget walls, physical reachability constraints, and action execution threshold constraints. Finally, the sensitivity parameters and stability control parameters are updated in a closed loop based on measured latency and reconstruction logs. This approach significantly reduces end-to-end communication latency and suppresses long-tail latency while ensuring hardware physical deployability and control stability, thereby improving the training efficiency and stability of large models.
[0204] The above description is merely an embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of the present invention should be included within the scope of the claims of the present invention.
Claims
1. A method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training, characterized in that: The physical sensing dynamic topology reconstruction method is implemented through a physical sensing dynamic topology reconstruction system, and the physical sensing dynamic topology reconstruction method specifically includes the following steps: Step 1: Collect the signal and network topology physical state variables of the current training phase during the large model training process online. The network topology physical state variables include logical hop count and link available bandwidth indicators. Construct a comprehensive effective bandwidth for the current training phase signal task. Step 2: Based on historical operational data, establish a quantitative mapping relationship between network topology physical state variables and end-to-end latency to obtain logical hop count sensitivity parameters and bandwidth sensitivity parameters, providing a basis for subsequent topology state index calculation; Step 3: Based on the comprehensive effective bandwidth constructed in Step 1, the logical hop count sensitivity parameter and bandwidth sensitivity parameter obtained in Step 2, construct a unified, online-calculated topology state index to determine whether the current network topology matches the current training phase, and serve as the direct basis for triggering reconstruction. Step 4: Combine the topology state index constructed in Step 3 to perform a prospective evaluation of the candidate actions, and under the premise of meeting the preset engineering constraints, select the action with the highest comprehensive score for network topology reconstruction. Step 5: Record the measured latency and network status after network topology reconstruction, and update the logical hop count sensitivity parameter, bandwidth sensitivity parameter, and stability control parameter.
2. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training as described in claim 1, characterized in that: Step 1 specifically includes the following steps: Step 1.1: Acquire the signal of the current training phase online from the large model training framework. The training phase includes the signal at time [time]. Is it in the data parallel dominant phase? and at the moment Is it in the pipeline parallel dominant phase? Furthermore, data parallelism dominates the phase. Parallel dominance phase of pipeline The following conditions must be met: in, This indicates the current sampling time. At any given sampling time, the physical perception dynamic topology reconstruction system corresponds to only one current dominant training phase. Step 1.2: After acquiring the training phase, the physical sensing dynamic topology reconstruction system uses in-band network telemetry (INT) or sampled stream telemetry (sFlow) to collect network topology physical state variables of training-related traffic through links, switching nodes and end-to-end paths, builds a data foundation and continuously monitors the network topology physical state variables. Step 1.3: Perform windowed aggregation on the collected network topology physical state variables to suppress instantaneous jitter, and obtain the windowed average logical hop count, defined as: in, For logical hop count, The length of the time window; Similarly, the window average available link bandwidth metric is defined as follows: in, This is a metric for available bandwidth on a link, representing the link's bandwidth. At any moment The available bandwidth is calculated from the link load status, queue occupancy status and sampling flow traffic statistics reported by the in-band network telemetry (INT) and the sampling flow telemetry (sFlow). If the current training phase task involves a set of critical links Then, construct the comprehensive effective bandwidth for the current training phase task: in, Indicates link The importance weights for the current training phase task are as follows: the higher the importance weight, the more important the link. The more critical the situation.
3. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training as described in claim 1, characterized in that: Step 2 specifically includes the following steps: Step 2.1: Before the physical sensing dynamic topology reconstruction system runs, samples are extracted based on historical operation logs, telemetry records, and topology snapshots saved by the controller to form a training sample set. For any historical sample Record the following fields: ; in, Indicates the first Does each sample belong to the data parallel dominant phase? , Indicates the first Does each sample belong to the pipeline parallel dominant phase? , Indicates the first The logical hop count corresponding to each sample Indicates the first The available bandwidth metric for each sample link Indicates the first The end-to-end delay corresponding to each sample; Step 2.2, based on logical hop count With the Link available bandwidth metrics for each sample The reciprocal of the time, establishing end-to-end time delay And solve for the sensitivity parameters: in, Represents the basic delay term. This represents the logical hop count sensitivity parameter. This represents the bandwidth sensitivity parameter.
4. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training according to claim 1, characterized in that: In step 3, determining whether the current network topology matches the current training phase specifically involves identifying whether the current training phase belongs to the data parallel-dominated stage or the pipeline parallel-dominated stage. If it is in the data parallel-dominated stage, the weight of the overall effective bandwidth is increased. And reduce the logical hop weight If the pipeline is in a parallel-dominated phase, increase the weight of the logical hop count. And reduce the weight of the overall effective bandwidth. .
5. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training according to claim 4, characterized in that: Step 3 specifically includes the following steps: Step 3.1: Construct the topology state index : in, This represents the average logical hop count for the window. Indicates the overall effective bandwidth. The larger the value, the more unfavorable the current topology is to the current training phase. The smaller the value, the better the current topology matches the current training requirements. Indicates time Logical hop count weight, Indicates time The weight of the overall effective bandwidth, and and Dynamically determined based on logical hop count sensitivity parameters and bandwidth sensitivity parameters: When data parallelism dominates the phase When, defined: When the pipeline parallelism dominates the phase When, defined: in, This indicates the logical hop count sensitivity under data parallelism-dominant phase. This indicates the bandwidth sensitivity under the data parallelism-dominant phase. This indicates the logic hop count sensitivity under the pipelined parallel dominant phase. This indicates the bandwidth sensitivity under the pipelined parallel dominant phase. Step 3.2: Set the topology status trigger threshold And determine whether to enter the reconstruction process, when the topology state index When the controller determines that there is a mismatch between the current network topology and the current training phase, it proceeds to step 4. At that time, the existing topology remains unchanged.
6. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training according to claim 1, characterized in that: In step 4, the preset engineering constraints include at least a combination of the following constraints: budget wall constraints. Action execution threshold constraints Physical reachability constraints .
7. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training according to claim 6, characterized in that: In step 4, the prospective evaluation involves calculating candidate actions. At any moment Net income : in, Candidate actions The predicted performance improvement of the topology state index is as follows. Candidate actions Refactoring overhead penalty item.
8. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training according to claim 6, characterized in that: Step 4 specifically includes the following steps: Step 4.1, Set the time. The network graph corresponding to the current topology is , Indicates time Network topology diagram, Represents a set of nodes. Indicates time The set of links; when a reconfiguration is detected, the controller generates a set of candidate actions: in, Indicates time The set of candidate actions Indicates the first One candidate reconstruction action, Indicates the number of candidate actions; Step 4.2: For each candidate action Construct the virtual topology after the action is executed: , In virtual topology Recalculate the logical hop count and overall effective bandwidth Trigger threshold based on topology status Calculate the predicted topological state index after the action is executed: Define candidate actions The resulting improvement in condition : like This indicates the candidate action. To improve the current topology, if This indicates the candidate action. No profit may worsen the situation; Step 4.3: Construct the overall net return scoring function for the action: in, Indicates action Overall net income score Indicates the weighting of improved returns. This indicates switching the cost weight. Indicates resource cost weight; Indicates the amount of improvement in the condition. For the first The reconstruction overhead of each candidate action Indicates the first Resource cost of each candidate action; Step 4.4: Apply budget wall constraints Physical reachability constraints Action execution threshold constraints Among the actions that satisfy the budget wall constraint, physical reachability constraint, and action execution threshold constraint, the action with the highest comprehensive score is selected as the final action to be executed. After execution, the current topology will be updated to... ,in, express Budget consumption, Indicates the budget wall threshold. This represents the function for determining the physical reachability region. This represents the overall net income scoring function. This indicates the threshold for action execution.
9. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training according to claim 8, characterized in that: Step 5 specifically includes the following steps: Step 5.1: After completing the final execution action Then, the prediction error is defined as follows: in, For the final execution action The measured end-to-end latency after execution The delay is predicted based on the current large model; like This indicates that the actual latency is higher than expected. This means the current model overestimates the benefits of the action; if This indicates that the actual latency is lower than expected. This means that the current model underestimates the benefits of actions; Step 5.2: Based on the prediction error Update the logical hop count sensitivity parameter and bandwidth sensitivity parameter for the corresponding phase, defining the logical hop count sensitivity update step size as follows: The bandwidth sensitivity update step size is If the current moment belongs to the data-parallel dominant phase, then update: in, This represents the currently observed logical hop count. This represents the currently observed combined effective bandwidth; if the current moment belongs to the pipeline parallelism-dominant phase, then only update... and ; Step 5.3: To prevent the physical sensing dynamic topology reconstruction system from executing multiple topology actions consecutively within a short period of time, a cooling-off time threshold is introduced. Let the time when the last refactoring was completed be... Then only if the following conditions are met: Only when the physical sensing dynamic topology reconfiguration system is in a certain state will it be allowed to re-enter the execution phase of network topology reconfiguration. Step 5.4: Based on the cooling time threshold in Step 5.3, introduce hysteresis control: Define a high threshold for refactoring trigger. Reconstruction exit low threshold and satisfy , when When the physical sensing dynamic topology reconstruction system enters the reconstruction determination state, only when the physical sensing dynamic topology reconstruction system has entered the reconstruction determination state, will the system proceed with the reconstruction determination process. Only when the time comes will the process exit the reconstruction determination state; Step 5.5: The physical sensing dynamic topology reconstruction system writes the reconstruction results of this round into the log database and returns to step 1 to continue the next round of online monitoring and decision-making, that is, repeats steps 1-5.
10. The method for physical perception dynamic topology reconstruction for large-scale model hybrid parallel training according to claim 1, characterized in that: The physical sensing dynamic topology reconstruction system includes: The state awareness module is used to acquire training phase signals online and to acquire network topology physical state variables based on in-band network telemetry or sampling stream telemetry technology. The parameter fitting module is used to fit sensitivity parameters between end-to-end latency and network topology physical state variables based on historical data. The decision control module is used to construct the topology state index and perform dynamic polarization of the phase weights, and select the optimal reconstruction action under the constraints of budget wall and physical reachability. The topology execution module is used to issue flow table or optical path switching commands based on the decision results to perform topology reconstruction.