MOE expert collaboration modeling and load-aware routing adjustment system and training method

By using MOE expert collaborative modeling and load-aware routing adjustment system, the problems of routing instability and load imbalance of MoE model in distributed environment are solved, the training throughput and stability are improved, communication overhead is reduced, and resource utilization and system robustness are enhanced.

CN122247904APending Publication Date: 2026-06-19SIPPR ENG GROUP +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SIPPR ENG GROUP
Filing Date
2026-03-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

When training the MoE model in a distributed environment, issues such as unstable routing, uneven division of labor among experts, and unbalanced load arise, leading to increased communication and synchronization waiting overhead, which affects training throughput and convergence stability, especially in cross-domain data scenarios.

Method used

The MOE expert collaborative modeling and load-aware routing adjustment system is adopted. Through the offline expert collaborative modeling module, the load time evolution feedback module, the joint routing decision module, and the hotspot expert replica enhancement module, an expert collaboration matrix is ​​constructed to perform load penalty and collaboration bias adjustment, dynamically create expert replicas to divert hotspots, and reduce communication and synchronization waiting overhead.

Benefits of technology

It improves the throughput and stability of MoE model training, reduces the communication overhead of All-to-All dispatch, improves resource utilization and system robustness, and ensures model convergence and performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122247904A_ABST
    Figure CN122247904A_ABST
Patent Text Reader

Abstract

This invention discloses a MOE (Multi-Object Optimization) expert collaboration modeling and load-aware routing adjustment system and training method, including an offline expert collaboration modeling module, a load time evolution feedback module, a joint routing decision module, and a hotspot expert replica enhancement module. Through expert collaboration structure modeling and collaboration bias injection, when candidate expert scores are similar, the system tends to select expert combinations with strong collaboration relationships or those in the same cluster, improving the stability of expert division of labor and suppressing routing drift. By performing time evolution modeling on expert load statistical signals and constructing a load penalty term, the system suppresses further congestion from continuously hotspot experts, alleviates tail waiting phenomena, and improves training throughput stability. For continuously hotspot experts, a replica enhancement, synchronization, and recycling mechanism is introduced to achieve traffic splitting when resources allow, thereby reducing tail iteration time. When resources are limited, replicas can be left uncreated or the number of replicas limited, and routing adjustment can still be completed using collaboration bias and load penalty.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of distributed machine learning and large model training optimization technology, specifically to distributed parallel training of Mixture-of-Experts (MoE) sparse expert models. It is particularly suitable for MOE expert collaborative modeling and load-aware routing adjustment systems and training methods, which can improve the stability of expert division of labor, alleviate load imbalance, and reduce the communication and synchronization waiting overhead caused by All-to-All dispatch in the MoE layer when multiple data sources are involved in training, thereby improving training throughput and resource utilization. Background Technology

[0002] Mixture-of-Experts (MoE) models, as a sparse activation structure, introduce multiple expert subnetworks and a gating network to perform conditional computation selection on input tokens. This ensures that each token activates only a small number of experts to participate in the computation, thus significantly improving model capacity and expressive power at similar computational costs. However, when training MoE models in a distributed environment, the expert subnetworks are deployed across different GPUs, server nodes, and even cloud-edge devices in different regions. During training, for each token in a batch, the gating network outputs the Top-K expert selection results for each token. Tokens need to be distributed across devices according to the selected experts, and the results are collected and aggregated after the experts complete their computations. That is, each device needs to both send tokens to and receive tokens from other devices. When the model size is large, the number of experts is large, the number of parallel devices increases, or the cross-node ratio increases, the cost of this all-to-all communication mode increases rapidly, causing significant synchronization waits and bandwidth contention, thus becoming the main bottleneck for training throughput and even offsetting the computational savings brought by MoE's sparse activation.

[0003] Meanwhile, in cloud-edge-device collaborative scenarios with diverse training data sources, varying sampling frequencies, and different business stages, gating networks face a dual challenge during training if there is a lack of effective adjustment and constraint on distribution changes: On the one hand, token distribution drift leads to fluctuations in routing strategies, making it difficult for expert division of labor to converge stably; on the other hand, some experts may become congested due to continuous high-frequency selection, while less popular experts remain idle for extended periods, resulting in decreased resource utilization and wasted model capacity, thus affecting training throughput and convergence stability. This problem is particularly prominent in cross-domain data scenarios. Significant differences in data distribution across different domains exacerbate the biased distribution of tokens among experts, further leading to load imbalance and tail waiting phenomena (i.e., individual congested experts slow down the overall iteration time). It also makes the process of expert function differentiation slower, and the boundaries of division of labor more easily blurred or even converge, ultimately weakening the model's expressive efficiency and resource utilization level.

[0004] Existing optimization techniques for MoE model training mainly include: fusing and overlapping communication operators, achieving coarse-grained load balancing through capacity factors or auxiliary losses, and expanding model size using strategies such as sharding parallelism, tensor parallelism, and pipelined parallelism. However, the above solutions still have the following shortcomings when training with cross-domain data from cloud, edge, and device: (1) The lack of explicit modeling and constraints on expert collaboration relationships and route stability makes it difficult to suppress route drift and unstable expert division of labor during training. (2) The lack of a closed-loop adjustment mechanism for the evolution of expert load time makes it difficult to respond in a timely manner to hot spot congestion and tail slowdown caused by changes in domain proportion; (3) The lack of expert copy enhancement and consistency synchronization / recycling mechanism for continuous hot spots makes it difficult to balance throughput and stability under resource constraints.

[0005] Therefore, there is an urgent need for an optimization method for MoE training that can model the expert collaboration structure and expert preferences, and introduce load-aware adjustment and load time evolution feedback mechanisms into routing decisions. This would suppress routing drift, alleviate hotspot congestion, reduce communication and synchronization waiting overhead caused by All-to-All dispatch in the MoE layer, and improve training throughput and system robustness while ensuring model convergence and performance. Summary of the Invention

[0006] The purpose of this invention is to provide an MOE expert collaborative modeling and load-aware routing adjustment system and training method to solve problems such as routing instability, expert division of labor drift and load imbalance caused by multi-source data participating in training. It also reduces the communication and synchronization waiting overhead caused by all-to-all dispatch in the MoE layer during multi-GPU / multi-node training, and improves training throughput, scalability and system robustness while ensuring model convergence and performance.

[0007] To achieve the above objectives, the MOE expert collaborative modeling and load-aware routing adjustment system of the present invention includes an offline expert collaborative modeling module, a load time evolution feedback module, a joint routing decision module, and a hotspot expert replica enhancement module. The offline expert collaboration modeling module collects routing logs and training statistics signals during the preheating training phase or the historical training phase, calculates the Top-K expert selection results of the token, establishes the co-occurrence relationship and collaboration strength among experts, and forms an expert collaboration matrix. The load time evolution feedback module collects expert load statistics during the online training phase; constructs a comprehensive load metric based on the load statistics, and uses a moving average or exponential moving average method to obtain the online load time evolution state, which is used to characterize expert congestion trends and hotspot changes. The joint routing decision module constructs a load penalty term based on the online load time evolution state; constructs an expert collaboration bias term based on the expert collaboration matrix; and calculates the joint routing score of each expert by combining the original scores of each expert from the gating network; and performs Top-K selection and token distribution based on the joint routing scores of the experts. The hotspot expert replica enhancement module identifies a set of continuously congested hot experts based on the online load time evolution status, triggers a replica creation strategy for hot experts, and deploys expert replicas on devices that meet resource constraints for traffic offloading.

[0008] Furthermore, the joint routing score also includes a communication cost bias; the communication cost bias measures the communication overhead from the token to the expert through communication time.

[0009] Furthermore, the expert load statistics signal includes at least one of the following: the number of tokens received by the expert, the computation time, the queuing time, and the capacity overflow situation.

[0010] Furthermore, the popular experts and their copies employ a periodic consistency synchronization or threshold-triggered synchronization strategy to maintain parameter consistency.

[0011] Furthermore, the joint routing decision module also includes capacity constraints, overflow control, and a fallback mechanism; that is, a capacity threshold is set for each expert, and when the number of received tokens by an expert exceeds the capacity threshold, overflow control is triggered; by selecting experts from candidate experts who do not meet the capacity requirements to supplement the overflow tokens, performing local rerouting on the overflow tokens, delaying the overflow tokens to the next micro-batch processing or discarding them, and cooperating with auxiliary loss constraints, the training process is ensured to be continuously executable.

[0012] Furthermore, the hotspot expert replica enhancement module uses a time window average comprehensive load metric threshold to identify the set of hot experts that are continuously congested.

[0013] Furthermore, the joint routing decision module also includes gated entropy constraints or selection inertia constraints to reduce frequent route switching in the early stages of training.

[0014] Furthermore, the expert collaboration bias term is , Let token x be the current set of candidate experts, and i and j be the expert IDs. , To count the total number of tokens within the window, For experts With experts The number of times the same token appears together in the Top-K set.

[0015] This invention provides a MOE expert collaborative modeling and load-aware routing training method, which, based on the MOE expert collaborative modeling and load-aware routing adjustment system, performs the following steps: S1, Initialize training device parameters, expert deployment mapping. Gated routing parameters, cooperative bias parameters, load penalty parameters, capacity control parameters, popular expert identification threshold and popular expert replica parameter synchronization strategy; S2, enter the preheating training phase, collect routing logs and training statistical signals, construct the expert collaboration matrix, and determine the expert collaboration bias term; S3, enter the online training phase, calculate the original scores of each expert by the gating network in each iteration, count the instantaneous expert load, use sliding statistics or exponential moving average to obtain the online load time evolution state, and determine the load penalty term; S4: Calculate the joint routing score for each expert based on the load penalty term, expert collaboration bias term, and the original scores of each expert; perform Top-K selection and token distribution based on the joint routing score. S5: Collect and aggregate expert outputs, complete forward and backward propagation, and update model parameters; S6 continuously monitors capacity overflow and expert load status, triggers overflow control, identifies popular experts, and dynamically creates / synchronizes / reclaims copies of popular experts; S7, Adjust batch size under video memory constraints And select the set of execution plans and corresponding batch sizes that maximize training throughput or minimize tail iteration time.

[0016] The beneficial effects of this invention include: (1) By modeling the expert collaboration structure based on historical routing behavior and injecting collaboration bias, when the scores of candidate experts are similar, they tend to choose expert combinations with strong collaboration relationships or the same cluster, thereby improving the stability of expert division of labor and suppressing route drift. (2) By modeling the time evolution of the expert load statistical signal and constructing a load penalty term, the load feedback closed loop is introduced into the routing decision, which can suppress further congestion of continuous hot experts, alleviate the straggler phenomenon, and improve the training throughput stability. (3) For experts with persistent hot spots, a replica enhancement, synchronization and recycling mechanism is introduced to achieve traffic splitting to reduce tail iteration time when resources allow; when resources are limited (e.g., a single machine with 8×RTX 4090 has no extra video memory), replicas may not be created or the number of replicas may be limited, and routing adjustment can still be completed by relying on cooperative bias and load penalty. (4) Introduce communication cost bias to reduce the risk of synchronization waiting caused by dispatch and improve the overall resource utilization. Attached Figure Description

[0017] Figure 1 This is a schematic diagram of the MoE model architecture.

[0018] Figure 2 This is a schematic diagram of a distributed MoE training architecture.

[0019] Figure 3 This is a schematic diagram of the expert collaborative modeling process and expert copy settings described in this invention.

[0020] Figure 4 This is a schematic diagram of the load-aware routing adjustment described in this invention.

[0021] Figure 5 A schematic diagram comparing the effects of embodiment 2 of the present invention.

[0022] Figure 6 A schematic diagram comparing the effects of three implementations of the present invention. Detailed Implementation

[0023] The technical solutions in the embodiments of the present invention will be clearly and completely described below. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0024] The technical terms used in this invention are explained as follows: Large Model Distributed Training: With the development of deep learning technology, the size of models continues to increase, significantly raising the demand for computing, memory, and communication resources. Distributed training achieves consistent model updates by distributing data and computing tasks across multiple GPUs or nodes for parallel execution and synchronizing gradients, parameters, or intermediate states between devices through communication networks. This improves training throughput and reduces the load on a single device.

[0025] Mixture of Experts (MoE): such as Figure 1 The hybrid expert model shown is a sparse activation structure composed of multiple expert networks (experts 1, 2, and 3 in the figure) and a gating network (routes in the figure). The gating network generates routing weights based on the features of the input token and selects the Top-K experts to participate in the computation; the output of the MoE layer is obtained by fusing the outputs of the activated experts according to the weights, thereby expanding the model parameter capacity and expressive power while controlling the computational cost.

[0026] Cloud-Edge-Device Collaboration Scenario: This refers to a training system whose data sources and computing resources may involve multiple sources, such as cloud data centers, edge nodes, and terminal devices.

[0027] Expert Collaboration Modeling: This refers to constructing a collaboration matrix or collaboration graph based on the co-occurrence relationship of different experts being selected by the same token according to the statistics of historical routing logs, and further clustering the experts to obtain expert clusters, which are used to stabilize the division of labor among experts and give priority to experts with strong collaboration relationships or in the same cluster when routing.

[0028] Load-aware routing adjustment refers to introducing penalty / bias terms related to expert load and its temporal evolution on the basis of the original scoring of the gating network, and jointly adjusting the routing with expert collaboration bias. This allows the routing to suppress persistent hot experts, alleviate tail waiting, and improve training throughput stability while ensuring model training performance. In distributed training, communication cost bias can also be optionally configured to further reduce all-to-all communication overhead.

[0029] Hot Expert Replica: When some experts are under high load / congestion for a long time during training, a replica of the expert is created on another device to distribute the load. Periodic synchronization or threshold-triggered synchronization is used to maintain the consistency of the parameters of the primary replica, so as to reduce tail waiting and improve training throughput stability.

[0030] MOE Hybrid Expert Model denoted as ,in The number of experts. During training in a distributed environment, such as... Figure 2 As shown, the expert subnetwork is deployed across cloud-edge devices on different GPUs, server nodes, and even in different regions. Let the set of device resources be denoted as... ,in This refers to the number of devices. When deploying MOE Expert on a single-machine multi-GPU server, the number of devices is... This refers to the number of GPUs. The deployment mapping between experts and devices is denoted as... This is used to indicate the device where each expert is located. The optimization objective of the MOE hybrid expert model training iteration is to minimize the training iteration time (including tail iteration time) under the condition of satisfying the memory constraint, or to maximize the training throughput under the equivalent representation. This translates to a mathematical model where the memory cost is... The time cost is The maximum video memory is Batch size is ,but: in, The training execution plan represents the training execution scheme adopted under given device resource set, expert deployment mapping, and parallel configuration conditions. and These represent the optimal training execution plan and the optimal batch size that minimize training time cost while satisfying memory constraints; Represents the set of positive integers. (Memory cost) It includes at least the following three types of overhead: (1) Model state overhead (including basic model parameters, MoE expert parameters, gradient cache, optimizer state, etc.); (2) Intermediate activation overhead (including forward save for reverse activation, MoE routing index, dispatch cache and eviction cache, etc.); (3) Additional overhead (including operator temporary workspace, communication buffer, copy synchronization temporary storage, etc.).

[0031] The MOE expert collaborative modeling and load-aware routing adjustment system of the present invention includes an offline expert collaborative modeling module, a load time evolution feedback module, a joint routing decision module, and a hotspot expert replica enhancement module.

[0032] like Figure 3 As shown, the offline expert collaboration modeling module collects routing logs and training statistical signals during the preheating training phase or the historical training phase, calculates the Top-K expert selection results for tokens, establishes co-occurrence relationships and collaboration strengths among experts, and forms an expert collaboration matrix. Expert Collaboration Matrix elements in .

[0033] in, For experts With experts The number of times the same token appears together in the Top-K set. To count the total number of tokens within the window.

[0034] Expert Collaboration Matrix Clustering or community discovery is performed to obtain expert cluster sets. The expert cluster set is used to represent the complementary functions and collaborative relationships among experts. To reduce log overhead, when collecting routing logs, sampling collection (only saving the Top-K index and count) or sliding window / exponential smoothing updates of "W" can be used to adapt to changes during the training phase.

[0035] The load time evolution feedback module collects expert load statistics during the online training phase; constructs a comprehensive load metric based on the load statistics, and uses a sliding statistics or exponential moving average method to obtain the online load time evolution state, which is used to characterize expert congestion trends and hotspot changes.

[0036] Expert load statistics include at least one of the following: number of tokens received by experts, computation time, queuing time, and capacity overflow. These are weighted to form a comprehensive load metric, as shown in the following formula: in, Indicates time No. A comprehensive load metric for each expert; Indicates time Assigned to the The number of tokens for each expert; Indicates time No. The computation time for each expert to process the corresponding token; Indicates time No. The waiting time for each expert; Indicates time No. The number of overflow tokens exceeded the capacity limit for one expert; , , ,and These represent the weighting coefficients of the corresponding statistical signals.

[0037] The online load time evolution state expression obtained by the exponential moving average method is as follows: like Figure 4 As shown, the joint routing decision module constructs a load penalty term based on the online load time evolution state; constructs an expert collaboration bias term based on the expert collaboration matrix; and calculates the joint routing score of each expert by combining the original scores of each expert from the gating network; and performs Top-K selection and token distribution based on the joint routing scores of the experts.

[0038] In the MoE model, a certain MoE layer contains A network of experts With gated networks .in, This represents the number of expert networks in this MoE layer; For the first The mapping function of an expert network; This is the set of all expert networks in this MoE layer; This is a mapping function for gating networks, used for expert selection and route allocation of the input token. The input token is represented as... The hidden state is represented as The gated network outputs the raw score for each expert. The linear scoring formula is as follows: in, Indicates input Corresponding to the The original scores of each expert; This indicates the input token or input feature; Indicates input The hidden representation obtained after feature transformation; Indicates the first The gating weight vector corresponding to each expert; Representing vectors and Inner product operation.

[0039] The gating network employs a Top-K sparse selection mechanism, first selecting the top experts from all expert scores. The candidate set consists of [number] experts, expressed as follows: Then, the gate weights are obtained by calculating Softmax within the candidate set, as shown in the following expression: Finally, the output of the MoE layer is a weighted sum of the outputs of the activated experts, expressed as follows: Token distribution during distributed training follows a "bucketing by expert affiliation followed by communication" approach: tokens from the same expert are bundled and sent to the device where that expert resides. The output is collected and aggregated after each device completes expert parallel computation. Dispatch and collection can be implemented using All-to-All or equivalent communication operators.

[0040] In this invention, the joint routing decision module constructs a load penalty term based on the online load time evolution state. The expression is as follows: The purpose of the load penalty is to prevent experts with consistently high loads from being further selected in routing decisions.

[0041] Construct expert collaboration bias terms based on the expert collaboration matrix. The expression is as follows: in, Let token x be the current set of candidate experts, and i and j be the expert numbers. The role of the expert collaboration bias is to prioritize retaining experts with a higher degree of collaboration with the overall candidate set when the original scores of multiple experts are close, thereby reducing drastic changes in route combinations.

[0042] By injecting collaborative bias and load penalty into the gating network to evaluate the raw scores of each expert, a joint route score is constructed, and the joint route score of each expert is calculated. The expression is as follows: The joint routing decision module also includes capacity constraints, overflow control, and a fallback mechanism. Specifically, a capacity threshold is set for each expert. When the number of received tokens for an expert exceeds the threshold, overflow control is triggered. This is achieved by selecting experts from candidate experts whose capacity is not met, performing local rerouting on overflow tokens, delaying overflow tokens to the next micro-batch, or discarding them, all in conjunction with auxiliary loss constraints to ensure the continuous execution of the training process. When capacity constraints are not met or there is a sudden increase in load, a "collaboration priority + low load priority" replacement strategy is preferred. This means that experts whose capacity is not met are prioritized for replacement, with a focus on experts that have a higher degree of collaboration with the overall candidate set, lower load, and are not yet at full capacity. This reduces training oscillations and tail waiting caused by rerouting.

[0043] When measurement conditions are available, this invention can further add a communication cost bias, measuring the communication overhead from token to expert by the communication time, and guiding tokens to be preferentially distributed to expert devices with lower communication costs. The communication cost bias, along with the collaboration bias and load penalty, is injected into the gating network to assign scores to each expert in the initial evaluation.

[0044] The joint routing decision module also includes stability constraints, such as gated entropy constraints or selection inertia constraints, to reduce frequent route switching in the early stages of training.

[0045] The hotspot expert replica enhancement module identifies a set of continuously congested hot experts based on the online load time evolution status, triggers a replica creation strategy for hot experts, and deploys expert replicas on devices that meet resource constraints for traffic offloading.

[0046] The Hotspot Expert Replica Enhancement Module addresses the tail-waiting issue caused by a few experts being under high load for extended periods. When a popular expert is detected, replicas are created to distribute the load, provided resources allow, and a synchronization strategy maintains consistency between the primary and secondary replicas, thereby improving throughput stability.

[0047] This invention uses a time window average comprehensive load metric threshold to determine and identify the set of popular experts with persistent congestion, as expressed below: in For window length, This is the hot topic threshold. For each popular expert... Select the set of devices for replica deployment. The mapping table is updated to allow the router to distribute tokens to either the primary or replica experts. GPUs with low current total load and sufficient memory are prioritized to reduce tail wait times and avoid memory overflow. Periodic or threshold-triggered synchronization strategies are used between popular expert replicas to maintain parameter consistency. Replica reclamation is performed when congestion eases or memory pressure increases to reduce additional memory and synchronization overhead. Synchronization can focus solely on expert weight parameters or include the complete training state, including the optimizer state, depending on system resources and training phase requirements.

[0048] Based on the above-mentioned MOE expert collaborative modeling and load-aware routing adjustment system, this invention also provides a MOE expert collaborative modeling and load-aware routing training method, which specifically includes the following steps: S1, Initialize training device parameters, expert deployment mapping. Gated routing parameters, cooperative bias parameters, load penalty parameters, capacity control parameters, popular expert identification threshold and popular expert replica parameter synchronization strategy; S2, enter the preheating training phase, collect routing logs and training statistical signals, construct the expert collaboration matrix, and determine the expert collaboration bias term; S3, enter the online training phase, calculate the original scores of each expert by the gating network in each iteration, count the instantaneous expert load, use sliding statistics or exponential moving average to obtain the online load time evolution state, and determine the load penalty term; S4: Calculate the joint routing score for each expert based on the load penalty term, expert collaboration bias term, and the original scores of each expert; perform Top-K selection and token distribution based on the joint routing score. S5: Collect and aggregate expert outputs, complete forward and backward propagation, and update model parameters; S6 continuously monitors capacity overflow and expert load status, triggers overflow control, identifies popular experts, and dynamically creates / synchronizes / reclaims copies of popular experts; S7, Adjust batch size under video memory constraints And select the set of execution plans and corresponding batch sizes that maximize training throughput or minimize tail iteration time.

[0049] Example 2 This embodiment was conducted on a single-machine server equipped with eight NVIDIA GeForce RTX 4090 GPUs. The training environment was fixed as a single-machine multi-GPU distributed training, without involving cross-node network conditions. The aim was to compare the training performance differences between the method of this invention and several common MoE routing / load control baseline schemes. The comparison schemes included: the original MoE routing scheme (gated Top-K routing only), the MoE scheme introducing only capacity constraints and overflow control, and the MoE scheme introducing only load balancing adjustment. To achieve a fair comparison, this embodiment uniformly set the training configuration for each comparison scheme and optimized it within the adjustable parameter range before reporting the best performance comparison result. This embodiment selected various GPT-like models of different sizes as the experimental model library (e.g., 0.3B, 1.5B, 2.7B, etc.) to verify the performance differences under different model sizes.

[0050] like Figure 5 As shown, the training throughput of the method in this invention is higher than that of the aforementioned comparative schemes under different model scales. The original MoE routing scheme, lacking load feedback adjustment, is prone to long-term congestion among hotspot experts, leading to increased iteration time fluctuations and rising tail waiting times. While the MoE scheme that only introduces capacity constraints and overflow control can alleviate extreme overflows, its effect on suppressing persistent hotspots is limited. The MoE scheme that only introduces load balancing adjustment lacks guidance from an expert collaboration structure, limiting routing stability and throughput improvement. The method in this invention, by introducing collaborative bias and combining it with load time evolution feedback to jointly adjust the routing, effectively alleviates expert congestion and uneven resource utilization, improves training throughput, and enhances iteration stability.

[0051] Example 3 This embodiment verifies the scalability of the method of the present invention and a comparative scheme (such as the original MoE routing scheme) without enabling the critical routing adjustment mechanism of the present invention on a single server equipped with 8 NVIDIA GeForce RTX 4090 GPUs under different GPU configurations.

[0052] like Figure 6 As shown, the method of this invention exhibits a superior throughput scalability trend with an increasing number of GPUs, with throughput growing faster as the number of GPUs increases, indicating that it can more effectively utilize multi-GPU resources. The scalability growth of the comparative scheme gradually slows down, suggesting that it is more susceptible to uneven load distribution and tail-end latency as the number of GPUs increases. The method of this invention, through load-aware routing adjustment and hotspot distribution when needed, can improve expert load distribution, reduce overload or idleness of some GPUs, thereby achieving higher training throughput and better scalability.

Claims

1. A MOE expert collaborative modeling and load-aware routing adjustment system, characterized in that, It includes an offline expert collaborative modeling module, a load time evolution feedback module, a joint routing decision module, and a hotspot expert replica enhancement module; The offline expert collaboration modeling module collects routing logs and training statistics signals during the preheating training phase or the historical training phase, calculates the Top-K expert selection results of the token, establishes the co-occurrence relationship and collaboration strength among experts, and forms an expert collaboration matrix. The load time evolution feedback module collects expert load statistics signals during the online training phase; A comprehensive load metric is constructed based on the load statistics signal, and the online load time evolution status is obtained by using sliding statistics or exponential moving average methods to characterize the expert congestion trend and hotspot changes. The joint routing decision module constructs a load penalty term based on the online load time evolution state; constructs an expert collaboration bias term based on the expert collaboration matrix; and calculates the joint routing score of each expert by combining the original scores of each expert from the gating network; and performs Top-K selection and token distribution based on the joint routing scores of the experts. The hotspot expert replica enhancement module identifies a set of continuously congested hot experts based on the online load time evolution status, triggers a replica creation strategy for hot experts, and deploys expert replicas on devices that meet resource constraints for traffic offloading.

2. The MOE expert collaborative modeling and load-aware routing adjustment system according to claim 1, characterized in that: The joint routing score also includes a communication cost bias; the communication cost bias measures the communication overhead from the token to the expert by the communication time.

3. The MOE expert collaborative modeling and load-aware routing adjustment system according to claim 1, characterized in that: The expert load statistics signal includes at least one of the following: number of tokens received by the expert, calculation time, queuing time, and capacity overflow.

4. The MOE expert collaborative modeling and load-aware routing adjustment system according to claim 1, characterized in that: The popular experts and their copies employ a periodic consistency synchronization or threshold-triggered synchronization strategy to maintain parameter consistency.

5. The MOE expert collaborative modeling and load-aware routing adjustment system according to claim 1, characterized in that: The joint routing decision module also includes capacity constraints, overflow control, and a fallback mechanism. Specifically, a capacity threshold is set for each expert. When the number of tokens received by an expert exceeds the capacity threshold, overflow control is triggered. The module also ensures the continuous execution of the training process by selecting experts from candidate experts who do not meet the capacity requirements, performing local rerouting on overflow tokens, delaying overflow tokens to the next micro-batch processing, or discarding them, in conjunction with auxiliary loss constraints.

6. The MOE expert collaborative modeling and load-aware routing adjustment system according to claim 1, characterized in that: The hotspot expert replica enhancement module uses a time window average comprehensive load metric threshold to identify the set of hot experts that are continuously congested.

7. The MOE expert collaborative modeling and load-aware routing adjustment system according to claim 1, characterized in that: The joint routing decision module also includes gated entropy constraints or selection inertia constraints to reduce frequent route switching in the early stages of training.

8. The MOE expert collaborative modeling and load-aware routing adjustment system according to claim 1, characterized in that: The expert collaboration bias is: , Let token x be the current set of candidate experts, and i and j be the expert IDs. , To count the total number of tokens within the window, For experts With experts The number of times the same token appears together in the Top-K set.

9. A MOE expert collaborative modeling and load-aware routing training method, characterized in that: According to any MOE expert collaborative modeling and load-aware routing adjustment system of claims 1-8, the following steps are performed: S1, Initialize training device parameters, expert deployment mapping. Gated routing parameters, cooperative bias parameters, load penalty parameters, capacity control parameters, popular expert identification threshold and popular expert replica parameter synchronization strategy; S2, enter the preheating training phase, collect routing logs and training statistical signals, construct the expert collaboration matrix, and determine the expert collaboration bias term; S3, enter the online training phase, calculate the original scores of each expert by the gating network in each iteration, count the instantaneous expert load, use sliding statistics or exponential moving average to obtain the online load time evolution state, and determine the load penalty term; S4: Calculate the joint routing score for each expert based on the load penalty term, expert collaboration bias term, and the original scores of each expert; perform Top-K selection and token distribution based on the joint routing score. S5: Collect and aggregate expert outputs, complete forward and backward propagation, and update model parameters; S6 continuously monitors capacity overflow and expert load status, triggers overflow control, identifies popular experts, and dynamically creates / synchronizes / reclaims copies of popular experts; S7, Adjust batch size under video memory constraints And select the set of execution plans and corresponding batch sizes that maximize training throughput or minimize tail iteration time.