Method for optimizing model training and electronic device
By statistically analyzing and pruning the expert network load of each processing layer during the training of the hybrid expert model, the problem of unbalanced expert network load is solved, thereby improving the utilization of computing resources and the efficiency of model training.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INSPUR SUZHOU INTELLIGENT TECH CO LTD
- Filing Date
- 2026-05-29
- Publication Date
- 2026-06-26
AI Technical Summary
Hybrid expert models suffer from an imbalance in expert network load, leading to low utilization of computing resources and impacting model training effectiveness and performance.
By acquiring the training data, statistically analyzing the expert routing information of each processing layer in the hybrid expert model, analyzing the load, determining the expert pruning index, and pruning the expert network of each processing layer, the training process of the hybrid expert model is optimized.
It improves the utilization of computational resources in hybrid expert models, enhances model training efficiency and performance, and reduces computational and memory overhead.
Smart Images

Figure CN122287752A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of model training technology, and in particular to an optimization method and electronic device for model training. Background Technology
[0002] Currently, various processors (such as Graphics Processing Units, GPUs) mostly employ Mixture-of-Experts (MoE) models when processing tasks. This model, by deploying multiple parallel expert networks at each layer and using routers to select a small number of experts to participate in computation, significantly reduces actual computational overhead while maintaining a large parameter scale. Therefore, it is widely used in fields such as natural language processing, computer vision, and multimodal tasks. However, in practical applications, these processors suffer from low computational resource utilization. This is because the Mixture-of-Experts model used by these processors exhibits an imbalance in expert network load; that is, the routers in the Mixture-of-Experts model tend to concentrate most of the tokens on a small number of experts, leading to some experts being overloaded while others remain idle for extended periods. Summary of the Invention
[0003] This application provides an optimization method and electronic device for model training. Its main purpose is to solve the problem of low processor computing resource utilization caused by the unbalanced load of expert networks in hybrid expert models in related technologies.
[0004] According to a first aspect of this application, an optimization method for model training is provided, comprising: Obtain the training data; The training data is input into the hybrid expert model, and the expert routing information of each processing layer in at least one processing layer of the hybrid expert model is statistically analyzed. Based on expert routing information, the load of at least one expert network in each processing layer is analyzed to determine the expert pruning index of each processing layer. The load is used to reflect the amount of data processed by at least one expert network when processing the training data. Based on the expert pruning index, the expert networks contained in each processing layer are pruned, and the hybrid expert model is trained based on the pruned processing layers to obtain the trained hybrid expert model.
[0005] According to a second aspect of this application, an electronic device is provided, comprising: At least one processor; and a memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by at least one processor, which enables the at least one processor to perform the optimization method for model training described in the first aspect.
[0006] This application provides an optimization method and electronic device for model training, relating to the field of model training technology. Compared with related technologies, this application obtains training data; inputs the training data into a hybrid expert model and statistically analyzes the expert routing information of each processing layer in at least one processing layer of the hybrid expert model; based on the expert routing information, analyzes the load of at least one expert network in each processing layer to determine the expert pruning index of each processing layer, where the load reflects the amount of data processed by at least one expert network when processing the training data; and prunes the expert networks contained in each processing layer according to the expert pruning index, so as to train the hybrid expert model based on the pruned processing layers, thereby obtaining the trained hybrid expert model. This achieves targeted pruning of the expert networks in each processing layer during the training process of the hybrid expert model, effectively solving the problem of unbalanced expert network load in the hybrid expert model, improving model training efficiency, ensuring the model performance of the hybrid expert model, and ultimately improving the resource utilization of the processor using the hybrid expert model.
[0007] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this application, nor is it intended to limit the scope of this application. Other features of this application will become readily apparent from the following description. Attached Figure Description
[0008] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this application. Wherein: Figure 1 A flowchart illustrating the first model training optimization method provided in this application embodiment; Figure 2 A flowchart illustrating the second model training optimization method provided in this application embodiment; Figure 3 A flowchart illustrating the third model training optimization method provided in this application embodiment; Figure 4 A schematic diagram illustrating a specific optimization of model training provided in an embodiment of this application; Figure 5 This is a schematic diagram of the structure of an optimization device for model training provided in an embodiment of this application. Detailed Implementation
[0009] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.
[0010] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.
[0011] Hybrid expert models significantly reduce actual computational overhead while maintaining a large parameter scale by introducing multiple parallel expert networks at each layer and using routers to select a few experts to participate in the computation. Therefore, they are widely used in natural language processing, vision, multimodal computing and other fields.
[0012] However, hybrid expert models commonly face the problem of unbalanced expert load during actual training. This means that the routers in the model tend to concentrate most of the input tokens on at least a few expert networks, causing some expert networks to be continuously overloaded while the rest remain idle for extended periods. This problem not only leads to low utilization of the corresponding processor computing resources but also triggers a series of chain reactions, severely impacting the model's training effectiveness and actual operational performance. Therefore, it is urgent to construct an effective load balancing mechanism to solve this problem.
[0013] Specifically, the hybrid expert model consists of multiple parallel expert networks and routers. The routers assign weights to each expert network based on the input features, typically activating only the top K expert networks with the highest scores. The final output of the model is the weighted sum of the calculation results of all activated expert networks. However, this load imbalance problem has multiple negative impacts: on the one hand, overloaded expert networks, due to frequent calls, experience overly concentrated parameter updates, leading to significant gradient noise, overfitting, and even gradient explosion, directly causing instability in the model training process; on the other hand, idle expert networks, rarely selected for training, have near-zero load, and their parameters cannot fully learn effective features, effectively rendering them ineffective expert networks and resulting in a serious waste of the model's overall capacity.
[0014] Furthermore, uneven load on expert networks can directly trigger routing collapse, creating a dual bottleneck in both hardware and training. From a hardware perspective, the expert networks in a hybrid expert model are typically distributed across multiple processor (e.g., GPU) cores. Routing collapse leads to significant bottlenecks in the utilization of computing resources on each processor core, preventing the model from leveraging its hardware adaptability. From a training perspective, routing collapse easily triggers gradient conflicts. Overloaded expert networks accumulate larger gradients, resulting in a much faster learning speed than idle expert networks, making model training difficult to converge and exacerbating training instability. Simultaneously, uneven load also directly leads to poor model performance and weak generalization ability. Because idle expert networks do not receive sufficient training vocabulary, they cannot learn meaningful feature knowledge, making it difficult for the model to adapt to diverse real-world application scenarios.
[0015] To address the problems existing in related solutions, this application proposes a Layer-wise Adaptive Expert Pruning (LAEP) method for hybrid expert models. Based on the load distribution of each processing layer during the training of the hybrid expert model, the expert network of each processing layer is adaptively pruned. The hybrid expert model is then trained again based on each pruned processing layer, thereby fundamentally solving the problem of unbalanced expert network load. Furthermore, while improving model training efficiency, maintaining model accuracy, and even improving accuracy, the method reduces the model parameter size and computational overhead, thereby increasing the resource utilization of the corresponding processor.
[0016] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0017] Figure 1 This is a flowchart illustrating the first model training optimization method provided in the embodiments of this application.
[0018] like Figure 1 As shown, the method includes the following steps: Step 101: Obtain the training data.
[0019] In some embodiments, the training data refers to the original task sample data input to the hybrid expert model for completing iterative model training and routing behavior statistics. It is the basic data that the model can use to perform lexical parsing and hierarchical forward operations. In this application, the acquisition channel of the training data is not limited. It can be retrieved from local storage, loaded from a remote model library, or generated by adapting existing models.
[0020] It is understandable that the training data is standardized and preprocessed to be uniformly converted into a format that the model can recognize, so as to ensure the normal execution of subsequent word segmentation and hierarchical transmission processes.
[0021] Step 102: Input the data to be trained into the hybrid expert model and collect the expert routing information of each processing layer in at least one processing layer of the hybrid expert model.
[0022] In some embodiments, expert routing information refers to the set of expert assignment decisions made by routers in each processing layer for each input token during the forward propagation of the hybrid expert model. The core content of expert routing information is the number of tokens received and processed by each expert network within a preset statistical period. The statistics of expert routing information are performed in real time during the model's forward computation.
[0023] Specifically, when the training data is tokenized to form a sequence of tokens and then input into the hybrid expert model layer by layer, the router within each processing layer calculates the probability distribution of each token being assigned to each expert network based on the input features, and selects the expert network that actually participates in the calculation. During model training, a preset statistical period (e.g., the first 10% of training steps or a fixed number of steps, such as 1000 steps) is used as a time window to record the total number of tokens received and processed by each expert network in each processing layer within that window, forming the basis data for subsequent load analysis.
[0024] For example, this application does not limit the specific method of obtaining the hybrid expert model, as long as the hybrid expert model that needs to be optimized for load can be obtained. Optionally, obtaining the hybrid expert model in this application includes, but is not limited to: obtaining a pre-stored hybrid expert model from a local storage medium; and / or, loading the hybrid expert model from a remote server or model library; and / or, constructing a hybrid expert model based on an existing pre-trained model; and / or, designing and training a hybrid expert model from scratch.
[0025] Step 103: Based on expert routing information, analyze the load of at least one expert network in each processing layer to determine the expert pruning index for each processing layer.
[0026] In this application, each processing layer is any one of at least one processing layer in the hybrid expert model. That is, each processing layer is any processing layer in the hybrid expert model that contains a parallel expert network, and each layer independently performs load analysis and pruning to adapt to load differences between layers.
[0027] The processing layer refers to the functional layer in the hybrid expert model that has the ability to deploy independent parallel expert networks, receive input lexical units, and perform specific computational processing. It is the basic unit that constitutes the core architecture of the hybrid expert model and covers various functional layers in the model's path from input to output.
[0028] In some embodiments, given the significant differences in expert load distribution across different layers of the hybrid expert model, this application can first obtain the hybrid expert model and train it. During the training phase of the hybrid expert model, the load is statistically analyzed and distribution characteristics are examined layer by layer. Specific expert pruning indicators are selected for each layer to ensure that pruning accurately removes invalid experts without compromising model performance, while also meeting the high-efficiency requirements of the pre-training phase.
[0029] Hybrid expert models are large model structures composed of multi-layer parallel expert networks and routers. Routers select a small number of experts to participate in the computation, balancing parameter scale and computational efficiency. They are widely used in fields such as natural language processing, vision, and multimodal computing.
[0030] Load conditions are used to reflect the amount of data processed (i.e., the number of tokens) when at least one expert network processes the training data.
[0031] Expert pruning metrics are pruning standards adapted to each processing layer. In this application, the determination method of expert pruning metrics can adaptively determine the method based on the inter-layer load differences of the hybrid expert model. That is, for hybrid expert models with small inter-layer load differences, this application can globally fix the expert pruning metrics for the entire model; for hybrid expert models with large inter-layer load differences, this application can perform layered adaptive determination based on the load of at least one expert network in each layer, accurately adapting to the inter-layer load differences. This application mainly focuses on layered adaptive determination based on the load of at least one expert network in each layer.
[0032] Step 104: Prune the expert networks contained in each processing layer according to the expert pruning index, so as to train the hybrid expert model based on the pruned processing layers and obtain the trained hybrid expert model.
[0033] In some embodiments, this application can accurately screen out invalid expert networks to be pruned in a hybrid expert model based on expert pruning metrics adapted to each processing layer. Expert network pruning refers to directly removing the invalid expert network from the hybrid expert model to eliminate invalid and redundant experts.
[0034] After pruning, this application can form pruned processing layers based on the remaining effective expert networks in each processing layer, and continue training the hybrid expert model based on these pruned processing layers until the model reaches the preset training objective, thus obtaining the trained hybrid expert model and ensuring training effectiveness and performance stability. The preset training objective may include, but is not limited to, at least one of the following: accuracy or precision reaching the target, loss function convergence, inference latency meeting requirements, throughput reaching a threshold, load balancing meeting expectations, and resource utilization reaching the target; in this application embodiment, the objective can be set according to actual conditions, and no specific limitation is made.
[0035] It is understood that this application performs the above-mentioned expert network pruning process for each processing layer of the hybrid expert model, and the expert network pruning operation of each processing layer is implemented in parallel, rather than being processed sequentially for each processing layer.
[0036] In summary, this application performs hierarchical statistical analysis on the expert network load of each processing layer of the hybrid expert model during training, thereby accurately determining the expert pruning index that matches the load characteristics of each layer and accurately deleting invalid expert networks. Based on the pruned processing layers, the hybrid expert model can continue to be trained. This reduces the computational overhead and memory consumption in the pre-training stage while fully ensuring the accuracy and generalization ability of the model, thereby further improving the utilization rate of the processor's computing resources.
[0037] Figure 2 A flowchart illustrating the second model training optimization method proposed in this application is further shown. Based on Figure 2 The illustrated embodiment further explains step 101. Figure 2 This may include the following steps.
[0038] Step 201: Based on expert routing information, obtain at least one workload of at least one expert network in each processing layer.
[0039] In some embodiments, to avoid statistical data distortion caused by load fluctuations in the early stages of training and to ensure that the acquired workload truly reflects the actual operating status of each expert network, this application may determine whether the hybrid expert model is in a stable training phase before acquiring the workload of at least one expert network in each processing layer, i.e., whether the hybrid expert model meets the load statistics conditions. When the training time of the hybrid expert model reaches a preset time or the number of training steps reaches a preset number, it is determined that it meets the load statistics conditions. At this time, performing load statistics on the hybrid expert model can effectively ensure the validity of the statistical data.
[0040] In this application, the workload of each expert network in each processing layer is determined by the number of tokens. This is because the core task of each expert network in the hybrid expert model is to perform feature processing and computation on the tokens assigned by the routing. The amount of token processing directly corresponds to the task capacity and computational cost of the expert network, making it the most direct and quantifiable indicator of the expert network's workload. Based on this, this application can pre-set a statistical period, and separately count the number of tokens assigned to each expert network in each processing layer according to the preset statistical period, and directly use this number of tokens as the workload of the corresponding expert network. The aforementioned preset statistical period can be flexibly selected according to the model training scale and training rhythm, specifically, it can be any one of a training round, a preset number of training steps, or a preset percentage of training steps.
[0041] In this application, the lexical units are the basic data processing units generated by the processor of the hybrid expert model after it performs lexicalization processing on the input task data. The lexical units of each processing layer are those basic lexical units that are routed and distributed by the router of the current hybrid expert model, and then directed to each processing layer and processed by the expert networks within that layer.
[0042] Specifically, taking the l-th layer of a hybrid expert model as an example, this application will target the expert network set of the l-th layer. Count the number of tokens assigned to each expert network by the route. The preset statistical period used in this statistical analysis can be flexibly selected based on the model training scale and training rhythm, including a full training epoch, a fixed number of training steps (e.g., 1000 steps), or a fixed percentage of training steps (e.g., the top 10% of the total steps). All statistical periods are selected according to the same standard: ensuring that the token processing volume of each expert network tends to be stable and without significant fluctuations within the statistical period, thereby guaranteeing the objectivity and representativeness of the statistical results.
[0043] After completing the count of routing terms for each expert network in layer l, this application can form a layered load matrix for layer l. The hierarchical load matrix contains at least one workload of at least one expert network in the l-th layer.
[0044] It is understandable that the model load optimization strategy proposed in this application is mainly applied to the pre-training stage of hybrid expert models. Compared with related technologies, this application, by selectively pruning the expert network for each processing layer during the pre-training stage, not only possesses the flexibility to adapt to various task scenarios, but also achieves precise control of model resources and efficient optimization of network structure, reducing the computational and memory overhead of the pre-training stage from the source, while ensuring the stability of model performance.
[0045] Step 202: Based on at least one workload, analyze the load of at least one expert network in each processing layer to determine the expert pruning index that is suitable for each processing layer.
[0046] In some embodiments, this application may determine load indicators to characterize the load of each processing layer based on the obtained workload of at least one expert network, and characterize the load distribution characteristics of the layer through hierarchical load statistical feature analysis.
[0047] Specifically, for each layer of the hybrid expert model, multiple load metrics are calculated. These load metrics include average load, maximum load, minimum load, load range, load variance, load coefficient of variation, and load quantile characteristics, to comprehensively quantify the load distribution of each processing layer.
[0048] The average load is used to reflect the overall load level of the expert network at this layer. The formula for calculating the average load is: , For average load, The workload of the i-th expert network in each processing layer l; Maximum load, minimum load, and load range are used to reflect the fluctuation range and extreme differences of the load at this level. The formula for calculating the load range is: Rl represents the load range of each processing layer l. For the maximum load in each processing layer l, This represents the minimum load in each processing layer l; Load variance, or load variability coefficient, is used to determine whether the load of each expert network in this layer is balanced. The formula for calculating the load variability coefficient is: ,in, The standard deviation of each treatment layer l, The variance of each processing layer l, The load variation coefficient for each processing layer l; Quantile features are used to identify whether there are long-tail experts (abnormally high load) or abnormally low load experts. The formula for calculating quantile features is: ,in, Let p be the p quantile of the load distribution in each processing layer l (p takes values in the range (0,1)), and inf{} be the infimum (i.e. the minimum x value that satisfies the subsequent conditions, ensuring that the quantile value is unique and accurate). The proportion (probability) of expert networks with workloads less than or equal to x is at least p.
[0049] After obtaining the load index, this application can determine the expert pruning index for each processing layer in two ways: one is to directly calculate it based on the load index and the workload of at least one expert network in each processing layer; the other is to first determine the target processing type corresponding to each processing layer according to the load index and the hierarchical position of each processing layer in the hybrid expert model, and then match the expert pruning index corresponding to the target processing type according to the preset mapping relationship between the processing type and the expert pruning index, and different processing types correspond to different expert pruning indices in the mapping relationship.
[0050] Because this application employs a two-stage screening strategy for expert networks, the expert pruning metrics can be divided into a first expert pruning metric and a second expert pruning metric. These two metrics work together to achieve accurate screening and pruning of expert networks. The first expert pruning metric is used for absolute screening of low-load expert networks, aiming to accurately locate and remove expert networks with low usage rates. The second expert pruning metric, on the other hand, filters based on cumulative contribution to ensure that the removed expert networks do not significantly impact the overall model capacity, thereby accurately identifying low-contribution expert networks.
[0051] The first method for determining the expert pruning index is real-time calculation, which directly calculates the first expert pruning index and the second expert pruning index based on the load index and the workload of at least one expert network in each processing layer.
[0052] For the calculation of the first expert tailoring index, this application needs to first determine the ratio of the average load and the load range in the load index. The product of the load index ratio and the first global coefficient is determined as the first expert tailoring index. The specific formula for the first expert tailoring index is: ,in, For the first expert to tailor the indicators, This is the first global coefficient (e.g., 0.1~0.3), used to control the cutting intensity. For extremely poor load, The larger the value, the more unbalanced the load. The larger the size, the more aggressive the cropping scheme will be (larger cropping ratio). This represents the average load.
[0053] For calculating the second expert pruning index, this application defines the second expert pruning index as the product of the second global coefficient and the load variation coefficient in the load index. Specifically, the formula for the second expert pruning index is: ,in, For the second expert to tailor the indicators, For the second global coefficient, is the coefficient of variation of the load. In this application, It is the threshold for cumulative contribution, which is generally taken as 5% to 20%.
[0054] Optionally, to more accurately characterize the local load distribution characteristics of expert networks within each processing layer (such as the concentration range of low-load expert networks and the threshold value of high-load expert networks), and to improve the adaptability of the expert pruning index to the actual load distribution, this application can also incorporate quantile characteristics into the calculation process of the second expert pruning index. For example, the effective load range can be screened by quantiles to correct the load variation coefficient, or the quantile skewness coefficient and the load variation coefficient can be weighted and fused before being multiplied by the second global coefficient, or the quantile can be used as an additional judgment condition to constrain the pruning range, further improving the accuracy of expert network pruning. The quantile skewness coefficient is a statistical measure calculated based on the quantiles of the load of expert networks within each processing layer, used to characterize the skewness direction and degree of load distribution.
[0055] The second method for determining the expert pruning metric is type mapping matching. First, based on the load metric and the hierarchical position of each processing layer in the hybrid expert model, the target processing type corresponding to each processing layer is determined. Then, according to a preset mapping relationship between processing types and expert pruning metrics, the expert pruning metric corresponding to the target processing type is matched. Different processing types in this mapping relationship correspond to different first and second expert pruning metrics. Processing types can include input layers, intermediate processing layers, and output layers.
[0056] For example, this application can set a first expert pruning index for the input and output layers. =0.2; For the intermediate processing layer, the first expert pruning index can be set. =0.4. This method is mainly used to filter extremely weak experts with very low load and obvious ineffectiveness. It is suitable for different levels of functional characteristics and can accurately identify ineffective expert networks that need to be pruned. Among them, the second expert pruning index is... The settings can refer to the hierarchical matching logic of the first expert's tailoring index, which will not be elaborated here.
[0057] In summary, this application first determines the stable training load stage, conducts hierarchical load statistics with word count as the core indicator, avoids data distortion, and forms a hierarchical load matrix that truly reflects the actual load of experts. Then, based on the statistical workload, it calculates multi-dimensional load indicators, combines a two-stage screening strategy, and determines expert pruning indicators that are suitable for each processing layer in two ways. This achieves accurate matching between expert pruning indicators and the load distribution characteristics of each layer, avoiding the blindness of a unified global indicator, providing a scientific and reliable basis for subsequent accurate screening of low-ineffective expert networks, and allowing flexible adjustment of pruning intensity to adapt to different model training scenarios.
[0058] Figure 3 A flowchart illustrating the third model training optimization method proposed in this application is further shown. Based on Figure 3The illustrated embodiment further explains step 102. Figure 3 This may include the following steps.
[0059] Step 301: Based on the expert pruning index and at least one workload, perform load contribution analysis on at least one expert network in each processing layer to identify invalid expert networks in at least one expert network.
[0060] In some embodiments, this application can combine predetermined expert pruning metrics with the workload of each expert network to perform load contribution analysis on all expert networks in each processing layer. Through multi-dimensional quantitative judgment, invalid expert networks that do not need to be retained within the layer can be accurately identified, providing clear target objects for subsequent model pruning.
[0061] In some embodiments, this application employs a two-stage screening strategy, combining the first and second expert pruning metrics of each processing layer to determine invalid expert networks.
[0062] This application cross-validates the screening results from two dimensions to avoid the one-sidedness of single-dimensional screening and improve the accuracy of invalid expert identification. Specifically, this application can perform low-load screening on each expert network based on the first expert pruning index and the workload of each expert network, identifying low-load expert networks within a layer and including them in the first candidate sequence; at the same time, this application can also perform low-contribution screening on each expert network based on the second expert pruning index and the workload of each expert network, identifying low-contribution expert networks within a layer and including them in the second candidate sequence; finally, expert networks that appear in both the first and second candidate sequences are determined to be invalid expert networks to be pruned, that is, only expert networks that simultaneously meet the two characteristics of low load and low contribution will be identified as invalid experts, ensuring that the pruning operation does not mistakenly delete expert networks that have a real effect on the model.
[0063] It is important to note that, to avoid misidentification of invalid expert networks and to improve overall screening efficiency and accuracy, this application designs the low-load screening stage and the low-contribution screening stage to be executed in parallel. The two screening stages perform analysis independently and output screening results synchronously. Through mutual verification and cross-matching of the results, accurate identification of invalid expert networks to be pruned in each processing layer is achieved. This avoids the efficiency loss of serial screening and reduces the probability of missed or false positives through dual judgment criteria.
[0064] Specifically, for the low-load screening stage, this application uses any expert network (i.e., the first expert network) in each processing layer as an independent analysis unit, and determines whether it is a low-load expert network by quantitatively calculating the load ratio.
[0065] Specifically, this application can extract the average load from the load metrics of each processing layer. The workload of the first expert network (i.e., the first workload, the quantized data of the word units actually processed by the first expert network) and the workload ratio of the first expert network are calculated based on the above two data. The ratio of the first workload to the average workload of each processing layer is a core quantitative indicator for measuring the load level of the expert network relative to the overall level within the layer. The calculated load ratio is then compared with the first expert pruning index. A numerical comparison is performed. If the load ratio is less than the first expert pruning index, the first expert network is determined to be a low-load expert network and is included in the first candidate sequence. If the load ratio is greater than or equal to the first expert pruning index, the load level of the expert network is determined to meet the requirements and is not included in the first candidate sequence.
[0066] The core of this screening stage is to identify expert networks whose load levels are significantly lower than the overall level of each processing layer through relative load determination. These expert networks have low utilization rates in each processing layer due to their low actual task load, and are an important criterion for determining invalid expert networks.
[0067] Specifically, for the low-contribution screening stage, this application determines the actual contribution of the expert network to each processing layer by quantitatively calculating the contribution ratio, and identifies the low-contribution expert network within the layer.
[0068] Specifically, this application can sort all expert networks in ascending order based on their actual workload within each processing layer to obtain the corresponding ascending order ranking result. The process involves sorting the expert networks by their workload values from left to right, from smallest to largest, placing the expert network with the lowest workload at the front of the sorted results. Then, a target position range is defined according to a preset ratio, selecting the top t expert networks in the ascending sort as the expert networks to be processed (this preset ratio can be flexibly set based on model training requirements and the workload of at least one expert network within a layer; the core is to identify the group of expert networks with the lowest workloads as the objects of contribution analysis). The workloads of all expert networks to be processed are extracted, and the summation of all workloads yields a partial contribution value for each expert network to be processed. This indicator reflects the overall contribution of the group of expert networks with the lowest load levels to each processing layer; the workload of all expert networks in each processing layer is summarized and summed to determine the total contribution value of all expert networks in that layer. This indicator reflects the overall load capacity of each processing layer and serves as a benchmark for determining the contribution of the expert network; the contribution ratio of the aforementioned partial contribution values to the total contribution value is calculated to obtain the contribution ratio of the expert network to be processed. The contribution ratio is the core quantitative indicator for measuring the actual contribution of the group of expert networks with the lowest load levels to each processing layer. The calculated contribution ratio is compared with the second expert pruning index. If the contribution ratio is less than the second expert pruning index, the group of expert networks to be processed is determined to be a low-contribution expert network and all of them are included in the second candidate sequence. If the contribution ratio is greater than or equal to the second expert pruning index, the overall contribution of the group of expert networks is determined to meet the requirements and they are not included in the second candidate sequence.
[0069] The core of this screening stage is to identify expert networks with extremely low load levels and negligible overall contribution to each processing layer through group contribution determination. The existence of such expert networks has no substantial positive effect on model performance and only occupies model computing and memory resources, which is another core criterion for invalid expert networks.
[0070] Step 302: Remove invalid expert networks from at least one expert network to obtain pruned processing layers, and train the hybrid expert model based on the pruned processing layers to obtain the trained hybrid expert model.
[0071] In some embodiments, this application can remove invalid expert networks from all expert networks in each processing layer to obtain each processing layer that has been pruned. Then, the hybrid expert model can continue to be trained based on the pruned processing layers to optimize the overall load of the hybrid expert model and reduce the resource consumption of the model operation from the underlying structure.
[0072] In some embodiments, after removing invalid expert networks from each processing layer, this application does not directly use the remaining valid expert networks as the pruned processing layers. Instead, it performs targeted optimization on the remaining valid expert networks to form the final pruned processing layers. Based on this, the hybrid expert model is further trained, thereby optimizing the overall load of the hybrid expert model. This directly reduces the computational overhead during model operation and simultaneously reduces GPU memory usage, achieving a dual optimization of model resource consumption.
[0073] Specifically, the redistribution process of the remaining effective expert networks in this application involves accurately identifying all effective expert networks remaining after pruning each processing layer (i.e., all expert networks remaining after removing invalid experts); then, redistributing and adjusting all the aforementioned effective expert networks to form structurally optimized pruned processing layers through a load-balanced configuration, making the deployment of effective expert networks more compatible with the hardware operating logic; finally, based on the structurally optimized processing layers, the hybrid expert model continues to be trained until the model's various performance indicators reach the preset training targets, ensuring that the model's performance does not degrade after pruning and maintaining overall stability from the training stage.
[0074] In this application, the core design logic for the redistribution of effective expert networks is to combine load balancing objectives with hardware resource adaptability. That is, all redistribution operations revolve around ensuring a uniform distribution of expert network load at the hardware level. After completing the invalid expert pruning operation, this application optimizes the remaining effective expert networks among the expert parallel groups by minimizing the load differences between them. This operation effectively improves the routing efficiency of terms in the hybrid expert model while increasing the actual utilization of hardware resources. It fundamentally avoids the resource waste problem of some expert parallel groups having excessively high loads and fully utilized computing power, while others have excessively low loads and idle hardware resources during hardware operation. Here, the expert parallel group is a collection of expert networks divided according to the core resource carrying capacity of the hardware, such as memory and computing power. The number of these groups is determined based on the specifications and performance of the actual hardware resources. To ensure the stability of hardware operation, each expert parallel group has a fixed capacity limit; that is, the number of expert networks that each parallel group can accommodate is a fixed value and cannot be over-allocated.
[0075] Specifically, the effective expert network reallocation process is as follows: The effective workload of each effective expert network (i.e., the quantized load of the expert network actually processing words) is extracted, and all effective expert networks are sorted in descending order based on this effective workload, resulting in a descending sort from highest to lowest load. Subsequently, based on a preset load balancing strategy and this descending sort, the effective expert networks are sequentially allocated to their corresponding expert parallel groups. This allocation process is implemented using a greedy algorithm. The core principle of the greedy algorithm is to select the current optimal solution at each step, ultimately approaching the global optimal solution, thus meeting the requirements of expert network load balancing allocation in this application.
[0076] Furthermore, the specific execution steps for allocating effective expert networks based on the greedy algorithm are as follows: Based on the aforementioned descending sorting results, select the first effective expert network (i.e., the first position) at the top of the sorting, i.e., the effective expert network with the heaviest load; traverse all expert parallel groups, and in real time filter out the first expert parallel group with the smallest current total workload that has not reached the capacity limit, as the allocation target for the first effective expert network; formally allocate the first effective expert network to the first expert parallel group, and update the current total workload and the number of experts already accommodated in the parallel group; if the number of experts already accommodated in the first expert parallel group reaches the preset capacity limit after receiving the first effective expert network, immediately pause all expert network allocation operations for that group, and repeat the above-mentioned greedy Sanofi allocation process for the remaining expert parallel groups that have not reached the capacity limit; continuously iterate the above allocation process until all effective expert networks in each processing layer have been allocated, thereby achieving load-balanced deployment of all effective expert networks among the parallel groups.
[0077] Understandably, after implementing targeted pruning of each processing layer in this application, the structure of the hybrid expert model is optimized, which can not only significantly reduce the number of model parameters and computational load, greatly reduce the overhead of the model training stage, but also ensure that the test loss of large models remains basically stable. At the same time, the pruning structure shows good stability in hybrid expert models of different sizes and has good adaptability to various attention structures, which fully demonstrates that this application has excellent universality and transferability.
[0078] In summary, this application determines the dual-expert pruning index suitable for each processing layer through hierarchical load statistics and multi-dimensional feature analysis. It adopts a parallel two-stage screening strategy to accurately identify and remove invalid expert networks. Then, it combines a greedy algorithm to perform load balancing redistribution and parallel group optimization on the remaining valid experts. This effectively solves the core problems of large load differences between layers in hybrid expert models, blind pruning by unified strategies, and resource consumption by invalid experts. It also avoids expert misidentification and performance loss. While significantly reducing the computational overhead and memory consumption in the pre-training stage, it improves the efficiency of word routing and hardware utilization. Ultimately, it achieves a synergistic balance between model load optimization, training efficiency improvement, and task performance stability, and has strong universality and feasibility for implementation.
[0079] To make it easier to understand, further, such as Figure 4 As shown, this application provides a schematic diagram of a specific model training optimization.
[0080] Reference Figure 4 The figure illustrates the overall model optimization process of hierarchical adaptive expert pruning and rearrangement. The core is to accurately filter and remove invalid experts in the hybrid expert model, while optimizing the parallel deployment of the remaining valid experts, thereby improving model training efficiency and hardware utilization.
[0081] Specifically, this application first requires hierarchical expert load collection. After the hybrid expert model enters the load stabilization phase of training, the workload of the expert networks in each processing layer is independently statistically analyzed using the number of tokens as the core indicator, forming a hierarchical load matrix, i.e., at least one workload of at least one expert network. Next, hierarchical load analysis is performed, calculating multi-dimensional load indicators such as average load, load range, and coefficient of variation to characterize the load of at least one expert network in each processing layer, providing data support for subsequent pruning. Based on the analysis results, an absolute load pruning criterion (i.e., the first expert pruning criterion, α) and a cumulative contribution pruning criterion (i.e., the second expert pruning criterion, β) suitable for that layer are generated. Through parallel two-stage screening, invalid experts that simultaneously meet the criteria of low load and low contribution are identified and pruned. After pruning, a hierarchical adaptive expert reordering is performed, using a greedy algorithm to distribute the remaining effective experts to hardware-adapted parallel expert groups in descending order of load, minimizing the load difference between groups and improving hardware utilization. Finally, the model is trained again based on the pruned and reordered model until the preset accuracy is reached, achieving a synergistic unity of load optimization, efficiency improvement, and performance stability.
[0082] Understandably, for Figure 4 For the specific implementation process of each step, please refer to... Figures 1 to 3 The embodiments shown will not be described in detail here.
[0083] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.
[0084] Embodiments of this application also provide an optimization apparatus 500 for model training. Figure 5 This is a schematic diagram of the structure of an optimization device for model training provided in an embodiment of this application, as shown below. Figure 5 As shown, it includes: Acquisition unit 510 is used to acquire the data to be trained; The statistics unit 520 is used to input the data to be trained into the hybrid expert model and to collect the expert routing information of each processing layer in at least one processing layer of the hybrid expert model. The determination unit 530 is used to analyze the load of at least one expert network in each processing layer based on expert routing information, so as to determine the expert pruning index of each processing layer. The load is used to reflect the amount of data processing when at least one expert network processes the training data. The pruning unit 540 is used to prune the expert networks contained in each processing layer according to the expert pruning index, so as to train the hybrid expert model based on the pruned processing layers and obtain the trained hybrid expert model.
[0085] Furthermore, in one possible implementation of this application embodiment, the determining unit 530 is configured to: obtain at least one workload of at least one expert network in each processing layer based on expert routing information; and analyze the load of at least one expert network in each processing layer based on at least one workload to determine expert pruning metrics adapted to each processing layer.
[0086] Furthermore, in one possible implementation of the embodiments of this application, the determining unit 530 is configured to: determine a load index for characterizing the load status of at least one expert network in each processing layer based on at least one workload; and determine an expert pruning index for each processing layer based on the load index and at least one workload.
[0087] Furthermore, in one possible implementation of this application embodiment, the determining unit 530 is configured to: determine a load index for characterizing the load status of at least one expert network in each processing layer based on at least one workload; determine the target processing type of each processing layer based on the load index and the hierarchical position of each processing layer in the hybrid expert model; and determine the expert pruning index corresponding to the target processing type according to a preset mapping relationship between processing type and expert pruning index, wherein different processing types correspond to different expert pruning indices in the mapping relationship.
[0088] Furthermore, in one possible implementation of this application embodiment, the expert trimming index includes a first expert trimming index. The determining unit 530 is used to: determine the load index ratio between the average load and the load range in the load index; and determine the product of the load index ratio and the first global coefficient as the first expert trimming index.
[0089] Furthermore, in one possible implementation of this application embodiment, the expert pruning index includes a second expert pruning index, and the determining unit 530 is used to: determine the second expert pruning index by multiplying the second global coefficient and the load variation coefficient in the load index.
[0090] Furthermore, in one possible implementation of this application embodiment, the pruning unit 540 is configured to: perform load contribution analysis on at least one expert network of each processing layer according to expert pruning indicators and at least one workload, determine invalid expert networks in at least one expert network, and remove invalid expert networks from at least one expert network.
[0091] Furthermore, in one possible implementation of this application embodiment, the expert pruning index includes a first expert pruning index and a second expert pruning index. The pruning unit 540 is configured to: perform low-load analysis on at least one expert network based on the first expert pruning index and at least one workload, identify low-load expert networks among the at least one expert network, and include the low-load expert networks in a first candidate sequence; perform contribution analysis on at least one expert network based on the second expert pruning index and at least one workload, identify low-contribution expert networks among the at least one expert network, and include the low-contribution expert networks in a second candidate sequence; and identify expert networks among the at least one expert network that exist simultaneously in both the first and second candidate sequences as invalid expert networks.
[0092] Furthermore, in one possible implementation of this application embodiment, the pruning unit 540 is configured to: determine the load ratio of the first expert network based on the average load in the load index and the first workload of the first expert network, wherein the first expert network is any one of at least one expert network; and determine the first expert network as a low-load expert network if the load ratio is less than the first expert pruning index.
[0093] Further, in one possible implementation of this application embodiment, the pruning unit 540 is configured to: sort at least one expert network in ascending order based on at least one workload to obtain an ascending order sorting result; determine a partial contribution value of at least one expert network to be processed based on at least one workload of at least one expert network to be processed located within a target position range in the ascending order sorting result, wherein the target position range is determined according to a preset ratio; determine the total contribution value of at least one expert network based on at least one workload; determine the contribution ratio of at least one expert network to be processed based on the partial contribution value and the total contribution value; and determine at least one expert network to be processed as at least one low contribution expert network if the contribution ratio is less than a second expert pruning index.
[0094] Furthermore, in one possible implementation of the embodiments of this application, the determining unit 530 is used to: determine whether the hybrid expert model meets the load statistics conditions during the training process of the hybrid expert model; and if the hybrid expert model meets the load statistics conditions, obtain at least one workload of at least one expert network in each processing layer based on a preset statistical period.
[0095] Furthermore, in one possible implementation of the embodiments of this application, the load statistics conditions include at least one of the following: the training duration of the hybrid expert model is greater than or equal to a preset duration; the number of training steps of the hybrid expert model is greater than or equal to a preset number of steps.
[0096] Furthermore, in one possible implementation of this application embodiment, the determining unit 530 is used to: count the number of at least one term in at least one expert network based on expert routing information and a preset statistical period; and determine the number of at least one term as at least one workload of at least one expert network.
[0097] Furthermore, in one possible implementation of this application embodiment, the pruning unit 540 is used to: determine at least one effective expert network remaining after pruning each processing layer; redistribute the at least one effective expert network to obtain each pruned processing layer; and train the hybrid expert model based on each pruned processing layer to obtain the trained hybrid expert model.
[0098] Furthermore, in one possible implementation of this application embodiment, the determining unit 530 is configured to: sort at least one effective expert network in descending order according to at least one effective workload of at least one effective expert network to obtain a descending order sorting result; and, based on the load balancing strategy and the descending order sorting result, sequentially allocate the at least one effective expert network after descending order sorting to at least one expert parallel group, wherein the at least one expert parallel group is at least one expert set adapted to the hardware resource partitioning.
[0099] Furthermore, in one possible implementation of this application embodiment, the determining unit 530 is configured to: determine a first effective expert network located in a first position after descending sorting based on the descending sorting result; determine a first expert parallel group with the smallest current total workload and not yet reaching the capacity limit among at least one expert parallel group; allocate the first effective expert network to the first expert parallel group; and, if the first expert parallel group reaches the capacity limit, suspend the allocation of effective expert networks in the first expert parallel group until at least one effective expert network has been allocated.
[0100] For a description of the features in the embodiment corresponding to the optimization device for model training, please refer to the relevant description in the embodiment corresponding to the optimization method for model training, which will not be repeated here.
[0101] Embodiments of this application also provide an electronic device, including at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps in any of the above-described model training optimization method embodiments.
[0102] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above-described optimization method embodiments for model training at runtime.
[0103] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.
[0104] Embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above-described optimization method embodiments for model training.
[0105] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above-described optimization method embodiments for model training.
[0106] Any of the components, modules, units, parts, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Alternatively or additionally, any functionality described herein can be executed at least in part by one or more hardware logic components, such as, but not limited to, a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), an application-specific standard product (ASSP), a system-on-a-chip (SoC), a complex programmable logic device (CPLD), a microprocessor (MCU), etc. The terms "system," "computing device," or "apparatus" as used herein encompass various means, devices, and machines for processing data, including, for example, one or more programmable processors, computers, SoCs, or combinations thereof. The apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or one or more combinations thereof. The aforementioned computer program (also known as a program, software, software application, app, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for a computing environment.
[0107] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0108] The above provides a detailed description of the optimization method, electronic device, storage medium, and product for model training provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and core ideas of this application. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.
Claims
1. An optimization method for model training, characterized in that, include: Obtain the training data; The training data is input into the hybrid expert model, and the expert routing information of each processing layer in at least one processing layer of the hybrid expert model is statistically analyzed. Based on the expert routing information, the load of at least one expert network in each processing layer is analyzed to determine the expert pruning index of each processing layer. The load is used to reflect the amount of data processing when the at least one expert network processes the training data. According to the expert pruning index, the expert networks contained in each processing layer are pruned, so as to train the hybrid expert model based on the pruned processing layers, and obtain the trained hybrid expert model.
2. The method according to claim 1, characterized in that, The step of analyzing the load of at least one expert network in each processing layer based on the expert routing information to determine the expert pruning index for each processing layer includes: Based on the expert routing information, at least one workload of at least one expert network in each processing layer is obtained; Based on the at least one workload, analyze the load of at least one expert network in each processing layer to determine the expert pruning index adapted to each processing layer.
3. The method according to claim 2, characterized in that, The step of analyzing the load of at least one expert network in each processing layer based on the at least one workload to determine the expert pruning metrics adapted to each processing layer includes: Based on the at least one workload, determine a load metric to characterize the load of at least one expert network in each processing layer; Based on the load metric and the at least one workload, the expert pruning metric for each processing layer is determined.
4. The method according to claim 2, characterized in that, The step of analyzing the load of at least one expert network in each processing layer based on the at least one workload to determine the expert pruning metrics adapted to each processing layer includes: Based on the at least one workload, determine a load metric to characterize the load of at least one expert network in each processing layer; Based on the load index and the hierarchical position of each processing layer in the hybrid expert model, the target processing type of each processing layer is determined; Based on the preset mapping relationship between processing types and expert trimming indicators, the expert trimming indicator corresponding to the target processing type is determined, and different processing types correspond to different expert trimming indicators in the mapping relationship.
5. The method according to claim 3, characterized in that, The expert selection criteria include the first expert selection criterion. The process of determining the expert pruning metrics for each processing layer based on the load metrics and the at least one workload includes: Determine the load index ratio between the average load and the load range in the load index; The product of the load index ratio and the first global coefficient is determined as the first expert tailoring index.
6. The method according to claim 3, characterized in that, The expert selection criteria include a second expert selection criterion. The process of determining the expert pruning metrics for each processing layer based on the load metrics and the at least one workload includes: The product of the second global coefficient and the load variation coefficient in the load index is determined as the second expert tailoring index.
7. The method according to any one of claims 3 or 4, characterized in that, The step of pruning the expert network contained in each processing layer according to the expert pruning index includes: Based on the expert pruning index and the at least one workload, perform load contribution analysis on at least one expert network of each processing layer to identify invalid expert networks in the at least one expert network. Remove the invalid expert network from the at least one expert network.
8. The method according to claim 7, characterized in that, The expert selection criteria include a first expert selection criterion and a second expert selection criterion. The step of performing load contribution analysis on at least one expert network in each processing layer based on the expert pruning index and the at least one workload, and determining the invalid expert network in the at least one expert network, includes: Based on the first expert pruning metric and the at least one workload, perform low-load analysis on the at least one expert network to determine the low-load expert network among the at least one expert network, and include the low-load expert network in the first candidate sequence; Based on the second expert pruning metric and the at least one workload, a contribution analysis is performed on the at least one expert network to determine the low-contribution expert network among the at least one expert network, and the low-contribution expert network is included in the second candidate sequence. An expert network that exists in both the first candidate sequence and the second candidate sequence is identified as an invalid expert network.
9. The method according to claim 8, characterized in that, The step of performing low-load analysis on the at least one expert network based on the first expert pruning metric and the at least one workload, and determining the low-load expert network among the at least one expert network, includes: Based on the average load in the load index and the first workload of the first expert network, the load ratio of the first expert network is determined, wherein the first expert network is any one of the at least one expert networks. If the load ratio is less than the first expert pruning index, the first expert network is determined to be the low-load expert network.
10. The method according to claim 8, characterized in that, The step of analyzing the contribution level of the at least one expert network based on the second expert pruning index and the at least one workload, and determining the low-contribution expert network among the at least one expert network, includes: Based on the at least one workload, sort the at least one expert network in ascending order to obtain the ascending sort result; Based on at least one unprocessed workload of at least one unprocessed expert network located within the target location range in the ascending sorting results, a partial contribution value of the at least one unprocessed expert network is determined, wherein the target location range is determined according to a preset ratio; The total contribution value of the at least one expert network is determined based on the at least one workload; Based on the partial contribution value and the total contribution value, determine the contribution ratio of at least one expert network to be processed; If the contribution ratio is less than the second expert pruning index, the at least one expert network to be processed is identified as at least one low-contribution expert network.
11. The method according to claim 2, characterized in that, The step of obtaining at least one workload of at least one expert network in each processing layer based on the expert routing information includes: During the training process of the hybrid expert model, it is determined whether the hybrid expert model meets the load statistics condition. The load statistics condition includes at least one of the following: the training duration of the hybrid expert model is greater than or equal to a preset duration, and the number of training steps of the hybrid expert model is greater than or equal to a preset number of steps. When the hybrid expert model satisfies the load statistics conditions, at least one workload of at least one expert network in each processing layer is obtained based on a preset statistical period.
12. The method according to claim 2, characterized in that, The step of obtaining at least one workload of at least one expert network in each processing layer based on the expert routing information includes: The number of at least one terminology in the at least one expert network is counted based on the expert routing information and the preset statistical period. The number of at least one lexical unit is determined as at least one workload of the at least one expert network.
13. The method according to claim 1, characterized in that, The step of training the hybrid expert model based on the pruned processing layers to obtain the trained hybrid expert model includes: Determine at least one valid expert network remaining after each processing layer has been pruned; The at least one effective expert network is reallocated to obtain the pruned processing layers; The hybrid expert model is trained based on the trimmed processing layers to obtain the trained hybrid expert model.
14. The method according to claim 13, characterized in that, The reallocation of the at least one effective expert network includes: Based on at least one effective workload of the at least one effective expert network, sort the at least one effective expert network in descending order to obtain the descending order sorting result; Based on the load balancing strategy and the descending sorting result, at least one effective expert network after descending sorting is sequentially assigned to at least one expert parallel group, wherein the at least one expert parallel group is at least one expert set adapted to the hardware resource partitioning.
15. The method according to claim 14, characterized in that, The step of allocating at least one effective expert network after descending order ranking to the at least one expert parallel group based on the load balancing strategy and the descending order ranking result includes: Based on the descending sorting result, the first effective expert network located in the first position after descending sorting is determined; Determine the first expert parallel group among the at least one expert parallel groups that has the smallest current total workload and has not reached the capacity limit; Assign the first effective expert network to the first expert parallel group; If the first expert parallel group reaches its capacity limit, the allocation of effective expert networks for the first expert parallel group is suspended until at least one effective expert network has been allocated.
16. An electronic device, characterized in that, include: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the optimization method for model training according to any one of claims 1-15.