A compression method, device, and medium for a sparse activation hybrid expert model

By acquiring the expert activation information of the sparse activation hybrid expert model, calculating the importance evaluation value, and removing redundant expert subnetworks, the problems of high memory consumption and low computational resource utilization of the SMoE model on resource-constrained devices are solved, achieving efficient compression and performance preservation in multimodal scenarios.

CN122242613APending Publication Date: 2026-06-19ZHEJIANG MEIRI HUDONG NETWORK TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG MEIRI HUDONG NETWORK TECH CO LTD
Filing Date
2026-04-15
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing Sparse Activation Hybrid Expert Models (SMoE) face challenges such as high memory consumption, low computational resource utilization, and insufficient adaptability to multimodal scenarios when deployed on resource-constrained devices. Existing compression schemes struggle to avoid functional subspace collapse and inaccurate identification of redundant modules in generative tasks, failing to achieve a balance between high compression ratio and performance preservation.

Method used

By acquiring expert activation information, calculating expert importance evaluation values, identifying and removing low-importance expert subnetworks, and differentiating modality configuration weights in multimodal scenarios, experts are reorganized to form a compressed model.

Benefits of technology

It accurately identifies redundant experts, maintains the core performance of the model, adapts to multimodal heterogeneous scenarios, simplifies the compression process, reduces deployment costs, maintains the flexibility of the model architecture, and adapts to the deployment needs of resource-constrained devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242613A_ABST
    Figure CN122242613A_ABST
Patent Text Reader

Abstract

This invention discloses a compression method, device, and medium for sparse activation hybrid expert models. The method includes: acquiring expert activation information of the model to be compressed by embedding a monitoring hook mechanism during forward inference, while distinguishing tokens of different modalities or input roles and recording their associated information; calculating an expert importance evaluation value reflecting the strength of routing decisions and output responses for each expert subnetwork based on the activation information; identifying and removing redundant expert subnetworks based on the evaluation value; and reorganizing the retained expert subnetworks to form the compressed model. This invention can be deployed directly without fine-tuning, retains core performance at high compression ratios, and optimizes routing stability through temperature scaling, providing an efficient solution for deploying large-scale models on resource-constrained devices.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a compression method, apparatus and medium for a sparse activation hybrid expert model. Background Technology

[0002] With the rapid development of artificial intelligence technology, large-scale language models and multimodal visual language models have demonstrated outstanding performance in tasks such as natural language generation, visual question answering, and image description. Among them, the Sparse Activation Hybrid Expert (SMoE) model, with its core mechanism of activating a small number of expert subnetworks through dynamic routers, has decoupled the model parameter scale from the inference computation overhead. It can ensure the model's expressive power through massive parameters and control the real-time computation cost through sparse activation. It has been widely used in the architectural design of pure text large language models (LLM) and multimodal visual language models (VLM).

[0003] Although the SMoE model effectively reduces the computational overhead of the inference phase, the numerous expert subnetworks it contains still require a significant amount of memory. This severely limits its deployment and application in resource-constrained scenarios such as edge devices and embedded devices. Furthermore, due to the characteristics of the input data and the model architecture design, the usage frequency of each expert subnetwork is extremely uneven, with some subnetworks remaining in a low-activation state for extended periods. This results in low utilization of hardware accelerator computing resources, further exacerbating the deployment challenges of the SMoE model.

[0004] To address these issues, researchers have proposed two core expert subnetwork compression schemes: one is the expert subnetwork merging scheme, which achieves resource sharing by fusing the weight parameters of multiple expert subnetworks, thereby reducing the total number of parameters; the other is the expert subnetwork pruning scheme, which directly removes redundant expert subnetworks that have little impact on the model output. Existing research shows that in discriminative tasks (such as text classification and multiple-choice questions), the expert subnetwork merging scheme outperforms the pruning scheme in compression. However, in generative tasks such as code generation, mathematical reasoning, and visual question answering, the performance differences and adaptability of the two schemes lack systematic demonstration and comparison, and existing schemes all have significant technical limitations.

[0005] For expert subnetwork merging schemes, functional subspace collapse is prone to occur in generative tasks. The merging operation causes the router to lose its independent, input-dependent control over the original expert subnetwork, making it unable to dynamically adjust the activation strategy of the expert subnetwork according to different inputs. This leads to unavoidable inference errors, severely affecting the output quality of the generative task. Existing expert subnetwork pruning schemes use pruning criteria (such as activation frequency and single activation norm) that only evaluate the importance of expert subnetworks from a single dimension. They fail to jointly consider the decision strength of routing gating and the output response strength of expert subnetworks, making it impossible to accurately quantify the actual contribution of expert subnetworks to the model inference results. It is also difficult to form a scientific threshold judgment basis, which can easily lead to the accidental deletion of useful functional modules or the retention of redundant modules, making it impossible to achieve a balance between high compression ratio and performance preservation.

[0006] For multimodal visual language models (VLMs), the limitations of existing compression schemes are even more pronounced. In multimodal models, the semantic characteristics and expressions of different modalities and input roles, such as image tokens, user text tokens, and model-generated tokens, differ significantly, and the activation patterns of the expert subnetworks they trigger also exhibit obvious heterogeneity. Existing compression schemes adopt a uniform compression strategy, which lacks differentiated consideration of the activation contributions of different modalities and input roles, and also lacks corresponding weight configuration mechanisms to adapt to mixed input scenarios. Furthermore, they ignore the essential differences between visual encoders and language models in the usage patterns of expert subnetworks, resulting in compressed models that struggle to balance image understanding and text generation capabilities, and are unable to adapt to the diverse needs of multimodal tasks.

[0007] In summary, existing SMoE model compression schemes have shortcomings in terms of adaptability to generative tasks, compatibility with multimodal scenarios, and the balance between compression accuracy and performance. There is an urgent need for an efficient compression method that can accurately identify redundant modules, avoid functional subspace collapse, and adapt to multimodal heterogeneous characteristics to meet the deployment requirements of large-scale SMoE models on resource-constrained devices. Summary of the Invention

[0008] To address the aforementioned technical problems, the technical solution adopted by this invention is as follows:

[0009] According to a first aspect of the present invention, a compression method for a sparse activation hybrid expert model is provided, comprising the following steps:

[0010] Obtain expert activation information of the model to be compressed during the inference process, including routing thresholds and the output vectors of the corresponding expert subnetworks.

[0011] Based on the expert activation information, an expert importance evaluation value is calculated for each expert subnetwork. This expert importance evaluation value is used to reflect the routing decision strength and output response strength when the expert subnetwork is activated.

[0012] Based on the expert importance evaluation value, identify and remove expert subnetworks whose expert importance evaluation value is less than a preset threshold;

[0013] The expert subnetworks that were removed and retained are reorganized to form a compressed model.

[0014] When the model to be compressed is a multimodal visual language model, before calculating the expert importance evaluation value, the method further includes: distinguishing the activation modes of different modalities or input roles, configuring weights according to modality to calculate a weighted expert importance evaluation value, and performing the removal and recombination operation based on the weighted expert importance evaluation value.

[0015] According to a second aspect of the present invention, an electronic device is provided, including a processor and a memory; the processor executes the steps of the method described in the first aspect of the present invention by invoking a program or instructions stored in the memory.

[0016] According to a third aspect of the present invention, a computer-readable storage medium is provided that stores a program or instructions that cause a computer to perform the steps of the method described in the first aspect of the present invention.

[0017] The present invention has at least the following beneficial effects:

[0018] 1. Accurately locate redundant experts to ensure core model performance: By capturing the core activation information of routing thresholds and expert subnetwork output vectors, an expert importance evaluation system that reflects the strength of routing decisions and output response can be constructed. This system can accurately identify redundant experts who contribute very little to the model output. Even after removing redundant experts, the core reasoning and generation capabilities of the model can still be preserved, avoiding significant performance degradation.

[0019] 2. Adapt to multimodal heterogeneous scenarios and optimize task adaptability: For multimodal visual language models, by distinguishing the activation modes of different modalities or input roles and configuring modal weights, the expert importance assessment is more in line with the characteristics of cross-modal data. It can balance image understanding and text generation capabilities, solve the problem of insufficient adaptation of traditional compression methods to multimodal scenarios, and broaden the scope of application of the method.

[0020] 3. Simplify the compression process and reduce deployment costs: The entire compression process is driven by inference activation data and requires no additional fine-tuning steps. Through a standardized process of information acquisition, importance calculation, removal and reorganization, a compressed model can be obtained quickly, which greatly reduces the deployment and debugging costs after compression and adapts to the needs of large-scale model deployment on resource-constrained devices.

[0021] 4. Maintain model architecture flexibility: Adopt an expert removal and reorganization strategy instead of expert merging, retain the router's independent control over the remaining expert sub-networks, maintain the dynamic mixing strategy of input dependence, and ensure that the compressed model can still adaptively adjust the expert activation logic according to different inputs, avoiding the problem of reduced inference flexibility caused by architecture rigidity.

[0022] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0024] Figure 1 The flowchart illustrates a compression method for a sparse activation hybrid expert model provided in this embodiment of the invention. Detailed Implementation

[0025] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0026] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of this invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.

[0027] It should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe the steps as sequential processes, many of these steps can be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the steps can be rearranged. A process can be terminated when its operation is complete, but it may also have additional steps not included in the figures. A process can correspond to a method, function, procedure, subroutine, subroutine, etc.

[0028] This invention provides a compression method for sparse activation hybrid expert models, such as... Figure 1 As shown, it includes the following steps:

[0029] S100: Obtain expert activation information of the model to be compressed during the inference process, including routing threshold and output vector of the corresponding expert subnetwork.

[0030] Specifically, a calibration dataset adapted to the type of model to be compressed is first prepared. The model to be compressed is a sparse activation hybrid expert (SMoE) model, including a pure text generative large language model (LLM) or a multimodal visual language model (VLM). If it is a pure text LLM, the calibration dataset is selected from general text corpora or domain-specific datasets that match the application scenario of the model (such as code generation corpora or mathematical reasoning text). If it is a multimodal VLM, the calibration dataset is selected from multimodal data containing image-text pairs (such as visual question answering samples or image description samples) to cover diverse input scenarios such as image tokens, user text tokens, and model-generated tokens.

[0031] Subsequently, the calibration dataset is input into the model to be compressed for forward inference. By embedding a monitoring hook mechanism during the forward inference process, tokens of different modalities (images, text) or roles (user input, model generation) flowing through the router are captured in real time and accurately distinguished. A separate record link is established for each type of token, and the corresponding routing threshold and expert subnetwork output are associated and stored separately. The monitoring hook mechanism is configured at the output of the model's router module and the output of each expert subnetwork. A non-intrusive design is used to achieve interference-free collection of intermediate inference data, ensuring both the authenticity and completeness of the acquired information, and enabling precise binding of different token types with activation information through categorized recording. Specifically, the collected information is divided into two categories:

[0032] One type is the routing threshold, which is the gate weight of the model router module's output for each expert subnetwork. This threshold reflects the router's activation preference and decision strength for the corresponding expert subnetwork. The router determines the set of expert subnetworks to be activated under the current input by performing Top-K filtering on this threshold. Therefore, during the data acquisition process, it is necessary to simultaneously record whether each threshold is within the Top-K range to provide a basis for subsequent expert importance evaluation. The calculation process of the threshold is as follows: the model router module first performs feature mapping on the input, that is, it is transformed into a logits vector (log odds vector) with the same dimension as the number of expert subnetworks through a linear layer. The feature mapping here is the core prerequisite for gating weight generation. It is strongly related to gating weight and has a closed logical loop. The linear layer is essentially a bridge to adapt the input features to the functions of each expert sub-network. The weight matrix dimension of the linear layer is designed as "input feature dimension × number of expert sub-networks". The semantic / feature information of the original input can be translated into the original fitness score of each expert sub-network through weighted summation. Each dimension of the logits vector corresponds to the fitness score of an expert, realizing the dimensional alignment and information translation of the input from the data semantic space to the expert fitness space, providing an accurate original basis for the subsequent generation of gating weights.

[0033] The logits vector is then subjected to softmax normalization to obtain the gate value for each expert subnetwork. The gate value ranges from (0, 1), and the sum of the gate values ​​of all expert subnetworks is 1. This accurately reflects the router's activation preference and decision strength for each expert subnetwork. The larger the value, the more the router tends to select that expert subnetwork to process the input.

[0034] In this invention, the K value in Top-K is the number of expert subnetworks that need to be activated in each round of inference, which is preset by the model. The determination method is adapted to the model architecture and application scenario: it can be determined based on the original design parameters of the model to be compressed (such as a fixed K value configured during the model training stage, commonly 2, 4, 8, etc.), or it can be adaptively adjusted according to hardware resource constraints and inference speed requirements. When adjusting, it is necessary to ensure that the K value is less than the total number of expert subnetworks and matches the target number of experts for subsequent pruning (avoiding that the K value is greater than the number of experts remaining after pruning).

[0035] Another type is the output vector of the expert subnetwork. The expert subnetwork mentioned here is a functional submodule in the Sparse Activation Hybrid Expert (SMoE) model that possesses independent feature processing capabilities and weight parameters. Multiple expert subnetworks are deployed in parallel, and the router module dynamically selects some subnetworks to activate based on the input data. Unactivated expert subnetworks do not participate in the current inference computation, thus decoupling the model parameter scale and computational overhead. The output vector of the expert subnetwork is a high-dimensional feature vector generated by the expert subnetwork activated by the router after performing dedicated feature extraction, transformation, and output processing on the input data. Specifically, the generation logic is as follows: after the input data is allocated to the corresponding expert subnetwork by the router, the expert uses a neural network structure (such as fully connected layers, attention layers, convolutional layers, etc.) adapted to the model architecture to refine the input features, mining deep semantic or feature information, and finally outputting a feature vector. The magnitude of this output vector can be quantified by the norm, accurately reflecting the output response strength of the expert subnetwork to the current input, providing core data support for expert importance assessment. Specifically, the L2 norm (Euclidean norm) is used. The reason for choosing the L2 norm is that it can effectively reflect the overall magnitude of the vector, accurately quantify the output response intensity of the expert subnetwork, and has moderate computational complexity, which is suitable for the calibration inference efficiency of large-scale models. During the data collection, it is necessary to store the data in a three-dimensional correspondence with the routing threshold according to the token type, model level, and expert subnetwork number, so as to ensure that the router decision results and expert subnetwork output contributions corresponding to different tokens in the same inference process can be accurately matched later.

[0036] After data collection, the acquired expert activation information is preprocessed to remove abnormal data (such as invalid vectors generated by inference interruption and abnormal fluctuations in gating values). A structured dataset is then constructed according to the input sample, model level, and expert subnetwork number to provide standardized data support for the calculation of expert importance evaluation values ​​in subsequent steps.

[0037] S200, based on the expert activation information, calculate an expert importance evaluation value for each expert subnetwork. This expert importance evaluation value is used to reflect the routing decision strength and output response strength when the expert subnetwork is activated.

[0038] Specifically, based on the structured expert activation information collected and preprocessed by S100 (routing gates and expert subnetwork output vectors stored in association with token type, model level, and expert subnetwork number), an importance evaluation value is independently calculated for each expert subnetwork. Two illustrative calculation methods are provided, and the appropriate solution can be selected according to the model type and deployment requirements:

[0039] I. Calculation Method for Expert Importance Assessment Value

[0040] Method 1: The expected product of the routing threshold and the square of the output vector norm

[0041] For each expert subnetwork, the importance evaluation value S is calculated using the formula: S = E[g(x)·‖f(x)‖ 2 The parameters and their calculation logic are defined as follows:

[0042] 1. Parameter Description: g(x) is the routing threshold when the expert subnetwork is activated (i.e., the routing threshold is within the Top-K range), reflecting the router's decision strength for the expert subnetwork. A larger value indicates that the router is more inclined to choose the expert subnetwork to process input x; f(x) is the feature vector output after the expert subnetwork is activated, ‖f(x)‖ 2 E[·] represents the squared L2 norm of the feature vector; E[·] represents the expected value of all samples of the activated expert subnetwork in the calibration dataset, that is, the average value of the "product of the routing gate and the squared norm of the output vector" in all valid samples, eliminating the random influence of a single sample.

[0043] 2. Calculation steps: First, iterate through all samples in the calibration dataset and select the subset of samples in which the expert subnetwork is activated; for each sample in this subset, calculate g(x) and ||f(x)||. 2 The product of all products is summed and divided by the number of sample subsets to obtain the S of the expert subnetwork.

[0044] Method 2: Normalized routing threshold weighted output vector norm expectation

[0045] To further suppress the interference of extreme routing thresholds on the evaluation results and improve the stability of importance ranking, a normalized calculation method can be adopted. The importance evaluation value S of each expert subnetwork is calculated as follows: S=E[(g(x) / Σg(x))·‖f(x)‖ ], where the definitions and optimization logic of each parameter are as follows:

[0046] 1. Parameter description: The denominator Σg(x) represents the sum of the routing gate values ​​of the Top-K activated expert subnetworks corresponding to the current input x. g(x) is normalized to convert the routing decision strength into a relative proportion, avoiding the dominance of a single high-gated sample in the evaluation result; ‖f(x)‖ is the L2 norm of the output vector (compared to the squared term, it can reduce the weight of extreme output vectors, which is suitable for scenarios with high requirements for evaluation stability); E[·] also represents the mathematical expectation of the samples that activate the expert subnetwork.

[0047] 2. Calculation steps: Traverse all samples in the calibration dataset and filter the subset of samples in which the expert subnetwork is activated; for each sample, first calculate the sum of the routing thresholds of the current Top-K activated experts, then divide the routing thresholds of the experts in the sample by the sum to obtain the normalized routing decision strength; multiply the normalized result by the L2 norm of the output vector; take the average of the product results of all samples to obtain S.

[0048] Two calculation methods can be flexibly selected: Method 1 has higher calculation efficiency and is suitable for scenarios requiring rapid calibration of large-scale models; Method 2 has stronger evaluation stability and is suitable for scenarios requiring high accuracy in the identification of redundant experts.

[0049] II. Weighted Processing Flow of Multimodal Visual Language Model

[0050] When the model to be compressed is a multimodal visual language model (VLM), the expert activation information collected by S100 needs to be classified according to modality / input role first, and then the weighted expert importance evaluation value is calculated based on the classification results. The specific process is as follows:

[0051] 1. Activation Mode Differentiation: Based on the token types recorded by the monitoring hook mechanism in step S100, the expert activation information corresponding to three types of activation modes is accurately distinguished and classified as follows: activation information corresponding to image feature tokens (feature units generated after image processing by the visual encoder), activation information corresponding to user input text tokens (original text encoding units of user-initiated requests), and activation information corresponding to model-generated text tokens (response text encoding units generated by the model in response to user requests). Each type of activation information is independently associated with the corresponding routing threshold, expert subnetwork output vector, and activation sample identifier to ensure no overlap in classification and data traceability.

[0052] The activation pattern here refers to the inherent rules and feature set of tokens of different modalities (image features, text) or different input roles (user input, model generation) that trigger the activation of expert subnetworks when flowing through the router. It includes three core dimensions, which can accurately distinguish the differentiated impact of various tokens on expert activation, providing a core basis for subsequent weighted evaluation:

[0053] (1) Activation of related features: that is, which level and type of expert subnetworks a certain type of token tends to activate (e.g., image feature tokens often activate experts with strong visual feature processing capabilities, and model-generated text tokens often activate experts with text fluency optimization capabilities), which directly determines the functional compatibility between various tokens and expert subnetworks.

[0054] (2) Activation intensity distribution: The distribution pattern of routing thresholds corresponding to various tokens (such as the mean, variance, and Top-K activation ratio of gate weights) and the response amplitude characteristics of the expert subnetwork output vector are the core indicators for quantifying the influence of tokens on expert activation.

[0055] (3) Activation timing / frequency characteristics: In continuous reasoning or multi-round tasks, the frequency, interval and co-activation relationship of various tokens activating expert subnetworks (such as the continuity of activation of similar experts by the model after the user inputs a text token to trigger expert activation) can correct the evaluation bias caused by a single activation sample.

[0056] In short, activation patterns are the corresponding rules between various tokens and the activation behaviors of expert subnetworks. Accurate differentiation and feature extraction of these patterns are the core prerequisites for weighted evaluation of expert importance and ensuring the accuracy of evaluation in multimodal scenarios.

[0057] 2. Modal weight configuration: Configure independent weights w1 (image feature token weight), w2 (user input text token weight), and w3 (model generated text token weight) for the three types of tokens respectively. The weights satisfy the constraints: w1≥w2, w1≥w3, and w1+w2+w3=1 (weight normalization processing to ensure that the magnitude of the evaluation value is controllable).

[0058] 3. Adaptive Weight Adjustment: Modal weights are dynamically adjusted based on the task characteristics of the target deployment scenario. The specific adaptation rules are as follows:

[0059] (1) Image-intensive understanding tasks (such as visual question answering, image semantic segmentation, and image-text matching): The core requirement is to ensure the accuracy of image feature processing. At this time, the weight of image feature tokens will be increased. For example, w1 will be adjusted to the range of 0.5 to 0.7, and the remaining weights will be allocated to w2 and w3 (e.g., w1=0.6, w2=0.2, w3=0.2) to strengthen the contribution of image feature tokens to expert importance assessment.

[0060] (2) Dialogue-intensive generation tasks (such as multi-turn text-image dialogue, image-based creative text generation): The core requirement is to ensure the coherence of text generation. At this time, the weight of the text token generated by the model is increased. For example, w1 is appropriately reduced to the range of 0.4 to 0.5, and w3 is increased to the range of 0.3 to 0.4 (e.g., w1=0.45, w2=0.15, w3=0.4), which is suitable for the expert activation mode of dialogue generation scenarios.

[0061] 4. Weighted importance calculation:

[0062] To address the varying accuracy and efficiency requirements of multimodal scenarios, two weighted importance calculation implementation methods are designed. Both are based on the independent importance evaluation values ​​S1 (corresponding to image feature tokens), S2 (corresponding to user input text tokens), and S3 (corresponding to model-generated text tokens) under three token activation modes, combined with modal weights to complete the calculation. Subsequently, an adaptation scheme can be selected according to actual needs to perform the recognition and removal operations of the expert subnetwork.

[0063] (1) Basic weighted implementation method

[0064] This method is suitable for scenarios with high requirements for evaluation efficiency and moderate scenario complexity. The core adopts linear weighted logic, which is directly related to the modal weight configuration rules mentioned above. The calculation process is simple and easy to implement in engineering.

[0065] Specifically, for each expert subnetwork, first calculate its independent importance evaluation values ​​S1, S2, and S3 under the three token activation modes using any of the above calculation methods; then calculate the weighted sum according to the weights to obtain the final weighted expert importance evaluation value S. weighted =w1×S1+w2×S2+w3×S3.

[0066] (2) Improved weighted implementation method

[0067] This approach is suitable for scenarios with high requirements for evaluation accuracy and strong heterogeneity of multimodal data. Based on the basic weighting logic, it integrates the core features of the activation mode (activation frequency and stability) and the characteristics of the model hierarchy mentioned above, and introduces multi-dimensional correction factors to avoid the limitations of single weighting.

[0068] Specifically, the independent importance assessment values ​​S1, S2, and S3 are first obtained using the same calculation method as the basic weighted implementation. Then, three dimensions are introduced: activation frequency correction, hierarchical weight adaptation, and activation stability constraints, to construct an improved formula. This improved formula, combined with modal weights, calculates the final weighted value, achieving a more accurate expert value assessment. The improved formula is as follows:

[0069] .

[0070] Among them, S j weighted-advanced Let w be the final improved weighted importance evaluation value for the j-th expert subnetwork, where j ranges from 1 to n, and n is the total number of expert subnetworks. i Let N be the modal weight of the i-th type of token, where i takes values ​​of 1, 2, and 3, corresponding to image feature tokens, user input text tokens, and model-generated text tokens, respectively. ji N represents the total frequency of the j-th expert being activated by the i-th type of token; ri S represents the total frequency of the r-th expert being activated by the i-th type of token, where r ranges from 1 to n. ji Let S1, S2, and S3 be the independent importance evaluation values ​​of the j-th expert subnetwork under the i-th token activation mode. α is a dynamic correction coefficient (ranging from 0.05 to 0.2), balancing the weights of the basic evaluation values ​​and activation stability characteristics to avoid a single dimension dominating; L jThe importance weight of the model level to which the j-th expert subnetwork belongs (the closer the level is to the output, the greater the weight, such as the level L before the output layer). j =1.0, the next level after the input layer L j 0.5); L total The sum of importance weights for all model levels enables adaptive adjustment of the hierarchy; σ ji (g) represents the standard deviation of the routing threshold after softmax normalization when the j-th expert is activated by the i-th type of token, reflecting the stability of expert activation.

[0071] To further enhance the impact of activation stability on expert evaluation, a stability constraint mechanism is added to the improved weighted formula. This is done in calculating σ. ji (g) If the standard deviation of the gating value of the token corresponding to a certain expert subnetwork exceeds a preset threshold (exemplarily 0.3, which can be adaptively adjusted according to the model architecture), a penalty factor β (β ranges from 0.8 to 0.95) is automatically introduced to lower the evaluation score of the expert in the corresponding token mode. The correction logic is to... The entire process is multiplied by β. This mechanism can accurately select stable and reliable expert subnetworks while suppressing the interference of unstable experts on the inference performance of the compressed model. It forms an evaluation-constraint technique closed loop with the improved formula, further improving the compression accuracy.

[0072] Compared to the basic weighted implementation method, this method deeply integrates the three core dimensions of the activation mode mentioned above, breaks through the single logic of linear weighting, and can accurately avoid problems such as low-frequency expert deletion, hierarchical functional imbalance, and interference from unstable activation experts. At the same time, it maintains a technical closed loop that distinguishes it from the activation mode mentioned above and standardizes the output vector, making it suitable for high-precision multimodal compression scenarios.

[0073] S300: Based on the expert importance evaluation value, identify and remove expert subnetworks whose expert importance evaluation value is less than a preset threshold.

[0074] The core of this step is based on the expert importance assessment value output by S200 (using S1 or S2 for plain text models, and S2 for multimodal models). weighted The process involves accurately identifying and removing redundant expert subnetworks using standardized screening logic, while simultaneously ensuring model structural security and compression efficiency. The specific steps are as follows:

[0075] I. Preprocessing and Ranking of Evaluation Values

[0076] First, the evaluation values ​​of all expert subnetworks are preprocessed to remove extremely low values ​​caused by data anomalies (such as outliers more than 10 times the average value, to avoid mistakenly deleting normal functional modules). Then, the expert subnetworks are sorted in ascending order of evaluation values ​​to generate a global or hierarchical expert ranking list. During the sorting process, the core correlation information of the expert subnetworks must be retained, including their model level and the corresponding token activation contribution ratio (for multimodal models), to support subsequent decisions on the removal scope.

[0077] II. Determining Removal Criteria (Thresholds, Quantity, and Proportion Matching)

[0078] The preset threshold is not a fixed value, but is determined in conjunction with a predetermined quantity or a preset ratio. The core logic is to deduce the threshold from the preset removal target to ensure that the removal operation is accurate and controllable. Specifically, there are two execution methods, which can be selected according to the model architecture and compression requirements:

[0079] 1. Removal by Predetermined Quantity: Based on the resource constraints of the target deployment hardware (such as memory capacity and computing power limits), a predetermined number of expert subnetworks to be retained is set, thus determining the number to be removed (number to be removed = original total number of experts - target retention quantity). The predetermined number of expert subnetworks with the lowest evaluation values ​​are selected from the sorted list as removal candidates. The preset threshold is then the highest evaluation value among the sorted removal candidates (ensuring that all experts below this threshold are removed, and the removal quantity precisely matches the preset target). For example, if the original total number of expert subnetworks is 160, and the target is to retain 80, then the top 80 (lowest evaluation values) after sorting are selected as removal targets, and the evaluation value of the 80th expert is the preset threshold for this removal.

[0080] 2. Remove according to a preset ratio: When there is no need to retain a fixed number of experts, the removal range is determined according to a preset ratio of the original total number of experts (common ratios are 30%~50%, which can be adjusted according to performance requirements). For example, if the original total number of experts is 64 and the preset removal ratio is 50%, then the top 32 experts after sorting (with the lowest evaluation value) are selected as the objects to be removed, and the evaluation value of the 32nd expert is simultaneously set to the preset threshold. This method is suitable for scenarios with clear requirements for compression ratio, without the need for precise calculation of hardware resource limits.

[0081] III. Global / Layer-by-Layer Removal of Execution Logic

[0082] Based on the functional correlation of the expert subnetworks at each layer of the model, a global or layer-by-layer removal strategy is selected. The execution details of the two strategies are as follows:

[0083] 1. Global Removal Strategy: Applicable to plain text LLMs or multimodal VLMs where expert subnetworks at each layer have closely related functions and significant cross-layer collaborative contributions. The evaluation values ​​of expert subnetworks at all levels are aggregated and sorted. The expert subnetwork with the lowest global evaluation value is selected and removed uniformly according to the preset number / proportion. This strategy maximizes global resource optimization efficiency and avoids functional imbalance caused by excessive removal at a single layer. However, it must ensure that the number of experts at each layer still meets the router's Top-K activation requirements after removal (i.e., the total number of remaining experts is greater than the K value of Top-K).

[0084] 2. Layer-by-Layer Removal Strategy: Suitable for models where each layer's expert subnetwork is functionally independent and inter-layer interference is minimal (such as the visual encoding and speech generation layers of a multimodal VLM). Each model layer is individually ranked by expert evaluation values, and the expert subnetwork with the lowest evaluation value is removed according to a preset number or proportion (this proportion can be uniform across layers or differentiated based on layer importance). For example, the visual encoding layer has high requirements for image feature processing, so a removal proportion of 30% can be set; the speech generation layer has high redundancy, so a removal proportion of 50% can be set. This strategy can specifically protect the functional integrity of core layers and avoid over-pruning of critical layers due to global removal.

[0085] IV. Removal of Validation and Execution

[0086] 1. Pre-removal verification: After selecting the list of expert subnetworks to be removed, pre-verification is required: simulate pruning on the candidate list based on a small amount of calibration data to verify whether the remaining expert subnetworks can respond normally to inference requests and whether the output results show significant degradation (such as garbled output or semantic distortion). If a performance crash occurs, a callback is needed to adjust the removal quantity / ratio (e.g., reduce the removal ratio to below 30%) and redetermine the list of networks to be removed.

[0087] 2. Formal Removal Execution: After pre-verification confirms accuracy, a thorough technical stripping operation is performed on the expert subnetworks in the candidate list, completely removing them from the model parameters, architecture, and inference chain. This includes three core operations: First, parameter-level removal: all independent weight parameters (including feature processing layer, output layer weights, and biases) of the expert subnetwork to be removed are deleted, releasing corresponding memory space. This differs from simply freezing parameters by stopping updates, achieving true memory usage optimization. Second, architecture-level adaptation: the gating mapping relationship of the router module is updated, and the routing links corresponding to the removed expert subnetwork are deleted, ensuring that the router only outputs gating values ​​to the remaining expert subnetworks during subsequent inference, avoiding inference errors caused by invalid routes. Third, associated link cleanup: the recorded links of the corresponding experts in the monitoring hook mechanism, evaluation data association information, and the calling logic involving the expert in the forward inference process are simultaneously deleted, ensuring that the remaining model inference chain has no redundant branches. At the same time, the removed expert subnetwork number, its level, weight parameter scale, and corresponding activation contribution information are recorded in detail to provide accurate data support for the expert reorganization in the subsequent S400 steps. If it is a multimodal model, after removal, it is also necessary to update the underlying data association of the weighted importance assessment and remove the submodal activation records of the removed experts to avoid interfering with subsequent reorganization and inference verification.

[0088] V. Multimodal Model Adaptation Instructions

[0089] When the model to be compressed is a multimodal visual-language model, all execution logic in this step remains unchanged, except that the expert importance evaluation value is replaced with the weighted expert importance evaluation value calculated by S200 (S weighted Due to S weighted The contribution weights of the three types of tokens have been integrated. Based on the removal operation performed, the expert subnetworks that are at the core of the multimodal task can be accurately preserved, taking into account both image understanding and text generation capabilities, and avoiding functional imbalance caused by the removal decision being dominated by a single modality token.

[0090] S400 reorganizes the expert subnetworks that were removed but retained to form a compressed model.

[0091] The core objective of this step is to build a structurally complete and inference-smooth compressed model based on the remaining expert subnetwork after the S300 removal operation. This is achieved through architecture reconstruction, link adaptation, and parameter calibration, eliminating the architectural breakage caused by the removal operation while maximizing the preservation of the original model's functionality and performance. The specific process is as follows:

[0092] I. Pre-restructuring preparations

[0093] First, the removal information recorded in the S300 steps (including the level, number, weight scale, and associated links of the removed expert subnetworks) is retrieved to sort out the information of the retained expert subnetworks: the number of retained experts is counted according to the model level, and it is verified whether the number of remaining experts at each level meets the router's Top-K activation requirements (ensuring that the number of remaining experts at each level is greater than the preset K value to avoid insufficient experts to activate during inference); at the same time, the weight parameter files of the retained experts and the activation contribution records of each modality (multimodal model) are organized to form a standardized reorganization data source, providing a foundation for subsequent architecture adaptation.

[0094] II. Core Restructuring Operations

[0095] The reconfiguration operation focuses on "architectural coherence" and "inference compatibility," and is executed in three layers, covering the entire process from hardware adaptation to link calibration:

[0096] 2. Router Module Adaptation and Adjustment: This is a crucial step in the reorganization process, focusing on adapting the router to the distribution of the remaining expert subnetworks. First, the router's gating layer (linear layer) parameters are replaced, removing the weight dimensions corresponding to the removed experts and adjusting the gating layer output dimension to match the number of retained experts. Then, temperature scaling is applied to the routing gating distribution of the compressed model to smooth the distribution of routing decisions. The temperature coefficient T in the temperature scaling adjustment can be calculated using the formula T=Eold / Enew, where Eold is the mean gating logits of all expert subnetworks before removal, and Enew is the mean gating logits of the retained expert subnetworks. Alternatively, a custom ratio can be used based on actual inference performance to ensure a smooth transition in the distribution of gating values ​​among the remaining experts, avoiding extreme activation biases caused by a reduction in the number of experts.

[0097] Router Module Adaptation and Adjustment: This is a crucial step in the reorganization process, focusing on adapting the router to the distribution of the remaining expert subnetworks. First, the router's gating layer (linear layer) parameters are replaced, removing the weight dimensions corresponding to the removed experts and adjusting the gating layer output dimension to match the number of retained experts. Then, temperature scaling is applied to the routing gating distribution of the compressed model to smooth the distribution of routing decisions. The temperature coefficient T in the temperature scaling adjustment can be calculated using the formula T=Eold / Enew, where Eold is the mean gating logits of all expert subnetworks before removal, and Enew is the mean gating logits of the retained expert subnetworks. Alternatively, a custom ratio can be used based on actual inference performance to ensure a smooth transition in the distribution of gating values ​​among the remaining experts, avoiding extreme activation biases due to a reduction in the number of experts.

[0098] 3. Inference Link Closed-Loop Calibration: Traverse the entire forward inference link of the model to verify the effectiveness of the connection between the retained expert subnetwork and the input encoding module and output decoding module. For pure text models, focus on calibrating the feature transfer link between the encoded text token and the expert subnetwork. For multimodal models, additionally calibrate the feature mapping relationship between the image encoder and the retained experts to ensure that the image feature token and the user text token can accurately match the corresponding expert subnetwork with activation preference. At the same time, clean up the residual expert call logic in the link and repair the link breakpoints caused by the removal operation to avoid problems such as feature transfer failure and parameter addressing errors during inference.

[0099] III. Dedicated Reassembly Adaptation for Multimodal Models

[0100] For multimodal visual language models, a submodal link calibration step is added to the core reorganization operation: based on the modal weight configuration of the S200 step, the submodal processing priority of the retained expert subnetwork is adjusted. For experts with a high contribution to image feature token activation, their link association with the visual encoder is strengthened; for experts with a high contribution to model-generated text token activation, their collaborative logic with the text decoding module is optimized. At the same time, the multimodal weighted statistical mechanism is updated to remove the submodal records of removed experts, ensuring that the weighted importance evaluation logic during subsequent inference is consistent with the reorganized model architecture.

[0101] IV. Post-recombination verification and fine-tuning

[0102] 1. Functional Validation: Inference tests were conducted using a small calibration dataset (approximately 10%-20% of the original calibration set) to verify whether the compressed model could process the input normally (pure text model for text generation and inference tasks, multimodal model for visual question answering and image description tasks). The tests also checked for issues such as inference stuttering, output distortion, and link errors to ensure the integrity of the reconstructed model architecture.

[0103] 2. Performance calibration: Compare the inference performance metrics (including memory usage, inference speed, and task accuracy) of the recombined model with those of the original model. If a slight performance degradation occurs (such as a 1%-3% decrease in accuracy), calibration can be performed by fine-tuning the router gating layer parameters (non-expert sub-network weights to avoid disrupting the no-fine-tuning deployment characteristics) to ensure that the performance of the compressed model is close to that of the original model.

[0104] V. Compression Model Packaging

[0105] After reorganization and verification, the compressed model is standardized and packaged: the weight parameters of the expert subnetwork, the adapted router module, and the calibrated inference link configuration file are integrated and retained to generate an independent model file; the tokenizer configuration and image preprocessing rules (multimodal model) of the original model are copied synchronously to ensure that the input and output formats of the compressed model are fully compatible with the original model, and it can be deployed directly without modifying the downstream application calling code.

[0106] Through the above reorganization operations, the compressed model eliminates the memory footprint of redundant expert subnetworks while maintaining the core functional architecture and inference logic of the original model, achieving a balance between high compression ratio and high performance retention, and laying the foundation for subsequent model storage and deployment.

[0107] To fully verify the effectiveness, universality, and scenario adaptability of the proposed sparse activation hybrid expert model compression method, the following two typical embodiments are used for illustration: for a plain text generation large language model and a multimodal visual language model, respectively, to verify the compression effect and performance preservation capability of the method under different model architectures and different task scenarios. The plain text embodiment focuses on the adaptation to large-scale model generation tasks and the characteristics of deployment without fine-tuning, while the multimodal embodiment focuses on verifying the practicality of the submodal weighting strategy, jointly demonstrating the technical advantages and practical value of the proposed method.

[0108] I. Plain Text Model Example

[0109] This embodiment verifies the effectiveness of the compression method in various generation tasks for large-scale sparse activation hybrid expert (SMoE) plain text generation large language models with 20B~1T parameters. The specific implementation process and results are as follows:

[0110] 1. Implementation prerequisites: Select general text corpus (including code snippets, mathematical reasoning questions, creative writing materials, and tool call instructions) as calibration dataset, covering four core generation tasks: code generation, mathematical reasoning, creative writing, and tool call; the total number of original expert subnetworks in the model is 160, the K value in the router Top-K activation strategy is preset to 4, and the hardware deployment target is an edge server (memory constraint of 64GB).

[0111] 2. Compression Operation Process: Following step S100, the routing gate and expert subnetwork output vectors during model inference are collected, and structured data storage is completed through a monitoring hook mechanism; step S200 uses a basic scheme to calculate the expert importance evaluation value S=E[g(x)·‖f(x)‖ 2The S300 step employs a global removal strategy, setting a pruning target of 50% compression ratio to remove the 80 expert subnetworks with the lowest evaluation values ​​(retaining 80 experts after pruning to meet the Top-K activation requirement); during the S400 step of reorganization, the temperature coefficient is calculated using the formula T=Eold / Enew, and the routing gating distribution is adjusted through temperature scaling to reduce distribution sharpness, ensuring a smooth transition of the remaining expert gating logits, and simultaneously completing the gating layer parameter replacement and inference link calibration.

[0112] 3. Implementation Results: The compressed model can be deployed directly without any fine-tuning, and its performance retention rate in the four generation tasks is over 95% (code generation accuracy decrease ≤2%, mathematical reasoning accuracy decrease ≤1.5%, creative writing semantic coherence score decrease ≤1 point, and tool call success rate remains above 98%). The inference memory usage is reduced from the original 120GB to 58GB, a memory saving rate of 51.7%, and the inference speed is improved by 18% compared to the original model. The accompanying implementation tools support flexible configuration of temperature scaling factor and custom scaling parameters, which can meet the comparison needs of different experimental scenarios.

[0113] II. Multimodal Model Examples

[0114] This embodiment verifies the adaptability of the method to cross-modal tasks for a multimodal visual language model (VLM) using the Aria architecture and containing 64 expert sub-networks. The specific implementation process and results are as follows:

[0115] 1. Implementation prerequisites: Select a multimodal calibration dataset containing image-dialogue pairs (integrating COCO image set, VQAv2 visual question answering samples, and multi-turn image-text dialogue data), covering the two core tasks of visual question answering and image description; adopt the S200 multimodal weighting strategy, with the initial configuration of modal weights w1 (image feature token) = 0.5, w2 (user input text token) = 0.2, and w3 (model generated text token) = 0.3.

[0116] 2. Compression Operation Process: Step S100 uses a monitoring hook mechanism to capture and distinguish the routing thresholds and expert subnetwork outputs corresponding to the three types of tokens in real time, establishing a multi-modal activation record; Step S200 calculates independent evaluation values ​​S1, S2, and S3 based on the multi-modal data, according to the weighted formula S... weighted =w1×S1+w2×S2+w3×S3 to obtain the weighted evaluation value; Step S300 adopts a layer-by-layer removal strategy, pruning both the visual coding layer and the language generation layer at a compression ratio of 50%, removing a total of 32 expert subnetworks (32 expert subnetworks are retained after pruning); During the reorganization in Step S400, additional multimodal link calibration is performed to strengthen the association between experts with high image feature activation ratios and the visual encoder, and update the multimodal weighted statistical mechanism to ensure link adaptability.

[0117] 3. Implementation Results and Scenario Adaptation: The compressed model performs similarly to the original model in visual question answering and image description tasks (visual question answering accuracy decreased by ≤2.3%, and image description BLEU-4 score decreased by ≤0.03). Inference memory usage was reduced from 80GB to 39GB, a saving of 51.2%. It also supports flexible adjustment of modal weights to adapt to different scenarios: For image-intensive tasks (such as image semantic segmentation-assisted question answering), w1 was increased to 0.6, and w2 and w3 were adjusted to 0.15 and 0.25 respectively, improving image feature processing accuracy maintenance to 96%; for dialogue-intensive tasks (multi-turn image-text dialogue generation), w3 was increased to 0.4, and w1 and w2 were adjusted to 0.45 and 0.15 respectively, improving text generation coherence by 12%. Furthermore, the accompanying analysis tools can automatically track the differences in activation across different modalities and generate standardized activation statistics reports, providing data support for weight adjustment and pruning strategy optimization.

[0118] In summary, the method provided by this invention has the following technical effects:

[0119] 1. Avoid merging errors and retain independent router control capabilities: A pruning strategy is adopted instead of an expert merging strategy. After pruning, the router can still independently adjust the routing threshold for the retained expert subnetworks, maintain the dynamic mixing strategy that depends on the input, avoid the functional subspace collapse caused by the merging method from the root, eliminate the inherent weight difference related error of the merging method, and ensure the output accuracy of the generation task.

[0120] 2. Accurately identify redundancy and achieve high performance retention: Construct an expert importance evaluation system based on actual inference activation data, which jointly reflects the strength of routing decisions and output response. Data-driven screening of redundant experts ensures that expert subnetworks with the least impact on model output are removed, and the core performance of the original model can still be retained even at a high compression ratio (e.g., 50%).

[0121] 3. Simplified deployment process with cross-architecture versatility: Compression and deployment can be completed without fine-tuning. With only one calibration, pruning and reorganization process, it can be adapted to 20B~1T parameter plain text SMoE models and multimodal visual language models, which greatly reduces the deployment threshold and cost of large-scale models on resource-constrained devices.

[0122] 4. Multimodal adaptive optimization to adapt to diverse tasks: By distinguishing the activation modes of image feature tokens, user input text tokens, and model-generated text tokens, and combining adaptive modal weight configuration, the evaluation priority can be dynamically adjusted according to task characteristics (image-intensive / dialogue-intensive) to achieve a balanced optimization of image understanding and text generation capabilities, adapting to the needs of multimodal heterogeneous scenarios.

[0123] 5. Optimize inference stability and engineering feasibility: Combine temperature scaling to adjust the distribution of routing gating, smooth routing decisions after pruning, and avoid extreme activation preferences; at the same time, it can be integrated with toolchains to achieve full-process automation, support functions such as visualization of modal activation and parameter configuration, and balance technological advancement with engineering practicality, significantly improving model inference efficiency and memory utilization.

[0124] This invention also provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform the method described in this invention.

[0125] This invention also provides a computer-readable storage medium storing computer-executable instructions for performing the methods described in this invention.

[0126] It should be understood that the various forms of processes shown above can be used to reorder, add, or delete steps. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this invention can be achieved, and this is not limited herein.

[0127] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A compression method for a sparse activation hybrid expert model, characterized in that, Includes the following steps: Obtain expert activation information of the model to be compressed during the inference process, including routing thresholds and the output vectors of the corresponding expert subnetworks; Based on the expert activation information, an expert importance evaluation value is calculated for each expert subnetwork. This expert importance evaluation value is used to reflect the routing decision strength and output response strength when the expert subnetwork is activated. Based on the expert importance evaluation value, identify and remove expert subnetworks whose expert importance evaluation value is less than a preset threshold; The expert subnetworks that were removed and retained are reorganized to form a compressed model; When the model to be compressed is a multimodal visual language model, before calculating the expert importance evaluation value, the method further includes: distinguishing the activation modes of different modalities or input roles, configuring weights according to modality to calculate a weighted expert importance evaluation value, and performing the removal and recombination operation based on the weighted expert importance evaluation value.

2. The method according to claim 1, characterized in that, The expert importance evaluation value is calculated as follows: for each expert subnetwork, the expected value of the product of the routing threshold and the square of the output vector norm of the expert subnetwork is calculated when the expert subnetwork is activated.

3. The method according to claim 1, characterized in that, The step of identifying and removing expert subnetworks whose expert importance evaluation values ​​are less than a preset threshold based on the expert importance evaluation values ​​specifically includes: Within the global or layer-by-layer scope of the model to be compressed, a predetermined number or proportion of expert subnetworks with the lowest expert importance evaluation values ​​are selected and removed.

4. The method according to claim 1, characterized in that, The activation mode that distinguishes different modalities or input roles specifically includes: distinguishing tokens corresponding to image features, tokens corresponding to user input text, and tokens corresponding to model-generated text.

5. The method according to claim 4, characterized in that, The weighted expert importance assessment value meets the following conditions: ; Among them, S j weighted-advanced Let w be the weighted importance evaluation value of the j-th expert subnetwork, where j ranges from 1 to n, and n is the total number of expert subnetworks; i N represents the modal weight of the i-th type of token, where i ranges from 1 to 3; ji N represents the total frequency of the j-th expert being activated by the i-th type of token; ri S represents the total frequency of the r-th expert being activated by the i-th type of token, where r ranges from 1 to n; ji Let be the independent importance evaluation value of the j-th expert subnetwork under the i-th type of token activation mode; α is the dynamic correction coefficient; L j L represents the importance weight of the model level to which the j-th expert subnetwork belongs; total σ is the sum of the importance weights of all model levels; ji (g) is the standard deviation of the routing gating weight value when the j-th expert is activated by the i-th type of token.

6. The method according to claim 1, characterized in that, Also includes: Temperature scaling is applied to the routing gating distribution of the compressed model to smooth the distribution of routing decisions.

7. The method according to claim 4 or 5, characterized in that, The modal weights are adaptively configured based on the task characteristics of the target deployment scenario: when the task is image-intensive understanding, the weight of image feature tokens is increased; when the task is dialogue-intensive generation, the weight of text tokens generated by the model is increased.

8. The method according to claim 1, characterized in that, When acquiring expert activation information, a monitoring hook mechanism is embedded in the forward inference process to capture and distinguish tokens of different modes or roles flowing through the router in real time, and to record the routing threshold and expert subnetwork output corresponding to the tokens of different modes or roles respectively.

9. An electronic device, characterized in that, Including processor and memory; The processor executes the steps of the method as described in any one of claims 1 to 8 by invoking programs or instructions stored in the memory.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store a program or instructions that cause a computer to perform the steps of the method as described in any one of claims 1 to 8.