Large language model fusion method, electronic device, and storage medium

By using a component-level weighted fusion method, the FFN and MHA of a large language model are decomposed into independent units, and their specific weights are learned. This solves the problems of catastrophic forgetting and fuzzy probability distribution in existing technologies, and realizes a model that efficiently integrates multiple capabilities, thereby improving the performance and accuracy of composite tasks.

CN122242787APending Publication Date: 2026-06-19SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2026-03-06
Publication Date
2026-06-19

Smart Images

  • Figure CN122242787A_ABST
    Figure CN122242787A_ABST
Patent Text Reader

Abstract

This application discloses a large language model fusion method, electronic device, and storage medium. One large language model fusion method includes: loading at least two homologous expert models fine-tuned based on the same architecture; establishing learnable mask variables at at least one functional unit level of the expert models, wherein the functional unit includes at least a feedforward neural network and a multi-head attention mechanism; executing differentiated fusion strategies according to the type of the at least one functional unit; synthesizing complete layer parameters based on the virtual neurons, the virtual attention heads, and the logic-dominated expert model parameters to participate in forward computation; training the model using mixed data containing samples of different capabilities; and after completing a predetermined number of training rounds, outputting the final optimized mask parameters to complete the construction of the fusion model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of model merging technology for large language models (LLM), and particularly relates to large language model merging methods, electronic devices, and storage media. Background Technology

[0002] In related technologies, LLM (Large Language Model) refers to a deep neural network model with a huge number of parameters, pre-trained on massive amounts of text, possessing powerful natural language understanding and generation capabilities. Large language model fusion techniques mainly include simple model averaging, task arithmetic, and their derived advanced fusion algorithms, among which TIES-Merging (Trim, Elect, Sign & Merge) and DARE (Drop And Rescale) are the most representative. Furthermore, to achieve multi-capability fusion, traditional sequential fine-tuning is also a commonly used benchmark method in the industry.

[0003] The core objective of these technologies is to merge multiple homogeneous models (i.e., models originating from the same pre-training base) that have been fine-tuned for different tasks into a single model with comprehensive capabilities without incurring high costs for full parameter retraining.

[0004] Traditional simple model averaging is the most basic method, which directly performs a linear average of the weight matrices of multiple expert models. Task arithmetic, developed on this basis, calculates the "task vector" by measuring the parameter difference between the fine-tuned model and the pre-trained base model. Vector addition is then used to superimpose the capabilities of different tasks onto the base model.

[0005] To address the parameter interference issue caused by direct merging, the TIES-Merging technique introduces pruning and sign alignment mechanisms. It first prunes redundant parameters with small absolute values ​​from the task vector, then resolves sign conflicts between different models at the same parameter position through an election mechanism, and finally merges only the retained parameters. Another mainstream technique, DARE, is based on the assumption of high redundancy in model parameters. It attempts to reduce inter-model interference while preserving original capabilities by randomly discarding the vast majority (e.g., 90%) of fine-tuned parameters and rescaling the remaining parameters. Summary of the Invention

[0006] This invention provides a large language model fusion method, an electronic device, and a storage medium to at least solve one of the above-mentioned technical problems.

[0007] In a first aspect, embodiments of the present invention provide a large language model fusion method, comprising: loading at least two homologous expert models fine-tuned based on the same architecture; establishing learnable mask variables at at least one functional unit level of the expert models, wherein the functional unit includes at least a feedforward neural network and a multi-head attention mechanism; executing differentiated fusion strategies according to the type of the at least one functional unit; synthesizing complete layer parameters based on the virtual neurons, the virtual attention heads, and the logic-dominated expert model parameters to participate in forward computation; and training the mask parameters of the model using mixed data containing samples of different capabilities to complete the construction of the fusion model.

[0008] Secondly, embodiments of the present invention also provide a computer program product, the computer program product including a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to perform the steps of the large language model fusion method of any embodiment of the present invention.

[0009] Thirdly, embodiments of the present invention also provide an electronic device comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of the method described in the first aspect.

[0010] Fourthly, embodiments of the present invention also provide a storage medium storing a computer program thereon, characterized in that the computer program, when executed by a processor, implements the steps of the method described in the first aspect.

[0011] The method in this application implements a component-level weighted fusion method. Instead of performing a coarse average of the entire parameter matrix, this method delves into the functional unit level of the Transformer architecture. Specifically, it decomposes the feedforward neural network (FFN) into independent neurons, the multi-head attention layer (MHA) into independent attention heads, and learns a dedicated set of fusion weights for each functional unit. Attached Figure Description

[0012] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 This is a schematic diagram of a solution architecture provided in an embodiment of the present invention; Figure 2 A flowchart of a large language model fusion method provided in an embodiment of the present invention; Figure 3 A comparison diagram of traditional parameter-level merging and component-wise weighted fusion (CWF) proposed in this application, provided as an embodiment of the present invention; Figure 4 Cross-language mathematical reasoning capability provided in an embodiment of the present invention; Figure 5 Overall capability retention (Indonesian / Thai average) provided for an embodiment of the present invention. Figure 6 A non-decoder strategy ablation experiment provided for an embodiment of the present invention (mathematics + Thai). Figure 7 A comparison of optimization strategies in an Indonesian composite task provided by an embodiment of the present invention; Figure 8 A comparison of peak memory usage of the Llama-3-8B model provided in an embodiment of the present invention; Figure 9 A hierarchical average fusion weight for merging mathematics + Thai and mathematics + Indonesian languages ​​provided in an embodiment of the present invention; Figure 10 This is a feature selection distribution of each layer of an FFN neuron provided in an embodiment of the present invention; Figure 11 Microscopic activation dynamics provided in an embodiment of the present invention; Figure 12 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0014] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0015] The inventors discovered that the aforementioned technologies suffer from one or more of the following drawbacks: First, severe catastrophic forgetting and functional interference. When attempting to fuse models with drastically different abilities (e.g., a mathematical expert model skilled in logical reasoning and a small-language dialogue model skilled in verbal expression), existing methods often lead to a sharp degradation of the original core capabilities. Experimental data shows that after using TIES-Merging, the model's accuracy in handling complex mathematical tasks drops significantly, and in some cases, it even leads to a significant decline in English proficiency. Second, the inability to effectively construct composite capabilities. Models generated by existing fusion techniques often struggle to simultaneously utilize two capabilities to solve the same problem. The models typically exhibit a fragmentation of capabilities, either only capable of simple logical reasoning or fluent small talk, unable to maintain a rigorous logical chain while engaging in conversation in a non-native language. Finally, the output probability distribution becomes blurred. For tasks requiring extremely high symbolic precision, such as mathematical reasoning or code generation, the linear fusion strategy employed by existing technologies leads to a flattened or chaotic probability distribution in the output layer. This makes the model "hesitant" when generating key values ​​or symbols, easily resulting in illusions or outputting incorrect answers.

[0016] The inventors discovered that the root cause of the aforementioned shortcomings lies in the fact that existing similar techniques treat neural network parameters as unstructured numerical matrices, neglecting the inherent modular functional structure within the model. Specifically, methods such as TIES-Merging and DARE focus only on the magnitude or sign of the values ​​when processing parameters, employing a uniform pruning or averaging strategy for parameters across all levels and locations. However, in the Transformer architecture of large language models, different components play different functional roles: neurons in the feedforward neural network (FFN) layer are primarily responsible for storing facts and logical knowledge, while the multi-head attention (MHA) layer is responsible for information routing and contextual association. Existing techniques fail to differentiate these functional units at the granular level, causing key neurons responsible for logical reasoning to be diluted or covered by parameters from other tasks, leading to catastrophic forgetting. Furthermore, existing techniques typically include input embeddings and output heads (LM Head) within the scope of linear fusion. These two components are the key interfaces connecting the hidden space and the discrete tokens. Simple linear mixing directly destroys the model's ability to accurately index domain-specific terms (such as mathematical symbols), which is the direct cause of fuzzy probability distributions and decreased inference accuracy.

[0017] To address the forgetting and conflict issues in multi-capability fusion, practitioners in this industry typically tend to adopt sequential fine-tuning or Mixture of Experts (MoE) architectures. Sequential fine-tuning involves training existing expert models using data from new domains, but this has proven highly prone to catastrophic forgetting of old knowledge. Another common approach is to build complex MoE systems that dynamically select different models to respond during inference, but this usually requires changing the underlying model architecture or deploying multiple large models simultaneously for ensemble inference, which leads to a significant increase in inference costs and memory usage.

[0018] This approach is less likely to be considered by industry professionals due to two main cognitive barriers. First, current mainstream research (such as TIES and DARE) generally treats model parameters as unstructured numerical matrices, focusing on solving numerical interference at the mathematical level while ignoring the biologically heuristic structures within neural networks (such as the functional specificity of neurons). Few in the industry would consider breaking down a massive parameter matrix into tens of thousands of independent neurons and attention heads for fine-grained management. Second, the strategy in this approach of completely excluding the fusion of non-decoder components (input / output layers) is counterintuitive. Conventional thinking suggests that model fusion should include all layers, but this approach actually demonstrates that retaining a single expert's input / output layer is crucial for maintaining a rigorous probability distribution.

[0019] This application proposes a component-wise weighted fusion method (CWF, a model fusion method that weights data at the neuron and attention head level). Instead of coarsely averaging the entire parameter matrix, this method delves into the functional unit level of the Transformer architecture. Specifically, it decomposes the feed-forward network (FFN, an important sub-layer in the Transformer model architecture, consisting of two linear transformations and activation functions, often considered the main area for storing facts and logical knowledge in large models) into independent neurons, and the multi-head attention layer (MHA, a key component in the Transformer model architecture, capturing dependencies between different positions in a sequence through multiple parallel attention heads) into independent attention heads, and learns a set of dedicated fusion weights for each functional unit.

[0020] By training on small-scale mixed data, CWF can surgically identify and preserve key neurons responsible for logical reasoning (i.e., "logical anchors"), while flexibly invoking components of the language model when processing linguistic features. To address the issue of ambiguous probability distributions, this invention forcibly retains the embedding layer and output head of the dominant expert (such as a mathematical model), ensuring that the tokens generated by the model have a high degree of determinism. Furthermore, considering the differences in learning laziness among different components, this invention employs a decoupling optimization strategy, successfully achieving an organic combination of heterogeneous capabilities.

[0021] Please refer to Figure 1 The diagram illustrates an architecture of a component-weighted large language model functional modular fusion (CWF) method based on this application.

[0022] like Figure 1 As shown, step S1: Loading the same-origin expert model and initializing the mask. This step is the initialization phase of the fusion process. First, the system directly loads at least two peer expert models fine-tuned based on the same architecture, such as a mathematical expert model (M_math) and a language dialogue expert model (M_chat). After loading, the system immediately freezes the original backbone parameters of all models to ensure that these parameters serve only as a fixed knowledge base in subsequent processes and do not undergo gradient updates.

[0023] Next, the system establishes learnable mask variables at the functional unit level. Unlike traditional layer-based weighting, the system delves into the component granularity: for each layer of feedforward neural network (FFN), a weight vector λ_F with dimension N (number of models) is initialized for each neuron; for each layer of multi-head attention mechanism (MHA), a weight vector λ_A with dimension N is initialized for each attention head.

[0024] To ensure numerical stability and preserve dominance, the system introduces the Softmax function to normalize these weights and sets a specific bias (such as 0.731) during initialization, so that the initial weights are slightly biased towards the logistic dominance model (such as the mathematical expert).

[0025] Step S2: Construct a component-level weighted computation graph This step defines the forward propagation logic of the data flow in the model. The system executes differentiated fusion strategies based on component type.

[0026] For the FFN layer, the system reads the neuron parameters (including Gate, Up, and Down projections) at the same location from each expert model, calculates a weighted sum based on the neuron mask defined in step S1, and synthesizes the current "virtual neuron" to participate in the computation. For the Attention layer, the system reads the attention head parameters (Q, K, V, O matrices) at the same location from each expert model, calculates a weighted sum based on the attention head mask, and synthesizes a "virtual attention head." For the input embedding layers and the output language model head (LM Head), the system does not perform weighted fusion, but directly forces the replication of the parameters of the dominant expert (M_math). This strategy ensures that the model always operates within a latent space suitable for logical reasoning and can output an accurate mathematical symbol probability distribution.

[0027] Step S3: Perform decoupling mask training This step is the core optimization process. The system trains the model using mixed data containing samples of different abilities (such as English math problems and target language dialogues).

[0028] The mixed data is forward-propagated through the computation graph constructed in step S2 to calculate the cross-entropy loss between the predicted result and the true label. During backpropagation, gradients are only propagated back to the mask parameters initialized in step S1. To address the differences in optimization characteristics among different components, the system employs a decoupling optimization strategy: a high learning rate (e.g., 2.5 × 10^(-3)) is applied to the FFN mask to overcome the large optimization inertia of neurons, enabling them to quickly extract key knowledge from mathematical experts; a low learning rate (e.g., 5 × 10^(-4)) is applied to the MHA mask to maintain the stability of the attention routing mechanism and prevent language features from drastically impacting the logical links. After completing a predetermined number of training rounds, the final optimized mask parameters are output, thus completing the construction of the fusion model.

[0029] In implementing this application, the inventors also considered the following alternative solutions: In the early stages of development, this application focused on a "layer-based fusion" approach. This approach avoids refining the granularity down to the neuron level, instead learning a fusion weight for each layer of the Transformer. Advantages: This approach significantly reduces the number of mask parameters that need to be learned, making it simpler to implement and with extremely low computational overhead. Disadvantages: Experiments show that a single layer of the Transformer often performs multiple functions simultaneously. For example, neurons in a certain layer may simultaneously contain neurons that process syntax and neurons that process logic. If weighting is applied at the layer level, these two types of functions cannot be separated, resulting in the model either retaining logic but losing language fluency, or retaining language but degrading logical ability. In contrast, the component-level fusion proposed in this patent application achieves a more refined separation.

[0030] Another alternative is "weighted output head fusion," which involves weighting the input and output layers like intermediate layers. Advantages: It maintains consistency in the overall model processing logic, eliminating the need for special handling of specific layers. Disadvantages: This leads to a significant performance degradation. Because different expert models, after fine-tuning the output heads, tend to predict the same token with varying probabilities, linear mixing results in a flat and high-entropy prediction distribution. In tasks like mathematical reasoning, which require precise output of numbers and symbols, this fuzzy probability distribution can cause the model to generate incorrect answers or meaningless characters.

[0031] Before finalizing the solution, this application developed a "unified optimized version." In this version, although a component-level mask structure was adopted, the same learning rate (e.g., "1×10^(-3")) was used for all modules (FFN and Attention) during training. Advantages: Simple hyperparameter settings, standard training process, no need to adjust optimizer settings for different modules. Disadvantages: Although the model converged, the final result was not as expected. Analysis revealed that the mask weights of the FFN layer changed very little, indicating that under a unified learning rate, the optimizer could not overcome the "inertia" of the FFN neurons, causing the model to fail to fully activate the key knowledge neurons in the mathematical expert. This resulted in a bottleneck in the performance of this version on complex reasoning tasks.

[0032] In addition, this application also tested a version that completely replicates the language head. This version attempts to retain the language model's output head, believing that this helps generate more authentic local language. Advantages: The text generated by the model performs exceptionally well in terms of fluency, with very few grammatical errors. Disadvantages: Mathematical reasoning ability nearly collapses (accuracy drops from 60% to 16%). This is because the language model's output head tends to predict high-frequency everyday words while suppressing low-frequency mathematical symbols and numbers. This "lexical mismatch" means that even if the intermediate layer reasoning is correct, the correct mathematical answer cannot be output in the end.

[0033] In terms of direct results, this approach achieves significant performance breakthroughs in challenging complex tasks such as cross-linguistic mathematical reasoning. Experimental data shows that in the Thai mathematical reasoning task, the accuracy of this approach (27.68%) far exceeds that of benchmark methods such as TIES-Merging (18.96%), while perfectly preserving the original English mathematical ability (approximately 79%), effectively solving the catastrophic forgetting problem. Furthermore, due to the adoption of a frozen backbone parameter strategy, this approach reduces the GPU memory usage during training by approximately 18GB compared to full fine-tuning (taking an 8B model as an example), significantly lowering the hardware requirements.

[0034] From a ripple effect perspective, this solution fundamentally changes the data dependency pattern of domain-specific models. Previously, training an Indonesian math model required collecting expensive Indonesian math problem data. However, using this solution, this application only needs two unrelated datasets: English math data and Indonesian dialogue data, to synthesize Indonesian math skills. This means that for various less commonly spoken languages ​​or long-tail domains, as long as corresponding skill models exist, composite applications can be quickly built through fusion, significantly reducing the data costs of AI deployment.

[0035] On a deeper scientific level, this approach offers an unprecedented perspective on model interpretability. By analyzing the trained mask, this application provides a clear view of the internal division of labor within the model. This visualized "neuraler routing" mechanism offers crucial theoretical and experimental support for designing more efficient modular neural network architectures in the future.

[0036] Please refer to Figure 2 The diagram shows a flowchart of a large language model fusion method provided in an embodiment of this application.

[0037] like Figure 2 As shown, in step 201, at least two homologous expert models fine-tuned based on the same architecture are loaded, wherein the expert models include a logic-dominated model and other models; In step 202, learnable mask variables are established at at least one functional unit level of the expert model, wherein the functional unit includes at least a feedforward neural network and a multi-head attention mechanism; In step 203, a differentiated fusion strategy is executed according to the type of the at least one functional unit. For the feedforward neural network layer, the neuron parameters of each expert model at the same position are read, and the weighted sum of the neuron parameters is calculated to synthesize a virtual neuron. For the multi-head attention mechanism layer, the weighted sum of the attention parameters is calculated to synthesize a virtual attention head. For non-decoder components, the parameters of the logic-dominated expert model are directly copied. Based on the virtual neurons, the virtual attention heads, and the parameters of the logic-dominated expert model, the complete layer parameters are synthesized to participate in the forward computation. In step 204, the model is trained using mixed data containing samples with different capabilities. Forward propagation is performed according to the forward computation process to calculate the cross-entropy loss between the predicted result and the true label. In the backpropagation stage, the gradient is only propagated back to the initialization parameters of the established learnable mask variable. A decoupling optimization strategy is applied to address the differences in optimization characteristics of different functional units. After completing a predetermined number of training rounds, the final optimized mask parameters are output to complete the construction of the fusion model.

[0038] This application implements a component-level weighted fusion method through the above-described scheme. Instead of coarsely averaging the entire parameter matrix, this method delves into the functional unit level of the Transformer architecture. Specifically, it decomposes the feedforward neural network (FFN) into independent neurons, the multi-head attention layer (MHA) into independent attention heads, and learns a unique set of fusion weights for each functional unit. By training on small-scale mixed data, this application's scheme can surgically identify and retain key neurons responsible for logical reasoning (i.e., "logical anchors"), while flexibly invoking language model components when processing language features. To address the problem of ambiguous probability distributions, this invention forcibly retains the embedding layer and output head of the dominant expert (such as a mathematical model), ensuring that the tokens generated by the model have a high degree of determinism. Furthermore, considering the differences in learning laziness among different components, this invention employs a decoupling optimization strategy, successfully achieving an organic combination of heterogeneous capabilities.

[0039] In some optional embodiments, establishing learnable mask variables at at least one functional unit level of the expert model includes: for each layer of the feedforward neural network, initializing a first weight vector of dimension N for each neuron, where N is the number of models; and for each layer of the multi-head attention mechanism, initializing a second weight vector of dimension N for each attention head. This embodiment differs from traditional layer-based weighting; the system delves into the component granularity: for each layer of the feedforward neural network (FFN), initializing a weight vector λ_F of dimension N (number of models) for each neuron; and for each layer of the multi-head attention mechanism (MHA), initializing a weight vector λ_A of dimension N for each attention head. A single layer of a Transformer often performs multiple functions simultaneously; the above scheme enables a more refined separation of functions.

[0040] In some optional embodiments, after establishing the learnable mask variables, the method further includes: introducing a Softmax function to normalize the first weights and the second weights, and setting a specific bias during initialization to slightly favor the logistically dominant model in the initial weights. This ensures numerical stability and preserves the dominant capability.

[0041] In some optional embodiments, the application decoupling optimization strategy includes: applying a high learning rate to the feedforward neural network mask to overcome the large optimization inertia of neurons, enabling them to quickly extract key knowledge from the logic-dominated expert model; and applying a low learning rate to the multi-head attention mechanism mask to maintain the stability of the attention routing mechanism and prevent the features of the other models from causing severe impacts on the logic link.

[0042] In some optional embodiments, after loading at least two homologous expert models fine-tuned based on the same architecture, the method further includes immediately freezing the original backbone parameters of all models. This ensures that these parameters serve only as a fixed knowledge base in subsequent processes and do not undergo gradient updates. Experimental data shows that, due to the strategy of freezing backbone parameters, this scheme reduces the GPU memory usage during training by approximately 18GB compared to full fine-tuning (taking an 8B model as an example), significantly lowering the hardware threshold.

[0043] In some alternative embodiments, the non-decoder component includes an input embedding layer and an output language model head.

[0044] In a further optional embodiment, the logic-dominant model is a mathematical expert model, and the other models are language dialogue expert models. Thus, the proposed solution achieves significant performance breakthroughs in challenging complex tasks such as cross-linguistic mathematical reasoning. Experimental data shows that in the Thai mathematical reasoning task, the accuracy of this solution (27.68%) far exceeds that of benchmark methods such as TIES-Merging (18.96%), while perfectly preserving the original English mathematical ability (approximately 79%), effectively solving the catastrophic forgetting problem.

[0045] Furthermore, this solution fundamentally changes the data dependency pattern of domain-specific models. Previously, training an Indonesian math model required collecting expensive Indonesian math problem data. However, using this solution, this application only needs two unrelated datasets: English math data and Indonesian dialogue data, to synthesize Indonesian math skills. This means that for various less commonly spoken languages ​​or long-tail domains, as long as corresponding skill models exist, composite applications can be quickly built through fusion, significantly reducing the data costs of AI deployment.

[0046] While many Large Language Models (LLMs) excel in areas such as mathematical reasoning and can function as expert models, single expert models often underperform when handling complex tasks requiring multiple different capabilities. This application proposes Component-Level Weighted Fusion (CWF), a parametrically efficient framework designed to combine diverse capabilities. Unlike methods that combine unstructured parameters, CWF operates on functional units, selectively fusing FFN neurons and attention heads by learning lightweight masks. In experiments combining a mathematical expert model with a low-resource language dialogue model, CWF successfully combines different capabilities, demonstrating superior performance on datasets requiring both mathematical and linguistic skills. This showcases the advantages of combining models through functional module fusion, while providing transparent interpretability for understanding the modular nature of LLMs. Furthermore, micro-level analysis of the masks shows that CWF effectively identifies and activates specific neurons responsible for mathematical logic while preserving neurons used for linguistic expression.

[0047] With the development of large-scale language models (LLMs), many general-purpose language models now possess rich capabilities to meet common user needs. However, in highly specialized fields (such as chemical molecule inference [Zhao et al., 2025], medical suggestion generation [Zhang et al., 2024]) or fields with scarce training resources (such as low-resource language [Ruder et al., 2019; Hu et al., 2020]), general-purpose models often struggle to handle specialized tasks in specific domains. To improve performance in such domains, researchers have used massive amounts of domain-specific data to deeply fine-tune general-purpose models, transforming them into expert models with specific capabilities. However, when faced with complex problems requiring the coordination of multiple capabilities, these single-capability expert models may also perform poorly. Further fine-tuning of expert models to supplement missing capabilities may lead to knowledge conflicts and catastrophic forgetting (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017), resulting in severe performance degradation. How to effectively combine the capabilities of different models to solve complex tasks has become a research hotspot in academia.

[0048] A growing body of research addresses this challenge through model collaboration and cooperative reasoning. Researchers have designed systems where multiple specialized large language models (or agents based on large language models) collaboratively solve complex tasks (Guo et al., 2024). Other studies have made significant progress in mathematical reasoning and problem-solving by exploring planning and tool-using abilities through collaborative agents (Du et al., 2024b; Chen et al., 2024). These multi-expert collaborative methods require carefully designed system architectures and cue words, and the deployment of multiple models for reasoning, leading to increased system complexity. Another important direction is model merging and fusion, which combines multiple expert models into a single model to unify their capabilities. Parametrically efficient methods such as model merging (Matena & Raffel, 2022) and task arithmetic (Ilharco et al., 2023) demonstrate that fine-tuned models can be combined without retraining from scratch. Recent advancements have proposed strategies such as weight interpolation (Wang et al., 2025) and layer swapping (Bandarkar et al., 2025) to mitigate catastrophic forgetting while achieving strong transferability. Although advanced techniques like TIES merging (Yadav et al., 2023) and DARE (Yu et al., 2023) introduce selection mechanisms based on weight size or sign, they essentially treat parameters as unstructured matrices, ignoring the functional modularity of the model structure (such as neuronal units). Therefore, these methods may fail to faithfully reflect the combined capabilities of different models, thus failing to fully exploit the original potential of expert models. Meanwhile, recent research on the interpretability of neural networks shows that large language models possess modular functional structures, and the model's capabilities are triggered by the combination of these functional modules. For example, numerous studies have shown that decoder layers of different depths process different types of information (Zeng et al., 2025; Men et al., 2024), while at a more refined module level, neurons in feedforward neural network (FFN) layers are considered to carry the core knowledge of the model (Tan et al., 2024; Geva et al., 2021), and the attention head mechanism has also been found to be crucial for performing different capabilities (Zheng et al., 2024; Han et al., 2025).

[0049] Figure 3This paper compares traditional parameter-level fusion with the component-level weighted fusion (CWF) proposed in this study. Traditional methods (left) suffer from knowledge interference and probability space ambiguity due to unstructured averaging, while CWF (right) employs a hierarchical fusion strategy: learning fine-grained masks for FFN neurons and coarse-grained masks for the attention head, thus achieving selective combination of functional units. Crucially, by strictly preserving the embeddings of the master expert and the language model head, CWF maintains a rigorous probability distribution, effectively decoupling and preventing catastrophic forgetting. The English-Chinese translations are as follows: Traditional Parameter-Level Merging (e.g., TIES, DARE); Math Expert; Chat Expert; Unstructured Averaging; Interference & Noise; Output LM Head (Blurred); Blurred Probability Landscape; Catastrophic Forgetting / Conflict; Ours: Component-wise Weight Fusion (CWF); Math Expert; Chat Expert; Target Mixed Model; Coarse-grained Head Mask (Per-Head QKV StackSoftmax); Attention Layers; Math MHA QKV Stack: Mathematical Multi-Head Attention QKV Stack; Mixed QKV Stack: Hybrid QKV Stack; ChatMHA QKV Stack: Chat Multi-Head Attention QKV Stack; Fine-grained Neuron Mask (Per-NeuronColumn Vector Softmax): Fine-grained neuron mask (per-neuron column vector Softmax); FFN Layers: Feedforward Neural Network (FFN) layers; Math Neuron Vector: Mathematical neuron vector; Mixed Neuron Vector: Hybrid neuron vector; Chat Neuron Vector: Chat neuron vector;Embedding & LM Head: Embedding layer and language model head; Math Expert Embeddings: Mathematical expert embedding layers; Transformer Blocks: Transformer blocks; Output LM Head: Output language model head; Rigorous Probability Landscape (MathPrecision): Rigorous probability distribution landscape (mathematical precision); Disentangled Abilities / Interpretable Routing: Capability decoupling and interpretable routing.

[0050] Inspired by these findings, this application hypothesizes that to effectively combine multiple capabilities, fusion should be performed at the level of the model's independent functional modules, rather than treating parameters as unstructured weight matrices. To this end, this application proposes a component-level weighted fusion (CWF) method, specifically targeting the basic processing units of the Transformer model: FFN neurons (FFNs) and attention heads. This application chooses this granularity to balance training efficiency and interpretability. Element-level methods require assigning masks to each parameter, which can easily lead to parameter explosion and optimization difficulties. In contrast, the module-level masking scheme of this application is lightweight and controllable. Furthermore, this granularity aligns with the inherent structure of large language models, allowing inventors to explicitly observe functional behaviors (such as the selection patterns of FFN neurons). CWF abandons heuristic weight averaging and instead learns lightweight masks on mixed task data, dynamically adjusting the contribution of each functional module. By selectively merging parameters at the same location based on functional importance, this application is able to construct a fusion model (…). Figure 3 The key is that this fine-grained granularity allows the fusion model to decouple and recombine diverse capabilities—such as mathematical reasoning and low-resource language generation—thus resolving the conflicts common in coarse-grained fusion methods. Experiments show that when a mathematical expert model is combined with a low-resource language dialogue model, CWF successfully combines these differentiated skills, outperforming the original model and existing fusion baselines in composite tasks. Further analysis reveals that CWF operates as an interpretable router: it retains knowledge-intensive neurons by learning complex strategies while adjusting the language processing head, thus providing valuable insights into the efficient fusion capabilities of large language models.

[0051] Related work Model fusion strategy Model fusion is an efficient way to combine knowledge, integrating the knowledge and capabilities of different models while avoiding the full cost of joint training. Some research focuses on training-free knowledge transfer methods between models, including parameter averaging (Wortsman et al., 2022; Du et al., 2024a) and structured addition and subtraction (Ilharco et al., 2023; Huang et al., 2024). On the other hand, learning-based methods such as multi-task learning (Crawshaw, 2020) and ensemble learning (Wang et al., 2024) may achieve better performance by introducing a small amount of training cost. In the fusion process, some alignment strategies... To reduce knowledge interference between models (Yadav et al., 2023; Yu et al., 2023), these strategies have failed to address the functional modules within the model.

[0052] Such fusion methods typically require meticulous examination of weight values, magnitudes, or gradients. This study employs a simple and low-cost joint training method, achieving efficient synergy between model fusion and capability combination through a parameter averaging strategy.

[0053] Previous research on the functional structure and modules of large language models has proposed various forms of functional modules that endow large language models with corresponding capabilities. One research path focuses on neurons in neural networks, where individual or grouped neurons encode abstract linguistic knowledge, world knowledge, and task-specific features (Geva et al., 2021; Song et al., 2024). Neuron functionality can also be measured by evaluating its semantic and causal contributions (Meng et al., 2022; Rai & Yao, 2024). From the perspective of the Transformer architecture (Vaswani, 2017), the attention head is also considered an interpretable unit: specific heads specialize in syntactic or semantic dependencies (Clark et al., 2019) and perform different tasks by injecting task vectors (Todd et al., 2024). Recent research suggests that functional modules also include neural network loops—modules distributed throughout the model that work together to handle specific tasks (Yao et al., 2024; Ameisen et al., 2025). In order to fully utilize the encoding capabilities within the original model, this study selects neurons and attention heads as fusion targets to combine their capabilities.

[0054] Methodology The component-level weighted fusion (CWF) method proposed in this application aims to fuse two or more expert-level LLMs with the same architecture. The goal is to create a single fused model that inherits the capabilities of all parent models while maintaining the inference cost of a single model. This section elaborates on the notation system and methods for constructing such models.

[0055] Refined modular integration This study focuses on the fusion of two expert-level LLMs (labeled M1 and M2). A key prerequisite for component-level fusion is that the models must be homologous—that is, derived from the same pre-trained backbone model through fine-tuning. Therefore, both have the same number of layers l, attention heads h, hidden dimension d, and global feedback network intermediate dimension dinter. Although the CWF framework can be mathematically generalized to model combinations of N>2, for clarity, the modeling and experiments in this application focus on a dual-model scenario.

[0056] Module 1: FFN neurons (fine fusion).

[0057] This application views a feedforward neural network (FFN) as a collection of functional units called "neurons". For model Mi, let the gating matrix and the uplink matrix be denoted as W. G,i W U,i ∈ R d×dinter The downlink matrix is ​​denoted as W. D,i ∈ R dinter×d Based on the direct contribution of neurons to the output, this application defines the k-th neuron as a vector tuple: input projection vector w (k) G,i w (k) U,i ∈ R d (W) G,i W U,i (the kth column) and the output projection vector w (k) D,i ∈ R d (W) D,i (row k).

[0058] This application achieves fusion by weighted averaging of neurons at the same location in multiple models.

[0059] The fused neuron vector w (k) ∗ Through learnable mixing coefficients λ (k) F,i The calculation shows that: The output contribution matrix Y of the fusion neuron under the input X. (k) F ∈R(s×d) for: The final FFN output is the sum of the contributions from all fused neurons: .

[0060] Module 2: Attention Heads (Coarse-grained Fusion). While the parameter matrix can technically be broken down into neurons for the attention layer, this application employs a different granularity. Because the MHA (Multi-Head Attention Mechanism) layer has an explicit multi-head structure, and each head represents a modular characteristic of specific capabilities (Zheng et al., 2024; Elhelo & Geva, 2024), this application uses a weighted average of the head matrices. Let model M... i The k-th head is determined by the projection matrix W (k) Q,i W (k) K,i W (k) V,i ∈ R d×dh and W (k) O,i ∈R dh×d (where d) h = d / h) parameterization. This application uses the head-specificity coefficient λ. (k) A,i Calculate the fusion head matrix W˜.

[0061] Subsequently, this application uses these fusion weights to implement the standard's self-attention mechanism. The contribution of the k-th fusion head is: The final MHA output is the sum of all fusion headers: .

[0062] Competitive selection mechanism. Finally, for the mixing coefficients λ used in equations (1) and (3), this application adopts a competitive selection mechanism. A learnable mask tensor M is introduced. F and M A This includes the logit value α. For any functional unit (neuron or processing unit) indexed by k, the weights of model i are: This design ensures that the sum of the contributions of different model parameters is 1, thereby maintaining the activation amplitude of the original backbone network.

[0063] Non-decoder components. For non-decoder structures such as the embedding layer and language model (LM) head, this application employs a selection rather than fusion strategy. Empirical analysis (followed by examples) shows that these components are the key interfaces for mapping latent representations to accurate lexical distributions. Linearly fusing different output heads (such as a math expert and a chat model) would “blur” the probability distribution picture required for rigorous symbolic reasoning. Therefore, in the final merged model M, this application strictly preserves the non-decoder parameters of the primary expert (such as a math expert) to ensure the generation of accurate numerical and symbolic tokens, fusing parameters from all parent models only at the MHA and FFN layers.

[0064] Mask training This application employs a training-based method to learn the optimal fusion mask. This application constructs a hybrid dataset D = {D1, ..., D...}. N}, where each subset D i Includes representative model M i Typical samples of capability. The optimization objective is to maintain all original pre-trained parameters Θ. orig Minimize the fusion model M under frozen conditions. fused The standard language model loss (cross-entropy). Learnable mask parameters Φ={M A M F The update rules for} are as follows: During the optimization process, this application only uses the gradient descent method to update the mask logic value.

[0065] Experimental verification This section evaluates the performance of CWF by rigorously comparing it with baseline models on a composite inference task. First, the experimental setup is described in detail, including the homologous expert models and the decoupling optimization strategy used in training. Then, the main results from cross-lingual mathematical benchmarks are presented, demonstrating CWF's advantage in incorporating heterogeneous models. Finally, ablation studies validate the effectiveness of the non-decoder component design and learning rate scheduling scheme.

[0066] Experimental setup Models and Datasets This application selects mathematical reasoning and low-resource language generation tasks to verify the ability of CWF combinatorial orthogonality (rigorous symbolic logic and linguistic competence). Experiments employ the Llama-3-8B architecture (Meta AI, 2024). This application uses a custom-tuned mathematical expert model (Mmath) and combines Sahabat-AI (Koto et al., 2023) and Typhoon (Pipatanakul et al., 2023) as expert models for Indonesian (Mid) and Thai (Mth). Mask training uses the GSM8K (Cobbe et al., 2021) training set and subsets are extracted from the Indonesian 1 and Thai (Phatthiyaphaibun, 2024) dialogue datasets. These data are used only to obtain the masks needed for fusion rather than to inject domain knowledge into the model, thus requiring a small number of samples for training. Evaluation is performed on the GSM8K and MATH (Hendrycks et al., 2021) datasets. This application also uses GPT-5 (OpenAI, 2025) to translate these test sets into Indonesian and Thai to evaluate the model's mathematical problem-solving capabilities when additional low-resource language abilities are required. General reasoning capabilities are evaluated using WinoGrande (Sakaguchi et al., 2020).

[0067] Implementation Details: This application initializes the fusion mask with a ratio of 0.731:0.269, which favors the mathematical expert model. This specific ratio is obtained by applying the softmax function to the mask logits, which have an initial value of (0.5, -0.5). This setting aims to preserve the core inference capabilities of the mathematical expert model and is further validated in the sensitivity analysis in Appendix A.1. Based on ablation experiments (see subsequent embodiments), this application employs two key strategies: (1) Non-decoder component retention: Strict retention of mathematical experts (M) math The embedding layer of the language model is combined with the language model head. Although the experiments compared schemes that directly replicate the chat model or learn weighted fusion, retaining the mathematical expert component is crucial for maintaining the accurate probability distribution required for numerical and symbolic generation.

[0068] (2) Decomposition and Optimization: Differentiated learning rates are used to adapt to the optimization characteristics of different functional modules. Although a baseline uniform learning rate (LR) of 1×10 is used... -3 For comparison, but this framework uses an attention mask (η). attn Assign 5 x 10 experience points -4 , is the FFN neural network mask (η) ffn ) Assign 2.5×10 -3 Experiments have shown that this 5x ratio can effectively overcome the optimization inertia of knowledge-intensive FFN neurons while ensuring the stability of routing updates in the attention head.

[0069] Baseline method This application compares the proposed CWF method with the following baseline schemes: (1) the original parent model without any fine-tuning or merging; (2) sequential fine-tuning, i.e., fine-tuning the parent model on other domain datasets; and (3) other merging methods. Simple averaging is a direct baseline scheme that creates a merged model by linearly averaging the weights of all models. TIES-Merging (Yadav et al., 2023) is an advanced fusion technique that addresses parameter interference by merging the weight increments of two parent models. DARE (Yu et al., 2023) further employs random pruning and rescaling strategies to merge different models.

[0070] Figure 4 Cross-linguistic mathematical reasoning capabilities are demonstrated. The CWF model achieves an optimal balance, significantly outperforming the baseline model in the target language (Hindi / Thai) while maintaining English reasoning capabilities. The English and Chinese translations are as follows: IndonesianFusion; ThaiFusion; Method; GSM-En; GSM-Id; MATH-En; MATH-Id; GSM-Th; Seq-FT; Simple Avg.; TIES-Merging; DARE; CWF (Ours); CWF (Ours) or Component-level Weighted Fusion (Ours).

[0071] Figure 5 The overall ability retention (Indonesian / Thai average) is shown. Scores are presented in Indonesian / Thai order. CWF maintains strong general reasoning ability, comparable to language experts. The English and Chinese translations are as follows: Task; CWF (Ours); WinoGrande; HellaSwag.

[0072] Main results This application focuses on evaluating the ability of fusion models to handle complex tasks that require both rigorous mathematical reasoning and low-resource language generation. Specifically, this application conducts experiments in two different fusion scenarios: combining a mathematical expert model with an Indonesian chat model (labeled Math+Indonesian), and combining it with a Thai chat model (labeled Math+Thai). Detailed evaluation results are presented in two parts: the core mathematical reasoning results are detailed in [link to evaluation]. Figure 4 For general capability analysis, see Figure 5 The results strongly demonstrate the effectiveness of CWF in fusing heterogeneous skills, and this application draws the following key conclusions: Beyond baseline fusion solutions in complex tasks. In major cross-linguistic mathematics benchmarks ( Figure 4 In the Indonesian language fusion experiment, CWF consistently achieved the optimal balance between reasoning depth and linguistic breadth. CWF demonstrated a comprehensive advantage: 56.40% on the GSM8K-1d task and 43.12% on the MATH-1d task. This performance significantly surpassed TIES Merging (48.15% / 19.14%) and DARE (28.23% / 38.66%), proving that component-level selection maintains the integrity of the logical chain better than vector-level fusion. The advantage was more subtle in the Thai language fusion experiment: while TIES-Merging achieved a competitive 45.88% on the simpler GSM8K-Th task, it collapsed in the complex MATH-Th benchmark (18.96%) and suffered severe regression in the English task. In contrast, CWF demonstrated strong adaptability to challenging problems, achieving the highest score of 27.68% on the MATH-Th task. This indicates that CWF not only relies on surface pattern matching, but also successfully transfers the deep reasoning structure of expert models to the target language, effectively resolving the conflict between logic and Thai writing characteristics.

[0073] Mitigation of catastrophic forgetting. A key drawback of standard fine-tuning and fusion methods lies in the loss of source domain expertise. For example... Figure 4 As shown in the English example, sequence fine-tuning (mathematics → language) led to significant degradation: the accuracy of English GSM8K dropped from 80.27% to 74.32% (Hindi) and 73.52% (Thai). In the Thai scenario, TIES-Merging performed even worse, with accuracy plummeting to 60.52%, indicating that its interference elimination heuristic inadvertently corrupted the core inference parameters. CWF, on the other hand, effectively resisted this forgetting phenomenon, maintaining near-perfect English performance (79.35% and 79.53%), with no statistically significant difference from the original mathematical expert model (Mmath). By strictly freezing the backbone network and only performing masked learning on functional units, CWF successfully isolated the "logic anchor" neurons required for mathematical inference, ensuring their continued activity in any linguistic context.

[0074] General capabilities are preserved. It is important to note that the mask optimization in this application utilizes only mathematical problems and everyday dialogue data. Therefore, the performance degradation observed in these out-of-domain tasks is an expected result of this highly specialized fusion. Despite the lack of direct supervision, CWF avoids the catastrophic collapse seen in baseline models, maintaining functional general reasoning capabilities (WinoGrande task score 53.3). This demonstrates that CWF successfully builds a focused expert model without completely destroying its general foundation.

[0075] Ablation Research To validate the architectural decisions, this application conducted controlled experiments on the Math+Thai task and observed a consistent trend in the Indonesian language task.

[0076] Figure 6 A non-decoder strategy ablation experiment (Mathematics + Thai) is shown. Strategy I preserves the exact numerical token ID required for inference. The English and Chinese translations are as follows: LM Head Strategy: Language Model Head Strategy (or Output Head Strategy); MATH (En): MATH (English); MATH (Thai): MATH (Thai); (I) Copy Math Head (Ours): (I) Copy Math Head (method of this application); (II) Copy Chat Head: (II) Copy Chat Head; (III) Weighted FusedHead: (III) Weighted Fused Head.

[0077] The impact of output header strategies. The language model (LM) header and embedding layer serve as a key interface between the hidden representation and the output tag space, such as... Figure 6 As shown, Strategy I (copying the math head) significantly outperforms other configurations. Strategy II (copying the chat head) causes the MATH (English) accuracy to plummet from 60.52% to 16.54%. This failure stems from lexical misalignment: while the decoder layer can compute the correct mathematical logic, the chat optimization head prioritizes natural language fluency over numerical precision when mapping these representations to token IDs. Strategy III (weighted fusion) also performs poorly because mixing different linear projections "blurs" the sharp probability distribution required for accurate mathematical answers.

[0078] The impact of decoupling optimization. This application assumes a significant difference in optimization inertia between fully convolutional neural network (FFN) neurons and the attention head. Preliminary experiments using a unified learning rate (LR) show that the FFN mask weights change only slightly and fail to activate specialized inference paths.

[0079] Figure 7A comparison of optimization strategies in the Indonesian composite task is shown. Decoupling LR enables the FFN knowledge module to achieve better adaptability. The English translations are as follows: Optimization Strategy; GSM8K (Indo); MATH (Indo); Uniform LR; Disentangled LR (Ours).

[0080] To quantify this phenomenon, this application... Figure 7 The decoupling optimization (higher ηffn, lower ηattn) was compared with a unified benchmark. The results confirmed that assigning a higher LR to FFN effectively overcomes its inertia, while reducing the LR of the attention head ensures stable language routing. This strategic division of labor allows CWF to accurately select "knowledge-intensive" neurons without interfering with "routing-intensive" attention mechanisms.

[0081] discuss This section analyzes the effectiveness of CWF from three perspectives: system efficiency, structural characteristics, and functional mechanisms. First, this application verifies the training efficiency and low resource requirements of the frozen backbone design by comparing it with standard fine-tuning methods. Second, by analyzing the learned mask, it reveals how the model maintains its inference logic when adapting to new languages. Finally, through specific neuron activation analysis, it clarifies how CWF functions as an interpretable router between different functional paths.

[0082] CWF's methodological advantages Figure 8 The peak memory usage comparison of the Llama-3-8B model is shown (batch size = 2, gradient accuracy = 4). CWF significantly saves memory resources by eliminating backbone optimizer states. The English and Chinese translations are as follows: Method; #Models; Opt. States; VRAM; Full FT; ~8B Params; CWF (Ours); Masks Only.

[0083] Computational efficiency. By freezing backbone weights and updating only the lightweight mask, CWF significantly lowers the hardware threshold. Unlike standard fine-tuning, which requires maintaining a large optimizer state for all parameters, this method significantly reduces memory overhead. Figure 8 As shown, even with two parent models loaded, CWF still consumes approximately 18GB less VRAM than single-model fine-tuning. This efficiency enables high-performance model fusion to be implemented on consumer-grade hardware, which would otherwise encounter an out-of-memory error.

[0084] Data efficiency and composability. A key advantage of CWF is its ability to synthesize capabilities without target task supervision. As described in the experimental setup, this method relies only on small, mutually exclusive datasets. Crucially, CWF does not require matching data from two different skills simultaneously. For example, when building an Indonesian math problem solver, this application trains using only separate English math datasets and Indonesian chat datasets, completely eliminating the need to acquire scarce and expensive Indonesian math corpora.

[0085] Figure 9 The diagram shows the layered average fusion weights for merging Math + Thai and Math + Indonesian. The vertical axis represents the weight w assigned to the Math expert (Mmath), with the corresponding weight 1-w for the language expert. The weights obtained by the FFN neural network (FFN, red line) (≈0.7) are consistently higher and more stable than those of the multilayer perceptron (MHA layer, green line), confirming that FFN is the primary repository of reasoning knowledge. The English translations are as follows: Layer-wise Model Fusion Weights; Average Weight for Math Expert; Transformer LayerIndex; Indonesian SA (Self-Attention); Thai SA (Self-Attention); Thai FF (Feed-Forward); Indonesian FF (Feed-Forward).

[0086] Learning Fusion Mask Analysis To understand the internal mechanisms of the fusion process, this application visualizes the masks learned in the mathematics + Thai and mathematics + Indonesian experiments. In all visualizations, "fusion weights" refer to the weight assignments of the mathematical expert model Mmath. Therefore, the mathematical model weight w corresponds to the language model weight 1-w.

[0087] FFN Neural Network Inertia and Attention Routing like Figure 9As shown, in both fusion modes, the FFN neural network consistently receives high fusion weights from mathematical experts. This confirms the factual and semantic knowledge required for the FFN layer to encode mathematical principles. During the experiment, the inventors observed significant optimization inertia in the FFN mask layer—its weights change only slightly at low learning rates. To overcome this problem, the decoupling optimization scheme in this application uses a higher learning rate to activate the dedicated neurons of the FFN, while setting a lower learning rate for the MHA layer to ensure the stability of cross-language information routing.

[0088] Strategic Disagreements and Language Adaptation By analyzing the differences in the weights of FFN and attention fusion in each layer, this division of labor mechanism can be further clarified. The significant differences indicate that the FFN and attention head in this layer are given completely different processing methods, reflecting a highly specialized fusion strategy.

[0089] In the fusion of mathematics and Thai, this difference is most pronounced in the initial and final layers. This application hypothesizes that the attention heads at these layers require a higher proportion of contribution from the Thai model Mth to handle language-specific grammar and formatting rules. Conversely, this pattern is weaker in the Indonesian model Mid. This observation supports the language affinity hypothesis: due to Indonesian's shared Latin alphabet and structural similarities with English, it requires less specialization at the input stage than the structurally significantly different Thai. However, both exhibit a high degree of divergence at the final layer, confirming that output generation remains a highly language-feature-dependent task.

[0090] Figure 10 The distribution of feature selection in each layer of FFN neurons is shown. The red bars represent the median weights, and the dashed line represents the initial value (0.731). The "spindle" shape and extended tails of all 32 layers indicate that while most neurons maintain stable knowledge anchors, some neurons undergo significant specialization to balance reasoning and language needs. The English and Chinese translations are as follows: Median; Init (0.731); Fusion Weight (vertical axis); Transformer Layer Index (horizontal axis); Layers 0-15; Layers 16-31.

[0091] Neuron-level specialization in FFN To further reveal the micro-mechanisms of capability combination, this application analyzes Figure 10The distribution of FFN neuron weights. Results show that despite an initial neuron value of 0.731, the median learned across 32 layers (red bars) remains consistently in a slightly lower range (approximately 0.68–0.70). This phenomenon is termed "global fluency tax": the fusion process forces mathematical experts to relinquish a precise proportion of processing power to language experts, resulting in a systematic global model adjustment. This indicates that dialogue fluency is not limited to specific modules but requires a global, unified relinquishment of parameter weights across the entire model depth.

[0092] Significant intra-layer heterogeneity is observed. Weight values ​​are concentrated in the 0.5 to 0.9 range, indicating that most neurons operate as collaborative units. However, the tail of the distribution, extending to the extremes, is particularly crucial: neurons with weights close to 1.0 act as invariant "logic anchors"—potentially encoding unemotionally unambiguous mathematical facts. Conversely, neurons at the lower tail (w≈0.2) are precisely engineered to process the target language's grammar. This stable distribution explains the CWF model's remarkable resistance to catastrophic forgetting: it achieves fluency through global weight allocation while rigorously protecting key logical neurons.

[0093] Figure 11 Microscopic activation dynamics are shown. M-biased (mathematical) and L-biased (linguistic) neurons are visualized across layers. (a) Shallow filtering input: M-biased neurons remain silent (white) when faced with Indonesian cues. (b) Mid-layer specialization: Under mathematical cues, M-biased neurons are strongly activated for symbols (dark red), while L-biased neurons are inhibited. (c) Deep convergence: The same L-biased neuron is activated in both mathematical and Indonesian contexts to process the final lexical generation. The English-Chinese translation is as follows: (a) Layer 2: Early Filtering (Input: Indo): (a) Layer 2: Early Filtering (Input: Indonesian); L-biased: L-biased (Language bias); M-biased: M-biased (Mathematics bias); (b) Layer 17: Specialization (Input: Math): (b) Layer 17: Specialization (Input: Mathematics); L-biased: L-biased (Language bias); M-biased: M-biased (Mathematics bias); (c) Layer 30: Convergence (Neuron: L-biased): (c) Layer 30: Convergence (Neuron: L-biased); Context: Math: Context: Mathematics; Context: Indo: Context: Indonesian; Activation: Low: Activation level: Low; High: High.

[0094] Microscopic Analysis: CWF as a Hierarchical Router Visualizing the "switching" mechanism. To reveal how CWF coordinates conflict capabilities within a single model, this application... Figure 11 The activation patterns of individual neurons are visualized. The selection method is intuitive: using the learned fusion mask, the most representative single neuron for each expert is identified at a specific depth. Specifically, this application selects the "M-biased" neuron with the highest learning weight among mathematics experts, representing the core of the reasoning backbone; conversely, for language experts, the "L-biased" neuron with the highest weight is selected, representing the center of language fluency. This application inputs Indonesian news clips and mathematical equations and observes the responses of these neurons.

[0095] Shallow (L2): Early domain filtering. Figure 11 The visualization in (a) shows that task separation is immediately achieved at the model entry point: when processing Indonesian text, L-biased neurons are normally activated to parse grammatical structures. However, M-biased neurons are completely silent (white), with an activation value of zero. This phenomenon indicates that CWF establishes an "early filtering" mechanism: the mathematical neural network backbone effectively filters out irrelevant linguistic noise during the embedding stage, rather than indiscriminately processing all inputs. This ensures that the specialized capabilities of mathematical experts are preserved and used only for processing relevant logical symbols, avoiding waste on general chat data.

[0096] Middle Layer (L17): Orthogonal Specialization. The most critical routing behavior occurs in the middle layer, which acts as the model's logic engine. For example... Figure 11 As shown in (b), when mathematical cues are introduced, the model exhibits a clear functional differentiation: M-biased neurons are strongly activated (dark red) when encountering numbers and variables, confirming their role as "logical anchors." Meanwhile, L-biased neurons are strictly suppressed (white). This orthogonal activation pattern—with the reasoning module active and the chat module silent—is the physical manifestation of the decoupling optimization scheme in this application. It avoids the "illusion" phenomenon common in standard models, where the language probability distribution interferes with precise symbolic reasoning.

[0097] Deep networks (L30): Output convergence. Ultimately, the different paths must be merged to generate a human-readable output. Figure 11 In the deep network shown in (c), this application observes that L-biased neurons (operating as decoders) are activated under both mathematical and Indonesian inputs. This convergence confirms the "language carrier" hypothesis of this application: regardless of whether the internal processing is driven by mathematical-specific loops or dialogue-specific loops, the final hidden state must be projected back to a shared lexical space. Therefore, the L-biased neurons in the deep layers act as a universal interface, transforming abstract logic into fluent symbols to generate the final answer.

[0098] in conclusion This application proposes Component-Level Weighted Fusion (CWF)—a parameter-efficient framework that combines different expert capabilities by selectively fusing functional modules (FFN neurons and attention heads). Cross-linguistic mathematical reasoning experiments demonstrate that CWF consistently outperforms strong baseline models such as TIES-Merging and DARE in complex, multi-faceted tasks. Crucially, the micro-analysis of this application reveals that CWF functions as an interpretable, fine-grained router: it accurately identifies and activates specific neurons responsible for mathematical logic while preserving the core units of linguistic expression. This approach ultimately provides a scalable and transparent solution for the on-demand combination of expert knowledge in large-scale language models.

[0099] In other embodiments, the present invention also provides a non-volatile computer storage medium storing computer-executable instructions that can execute the large language model fusion method in any of the above method embodiments for draft models and target verification models; In one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions, which are configured as follows: Load at least two homologous expert models fine-tuned based on the same architecture, wherein the expert models include a logic-dominated model and other models; Learnable mask variables are established at at least one functional unit level of the expert model, wherein the functional unit includes at least a feedforward neural network and a multi-head attention mechanism; A differentiated fusion strategy is executed according to the type of the at least one functional unit, wherein, for the feedforward neural network layer, the neuron parameters of each expert model at the same position are read, and the weighted sum of the neuron parameters is calculated to synthesize a virtual neuron; for the multi-head attention mechanism layer, the weighted sum of the attention parameters is calculated to synthesize a virtual attention head; for the non-decoder component, the parameters of the logic-dominated expert model are directly copied; and the complete layer parameters are synthesized based on the virtual neuron, the virtual attention head, and the parameters of the logic-dominated expert model to participate in the forward computation. The model is trained using mixed data containing samples with different capabilities. Forward propagation is performed according to the forward computation process to calculate the cross-entropy loss between the predicted results and the true labels. In the backpropagation stage, the gradient is only propagated back to the initialization parameters of the established learnable mask variables. A decoupling optimization strategy is applied to address the differences in optimization characteristics of different functional units. After completing a predetermined number of training rounds, the final optimized mask parameters are output to complete the construction of the fusion model.

[0100] Non-volatile computer-readable storage media may include a stored program area and a stored data area, wherein the stored program area may store an operating system and an application program required for at least one function; the stored data area may store data created according to the use of the large language model fusion method and the system, etc. Furthermore, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the non-volatile computer-readable storage medium may optionally include memory remotely located relative to the processor, and these remote memories may be connected to the large language model fusion method via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0101] This invention also provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium. The computer program includes program instructions, which, when executed by a computer, cause the computer to perform any of the above-described large language model fusion methods.

[0102] Figure 12 This is a schematic diagram of the structure of the electronic device provided in the embodiment of the present invention, such as... Figure 12 As shown, the device includes: one or more processors 710 and memory 720. Figure 12 Taking a processor 710 as an example, the device for the large language model fusion method and system may further include an input device 730 and an output device 740. The processor 710, memory 720, input device 730, and output device 740 can be connected via a bus or other means. Figure 12 Taking a bus connection as an example, the memory 720 is the aforementioned non-volatile computer-readable storage medium. The processor 710 executes various server functions and data processing by running non-volatile software programs, instructions, and modules stored in the memory 720, thereby implementing the large language model fusion method described in the above embodiment. The input device 730 can receive input numeric or character information and generate key signal inputs related to user settings and function control of the large language model routing device. The output device 740 may include a display screen or other display device.

[0103] The above-described product can execute the method provided in the embodiments of the present invention, and has the corresponding functional modules and beneficial effects for executing the method. Technical details not described in detail in this embodiment can be found in the method provided in the embodiments of the present invention.

[0104] In one implementation, the above-described electronic device is applied in a large-scale language model routing device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to: Load at least two homologous expert models fine-tuned based on the same architecture, wherein the expert models include a logic-dominated model and other models; Learnable mask variables are established at at least one functional unit level of the expert model, wherein the functional unit includes at least a feedforward neural network and a multi-head attention mechanism; A differentiated fusion strategy is executed according to the type of the at least one functional unit, wherein, for the feedforward neural network layer, the neuron parameters of each expert model at the same position are read, and the weighted sum of the neuron parameters is calculated to synthesize a virtual neuron; for the multi-head attention mechanism layer, the weighted sum of the attention parameters is calculated to synthesize a virtual attention head; for the non-decoder component, the parameters of the logic-dominated expert model are directly copied; and the complete layer parameters are synthesized based on the virtual neuron, the virtual attention head, and the parameters of the logic-dominated expert model to participate in the forward computation. The model is trained using mixed data containing samples with different capabilities. Forward propagation is performed according to the forward computation process to calculate the cross-entropy loss between the predicted results and the true labels. In the backpropagation stage, the gradient is only propagated back to the initialization parameters of the established learnable mask variables. A decoupling optimization strategy is applied to address the differences in optimization characteristics of different functional units. After completing a predetermined number of training rounds, the final optimized mask parameters are output to complete the construction of the fusion model.

[0105] The electronic devices described in this application exist in various forms, including but not limited to: (1) Mobile communication devices: These devices are characterized by their mobile communication capabilities and primarily aim to provide voice and data communication. These terminals include: smartphones, multimedia phones, feature phones, and low-end phones, etc.

[0106] (2) Ultra-mobile personal computer devices: These devices fall under the category of personal computers, possessing computing and processing capabilities, and generally also have mobile internet access features. These terminals include PDAs, MIDs, and UMPCs, etc.

[0107] (3) Portable entertainment devices: These devices can display and play multimedia content. This category includes: audio and video players, handheld game consoles, e-book readers, as well as smart toys and portable car navigation devices.

[0108] (4) Server: A device that provides computing services. The components of a server include a processor, hard disk, memory, system bus, etc. Servers are similar to general computer architectures, but because they need to provide highly reliable services, they have higher requirements in terms of processing power, stability, reliability, security, scalability, and manageability.

[0109] (5) Other electronic devices with data interaction functions.

[0110] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0111] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods of various embodiments or some parts of embodiments.

[0112] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for fusing large language models, comprising: Load at least two homologous expert models fine-tuned based on the same architecture, wherein the expert models include a logic-dominated model and other models; Learnable mask variables are established at at least one functional unit level of the expert model, wherein the functional unit includes at least a feedforward neural network and a multi-head attention mechanism; A differentiated fusion strategy is executed according to the type of the at least one functional unit, wherein, for the feedforward neural network layer, the neuron parameters of each expert model at the same position are read, and the weighted sum of the neuron parameters is calculated to synthesize a virtual neuron; for the multi-head attention mechanism layer, the weighted sum of the attention parameters is calculated to synthesize a virtual attention head; for the non-decoder component, the parameters of the logic-dominated expert model are directly copied; and the complete layer parameters are synthesized based on the virtual neuron, the virtual attention head, and the parameters of the logic-dominated expert model to participate in the forward computation. The model is trained using mixed data containing samples with different capabilities. Forward propagation is performed according to the forward computation process to calculate the cross-entropy loss between the predicted results and the true labels. In the backpropagation stage, the gradient is only propagated back to the initialization parameters of the established learnable mask variables. A decoupling optimization strategy is applied to address the differences in optimization characteristics of different functional units. After completing a predetermined number of training rounds, the final optimized mask parameters are output to complete the construction of the fusion model.

2. The method according to claim 1, characterized in that, The establishment of learnable mask variables at at least one functional unit level of the expert model includes: For each layer of the feedforward neural network, initialize a first weight vector of dimension N for each neuron, where N is the number of models; For each layer of multi-head attention mechanism, initialize a second weight vector of dimension N for each attention head.

3. The method according to claim 2, characterized in that, After establishing the learnable mask variable, the method further includes: The Softmax function is introduced to normalize the first and second weights, and a specific bias is set during initialization to make the initial weights slightly biased towards the logistic-dominated model.

4. The method according to claim 1, characterized in that, The application decoupling optimization strategy includes: A high learning rate is applied to the feedforward neural network mask to overcome the large optimization inertia of neurons, enabling them to quickly extract key knowledge from the logic-driven expert model. A low learning rate is applied to the multi-head attention mechanism mask to maintain the stability of the attention routing mechanism and prevent the features of the other models from causing a severe impact on the logical link.

5. The method according to claim 1, characterized in that, After loading at least two homologous expert models fine-tuned based on the same architecture, the method further includes: Immediately freeze the original backbone parameters of all models.

6. The method according to claim 1, characterized in that, The non-decoder component includes an input embedding layer and an output language model head.

7. The method according to any one of claims 1-6, characterized in that, The logic-driven model is a mathematical expert model, and the other models are language dialogue expert models.

8. An electronic device comprising: At least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method according to any one of claims 1-7.

9. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-7.