A natural language processing model fine-tuning method based on granularized sparse gating

CN122287754APending Publication Date: 2026-06-26WUHAN UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WUHAN UNIV OF SCI & TECH
Filing Date
2026-05-08
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies suffer from gradient entanglement, coarse gating granularity, and lack of dynamic adaptability in subspace parameter allocation during the fine-tuning of large language models, making it difficult to balance computing power and accuracy when deploying smart terminals.

Method used

A fine-tuning method for natural language processing models based on granular sparse gating is adopted. By dividing the input and output dimensions of the model's hidden layer into multiple subspaces, a parallel adapter sub-block is constructed. Then, granular gating vectors are used for dynamic sparsification and gradient updates to achieve fine-grained feature selection and adaptive parameter allocation.

Benefits of technology

It significantly suppresses gradient entanglement, improves natural language recognition accuracy, reduces hardware load, enables the deployment of lightweight models, and adapts to the storage and computing power limitations of edge devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122287754A_ABST
    Figure CN122287754A_ABST
Patent Text Reader

Abstract

This invention discloses a method for fine-tuning a natural language processing model based on granular sparse gating, relating to intelligent interaction and natural language processing technologies. This method is designed for memory-constrained intelligent interactive terminals. It acquires the natural language text data input from the terminal and converts it into word embedding features. The adapter space is divided into granular sub-blocks, and dynamic sparse gating vectors are embedded for fine-grained feature selection. During the training phase, a dual-path update mechanism is employed: conventional gradient descent is performed on the weights, and soft threshold updates based on proximal gradients are performed on the gating. This invention can improve the generalization ability and parameter efficiency of intelligent devices in tasks such as intent recognition without increasing terminal inference latency; and can significantly reduce the storage and GPU memory overhead of large models when deployed and distributed at edge devices while maintaining model generalization ability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of intelligent interaction and natural language processing technology, and in particular to a method for fine-tuning a natural language processing model based on granular sparse gating. Background Technology

[0002] With the rapid development of artificial intelligence technology, large language models have demonstrated outstanding performance in the field of natural language processing. To meet the demands of data privacy, real-time response, and applications in offline environments, deploying pre-trained large models to intelligent interactive terminals, such as smartphones, in-vehicle interactive systems, and embody intelligent robots, has become a significant industry trend. Compared to cloud servers, intelligent interactive terminals have extremely stringent requirements regarding device quality, size, power consumption, and especially computing power and memory specifications. Optimizing the fine-tuning and deployment system of large models on edge devices, achieving extreme lightweight design and low latency while ensuring their natural language processing performance, is of great significance for promoting the widespread adoption of edge AI technology.

[0003] To address the aforementioned computational and storage bottlenecks, efficient parameter fine-tuning techniques have emerged. Low-rank adaptation (LoRA), a typical architecture within this field, significantly reduces the computational and storage costs of fine-tuning by freezing pre-trained weights and injecting only trainable low-rank matrices into the network. Furthermore, to solve the challenge of dynamic memory allocation for large language models deployed on memory-constrained hardware, existing technologies have introduced sparse adaptation mechanisms. These mechanisms aim to dynamically adjust the rank of the low-rank matrix through a gating structure, thereby eliminating redundant parameters and reducing device power consumption. Deeply optimizing low-rank and sparse adaptation architectures is a feasible and crucial research direction for achieving efficient operation of large models on the intelligent interactive terminal side.

[0004] However, analysis of the feature dynamics at the underlying level of natural language processing reveals that when processing real and complex natural language text, the input feature matrix exhibits extreme uneven distribution, easily generating specific input channels with abnormally high activation values, i.e., outlier channels. These outlier channels propagate between network layers, making the fine-tuning process a highly coupled and complex feature system.

[0005] Existing technologies have significant limitations when fine-tuning and adapting to systems with such complex features. Specifically, existing technologies face the following significant drawbacks in practical edge applications: (1) Global gradient entanglement leads to impaired generalization performance: Outlier channels refer to specific input channels with abnormally high activation values ​​in the input features of large language models. Under the traditional low-rank adaptation architecture, the above-mentioned outlier channels will disproportionately dominate the global gradient signal during backpropagation. The huge gradients they generate will overwhelm the contributions of other normal natural language semantic features, thereby causing the weight update direction of the entire adapter matrix to be distorted and gradient entangled, which seriously affects the accuracy of intelligent devices in tasks such as intent recognition and common sense reasoning.

[0006] (2) Coarse gating granularity makes it difficult to balance computing power and accuracy: In order to solve the problem of memory allocation and rank selection when deploying large language models on edge intelligent hardware (such as mobile devices and robots) with limited video memory, sparse adaptation introduces a gating mechanism to dynamically adjust the rank. However, existing sparse adaptation gating mechanisms usually operate coarsely on the entire adapter layer or global feature channels. This coarse-grained gating cannot accurately isolate physical noise or redundant text features in the more micro-feature subspace, resulting in the system either over-pruning and damaging the core inference accuracy, or under-pruning and failing to meet the extreme lightweight requirements of edge devices.

[0007] (3) Lack of dynamic adaptability in subspace parameter allocation: Although some improved schemes introduce a simple block matrix strategy, the rank preset for each sub-block is fixed. This fixed structure cannot dynamically allocate parameter budgets according to the importance of feature subspaces in different natural language contexts, resulting in a waste of video memory resources.

[0008] Therefore, there is an urgent need for a novel fine-tuning method that can isolate gradient interference through structured block partitioning and combine dynamic gating to achieve fine-grained parameter allocation, so as to further unleash the application potential of large language models on edge terminal devices. Summary of the Invention

[0009] The technical problem to be solved by this invention is to provide a method for fine-tuning a natural language processing model based on granular sparse gating, which addresses the technical defects of existing large language models when deployed on smart terminals, such as gradient entanglement and coarse gating granularity.

[0010] The technical solution adopted by this invention to solve its technical problem is: This invention provides a method for fine-tuning a natural language processing model based on granular sparse gating, the method comprising the following steps: S1. Model Parameter Construction and Initialization: Obtain the frozen weight matrix of the pre-trained natural language processing model. The input dimension of the model's hidden layer With output dimension Divided into multiple physical subspaces, constructed There are several parallel adapter sub-blocks, where each subspace satisfies an integer division constraint; and for each adapter sub-block... Initialize model parameters, including: lower projection matrix The projection matrix Granularized gated vectors , Local rank for each adapter subblock; S2. Granular Forward Propagation: Acquire natural language text data input from intelligent interactive terminal devices and transform it into word embedding input features. ,in The length of the text sequence; the input feature is divided according to the input dimension. Each sub-block, each sub-input For each adapter sub-block The process sequentially performs down projection, gated sparsity, up projection, and aggregation operations, ultimately outputting a feature matrix with natural language context prediction information. ; S3. Loss Calculation: Based on the label sequence of the downstream natural language processing task. Calculation model predicts output Cross-entropy task loss between the real physical label sequence and the actual physical label sequence ; S4. Dual-path parameter update: based on the cross-entropy task loss. Calculate the gradient; then, for the down projection matrix in the model parameters... and projection matrix Perform gradient descent-based optimization updates; for granular gate vectors Perform proximal gradient updates, including gradient descent and element-wise soft thresholding. The element-wise soft thresholding operation is used to achieve structural sparsity of the gated vector, resulting in a fine-tuned natural language processing model.

[0011] Furthermore, in step S1 of the present invention, the division of physical space is symmetrical, that is... , build A parallel adapter sub-block, and set ,in For the total rank.

[0012] Furthermore, in step S2 of the present invention, the output feature matrix is... The methods specifically include: For each adapter sub-block implement: (1) Lower projection:

[0013] (2) Gating sparsity: The gating vector is broadcast. Extend to Feature interception is performed using the same dimension and element-wise multiplication.

[0014] (3) Upward projection and aggregation:

[0015] Concatenate all output blocks to obtain the final semantically enhanced output. :

[0016] The final output is a feature matrix with natural language context prediction information. :

[0017] Wherein, the gate vector By broadcasting and expanding the sample and sequence dimensions to align it with the intermediate feature tensor in the rank dimension, independent control of the rank-wise dimension is achieved.

[0018] Furthermore, the cross-entropy task loss in step S3 of the present invention The calculation formula is:

[0019] in, The length of the text sequence. For the number of word categories, One-hot encoding for the real label. This is used to predict the probability values ​​of the output features after passing through the activation function.

[0020] Furthermore, in step S4 of the present invention, the granular gating vector... The specific methods for performing proximal gradient updates include: Gradient descent steps:

[0021] Element-wise soft thresholding:

[0022] in, For the gating learning rate, The sparse regularization coefficient; The soft thresholding operation is an element-wise nonlinear shrinkage function used to achieve... Proximal optimization of the regularization term without explicitly computing the gradient of the regularization term.

[0023] Furthermore, in step S4 of the present invention, the setting of the soft threshold ξ includes two modes: (1) Fixed threshold mode: Maintain during fine-tuning As a preset constant: It is used for performance optimization under given sparsity constraints; (2) Dynamic scheduling mode: The threshold increment strategy is adopted; during the model training process, the sparse soft threshold is directly increased by a preset step size or period. Size: This allows the model to retain more parameters in the early stages of training to fully learn features, and then gradually increases the truncation threshold to eliminate redundant parameters until the model performance reaches a significant inflection point, which is used to explore the parameter compression limit of the model.

[0024] Furthermore, the present invention, based on the total rank... The model fine-tuning method adopts a hybrid architecture update approach, including: (1) When the total rank is set When the value exceeds the preset threshold, only the granular adapter paths in steps S1 to S4 are enabled. (2) When the total rank is set When the total rank is less than or equal to the preset threshold, a global low-rank adapter path is constructed in parallel alongside the granular adapter paths in steps S1 to S4, thus increasing the total rank. The outputs of the global low-rank adapter path are allocated to the granular adapter path and the global low-rank adapter path according to a preset ratio or dynamic strategy. The outputs of the global low-rank adapter path are weighted and summed with the aggregated outputs of the granular adapter path to compensate for the feature integrity under the low-rank condition.

[0025] Furthermore, the method of the present invention also includes a reasoning reconstruction step: After fine-tuning, detect all gated vectors. Zero element index in, physical removal The corresponding column vectors and The corresponding column vectors are used to reconstruct the model into a sparse, low-rank block matrix for faster inference.

[0026] This invention provides a natural language processing model fine-tuning system based on granular sparse gating, used to implement the aforementioned natural language processing model fine-tuning method based on granular sparse gating. The system includes: The parameter construction and initialization submodule is used to obtain the frozen weight matrix of the pre-trained natural language processing model and input it into the dimension. Divided into Subspace, output dimension Divided into Subspace, Construction Each adapter sub-block is parallelized and initialized with a low-rank down-projection matrix, an up-projection matrix, and a granular gating matrix, respectively. The granular forward propagation submodule is used to segment the input natural language text data into corresponding input sub-blocks, calculate the projection of the input sub-blocks onto the lower projection matrix to obtain intermediate features; use tensor dimension expansion operations to perform element-wise multiplication of the granular gating vectors and intermediate features to obtain sparse features; finally, project and aggregate all sub-blocks through the upper projection matrix to obtain the adapter increment. The loss calculation submodule is used to obtain the supervision signal of the fine-tuning task and calculate the task loss between the model prediction and the true label. The dual-path parameter update submodule is used to calculate gradients based on task loss and perform regular gradient updates on the down-projection matrix and up-projection matrix using the optimizer. A soft thresholding operation is added to the granular gating vector on the basis of gradient descent to realize the solution of the proximal gradient of the regularization term, thereby dynamically inducing parameter sparsity during training. The model evaluation and visualization submodule is used to extract the model's loss value on the validation set and the proportion of zero elements in the granular gating vector during or after model training, calculate the model sparsity, and generate the loss function and sparsity evolution curve to verify the model's effectiveness.

[0027] This invention provides an electronic device, comprising: Memory, used to store executable computer programs; When a processor executes an executable computer program stored in memory, it implements the aforementioned method for fine-tuning a natural language processing model based on granular sparse gating.

[0028] The beneficial effects of this invention are: Compared with existing low-rank adaptation methods, this invention has the following substantial differences in solving the deployment problem of intelligent interactive terminal models: (1) Structural decoupling at the micro level: The adapter structure is divided into multiple feature subspaces, which realizes the structural decoupling of natural language semantic features on the gradient propagation path, blocks the spread of abnormal text channels in the feature topology space, and significantly suppresses the phenomenon of global gradient entanglement. (2) Fine-grained feature selection: The gating mechanism is precisely applied to the rank dimension of the low-rank decomposition, breaking the technical bias of traditional gating that only applies to the hierarchical or global channel level, and realizing the precise retention and redundancy removal of fine-grained text features in complex contexts. (3) Adaptive lightweighting: The network structure is dynamically sparsified by the near-end gradient operator, so that parameter truncation and feature optimization are carried out in tandem, thereby generating lightweight model parameters that fit the memory and computing power limitations of edge devices during the fine-tuning stage.

[0029] The method of the present invention has the following technical effects: (1) Decoupling gradient interference and improving the accuracy of natural language recognition: By granularly segmenting and isolating semantic features, the propagation of high-frequency abnormal word channels in natural language text between different subspaces is effectively blocked, the global gradient entanglement problem is significantly suppressed, and the accuracy of intention recognition of intelligent interactive terminals when facing complex common sense reasoning and multi-turn dialogue is improved.

[0030] (2) Dynamic structure optimization to reduce hardware load: By using a gating mechanism to identify and eliminate redundant feature dimensions, the adaptive simplification of the model structure is achieved. Compared with the traditional low-rank adaptation method, under the same hardware conditions, this invention can significantly reduce the storage and video memory overhead of large models when they are distributed and deployed at the edge while maintaining the model's generalization ability, and effectively shorten the end-to-end response latency of smart terminals when facing dynamic switching of multiple intent tasks. Of course, any product implementing this invention does not necessarily need to achieve all of the advantages described above at the same time. Attached Figure Description

[0031] The present invention will be further described below with reference to the accompanying drawings and embodiments. In the accompanying drawings: Figure 1 This is a flowchart of GraSoRA according to an embodiment of the present invention; Figure 2 This is a flowchart of the dual-path parameter update process according to an embodiment of the present invention; Figure 3 This is a block diagram of the virtual module architecture of the fine-tuning system according to an embodiment of the present invention; Figure 4 This is an overall flowchart of the method according to an embodiment of the present invention. Detailed Implementation

[0032] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0033] Example 1 The natural language processing model fine-tuning method based on granular sparse gating in this invention (hereinafter referred to as GraSoRA or this method), as described in the embodiments of the present invention, is as follows: Figures 1-4 As shown, the main technical solution includes the following steps: S1. Parameter Construction and Initialization Obtain the frozen weight matrix of the pre-trained natural language processing model The input dimension of the model's hidden layer With output dimension Divided into multiple physical subspaces, constructed There are several parallel adapter sub-blocks, where each subspace satisfies an integer division constraint; and for each adapter sub-block... initialization: Lower projection matrix ; Up projection matrix ; Granularized gate vector

[0034] S2. Granular forward propagation: Acquire natural language text data input from intelligent interactive terminal devices and convert it into word embedding input features. ,in The length of the text sequence. This input feature is divided according to the input dimension. Each sub-block, each sub-input For each adapter sub-block ,implement: (1) Lower projection:

[0035] (2) Gating sparsity: The gating vector is broadcast. Extend to Feature interception is performed using the same dimension and element-wise multiplication.

[0036] (3) Upward projection and aggregation:

[0037] Concatenate all output blocks to obtain the final semantically enhanced output. :

[0038] The final output is a feature matrix with natural language context prediction information. :

[0039] S3. Loss Calculation: Label sequences based on downstream natural language processing tasks Calculation model predicts output Cross-entropy task loss between the real physical label sequence and the actual physical label sequence The calculation formula is as follows:

[0040] in, The length of the text sequence. For the number of word categories, One-hot encoding for the real label. This is used to predict the probability values ​​of the output features after passing through the activation function.

[0041] S4. Dual-path parameter update: Based on the aforementioned task loss Calculate the gradient; (1) For the projection matrix and Perform regular optimization updates based on gradient descent; (2) For the gated vector Performing proximal gradient updates includes: Gradient descent steps:

[0042] Element-wise soft thresholding:

[0043] in, For the gating learning rate, For sparse regularization coefficients The soft thresholding operation is used to achieve structural sparsity of the gated vector.

[0044] According to the above scheme, in a preferred embodiment, when the partitioning strategy is symmetrical partitioning... In step S1, the specific steps are as follows: S11: Model Analysis and Target Layer Localization. Traverse the network architecture of the pre-trained natural language processing model, use regular expressions to match the preset target linear layer, and extract its frozen weight matrix. .

[0045] S12: Dimensional Segmentation and Validation. Obtain the target layer. Input dimensions ( ) and output dimension ( ), and the set total low-rank dimension Based on the set number of subspaces Calculate sub-block dimension attributes: sub-input dimension Sub-output dimension , Zizhi And verify the divisibility of the dimensions.

[0046] S13: Construction of adapter subblock tensors. A parallel adapter sub-block. Instantiating high-dimensional tensors in memory: downprojection tensors. The shape is Up projection tensor The shape is and granular gating tensors The shape is .

[0047] S14: Differentiated parameter initialization. To break the optimization symmetry and ensure initial stability, a differentiated initialization strategy is adopted for the above tensors: downprojection tensor use Uniform distribution initialization; upward projection tensor Uses all-zero initialization; granular gating tensors We initialized the distribution with a mean of 0 and a small standard deviation.

[0048] Furthermore, in the preferred symmetrical partitioning embodiment, step S2 specifically includes the following steps: S21: Input Feature Reshaping. Apply the following to the input data sequentially. Processing, and then shaping it from Rearranged into chunks This aligns the input dimensions of the sub-blocks.

[0049] S22: Down projection and intermediate feature extraction. Utilizing Einstein's summation convention, all input sub-blocks are computed. For all output sub-blocks The intermediate projection. The reshaped input features are then compared with the downprojection tensor. Multiplying them together, we get the shape as follows: The intermediate feature tensor.

[0050] S23: Gated Feature Sparsity. Using tensor dimension expansion operations, the granularized gated tensor constructed in step S13 is sparsified. (shape is) The feature is then multiplied element-wise with the intermediate feature tensor mentioned above, and the feature channels of different sub-blocks are weighted and intercepted to obtain sparse features.

[0051] S24: Upprojection and Feature Aggregation. Again utilizing the Einstein summation convention, the sparsed features are combined with the upprojection tensor. Multiplication, mapping back to the output dimension And during the calculation process, the input block dimension is adjusted. Sum rank dimension Summation is used to aggregate the contributions of all input blocks.

[0052] S25: Residual Joining and Scaling. Reshape the aggregated features back to the standard output shape. Multiply by the scaling factor Then, with frozen weights The residuals are superimposed on the base outputs to obtain the final output of the current layer.

[0053] Furthermore, in step S3, the specific steps are as follows: S31: Task Data Formatting and Masking. Receives fine-tuned task data and concatenates it into a sequence of instructions and answers. Generates corresponding attention masks and labels, and sets the labels for the question prompts to ignore values ​​to prevent them from participating in loss calculation.

[0054] S32: Forward Output and Prediction. The formatted sequence is input into the model after the modification in step S2, and the predicted logical values ​​of the model in the vocabulary space are obtained.

[0055] S33: Task Loss Evaluation. Extract the supervision signal for the fine-tuning task, and use the cross-entropy loss function to calculate the task loss between the model's prediction logic and the true label. ).

[0056] Furthermore, in step S4, the specific steps are as follows: S41: Parameter Grouping and Layered Learning Rate Setting. In the optimizer, the trainable parameters are divided into two groups: the projection matrix and the weights. and Configure the learning rate based on the weights. Granularized gated tensors Configured with independent gating learning rate .

[0057] S42: Gradient backpropagation. The task loss is calculated based on the result of step S33. Perform backpropagation and use the chain rule to compute the results for all trainable parameters. , as well as The gradient.

[0058] S43: Regular update of the projection matrix. The optimizer uses the calculated gradients, based on the learning rate and the base weights. Projection matrix and projection matrix Perform regular gradient descent updates. Gated tensor. Also based on the gating learning rate Calculate the temporary update value This facilitates subsequent soft threshold operations.

[0059] S44: Gated vector proximal gradient truncation. After the regular optimizer update, the proximal gradient processing module is attached. A fixed threshold is calculated. In a gradient-free tracking environment, the updated gate tensor... Apply the soft threshold formula element by element:

[0060] in, For the gating learning rate, This is the sparsity penalty coefficient.

[0061] If the absolute value of the gate value is less than If the value is zero, it will be directly reduced to zero, thereby dynamically inducing the sparsity of the gating parameters during the training process.

[0062] Example 2: Mathematical Derivation Process The mathematical derivation of the natural language processing model fine-tuning method based on granular sparse gating is as follows: Total output of forward propagation:

[0063] Total loss function:

[0064] 1. Dimension Definition and Network Partitioning Assume the frozen weight matrix of the pre-trained natural language processing model is Acquire natural language text data input from intelligent interactive terminal devices and convert it into word embedding input features. The input dimension N is divided into n subspaces, and the output dimension M is divided into m subspaces.

[0065] Input block partitioning: The input is partitioned into n parts, each sub-block having a size of . ; Output block partitioning: The output is partitioned into m parts, each sub-block having a size of m. ; Local Rank: Let the local rank of each adapter sub-block be . .

[0066] Adapter matrix: For the first OK( ), No. List( Assign the following parameters to the sub-blocks of ) Down projection matrix: ; Up projection matrix: ; Granularized gated vector blocks: .

[0067] 2. Granular forward propagation derivation Step 1: Input Feature Segmentation Input matrix Divide into n sub-matrices by rows:

[0068] Each sub-input .

[0069] Step 2: Down projection and intermediate feature extraction For any sub-block ( ), input sub-block With the lower projection matrix Multiplying by the transpose of the expression extracts intermediate features. :

[0070] at this time, The dimension is .

[0071] Step 3: Tensor Dimension Expansion and Sparse Gating Granularized gate vector Logical expansion along the sample and spatial / sequence dimensions to align with intermediate features The dimensions are aligned. Then, a Hadamard product is performed to obtain the gated filtered, sparsed features. :

[0072] This step enables independent weighted control and interception of each rank dimension within each sub-block.

[0073] Step 4: Projection and Aggregation For the target output of the first block It needs to receive valid signals from all n input blocks in the same row. The sparsity features are then processed... Summation after projection:

[0074] Step 5: Output splicing and residual connection The calculated m local output blocks are concatenated row by row to obtain the final low-rank adapter increment. :

[0075] This is then superimposed on the output of the pre-trained weights to complete the forward propagation:

[0076] 3. Derivation of granular backpropagation Suppose that in the fine-tuning task, we calculated the task loss between the model's predicted values ​​and the true labels. According to the chain rule, the loss function will first output sub-blocks of the target. The gradient is generated and denoted as:

[0077] Next, this gradient is derived layer by layer to each micro-adapter. ).

[0078] Step 1: Calculate the projection matrix gradient

[0079] Up projection matrix Direct and sparsified intermediate features Multiply. Therefore, its gradient is the product of the output gradient and the transpose of the input features:

[0080] at this time, The dimension is .

[0081] Step 2: Calculate the sparsification intermediate features gradient To continue propagating the gradient downwards, we need to find the gated features. gradient:

[0082] at this time, The dimension is .

[0083] Step 3: Calculate the granular gating vector gradient In the forward propagation, gating is applied to the original intermediate features through the Hadamard product: .because It is a vector whose broadcast dimension is expanded to T. When taking the backward derivative, the gradients along the sequence dimension T need to be accumulated:

[0084] at this time, Dimensions and Consistency, for .

[0085] Step 4: Calculate the projection matrix gradient Finally, the gradient passes through the gate node and continues to be fed back. .

[0086] First, calculate the original intermediate features. gradient:

[0087] according to Seeking gradient:

[0088] at this time, The dimension is .

[0089] At this point, the mission losses have been calculated. right gradient, gradient and The gradient.

[0090] 4. Dual-path parameter update Path A: Regular update of adapter weights Lower projection matrix and projection matrix They are only responsible for fitting the task and do not require sparsification. Therefore, they use the base weight learning rate. Perform regular gradient descent updates with the standard optimizer:

[0091]

[0092] Path B: Soft thresholding of proximal gradients in gated vectors To induce dynamic sparsity in the model structure, the gate vector... We cannot simply perform ordinary gradient descent; we need to add a soft thresholding operation to achieve [the desired effect]. Solving for the proximal gradient of the regularization term.

[0093] According to the appendix Figure 2 The update, including the claims description, is done in two steps: Step 1: Based on independent gating learning rate Perform ordinary gradient descent to obtain temporary variables. :

[0094] Step 2: Apply the proximal operator to perform physical truncation:

[0095] Among them, sparse soft threshold , This is the sparsity penalty coefficient. This operation will apply to values ​​whose absolute value is less than... The gating parameter is directly truncated to zero, thus achieving true structural sparsity. Its physical meaning is: it is only allowed to survive if the contribution (gradient) of the gating dimension to reducing the task loss is greater than the sparsity penalty threshold; otherwise, it is set to zero.

[0096] 5. Computation and parameter analysis of the block architecture Parameter statistics: The sum of all parameters of matrix A is: ; The sum of all parameters of the B matrix is: .

[0097] Total number of parameters

[0098] Alignment dilemma with standard LoRA: The total number of parameters in a standard LoRA is .when and At that time, the total number of parameters in GraSoRA is exactly equal to the total number of parameters in standard LoRA. If adopted The code is divided into blocks, and the number of parameters is strictly required to be equal to that of the standard LoRA. We must satisfy the following:

[0099] In neural networks, especially in fully connected layers or projection layers of attention mechanisms, and They are usually not equal. This will lead to the calculated... It is a floating-point number that is difficult to divide evenly. Once The inability to round down makes it impossible to instantiate tensors in deep learning frameworks, leading to serious misalignment in engineering implementation.

[0100] Therefore, in a preferred embodiment, the partitioning strategy is symmetrical partitioning, that is, both the input dimension and the output dimension are divided into 1 / 2 and 2 / 3. Each subspace, thus constructing There are several parallel adapter sub-blocks. At this point, the rank of each sub-block is set to... This symmetric architecture mathematically guarantees that the total number of parameters of this invention remains absolutely consistent with that of the standard low-rank adapter under arbitrary rank constraints, achieving efficient fine-tuning with zero additional parameters.

[0101] Example 3 See Figure 1 For the frozen weight matrix of a pre-trained natural language processing model Set block parameters Total Rank System Construction A parallel adapter path. Sub-input dimension Sub-output dimension , Zizhi .

[0102] To achieve efficient computation, this example directly instantiates a high-dimensional tensor in physical memory: the downprojection tensor. The shape is Up projection tensor The shape is and granular gating tensors The shape is .

[0103] The specific process of forward propagation is as follows: 1. Input Reshaping: Input through After processing, the shape changes from Rearranged into chunks

[0104] 2. Parallel downward projection: Using Einstein's summation convention, according to the formula... Calculate the projection of all input sub-blocks onto the output sub-block to obtain the shape as follows. intermediate feature tensor .

[0105] 3. Broadcast Gating: Granular gating tensors are expanded through a broadcasting mechanism. Logically expand along the sample dimension and the spatial / sequence dimension to make it compatible with the intermediate feature tensor. Matching and with intermediate feature tensors Perform element-wise multiplication This step, while performing parallel computation, independently controls the activation state of each rank dimension within each subspace.

[0106] 4. Parallel Up Projection and Aggregation: Utilizing Perform upprojection and sum and aggregate over the input block dimension and sub-rank dimension, the output shape is [ Finally, reshape it into []. Then, multiply by the scaling factor to complete the residual connection.

[0107] 5. Calculate the loss: Based on the task loss, perform regular gradient descent updates on the projection tensors of each adapter sub-block, and perform soft threshold truncation updates on the gated tensors.

[0108] 6. Proceed to the next iteration loop until training is complete.

[0109] To verify the effectiveness of this invention in complex distribution and large-scale commonsense reasoning scenarios, the inventors conducted sampling evaluation and deep comparison experiments on multiple multiple choice and commonsense reasoning datasets (including PIQA, SIQA, ARC-Easy, and OpenBookQA) based on the Qwen-2.5-1.5B model. The experiments used a unified hyperparameter configuration and set the total rank... The number of micro-blocks is set for GraLoRA and GraSoRA of the present invention. All gating sparsity penalty terms use a fixed threshold mode.

[0110] Common sense reasoning tasks, due to their stringent logical chains and highly complex contextual dependencies, are prone to generating significant anomalous feature channels during model forward propagation. These anomalous feature channels dominate gradient updates in traditional global low-rank adapters, leading to representational collapse and common sense forgetting. This embodiment, through quantitative evaluation data from the four fine-tuning tasks mentioned above, fully verifies the effectiveness of GraSoRA in addressing this deficiency and achieving parameter efficiency and generalization ability: 1. Overcome the bottleneck of dense features and achieve excellent generalization ability. As shown in Table 1, the experimental data demonstrates that traditional dense fine-tuning methods often get stuck in local optima and have limited generalization ability when faced with complex scientific reasoning and common sense association data. For example, in the ARC-Easy scientific common sense task, the accuracy of traditional LoRA is 86.32%, while the existing coarse-grained sparse fitting method SoRA, although improved, still suffers from feature representation bottlenecks. In contrast, the GraSoRA proposed in this invention successfully overcomes this feature bottleneck. This indicates that the granular block-based mechanism of this invention can effectively physically isolate anomalous gradients, significantly improving the model's generalization ability while approaching the optimal validation performance.

[0111] 2. Overcome the defects of coarse-grained gating and achieve fine-grained and stable pruning. While existing global sparsity methods can achieve some parameter culling, the cost is structural instability. Because common-sense reasoning logic has extremely high requirements for feature integrity, global parameter pruning like SoRA, which affects the whole system, can easily mistakenly kill critical logical decision channels, thereby impairing model robustness and locking in the upper limit of accuracy.

[0112] Combining the experimental results in Tables 1 and 2, it can be seen that the GraSoRA of the present invention... Independent gating at the micro-granular level achieved a higher average sparsity of 12.01% across four major tasks. In practical engineering deployments of smart devices, this high sparsity means that the incremental weights generated by fine-tuning are highly compressed. This not only significantly reduces network transmission bandwidth and additional GPU memory overhead when edge devices perform hot updates of natural language models and dynamic switching of multi-intent tasks; but also, by using reparameterization techniques to fuse the sparse incremental matrix with the base model, this method not only fully preserves the accuracy improvement but also achieves physical deployment with zero additional inference latency. This empirical data fully demonstrates the engineering value of this invention in edge terminal scenarios where computing power is limited and flexible switching between multiple tasks is required.

[0113] Table 1. Accuracy comparison of different fine-tuning methods on the commonsense reasoning dataset.

[0114] Table 2 Comparison of final model sparsity at the end of training for different fine-tuning methods

[0115] Example 4: Hybrid Mode Considering that excessive segmentation may lead to feature fragmentation in extremely low-rank scenarios, this invention also provides a hybrid architecture.

[0116] When the configured rank is detected When the total rank is below a preset threshold, the system will open a standard global low-rank adapter path in parallel outside of the granular adapter path, and adjust the total rank. The adapters are allocated to the granular adapter path and the global low-rank adapter path according to a preset ratio or a dynamic strategy. The final output is a weighted sum of the two.

[0117] In the high-rank experiment of this embodiment, the system identifies and runs only the pure GraSoRA mode to obtain the best parameter efficiency.

[0118] Example 5 See Figure 3 The embodiments are used to implement the principles of the above method embodiments and construct a natural language processing model fine-tuning system based on granular sparse gating. The system includes a parameter construction and initialization submodule, a granular forward propagation submodule, a loss calculation submodule, a dual-path parameter update submodule, and a model evaluation and visualization submodule.

[0119] The parameter construction and initialization submodule is used to obtain the frozen weight matrix of the pre-trained natural language processing model and input it into the dimension. Divided into Subspace, output dimension Divided into Subspace, Construction Each adapter sub-block is parallelized and initialized with a low-rank down-projection matrix, an up-projection matrix, and a granular gating matrix, respectively. The granular forward propagation submodule is used to divide the input data into corresponding input sub-blocks, calculate the projection of the input sub-blocks onto the lower projection matrix to obtain intermediate features; use the tensor dimension expansion operation to perform element-wise multiplication of the granular gating vectors and the intermediate features to obtain sparse features; finally, project and aggregate all sub-blocks through the upper projection matrix to obtain the adapter increment; The loss calculation submodule is used to obtain the supervision signal of the fine-tuning task and calculate the task loss between the model prediction and the true label. The dual-path parameter update submodule is used to calculate the gradient based on the task loss, and to perform regular gradient updates on the lower projection matrix and the upper projection matrix using an optimizer; a soft thresholding operation is added to the granular gating vector on the basis of gradient descent to realize the solution of the proximal gradient of the regularization term, thereby dynamically inducing parameter sparsity during training. The model evaluation and visualization submodule is used to extract the model's loss value on the validation set and the proportion of zero elements in the granular gating vector during or after model training, calculate the model sparsity, and generate the loss function and sparsity evolution curve to verify the model's effectiveness.

[0120] Each submodule is mainly used to implement the various steps of the method implementation, which will not be elaborated here.

[0121] It should be noted that, depending on the implementation needs, the various steps / components described in this application can be broken down into more steps / components, or two or more steps / components or parts of the operation of steps / components can be combined into new steps / components to achieve the purpose of this invention.

[0122] This embodiment also includes a processor, a communication interface, a memory, and a communication bus; wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; the memory stores a computer program, and when the program is executed by the processor, the processor performs the steps of a natural language processing model fine-tuning method based on granular sparse gating.

[0123] This embodiment also provides a computer-readable storage medium storing executable instructions that, when executed by a processor, enable the processor to implement a natural language processing model fine-tuning method based on granular sparse gating.

[0124] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0125] Those skilled in the art should understand that the solutions proposed in this invention can not only be implemented as specific processing methods, but also used to construct corresponding system devices or computer software products. Therefore, the specific manifestations of this invention can be pure hardware architecture, pure software code, or a comprehensive architecture combining hardware and software. Furthermore, the program code related to this invention can be mounted on various types of computer-readable storage media (including but not limited to hard disks, optical disks, and various solid-state drives), and exist and be distributed as an independent computer program product.

[0126] This paper references the operational flowcharts of relevant methods and program products, as well as the structural block diagrams of corresponding hardware systems. It should be noted that each individual module or step in the above diagrams, and any combination thereof, can be implemented using computer program code. These code instructions can be loaded onto general-purpose computing devices, customized computers, embedded chips, and other terminal processors with data processing capabilities, thereby constructing a specific computing machine at the physical level. When the processor executes these instructions, it can complete the tasks planned in the aforementioned flowcharts and block diagrams, thus realizing the fine-tuning system for natural language processing models based on granular sparse gating described in this invention.

[0127] Furthermore, the aforementioned program instructions can also be stored within a computer-readable storage medium capable of guiding a computing device to operate according to expected logic. In this case, the storage medium containing the instructions itself constitutes a physical construct, the control logic of which is specifically designed to perform the corresponding functions specified in the flowcharts and block diagrams of this invention.

[0128] Similarly, after importing this code into a computing device or other programmable terminal, a series of standardized operations can be triggered within its operating system, thereby initiating a computer-driven processing mechanism. This sequence of actions triggered by the actual execution of the code by the device constitutes the specific steps for implementing the parameter fine-tuning method proposed in this invention.

[0129] Finally, it should be emphasized that the specific embodiments listed above are only for explaining the core concepts and technical advantages of this invention, and are intended to assist relevant practitioners in gaining a deeper understanding and carrying out engineering practice accordingly. They are by no means a mechanical limitation on the scope of patent protection of this invention. Any equivalent substitutions, conventional modifications, or adaptive alterations made based on the basic principles and core ideas disclosed in this invention, without departing from the spirit of this invention, should be fully covered within the legal protection scope of this invention.

Claims

1. A method for fine-tuning a natural language processing model based on granular sparse gating, characterized in that, The method includes the following steps: S1. Model Parameter Construction and Initialization: Obtain the frozen weight matrix of the pre-trained natural language processing model. The input dimension of the model's hidden layer With output dimension Divided into multiple physical subspaces, constructed There are several parallel adapter sub-blocks, where each subspace satisfies an integer division constraint; and for each adapter sub-block... Initialize model parameters, including: lower projection matrix The projection matrix Granularized gated vectors , Local rank for each adapter subblock; S2. Granular Forward Propagation: Acquire natural language text data input from intelligent interactive terminal devices and transform it into word embedding input features. ,in The length of the text sequence; the input feature is divided according to the input dimension. Each sub-block, each sub-input For each adapter sub-block The process sequentially performs down projection, gated sparsity, up projection, and aggregation operations, ultimately outputting a feature matrix with natural language context prediction information. ; S3. Loss Calculation: Based on the label sequence of the downstream natural language processing task. Calculation model predicts output Cross-entropy task loss between the real physical label sequence and the actual physical label sequence ; S4. Dual-path parameter update: based on the cross-entropy task loss. Calculate the gradient; then, for the down projection matrix in the model parameters... and projection matrix Perform gradient descent-based optimization updates; for granular gate vectors Perform proximal gradient updates, including gradient descent and element-wise soft thresholding. The element-wise soft thresholding operation is used to achieve structural sparsity of the gated vector, resulting in a fine-tuned natural language processing model.

2. The method for fine-tuning a natural language processing model based on granular sparse gating according to claim 1, characterized in that, In step S1, the physical space is divided symmetrically, that is... , build A parallel adapter sub-block, and set ,in For the total rank.

3. The method for fine-tuning a natural language processing model based on granular sparse gating according to claim 1, characterized in that, In step S2, the feature matrix is ​​output. The methods specifically include: For each adapter sub-block implement: (1) Lower projection: (2) Gating sparsity: The gating vector is broadcast. Extend to Feature interception is performed using the same dimension and element-wise multiplication. (3) Upward projection and aggregation: Concatenate all output blocks to obtain the final semantically enhanced output. : The final output is a feature matrix with natural language context prediction information. : Wherein, the gate vector By broadcasting and expanding the sample and sequence dimensions to align it with the intermediate feature tensor in the rank dimension, independent control of the rank-wise dimension is achieved.

4. The method for fine-tuning a natural language processing model based on granular sparse gating according to claim 1, characterized in that, Cross-entropy task loss in step S3 The calculation formula is: in, The length of the text sequence. For the number of word categories, One-hot encoding for the real label. This is used to predict the probability values ​​of the output features after passing through the activation function.

5. The method for fine-tuning a natural language processing model based on granular sparse gating according to claim 4, characterized in that, In step S4, the granular gating vector The specific methods for performing proximal gradient updates include: Gradient descent steps: Element-wise soft thresholding: in, For the gating learning rate, The sparse regularization coefficient; The soft thresholding operation is an element-wise nonlinear shrinkage function used to achieve... Proximal optimization of the regularization term without explicitly computing the gradient of the regularization term.

6. The method for fine-tuning a natural language processing model based on granular sparse gating according to claim 5, characterized in that, In step S4, the setting of the soft threshold ξ includes two modes: (1) Fixed threshold mode: Maintain during fine-tuning As a preset constant: It is used for performance optimization under given sparsity constraints; (2) Dynamic scheduling mode: The threshold increment strategy is adopted; during the model training process, the sparse soft threshold is directly increased by a preset step size or period. Size: This allows the model to retain more parameters in the early stages of training to fully learn features, and then gradually increases the truncation threshold to eliminate redundant parameters until the model performance reaches a significant inflection point, which is used to explore the parameter compression limit of the model.

7. The method for fine-tuning a natural language processing model based on granular sparse gating according to claim 2, characterized in that, According to the total rank The model fine-tuning method adopts a hybrid architecture update approach, including: (1) When the total rank is set When the value exceeds the preset threshold, only the granular adapter paths in steps S1 to S4 are enabled. (2) When the total rank is set When the total rank is less than or equal to the preset threshold, a global low-rank adapter path is constructed in parallel alongside the granular adapter paths in steps S1 to S4, thus increasing the total rank. The outputs of the global low-rank adapter path are allocated to the granular adapter path and the global low-rank adapter path according to a preset ratio or dynamic strategy. The outputs of the global low-rank adapter path are weighted and summed with the aggregated outputs of the granular adapter path to compensate for the feature integrity under the low-rank condition.

8. The method for fine-tuning a natural language processing model based on granular sparse gating according to claim 1, characterized in that, The method also includes a reasoning reconstruction step: After fine-tuning, detect all gated vectors. Zero element index in, physical removal The corresponding column vectors and The corresponding column vectors are used to reconstruct the model into a sparse, low-rank block matrix for faster inference.

9. A natural language processing model fine-tuning system based on granular sparse gating, used to implement the natural language processing model fine-tuning method based on granular sparse gating as described in any one of claims 1-8, characterized in that, The system includes: The parameter construction and initialization submodule is used to obtain the frozen weight matrix of the pre-trained natural language processing model and input it into the dimension. Divided into Subspace, output dimension Divided into Subspace, Construction Each adapter sub-block is parallelized and initialized with a low-rank down-projection matrix, an up-projection matrix, and a granular gating matrix, respectively. The granular forward propagation submodule is used to segment the input natural language text data into corresponding input sub-blocks, calculate the projection of the input sub-blocks onto the lower projection matrix to obtain intermediate features; use tensor dimension expansion operations to perform element-wise multiplication of the granular gating vectors and intermediate features to obtain sparse features; finally, project and aggregate all sub-blocks through the upper projection matrix to obtain the adapter increment. The loss calculation submodule is used to obtain the supervision signal of the fine-tuning task and calculate the task loss between the model prediction and the true label. The dual-path parameter update submodule is used to calculate gradients based on task loss and perform regular gradient updates on the down-projection matrix and up-projection matrix using the optimizer. A soft thresholding operation is added to the granular gating vector on the basis of gradient descent to realize the solution of the proximal gradient of the regularization term, thereby dynamically inducing parameter sparsity during training. The model evaluation and visualization submodule is used to extract the model's loss value on the validation set and the proportion of zero elements in the granular gating vector during or after model training, calculate the model sparsity, and generate the loss function and sparsity evolution curve to verify the model's effectiveness.

10. An electronic device, characterized in that, include: Memory, used to store executable computer programs; A processor, when executing an executable computer program stored in memory, implements the fine-tuning method for a natural language processing model based on granular sparse gating as described in any one of claims 1 to 8.