A large model lightweight adaptation method and system

By extracting semantic cores from educational data and generating pseudo-instruction-response pairs, and using singular value decomposition to initialize low-rank adapter modules, the system optimizes weight and gradient updates during training. This solves the adaptation efficiency and performance problems of large language models in low-resource scenarios, and enables efficient knowledge learning in the field of education.

CN122242512APending Publication Date: 2026-06-19ZHENGZHOU UNIVERSITY OF AERONAUTICS

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHENGZHOU UNIVERSITY OF AERONAUTICS
Filing Date
2026-02-03
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In low-resource scenarios, existing fine-tuning methods for large language models suffer from high computational costs, large storage overhead, low model adaptation efficiency, overfitting, and insufficient domain knowledge learning. In particular, when data is scarce in the education field, conventional data augmentation and gradient optimization strategies are difficult to effectively improve model performance.

Method used

By extracting the semantic core of educational data, pseudo-instruction-response pairs are generated. The low-rank adapter module is initialized using singular value decomposition. During model fine-tuning, weighting factors and gradient adjustment coefficients are calculated, orthogonality constraints are applied, and training loss and gradient update are optimized.

Benefits of technology

It accelerates the model's knowledge acquisition and task performance in specific educational fields, improves the model's adaptation efficiency and expressive ability under limited resources, and realizes the full utilization of training resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242512A_ABST
    Figure CN122242512A_ABST
Patent Text Reader

Abstract

This invention provides a lightweight adaptation method and system for large models. By extracting the semantic core from educational data, domain topic vector clusters are obtained, and pseudo-instruction-response pairs are generated to expand the training set based on these clusters. In terms of model structure, a low-rank adapter module is inserted into the transformer layer of the pre-trained large model, and the semantic direction of the topic vector clusters is extracted through singular value decomposition. The adapter parameters are initialized, and a weighted strategy is adopted during training: the prediction entropy and cross-entropy loss of each sample are calculated, and sample weights and gradient adjustment coefficients are generated based on these. The gradient is scaled and updated, and orthogonality constraints are applied, with the strength of the constraints adjusted by the sample weights.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of large models, and in particular relates to a lightweight adaptation method and system for large models. Background Technology

[0002] Large-scale language models such as GPT and LLaMA have demonstrated outstanding performance in various natural language processing tasks. However, applying these ultra-large-scale models pre-trained on general corpora to the education field faces challenges due to scarce data resources. Full fine-tuning is a method to adapt large models to a specific domain, but it requires updating all model parameters, resulting in extremely high computational and storage costs and compromising the model's original general capabilities. Low-rank adaptation (LoRA) is a highly representative method in parameter fine-tuning. LoRA simulates parameter updates by injecting trainable low-rank decomposition matrices into the Transformer layers of the pre-trained model, training only a small number of newly added parameters during fine-tuning, thus reducing computational and storage requirements. Although methods like LoRA improve fine-tuning efficiency, insufficient training data remains a bottleneck restricting model performance in resource-constrained scenarios. Conventional data augmentation methods struggle to generate highly relevant training samples, easily leading to overfitting or insufficient learning of domain knowledge. At the data level, data augmentation strategies fail to deeply discover and utilize the semantic core and inherent structure of specific domains; the generated pseudo-data may have weak relevance to domain knowledge points, making quality difficult to guarantee and limiting its effectiveness in improving the model's domain capabilities. At the model initialization level, the standard LoRA method uses a random Gaussian distribution to initialize the low-rank matrix. This initialization method is completely unrelated to the semantic characteristics of the target domain, causing the model to spend more time exploring the correct adaptation direction in the early stages of fine-tuning. When training data is scarce, it may be difficult to converge to the optimal solution. The fine-tuning process often fails to distinguish and utilize the learning value of different samples. At the gradient optimization and model constraint level, the standard gradient update strategy lacks a mechanism to adjust the learning intensity according to the training difficulty of the samples. Multiple direction vectors of the low-rank matrix may exhibit excessive correlation after training, reducing the efficiency and expressive power of the adaptation. Furthermore, the regularization method lacks a constraint mechanism related to the training state. Summary of the Invention

[0003] To address the above problems, this invention proposes a lightweight adaptation method for large models, comprising the following steps: Semantic core extraction is performed on the resource education data to obtain domain topic vector clusters that serve as semantic cores; based on the semantic cores and a preset set of instruction templates, pseudo-instruction-response pairs are generated and merged with the original data to form a training dataset; A low-rank adapter module is placed in the transformer layer of the pre-trained large model; singular value decomposition is performed on the domain topic vector cluster of the semantic kernel to extract semantic direction vectors, and the decomposition matrix of the low-rank adapter module is initialized using the direction vectors. In the iterative training of model fine-tuning, for each training sample, the entropy value of the predicted probability distribution of the model response and the cross-entropy loss with the standard response are calculated; based on the entropy value and the cross-entropy loss, a weighting factor and a gradient adjustment coefficient are calculated for the training sample, wherein the value of the gradient adjustment coefficient is positively correlated with the value of the cross-entropy loss. The cross-entropy loss is weighted using the weighting factor to obtain the training loss; gradient backpropagation is performed based on the training loss, and when updating the weights of the low-rank adapter module, the gradient is scaled using the gradient adjustment coefficient, and an orthogonality constraint is applied, the strength of which is adjusted by the weighting factor.

[0004] Optionally, the step of extracting semantic cores from resource education data to obtain domain topic vector clusters as semantic cores includes: Each text sample in the educational data is converted into a sentence vector using a pre-trained sentence vector encoder; The K-means clustering algorithm is used to cluster all sentence vectors. The centroid vector of each cluster is used as the domain topic vector, and all centroid vectors together constitute the domain topic vector cluster.

[0005] Optionally, generating pseudo-instruction-response pairs based on the semantic kernel and a preset set of instruction templates includes: Randomly select a domain topic vector from the domain topic vector cluster; In the sentence vectors of the original data, the text sample that is most similar to the topic vector of the selected domain is retrieved by calculating the cosine similarity. The content of the text sample is filled into the placeholder in the preset instruction template to form a new instruction, and the original response of the text sample is used as the corresponding response to form a pseudo-instruction-response pair.

[0006] Optionally, the step of performing singular value decomposition on the domain topic vector cluster of the semantic kernel, extracting semantic direction vectors, and using the direction vectors to initialize the decomposition matrix of the low-rank adapter module includes: Perform singular value decomposition on the matrix composed of the domain topic vector clusters to obtain the left singular vector matrix U; Select the k column vectors in the left singular vector matrix U that correspond to the first k largest singular values ​​as the semantic direction vectors; The rank of the low-rank adapter module is set to k, its decomposition matrix A is initialized as a random matrix conforming to a Gaussian distribution, and its decomposition matrix B is initialized as a matrix composed of the k main semantic direction vectors as column vectors.

[0007] Optionally, calculating the weighting factor and gradient adjustment coefficient for the training samples based on the entropy value and cross-entropy loss includes: For a training sample, the entropy of the predicted probability distribution of the model response is E, and the cross-entropy loss between the model and the standard response is L. The formula for calculating the weighting factor w is as follows: ,in These are preset hyperparameters used to control the strength of the entropy's suppression of the weights; The formula for calculating the gradient adjustment coefficient g is as follows: ,in This is a preset positive hyperparameter used to control the sensitivity of cross-entropy loss to gradient scaling.

[0008] Optionally, applying an orthogonality constraint, the strength of which is adjusted by the weighting factor, includes: When calculating the training loss, add an orthogonal loss term. ; The orthogonal loss term is calculated as follows: Where B is the decomposition matrix of the low-rank adapter module, and I is the identity matrix. It is the Frobenius norm; The weighting factor w is combined with a fixed orthogonal constraint hyperparameter. Multiplying them together yields the orthogonal constraint strength. ,Will It is added to the weighted cross-entropy loss.

[0009] Furthermore, this invention also relates to a lightweight adaptation system for large models, comprising the following modules: The module is used to extract the semantic core from the resource education data to obtain a domain topic vector cluster that serves as the semantic core; based on the semantic core and a preset set of instruction templates, pseudo-instruction-response pairs are generated and merged with the original data to form a training dataset. An extraction module is used to insert a low-rank adapter module into the transformer layer of a pre-trained large model; singular value decomposition is performed on the domain topic vector cluster of the semantic kernel to extract semantic direction vectors, and the direction vectors are used to initialize the decomposition matrix of the low-rank adapter module. The calculation module is used to calculate the predicted probability distribution entropy value of the model response and the cross-entropy loss with the standard response for each training sample during iterative training of model fine-tuning; based on the entropy value and cross-entropy loss, it calculates the weighting factor and gradient adjustment coefficient for the training sample, wherein the value of the gradient adjustment coefficient is positively correlated with the value of the cross-entropy loss. An adjustment module is used to weight the cross-entropy loss using the weighting factor to obtain the training loss; perform gradient backpropagation based on the training loss; when updating the weights of the low-rank adapter module, scale the gradient using the gradient adjustment coefficient and apply an orthogonality constraint, the strength of which is adjusted by the weighting factor.

[0010] Preferably, the step of extracting the semantic core from the resource education data to obtain a domain topic vector cluster as the semantic core includes: Each text sample in the educational data is converted into a sentence vector using a pre-trained sentence vector encoder; The K-means clustering algorithm is used to cluster all sentence vectors. The centroid vector of each cluster is used as the domain topic vector, and all centroid vectors together constitute the domain topic vector cluster.

[0011] Preferably, the step of generating pseudo-instruction-response pairs based on the semantic kernel and a preset set of instruction templates includes: Randomly select a domain topic vector from the domain topic vector cluster; In the sentence vectors of the original data, the text sample that is most similar to the topic vector of the selected domain is retrieved by calculating the cosine similarity. The content of the text sample is filled into the placeholder in the preset instruction template to form a new instruction, and the original response of the text sample is used as the corresponding response to form a pseudo-instruction-response pair.

[0012] Preferably, the step of performing singular value decomposition on the domain topic vector cluster of the semantic kernel, extracting semantic direction vectors, and using the direction vectors to initialize the decomposition matrix of the low-rank adapter module includes: Perform singular value decomposition on the matrix composed of the domain topic vector clusters to obtain the left singular vector matrix U; Select the k column vectors in the left singular vector matrix U that correspond to the first k largest singular values ​​as the semantic direction vectors; The rank of the low-rank adapter module is set to k, its decomposition matrix A is initialized as a random matrix conforming to a Gaussian distribution, and its decomposition matrix B is initialized as a matrix composed of the k main semantic direction vectors as column vectors.

[0013] Preferably, calculating the weighting factor and gradient adjustment coefficient for the training samples based on the entropy value and cross-entropy loss includes: For a training sample, the entropy of the predicted probability distribution of the model response is E, and the cross-entropy loss between the model and the standard response is L. The formula for calculating the weighting factor w is as follows: ,in These are preset hyperparameters used to control the strength of the entropy's suppression of the weights; The formula for calculating the gradient adjustment coefficient g is as follows: ,in This is a preset positive hyperparameter used to control the sensitivity of cross-entropy loss to gradient scaling.

[0014] Preferably, applying an orthogonality constraint, the strength of which is adjusted by the weighting factor, includes: When calculating the training loss, add an orthogonal loss term. ; The orthogonal loss term is calculated as follows: Where B is the decomposition matrix of the low-rank adapter module, and I is the identity matrix. It is the Frobenius norm; The weighting factor w is combined with a fixed orthogonal constraint hyperparameter. Multiplying them together yields the orthogonal constraint strength. ,Will It is added to the weighted cross-entropy loss.

[0015] This invention extracts semantic cores from educational data and generates high-quality pseudo-instruction-response pairs accordingly. The direction vectors of these semantic cores are used to initialize a low-rank adapter, providing a good starting point with domain prior knowledge for model fine-tuning, thereby accelerating model convergence. During training, specific weights and gradient adjustment coefficients are calculated for each training sample, enabling the model to focus on learning difficult samples with high prediction uncertainty or large errors, achieving full utilization of training resources. Gradient scaling is applied during weight updates, and orthogonality constraints related to sample importance are imposed, improving the large model's knowledge acquisition and task performance in specific educational domains with limited computational resources. Attached Figure Description

[0016] Figure 1 A flowchart of the first embodiment; Figure 2 This is a schematic diagram illustrating the process of generating pseudo-instruction-response pairs. Figure 3 This is a schematic diagram of the initialization of a low-rank adapter based on singular value decomposition. Detailed Implementation

[0017] Many specific details are set forth in the following description to provide a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this specification. Therefore, this specification is not limited to the specific implementations disclosed below.

[0018] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this specification. The singular forms “a,” “described,” and “the” as used in one or more embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items.

[0019] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this specification, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."

[0020] In the first embodiment, the present invention proposes a lightweight adaptation method for large models, such as... Figure 1 This includes the following steps: S1, extract the semantic core from the resource education data to obtain the domain topic vector cluster as the semantic core; based on the semantic core and the preset instruction template set, generate pseudo instruction-response pairs and merge them with the original data to form a training dataset; Using a pre-trained sentence vector model, such as SimCSE, each question-answer pair or text fragment in the educational data is converted into a high-dimensional text vector. K-means clustering is then used to cluster all text vectors, where the number of clusters, K, is a preset hyperparameter. The centroid vectors of each cluster constitute the domain topic vector cluster, which is the semantic kernel. From a preset set of instruction templates, such as templates like "Please explain a certain knowledge point," "How to apply a certain formula," and "What are the common misconceptions about a certain concept," a template is randomly selected. For each cluster center, the nearest original data samples are found, and the entities or keywords from these samples are filled into the selected instruction template to generate new pseudo-instructions, while retaining the original response content, thus forming a pseudo-instruction-response pair. All generated pseudo-instruction-response pairs are merged with the original educational dataset to form an expanded training dataset.

[0021] In an optional embodiment, the step of extracting the semantic core from the resource education data to obtain a domain topic vector cluster as the semantic core includes: Each text sample in the educational data is converted into a sentence vector using a pre-trained sentence vector encoder; The K-means clustering algorithm is used to cluster all sentence vectors. The centroid vector of each cluster is used as the domain topic vector, and all centroid vectors together constitute the domain topic vector cluster.

[0022] Suppose we have a batch of 1000 text samples containing historical knowledge questions and answers. Using a pre-trained sentence vector encoder such as Sentence-BERT, each text sample, such as "What year did Qin Shi Huang unify the six kingdoms?", is converted into a 768-dimensional sentence vector. The vector numerically represents the semantic information of the text. After processing all the samples, we will get a 1000-row, 768-column sentence vector matrix.

[0023] Further topic clustering is performed, using the 1000 768-dimensional sentence vectors generated above as input, and applying the K-means clustering algorithm. The number of clusters, K, is set to 50, representing the goal of extracting 50 core historical topics from this batch of data. Sentence vectors with similar semantics are grouped into the same cluster; for example, samples about dynastic changes are clustered together, and samples about important battles are clustered together. After clustering, the average value of all sentence vectors within each cluster is calculated to obtain the centroid vector of that cluster. These 50 centroid vectors represent the domain topic vectors of different historical themes, and together they constitute the domain topic vector clusters, which are the semantic kernels upon which subsequent steps are based.

[0024] In an optional embodiment, generating pseudo-instruction-response pairs based on the semantic kernel and a preset set of instruction templates includes: Randomly select a domain topic vector from the domain topic vector cluster; In the sentence vectors of the original data, the text sample that is most similar to the topic vector of the selected domain is retrieved by calculating the cosine similarity. The content of the text sample is filled into the placeholder in the preset instruction template to form a new instruction, and the original response of the text sample is used as the corresponding response to form a pseudo-instruction-response pair.

[0025] Specifically, the first step is to select a topic and retrieve samples. One topic vector is randomly selected from the 50 generated domain topic vectors, assuming it represents the topic "Tang Dynasty Economy". The cosine similarity between this topic vector and the sentence vectors of all 1000 original text samples in the dataset is calculated. For example, the sentence vector corresponding to the text sample "Explain the Kaiyuan Era of the Tang Dynasty" received the highest similarity score of 0.92, therefore this text sample was selected as the most relevant knowledge point to the topic.

[0026] The second step is to generate instruction-response pairs. One or more instruction templates are pre-defined, for example, the template could be "Please explain the concept of a domain knowledge point in simple terms." The retrieved text sample content, i.e., "Explain the Kaiyuan Prosperity of the Tang Dynasty," is filled into the placeholder position in the template. This generates a new pseudo-instruction: "Please explain the concept of 'Explain the Kaiyuan Prosperity of the Tang Dynasty' in simple terms." The standard answer corresponding to the original sample "Explain the Kaiyuan Prosperity of the Tang Dynasty" in the dataset is found, for example, "The Kaiyuan Prosperity refers to the prosperous era that occurred during the early reign of Emperor Xuanzong of Tang." This answer is used as the response to the newly generated pseudo-instruction, thus forming a complete pseudo-instruction-response pair, such as... Figure 2 , used to expand training data.

[0027] S2, Insert a low-rank adapter module into the transformer layer of the pre-trained large model; perform singular value decomposition on the domain topic vector cluster of the semantic kernel, extract semantic direction vectors, and use the direction vectors to initialize the decomposition matrix of the low-rank adapter module; We select the query and value matrices from the Transformer architecture of a pre-trained large model such as LLaMA as the target layer, and connect a low-rank adapter module in parallel with the weight matrix of each target layer. This module consists of the product of two low-rank matrices A and B, and only the parameters of A and B are updated during training. All centroid vectors from the domain topic vector clusters obtained above are stacked into a matrix M. Singular value decomposition is performed on matrix M to obtain three matrices U, Σ, and V. The first r right singular vectors corresponding to the larger singular values ​​are extracted, i.e., the first r columns of matrix V. These r vectors represent the semantic directions of the educational domain data. When initializing the low-rank adapter module, we keep the decomposition matrix A randomly initialized using a standard Gaussian distribution, while setting the column vectors of the decomposition matrix B to the extracted r semantic direction vectors, thereby injecting core domain knowledge into the model a priori.

[0028] In an optional embodiment, the step of performing singular value decomposition on the domain topic vector cluster of the semantic kernel, extracting semantic direction vectors, and initializing the decomposition matrix of the low-rank adapter module using the direction vectors includes: Perform singular value decomposition on the matrix composed of the domain topic vector clusters to obtain the left singular vector matrix U; Select the k column vectors in the left singular vector matrix U that correspond to the first k largest singular values ​​as the semantic direction vectors; The rank of the low-rank adapter module is set to k, its decomposition matrix A is initialized as a random matrix conforming to a Gaussian distribution, and its decomposition matrix B is initialized as a matrix composed of the k main semantic direction vectors as column vectors.

[0029] The first step is to extract semantic directions by combining 50 768-dimensional domain topic vectors into a 768-row, 50-column matrix. Singular Value Decomposition (SVD) is then performed on this matrix to obtain a left singular vector matrix U, a singular value matrix Σ, and a right singular vector matrix V. The column vectors of matrix U represent orthogonal basis directions in the data semantic space.

[0030] The second step is to initialize the low-rank adapter matrix. Assuming the rank k of the LoRA low-rank adapter is set to 16, it means the core semantics of the domain knowledge can be represented by 16 directions. The 16 largest singular values ​​are found from the singular value matrix Σ, and 16 column vectors corresponding to these 16 singular values ​​are selected from the left singular vector matrix U. These 16 768-dimensional vectors are the semantic direction vectors. In the LoRA module, the weight update matrix is ​​decomposed into matrices A and B. Matrix B, a 768-row, 16-column matrix, is initialized with the 16 semantic direction vectors, i.e., each vector is treated as a column of B. The other matrix, A, a 16-row, d-column matrix, is randomly initialized using a standard Gaussian distribution, such as... Figure 3In this way, the LoRA module is infused with prior knowledge about the core topics of the domain at the start of training.

[0031] S3, In the iterative training of model fine-tuning, for each training sample, calculate the entropy value of the predicted probability distribution of the model response and the cross-entropy loss with the standard response; based on the entropy value and the cross-entropy loss, calculate the weighting factor and gradient adjustment coefficient for the training sample, wherein the value of the gradient adjustment coefficient is positively correlated with the value of the cross-entropy loss. Specifically, for any instruction-response pair in the training dataset, the instruction part is input into the model. At each step of generating the response, the model outputs a probability distribution covering the entire vocabulary. The cross-entropy loss is calculated by comparing the model's predicted probability distribution at each generation position with the true tokens at that position in the standard response. The losses of the entire response sequence are then summed or averaged to obtain the cross-entropy loss value for that sample. The information entropy of the model's predicted probability distribution at each generation position is calculated, and the entropy values ​​of the entire sequence are averaged to obtain the predicted probability distribution entropy value for that sample. The weighting factor is calculated using a predefined function whose inputs are the cross-entropy loss and the entropy value. This function assigns greater weights to samples with larger cross-entropy losses or higher entropy values. For example, setting it to 1 plus a hyperparameter, multiplied by the cross-entropy loss value, ensures that samples with larger losses also have larger gradient adjustment coefficients.

[0032] In an optional embodiment, calculating the weighting factor and gradient adjustment coefficient for the training samples based on the entropy value and cross-entropy loss includes: For a training sample, the entropy of the predicted probability distribution of the model response is E, and the cross-entropy loss between the model and the standard response is L. The formula for calculating the weighting factor w is as follows: ,in These are preset hyperparameters used to control the strength of the entropy's suppression of the weights; The formula for calculating the gradient adjustment coefficient g is as follows: ,in This is a preset positive hyperparameter used to control the sensitivity of cross-entropy loss to gradient scaling.

[0033] Specifically, suppose the model is processing a training sample and predicting the next word. The model's prediction is incorrect and highly uncertain, resulting in a high cross-entropy loss L, for example, L=3.0. Simultaneously, the predicted probability distribution is very dispersed, leading to a high entropy value E, for example, E=4.0. Assuming the hyperparameter β is set to 0.1, the weighting factor w≈2.68. The sample receives a large base weight due to its high loss, but this weight is also somewhat suppressed due to the high uncertainty.

[0034] The model makes incorrect predictions but remains overconfident. In this case, the cross-entropy loss L is still high (L=3.0), but the model's predicted probabilities are concentrated on a few incorrect terms, resulting in a low entropy value E (e.g., E=0.5). The weighting factor w≈3.80. These samples are considered "hard samples" because they represent the model's knowledge gaps, thus receiving high weights to encourage the model to focus on them. If the model makes correct predictions and remains overconfident, L is low (e.g., L=0.2), and E is also low (e.g., E=0.5), then w≈1.14.

[0035] For each training sample processed by the model, a cross-entropy loss value L is calculated, representing the difference between the model's prediction and the true label. Assuming a preset hyperparameter α of 0.2, this parameter determines the strength of the gradient amplification effect of the loss value. The gradient adjustment coefficient g is calculated according to the formula. If the current sample is a simple sample, the model can easily predict it correctly, and the cross-entropy loss L might only be 0.1. Therefore, its gradient adjustment coefficient g = 1.02. This indicates that during backpropagation, the gradient generated by this sample remains essentially constant, only slightly amplified by 2%. If the current sample is a difficult sample, the model's prediction differs greatly from the true value, with a loss L as high as 4.0. Therefore, its gradient adjustment coefficient g = 1.8. This indicates that the gradient generated by the difficult sample will be amplified by 80%.

[0036] S4, the cross-entropy loss is weighted using the weighting factor to obtain the training loss; gradient backpropagation is performed based on the training loss, and when updating the weights of the low-rank adapter module, the gradient is scaled using the gradient adjustment coefficient, and an orthogonality constraint is applied, the strength of which is adjusted by the weighting factor.

[0037] In one embodiment, the weighting factor calculated for each training sample is multiplied by the sample's cross-entropy loss value. The resulting product is the training loss representing the sample's contribution to the overall model training. For example, if a sample has a cross-entropy loss of 0.8 and a weighting factor of 1.5, the training loss is 1.2.

[0038] An orthogonality constraint loss is defined, calculated as the Frobenius norm difference between the transpose of matrix B in the low-rank adapter and itself, and an identity matrix, ensuring that the column vectors of matrix B are mutually orthogonal. This orthogonality constraint loss is multiplied by a weighting factor and a manually set hyperparameter, and added to the training loss to form the total loss. Based on this total loss, the gradients of matrices A and B in the low-rank adapter module are calculated using the backpropagation algorithm. Before the optimizer Adam performs the weight update step, the calculated gradient is multiplied by the gradient adjustment coefficient corresponding to the sample, thereby scaling the gradient. The optimizer uses the scaled gradient to update the weight parameters of matrices A and B.

[0039] In an optional embodiment, applying an orthogonality constraint, the strength of which is adjusted by the weighting factor, includes: When calculating the training loss, add an orthogonal loss term. ; The orthogonal loss term is calculated as follows: Where B is the decomposition matrix of the low-rank adapter module, and I is the identity matrix. It is the Frobenius norm; The weighting factor w is combined with a fixed orthogonal constraint hyperparameter. Multiplying them together yields the orthogonal constraint strength. ,Will It is added to the weighted cross-entropy loss.

[0040] The first step is to calculate the orthogonality loss. In the low-rank adapter, the column vectors of matrix B represent the learned semantic directions. Assuming B is a 768x16 matrix, multiplying the transpose of B by B yields a 16x16 matrix. The result of multiplying the transpose of B by B should be close to a 16x16 identity matrix I. Orthogonality Loss This involves calculating the square of the Frobenius norm of the difference between the two matrices. If the calculated value is 0.05, it means that the column vectors of B are close to being orthogonal.

[0041] The second step is to adjust the constraint strength and incorporate it into the total loss. A fixed orthogonal constraint hyperparameter λ is set, for example, λ = 0.01, which serves as the base penalty strength. The weighting factor w for the current sample, calculated above, is used. Assuming the current sample is difficult, w is 3.80. Then the weight of the applied orthogonal constraint term is w × λ, i.e., 0.038. The total training loss equals the weighted cross-entropy loss wL, plus the adjusted orthogonal loss term. This indicates that when the model deals with difficult samples, the requirement to maintain the orthogonality of semantic directions will be correspondingly strengthened, ensuring that the structure of the semantic space is not destroyed while learning new knowledge.

[0042] In a second embodiment, the present invention also provides a lightweight adaptation system for large models, comprising the following modules: The module is used to extract the semantic core from the resource education data to obtain a domain topic vector cluster that serves as the semantic core; based on the semantic core and a preset set of instruction templates, pseudo-instruction-response pairs are generated and merged with the original data to form a training dataset. An extraction module is used to insert a low-rank adapter module into the transformer layer of a pre-trained large model; singular value decomposition is performed on the domain topic vector cluster of the semantic kernel to extract semantic direction vectors, and the direction vectors are used to initialize the decomposition matrix of the low-rank adapter module. The calculation module is used to calculate the predicted probability distribution entropy value of the model response and the cross-entropy loss with the standard response for each training sample during iterative training of model fine-tuning; based on the entropy value and cross-entropy loss, it calculates the weighting factor and gradient adjustment coefficient for the training sample, wherein the value of the gradient adjustment coefficient is positively correlated with the value of the cross-entropy loss. An adjustment module is used to weight the cross-entropy loss using the weighting factor to obtain the training loss; perform gradient backpropagation based on the training loss; when updating the weights of the low-rank adapter module, scale the gradient using the gradient adjustment coefficient and apply an orthogonality constraint, the strength of which is adjusted by the weighting factor.

[0043] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, compact disc read-only memory (CD-ROM), optical storage, etc.) containing computer-usable program code.

[0044] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0045] The above description is merely an embodiment of this application and is not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A lightweight adaptation method for large models, characterized in that, Includes the following steps: Semantic core extraction is performed on resource education data to obtain domain topic vector clusters that serve as semantic cores; Based on the semantic kernel and the preset set of instruction templates, pseudo-instruction-response pairs are generated and merged with the original data to form a training dataset. A low-rank adapter module is placed in the transformer layer of the pre-trained large model; singular value decomposition is performed on the domain topic vector cluster of the semantic kernel to extract semantic direction vectors, and the decomposition matrix of the low-rank adapter module is initialized using the direction vectors. In the iterative training of model fine-tuning, for each training sample, the entropy value of the predicted probability distribution of the model response and the cross-entropy loss with the standard response are calculated; based on the entropy value and the cross-entropy loss, a weighting factor and a gradient adjustment coefficient are calculated for the training sample, wherein the value of the gradient adjustment coefficient is positively correlated with the value of the cross-entropy loss. The cross-entropy loss is weighted using the weighting factor to obtain the training loss; gradient backpropagation is performed based on the training loss, and when updating the weights of the low-rank adapter module, the gradient is scaled using the gradient adjustment coefficient, and an orthogonality constraint is applied, the strength of which is adjusted by the weighting factor.

2. The method according to claim 1, characterized in that, The process of extracting semantic cores from resource education data to obtain domain topic vector clusters that serve as semantic cores includes: Each text sample in the educational data is converted into a sentence vector using a pre-trained sentence vector encoder; The K-means clustering algorithm is used to cluster all sentence vectors. The centroid vector of each cluster is used as the domain topic vector, and all centroid vectors together constitute the domain topic vector cluster.

3. The method according to claim 1, characterized in that, The generation of pseudo-instruction-response pairs based on the semantic kernel and a preset set of instruction templates includes: Randomly select a domain topic vector from the domain topic vector cluster; In the sentence vectors of the original data, the text sample that is most similar to the topic vector of the selected domain is retrieved by calculating the cosine similarity. The content of the text sample is filled into the placeholder in the preset instruction template to form a new instruction, and the original response of the text sample is used as the corresponding response to form a pseudo-instruction-response pair.

4. The method according to claim 1, characterized in that, The step of performing singular value decomposition on the domain topic vector cluster of the semantic kernel, extracting semantic direction vectors, and using the direction vectors to initialize the decomposition matrix of the low-rank adapter module includes: Perform singular value decomposition on the matrix composed of the domain topic vector clusters to obtain the left singular vector matrix U; Select the k column vectors in the left singular vector matrix U that correspond to the first k largest singular values ​​as the semantic direction vectors; The rank of the low-rank adapter module is set to k, its decomposition matrix A is initialized as a random matrix conforming to a Gaussian distribution, and its decomposition matrix B is initialized as a matrix composed of the k main semantic direction vectors as column vectors.

5. The method according to claim 1, characterized in that, The step of calculating the weighting factor and gradient adjustment coefficient for the training samples based on the entropy value and cross-entropy loss includes: For a training sample, the entropy of the predicted probability distribution of the model response is E, and the cross-entropy loss between the model and the standard response is L. The formula for calculating the weighting factor w is as follows: ,in These are preset hyperparameters used to control the strength of the entropy's suppression of the weights; The formula for calculating the gradient adjustment coefficient g is as follows: ,in This is a preset positive hyperparameter used to control the sensitivity of cross-entropy loss to gradient scaling.

6. The method according to claim 1, characterized in that, Applying an orthogonality constraint, the strength of which is adjusted by the weighting factor, includes: When calculating the training loss, add an orthogonal loss term. ; The orthogonal loss term is calculated as follows: Where B is the decomposition matrix of the low-rank adapter module, and I is the identity matrix. It is the Frobenius norm; The weighting factor w is coupled with a fixed orthogonal constraint hyperparameter. Multiplying them together yields the orthogonal constraint strength. ,Will It is added to the weighted cross-entropy loss.

7. A lightweight adaptation system for large models, characterized in that, Includes the following modules: The module is used to extract the semantic core from the resource education data to obtain a domain topic vector cluster that serves as the semantic core; based on the semantic core and a preset set of instruction templates, pseudo-instruction-response pairs are generated and merged with the original data to form a training dataset. An extraction module is used to insert a low-rank adapter module into the transformer layer of a pre-trained large model; singular value decomposition is performed on the domain topic vector cluster of the semantic kernel to extract semantic direction vectors, and the direction vectors are used to initialize the decomposition matrix of the low-rank adapter module. The calculation module is used to calculate the predicted probability distribution entropy value of the model response and the cross-entropy loss with the standard response for each training sample during iterative training of model fine-tuning; based on the entropy value and cross-entropy loss, it calculates the weighting factor and gradient adjustment coefficient for the training sample, wherein the value of the gradient adjustment coefficient is positively correlated with the value of the cross-entropy loss. An adjustment module is used to weight the cross-entropy loss using the weighting factor to obtain the training loss; perform gradient backpropagation based on the training loss; when updating the weights of the low-rank adapter module, scale the gradient using the gradient adjustment coefficient and apply an orthogonality constraint, the strength of which is adjusted by the weighting factor.

8. The system according to claim 7, characterized in that, The process of extracting semantic cores from resource education data to obtain domain topic vector clusters that serve as semantic cores includes: Each text sample in the educational data is converted into a sentence vector using a pre-trained sentence vector encoder; The K-means clustering algorithm is used to cluster all sentence vectors. The centroid vector of each cluster is used as the domain topic vector, and all centroid vectors together constitute the domain topic vector cluster.

9. The system according to claim 7, characterized in that, The generation of pseudo-instruction-response pairs based on the semantic kernel and a preset set of instruction templates includes: Randomly select a domain topic vector from the domain topic vector cluster; In the sentence vectors of the original data, the text sample that is most similar to the topic vector of the selected domain is retrieved by calculating the cosine similarity. The content of the text sample is filled into the placeholder in the preset instruction template to form a new instruction, and the original response of the text sample is used as the corresponding response to form a pseudo-instruction-response pair.

10. The system according to claim 7, characterized in that, The step of performing singular value decomposition on the domain topic vector cluster of the semantic kernel, extracting semantic direction vectors, and using the direction vectors to initialize the decomposition matrix of the low-rank adapter module includes: Perform singular value decomposition on the matrix composed of the domain topic vector clusters to obtain the left singular vector matrix U; Select the k column vectors in the left singular vector matrix U that correspond to the first k largest singular values ​​as the semantic direction vectors; The rank of the low-rank adapter module is set to k, its decomposition matrix A is initialized as a random matrix conforming to a Gaussian distribution, and its decomposition matrix B is initialized as a matrix composed of the k main semantic direction vectors as column vectors.