A method and system for differential privacy protection zeroth-order fine-tuning based on low-rank random subspace and gradient migration

By using a differential privacy-preserving method with low-rank random subspaces and gradient transfer in large language model fine-tuning, the perturbation direction is restricted, and sample-by-sample pruning and one-dimensional Gaussian noise aggregation are combined to solve the problems of high computational cost and noise dimensionality dependence in large language model fine-tuning, thus achieving stable and efficient fine-tuning results.

CN122241760APending Publication Date: 2026-06-19SOUTHEAST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SOUTHEAST UNIV
Filing Date
2026-03-24
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing fine-tuning methods for large language models suffer from high computational overhead and noise dimensionality dependence under differential privacy protection, leading to unstable model training and decreased accuracy, especially under large-scale parameter conditions.

Method used

We employ a differential privacy-preserving zero-order fine-tuning method using low-rank random subspaces and gradient transfer. This method restricts the perturbation direction to low-rank random subspaces constructed by layers or modules. Guided by gradients from public datasets, we perform two-point finite difference forward computation, sample-by-sample pruning, and one-dimensional Gaussian noise aggregation to reduce noise dimensionality dependence and improve training stability.

Benefits of technology

Achieving stable fine-tuning with low overhead and low noise dimensionality dependence under differential privacy constraints improves the training stability and accuracy of large language models, making it suitable for resource-constrained scenarios. In experiments, it demonstrates superior convergence trend and task performance compared to existing methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241760A_ABST
    Figure CN122241760A_ABST
Patent Text Reader

Abstract

This invention discloses a differential privacy-preserving zero-order fine-tuning method and system based on low-rank random subspaces and gradient transfer. The method includes: dividing the trainable parameter set of a large language model into multiple parameter blocks by layer or module; constructing a low-rank random subspace for each parameter block; in each iteration, sampling training batch data from the training dataset and sampling a random seed; generating a coefficient matrix for each parameter block based on the random seed, calculating the corresponding gradient matrix using a public dataset, and constructing a perturbation direction based on the low-rank random subspace and the gradient matrix; obtaining a sample-by-sample loss differential scalar using two-point finite difference in the subspace direction, performing sample-by-sample pruning, calculating the mean and aggregating the results, adding one-dimensional Gaussian noise only to the aggregated scalar, and then uniformly scaling the low-rank perturbation with the noisy scalar to complete the parameter update. This invention reduces noise dimensionality dependence while ensuring differential privacy constraints, improving the stability and accuracy of zero-order fine-tuning.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a differential privacy-preserving zero-order fine-tuning technique based on low-rank random subspaces and gradient transfer, belonging to the fields of differential privacy preservation and deep learning technology. Background Technology

[0002] With the development of Large Language Models (LLMs) and deep learning technologies, LLMs have been widely applied in scenarios such as text understanding, dialogue generation, code generation, and enhanced question answering. To adapt the model to specific domain tasks and business data, fine-tuning of the pre-trained model is usually required. However, existing first-order fine-tuning methods rely on backpropagation to calculate gradients, resulting in high memory consumption and computational overhead during training, and are difficult to implement efficiently on resource-constrained devices under large-scale parameter conditions.

[0003] Meanwhile, LLM fine-tuning often uses training data containing user behavior, business logs, or sensitive text, posing data leakage and compliance risks. Differential privacy (DP) provides a quantifiable means of privacy protection for model training, but existing DP fine-tuning schemes typically require sample-by-sample pruning of high-dimensional gradients and the addition of multi-dimensional Gaussian noise. The noise energy accumulates with the parameter dimension, easily leading to a decline in model performance, especially in the scenario of full-parameter fine-tuning of large models.

[0004] To reduce backpropagation overhead, Zero-Order Optimization (ZO) methods have been introduced into fine-tuning large models. These methods estimate gradient directions using only forward loss via two-point finite differences, thus updating parameters without calculating the backpropagation gradient. However, traditional ZO methods typically employ random perturbations across the entire parameter space, making them significantly affected by parameter dimensionality, resulting in large gradient estimation variance and unstable convergence. Furthermore, the addition of differential privacy noise further exacerbates the signal-to-noise ratio reduction, limiting the fine-tuning effect. Therefore, how to reduce noise dimensionality dependence using low-overhead privacy protection mechanisms under differential privacy constraints, thereby improving the stability and accuracy of ZO fine-tuning of large language models while ensuring privacy compliance, remains a pressing technical challenge. Summary of the Invention

[0005] Purpose of the invention: This invention aims to provide a differential privacy-preserving zero-order fine-tuning method and system based on low-rank random subspaces and gradient transfer, which solves the problems existing in the fine-tuning of existing large language models under differential privacy and resource constraints. It achieves low-overhead, low-noise dimensionality-dependent, stable and effective zero-order large language model fine-tuning under differential privacy constraints.

[0006] Technical Solution: To achieve the aforementioned objectives, the first aspect of this invention provides a differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient transfer. The core idea is: within the zero-order fine-tuning framework, guided by gradients from a public dataset, the perturbation direction is restricted to a low-rank random subspace constructed layer by layer or module, thereby reducing noise dimensionality dependence and improving training stability. This method includes the following steps:

[0007] S1. Obtain the large language model to be fine-tuned and the training dataset, and divide the set of trainable parameters of the large language model into multiple parameter blocks according to layers or modules; each parameter block corresponds to a weight matrix or its equivalent reshape representation; the training dataset includes text training data;

[0008] S2. Generate a column orthogonal basis for a low-rank random subspace for each parameter block; the column orthogonal basis is obtained by QR decomposition of the random matrix;

[0009] S3. In each iteration, training batch data is sampled from the training dataset, and a random seed is sampled; a coefficient matrix is ​​generated for each parameter block based on the random seed, the corresponding gradient matrix is ​​calculated using the public dataset, and the perturbation direction is constructed based on the low-rank random subspace and the gradient matrix.

[0010] S4. Perform two-point finite difference forward computation, apply positive and negative perturbations to all parameter blocks respectively, obtain the loss amount for each sample, and calculate the directional difference scalar for each sample.

[0011] S5. Perform sample-by-sample pruning on the difference scalar to obtain the pruned difference scalar;

[0012] S6. The trimmed difference scalars are averaged to obtain aggregated scalars, and one-dimensional Gaussian noise is added only to the aggregated scalars to obtain the noise amplitude.

[0013] S7. Use the noise level to uniformly scale the low-rank perturbation direction of each parameter block and update the large language model parameters.

[0014] S8. Repeat steps S3 to S7 until the preset number of iterations is completed, and output the fine-tuned large language model that satisfies the differential privacy constraint.

[0015] In specific implementation, in step S1, the set of trainable parameters is either a set of fully fine-tuned parameters, a subset of trainable parameters after freezing some layers, or a set of parameters that maps multiple tensors to several matrix parameter blocks through reshaping.

[0016] Preferably, in step S2, the weight matrix of the i-th parameter block... Construct an orthogonal basis for a low-rank random subspace sequence. and ;in, and Let r represent the row and column dimensions of the weight matrix corresponding to the parameter block, respectively, where r is the low-rank dimension. The low-rank dimension r is a preset constant or a layered setting value that varies with the layer, satisfying the following conditions: This is to limit the rank of the perturbation direction and reduce the zero-order estimation variance.

[0017] Preferably, in step S3, for the i-th parameter block, in the t-th iteration, the coefficient matrix... It is obtained by sampling from a Gaussian distribution and can be reproduced by a random seed, so that positive and negative perturbations use the same perturbation direction.

[0018] Preferably, in step S3, for the i-th parameter block, the perturbation direction constructed in the t-th iteration is:

[0019]

[0020] in, The weighting factor is used to control the fusion ratio of the common prior gradient and the zeroth-order private perturbation. , It forms a column orthogonal basis for low-rank random subspaces. This is the gradient matrix corresponding to the i-th parameter block, calculated using a public dataset.

[0021] Preferably, in step S4, the step size of the two-point finite difference is a preset value and is shared by all parameter blocks in one iteration to ensure the comparability and stability of the difference scalar.

[0022] Preferably, in step S6, the variance parameter of the one-dimensional Gaussian noise is calculated based on a combination of random sampling mechanism and iteration number to meet the given differential privacy budget.

[0023] Preferably, during the iterative training process, a subspace update frequency F is set, wherein the subspace update frequency F is a preset positive integer; when the number of iterations t mod F At time 0, update the column orthogonal basis of the low-rank random subspace; when t mod F At time 0, the orthogonal basis from the previous round is reused to reduce the computational overhead of QR decomposition.

[0024] Secondly, this invention provides a differential privacy-preserving zero-order fine-tuning system based on low-rank random subspaces and gradient migration, used to implement the differential privacy-preserving zero-order fine-tuning method described in the first aspect. The system includes:

[0025] The parameter block partitioning module is used to acquire the large language model to be fine-tuned and the training dataset, determine the set of trainable parameters, and divide the trainable parameters into multiple parameter blocks according to layers or modules; each parameter block corresponds to a weight matrix or its equivalent renormal form representation; the training dataset includes text training data;

[0026] A low-rank random subspace construction module is used to generate a column orthogonal basis for a low-rank random subspace for each parameter block; the column orthogonal basis is obtained by QR decomposition of the random matrix;

[0027] The low-rank perturbation generation module is used to sample training batch data from the training dataset in each iteration and sample random seeds; generate coefficient matrices for each parameter block based on the random seeds, calculate the corresponding gradient matrix using the public dataset, and construct perturbation directions based on the low-rank random subspace and the gradient matrix.

[0028] The differential forward computation module is used to perform two-point finite difference forward computation, apply positive and negative perturbations to all parameter blocks respectively, obtain the loss amount of each sample, and calculate the directional differential scalar for each sample.

[0029] The differential privacy pruning module is used to perform sample-by-sample pruning on the differential scalar to obtain the pruned differential scalar;

[0030] The aggregation and noise-adding module is used to perform mean aggregation on the clipped difference scalars to obtain an aggregated scalar, and add one-dimensional Gaussian noise only to the aggregated scalar to obtain a noise-enhanced amplitude.

[0031] The parameter update module is used to uniformly scale the low-rank perturbation direction of each parameter block using the noise amplitude and update the parameters of the large language model.

[0032] The training loop control module is used to cyclically control the low-rank perturbation generation module, the differential forward computation module, the differential privacy pruning module, the aggregation and noise addition module, and the parameter update module until the preset number of iterations is completed, and outputs a fine-tuned large language model that satisfies the differential privacy constraints.

[0033] Thirdly, the present invention provides a computer system including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps of the differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient migration described in the first aspect.

[0034] Beneficial effects: Compared with the prior art, the present invention has at least the following beneficial effects:

[0035] 1. This invention divides the trainable parameter set of a large language model into multiple parameter blocks by layer or module, and constructs a low-rank random subspace for each parameter block. By restricting the perturbation direction within the low-rank random subspace by layer or module, the variance and direction error of the zero-order gradient estimation can be reduced, thereby improving convergence stability.

[0036] 2. This invention utilizes publicly available datasets to calculate the corresponding gradient matrix, constructs perturbation directions based on low-rank random subspaces and the gradient matrix, and can integrate the real gradients of publicly available data without privacy restrictions as prior guidance. Without consuming the privacy budget of the target private data, it provides the model with a low-variance, accurate descent direction, thereby effectively overcoming the high variance problem of blind search in the early stage of zero-order optimization.

[0037] 3. This invention performs mean aggregation on the clipped difference scalars to obtain aggregated scalars, and adds one-dimensional Gaussian noise only to the aggregated scalars, avoiding noise accumulation caused by adding noise to high-dimensional gradient vectors, which is more conducive to maintaining model accuracy under the same privacy budget.

[0038] 4. This invention uses two-point finite difference forward computation, which does not require backpropagation throughout the entire process. It is suitable for resource-constrained scenarios and can further reduce the overhead of orthogonal basis generation through lazy updates.

[0039] 5. Based on the experimental results, on the OPT-1.3B model, the performance curve of the method of this invention is generally better than that of the comparative method DPZero on the SST-2 and SQuAD datasets, indicating that the method of this invention has a better convergence trend and task performance during training. Attached Figure Description

[0040] Figure 1 This is a flowchart illustrating the overall process of an embodiment of the present invention.

[0041] Figure 2 This is a flowchart illustrating the construction process of a single-layer low-rank random subspace and low-rank perturbation in an embodiment of the present invention.

[0042] Figure 3 This is a flowchart of the differential privacy mechanism in an embodiment of the present invention;

[0043] Figure 4 The graph shows the performance of the OPT-1.3B model as a function of training steps when fine-tuned on the SST-2 dataset using the DP-PSZO and DPZero methods, respectively.

[0044] Figure 5 The graph shows the performance of the OPT-1.3B model as a function of training steps when fine-tuned on the SQuAD dataset using the DP-PSZO and DPZero methods, respectively. Detailed Implementation

[0045] The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. Those skilled in the art should understand that the following embodiments are only used to illustrate the technical solution of the present invention and are not intended to limit the scope of protection of the present invention.

[0046] like Figure 1 As shown in the figure, the differential privacy-preserving zero-order fine-tuning method based on low-rank random subspace and gradient transfer provided by the present invention mainly includes the following steps:

[0047] S1. Parameter block partitioning: Obtain the large language model to be fine-tuned and the training dataset, and divide the set of trainable parameters of the large language model into multiple parameter blocks according to layers or modules; each parameter block corresponds to a weight matrix or its equivalent reshape representation; the training dataset includes text training data;

[0048] S2. Construction of low-rank random subspace: Generate a column orthogonal basis of a low-rank random subspace for each parameter block; the column orthogonal basis is obtained by QR decomposition of the random matrix;

[0049] S3. Low-rank perturbation generation: In each iteration, training batch data is sampled from the training dataset, and a random seed is sampled; a coefficient matrix is ​​generated for each parameter block based on the random seed, the corresponding gradient matrix is ​​calculated using the public dataset, and the perturbation direction is constructed based on the low-rank random subspace and the gradient matrix.

[0050] S4. Differential Forward Computation: Perform two-point finite difference forward computation, apply positive and negative perturbations to all parameter blocks respectively, obtain the loss amount of each sample, and calculate the directional difference scalar for each sample.

[0051] S5. Differential privacy pruning: Perform sample-by-sample pruning on the differential scalar to obtain the pruned differential scalar;

[0052] S6. Aggregate and add noise: The trimmed difference scalars are averaged and aggregated to obtain aggregated scalars, and one-dimensional Gaussian noise is added only to the aggregated scalars to obtain the noise amplitude.

[0053] S7. Parameter Update: Use the noise amplitude to uniformly scale the low-rank perturbation direction of each parameter block and update the parameters of the large language model.

[0054] S8. Training loop control: Repeat steps S3 to S7 until the preset number of iterations is completed, and output the fine-tuned large language model that satisfies the differential privacy constraint.

[0055] This embodiment addresses the problems of high backpropagation overhead, strong dimensionality dependence of differential privacy noise, and high variance of zero-order estimation in existing large language model fine-tuning. Without performing backpropagation, it achieves efficient zero-order fine-tuning while satisfying differential privacy constraints by constructing layer-by-layer or module-by-module low-rank random subspace perturbations and combining them with a per-sample scalar pruning and aggregation noise mechanism. Unlike existing technologies, this embodiment does not directly apply high-dimensional random perturbations to the full parameter space, nor does it perform per-sample pruning and multi-dimensional noise addition to high-dimensional gradient vectors. Instead, it uses the gradient of the public dataset as a priori, restricting the perturbation to a low-rank random subspace constructed layer-by-layer or module-by-module, and adding one-dimensional Gaussian noise only to the aggregation result of per-sample directional differential scalars, thereby reducing noise dimensionality dependence and improving training stability and fine-tuning accuracy.

[0056] The specific implementation of each step is described below as an example.

[0057] The parameter block partitioning in step S1 is specifically as follows: First, obtain the large language model to be fine-tuned and the training dataset D. Divide the set of trainable parameters of the model into multiple parameter blocks by layer or module. For the i-th parameter block, its parameters can be represented as a weight matrix. .in, and These represent the row and column dimensions of the weight matrix corresponding to the parameter block, respectively. If the original parameters are not in matrix form, they can be converted into an equivalent matrix representation using tensor reshaping. In practical applications, the set of trainable parameters can be fully parameterized fine-tuning parameters, a subset of trainable parameters after freezing some layers, or a set of parameters that maps multiple tensors to several matrix parameter blocks through reshaping.

[0058] The construction of the low-rank random subspace in step S2 specifically involves constructing a low-rank random subspace for each parameter block. Specifically, for the i-th parameter block, two random matrices are generated and subjected to QR decomposition to obtain a column orthogonal basis. , Where r is the low-rank dimension, which is a preset constant or a layer-specific setting that varies with the layer, satisfying... This is to limit the rank of the perturbation direction and reduce the zero-order estimation variance. and A low-rank random subspace corresponding to the parameter block of this layer is defined together. To reduce the computational overhead of repeatedly generating orthogonal bases, a subspace update frequency F can be set, which is updated when training step t satisfies t mod F = 0. and Otherwise, the subspace basis from the previous round is reused.

[0059] The low-rank perturbation generation in step S3 specifically involves: in each training iteration t, sampling a mini-batch from the training dataset D. Where b is the batch size. Then, a random seed is sampled. And based on the random seed, generate a small-scale coefficient matrix for each parameter block. Meanwhile, from publicly available datasets without privacy restrictions The model samples a batch of common data, calculates the true gradient of the current model on that common data using standard first-order backpropagation, and extracts the gradient matrix corresponding to the current parameter block. This serves as a low-variance prior guiding direction. Finally, combining the zero-order pathfinding of private data with the common prior, a perturbation direction is constructed:

[0060]

[0061] in, The weighting factor is used to control the fusion ratio of the common prior gradient and the zeroth-order private perturbation. , It is a column orthogonal basis for low-rank random subspaces.

[0062] The two-point finite difference forward computation in step S4 specifically involves applying a positive perturbation to all parameter blocks. Perform a forward propagation based on the positive perturbation parameters to obtain the loss value for each sample in the mini-batch. Subsequently, a negative perturbation is applied to all parameter blocks. Perform a second forward propagation to obtain the loss value for each sample. Therefore, for the j-th sample, the sample-by-sample directional difference scalar is calculated:

[0063]

[0064] The step size of the two-point finite difference in step S4 The preset values ​​are shared across all parameter blocks within a single iteration to ensure the comparability and stability of the difference scalars. The per-sample loss can be composed of classification cross-entropy, sequence-to-sequence negative log-likelihood, or a combination thereof, and can be determined based on the specific large language model task being performed.

[0065] Step S5, differential privacy pruning, specifically involves performing a sample-by-sample pruning operation on the directional difference scalar for each sample.

[0066]

[0067] in, This represents the difference scalar corresponding to the j-th sample in the t-th iteration, where C is the preset pruning threshold. The clipping function represents the difference scalar after clipping. Used to clip input scalars to a closed interval [-C, C], it is defined as follows:

[0068]

[0069] That is, when the difference scalar When it is less than -C, it is truncated to -C; when the difference scalar When it is greater than C, truncate it to C; when the difference scalar When the value falls within the interval [-C, C], its original value remains unchanged.

[0070] In step S6, the aggregation and noise addition specifically involves performing mean aggregation on the clipped difference scalars.

[0071]

[0072] One-dimensional Gaussian noise is added only to the polymer scalar. The noise amplitude is obtained:

[0073]

[0074] in, By target privacy budget ( , The sampling rate q = b / |D| and the total number of training steps T are determined through privacy accounting. Specifically, the accounting process is usually based on the Moments Accountant method or the Rényi Differential Privacy (RDP) analysis framework. First, the privacy loss per iteration is quantified using the privacy amplification by subsampling brought about by sampling q. Then, the cumulative privacy loss of T rounds of training is strictly tracked according to the Composition Theorem. Finally, the overall preset privacy loss is solved by numerical search. , The minimum feasible noise standard deviation of the constraints. .

[0075] The parameter update in step S7 specifically involves: utilizing the noise amplitude. The low-rank perturbation directions of all parameter blocks are uniformly scaled, and parameter updates are performed:

[0076]

[0077] in This is the learning rate.

[0078] The training loop control in step S8 is as follows: Repeat steps S3 to S7 until the preset number of training rounds T is completed, and finally output a fine-tuned large language model that satisfies the differential privacy constraint.

[0079] Figure 2 The diagram illustrates the construction process of the single-layer low-rank random subspace and low-rank perturbation used in this invention. Figure 2 As shown, this invention does not directly perform random perturbation in the full-dimensional parameter space. Instead, it constructs a low-rank random subspace spanned by column orthogonal bases for each parameter block and generates perturbation directions within this subspace. The perturbation matrix generated by this invention... The rank does not exceed This allows for control over the complexity of the perturbation direction. Through the aforementioned low-rank perturbation method, this invention restricts the perturbation, which would normally operate in a high-dimensional parameter space, to a lower-dimensional, structured random subspace. This reduces the variance of the zero-order finite-difference estimation and minimizes the impact of invalid perturbation directions on the training process, making the estimated update direction closer to the true effective gradient direction.

[0080] Figure 3 The flowchart illustrating the differential privacy mechanism employed in this invention is shown. Figure 3 As shown, this invention does not perform sample-by-sample clipping and multi-dimensional Gaussian noise addition on the full-dimensional gradient vector, but rather performs clipping, aggregation, and one-dimensional noise addition on the sample-by-sample directional difference scalar.

[0081] Specifically, in each iteration, the directional difference scalar of each sample within the mini-batch is obtained through two-point finite difference. The differential scalar reflects the loss change trend of the corresponding sample in the current low-rank perturbation direction. To limit the maximum impact of a single sample on the update magnitude, for each... Perform cropping. After cropping, aggregate the mean of all cropped samples, and then add one-dimensional Gaussian noise to the aggregated result. (Noise level) This is used to uniformly scale the low-rank perturbation directions of all parameter blocks in the current round, thereby completing parameter updates under differential privacy constraints. Compared with existing methods that add noise to high-dimensional gradient vectors, the differential privacy mechanism of this invention only adds noise to scalars, thus resulting in smaller noise levels under the same privacy budget and more effectively preserving useful update signals.

[0082] The effectiveness of the present invention will be verified by specific experiments below.

[0083] Experiment 1: This example uses the SST-2 text classification dataset as an application scenario to demonstrate the specific application effect of the Differentially Private Public-guided Subspace Zeroth-Order Optimization (DP-PSZO) method proposed in this invention, based on low-rank random subspaces and gradient transfer, in a text classification task. The base model used is OPT-1.3B, and the training data consists of sentences and their sentiment labels. The task objective is to determine the sentiment polarity of the input text.

[0084] In this embodiment, a pre-trained large language model is selected as the object to be fine-tuned. Its trainable layer parameters are divided into multiple parameter blocks, and zero-order fine-tuning is performed using the method of this invention. During training, the batch size b=8, low-rank dimension r=8, pruning threshold C=8, and finite difference step size are set. =1.2e-3, learning rate =2.5e-6 and target privacy budget ( , )=(6,1e-5), and maintain the same or comparable training rounds, data sampling method and evaluation metrics as the comparison method DPZero.

[0085] Based on the applicant's current experimental results, under the same or comparable differential privacy budget and training settings on the SST-2 dataset, the method of this invention outperforms DPZero. Figure 4 As shown, with the increase of training steps, the performance curve of the model obtained by the DP-PSZO method is generally higher than that of the model obtained by the DPZero method, indicating that the method of the present invention has better training effect and convergence trend in text classification tasks.

[0086] Experiment 2: This embodiment uses the SQuAD reading comprehension dataset as an application scenario to demonstrate the specific application effects of the proposed method in generative or extractive question answering tasks. The training data consists of questions, context paragraphs, and corresponding answers. The task objective is to generate or extract the correct answer based on a given question and context.

[0087] In this embodiment, the differential privacy zero-order fine-tuning framework remains consistent with Experiment 1, adapting only the sample input format and loss definition to the question-answering task. Comparative results show that, under the same or comparable training settings on the SQuAD dataset, the method of this invention also outperforms DPZero. Figure 5 As shown, with the increase of training steps, the performance curve of the model obtained by the DP-PSZO method is generally better than that of the model obtained by the DPZero method, indicating that the method of the present invention also has better training effect and convergence performance in question answering tasks.

[0088] Experimental results on both the SST-2 and SQuAD datasets show that the method proposed in this invention outperforms DPZero in various types of natural language processing tasks. This demonstrates that the proposed techniques of low-rank random subspace perturbation, gradient transfer from public datasets, scalar pruning, and aggregation with noise can effectively improve the accuracy, stability, and generalization ability of fine-tuning differential privacy zero-order large language models, thus verifying the feasibility and practical value of the method.

[0089] Based on the same inventive concept, this invention also provides a differential privacy-preserving zero-order fine-tuning system based on low-rank random subspaces and gradient transfer, comprising: a parameter block partitioning module, used to acquire a large language model to be fine-tuned and a training dataset, determine a set of trainable parameters, and partition the trainable parameters into multiple parameter blocks by layer or module; each parameter block corresponds to a weight matrix or its equivalent renormalized representation; the training dataset includes text training data; a low-rank random subspace construction module, used to generate a column orthogonal basis of a low-rank random subspace for each parameter block; the column orthogonal basis is obtained by QR decomposition of a random matrix; a low-rank perturbation generation module, used to sample training batch data from the training dataset in each iteration and sample a random seed; generate a coefficient matrix for each parameter block based on the random seed, calculate the corresponding gradient matrix using a public dataset, and based on the low-rank random subspace and the gradient... The system comprises the following modules: a matrix construction perturbation direction module; a differential forward computation module for performing two-point finite difference forward computation, applying positive and negative perturbations to all parameter blocks to obtain the loss for each sample, and calculating the directional differential scalar for each sample; a differential privacy pruning module for performing sample-by-sample pruning on the differential scalar to obtain the pruned differential scalar; an aggregation and noise addition module for performing mean aggregation on the pruned differential scalar to obtain an aggregated scalar, and adding only one-dimensional Gaussian noise to the aggregated scalar to obtain a noise amplitude; a parameter update module for uniformly scaling the low-rank perturbation direction of each parameter block using the noise amplitude and updating the parameters of the large language model; and a training loop control module for iteratively controlling the low-rank perturbation generation module, the differential forward computation module, the differential privacy pruning module, the aggregation and noise addition module, and the parameter update module until a preset number of iterations is completed, outputting a fine-tuned large language model that satisfies differential privacy constraints.

[0090] This invention also provides a computer system, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps of the differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient migration described in any of the foregoing embodiments.

[0091] The program code used to implement the method of the present invention can be written in any combination of one or more programming languages. This program code can be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the program code causes the steps of the method of the present invention to be performed. The program code can be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a standalone software package, or entirely on a remote machine or server. All aspects not detailed in this invention are well-known to those skilled in the art.

[0092] It should be noted that the various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to mutually. For the systems and methods described in this specification, those skilled in the art can make several improvements and modifications without departing from the principles of the invention, and these improvements and modifications should also be considered within the scope of protection of this invention.

Claims

1. A differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient transfer, characterized in that, Includes the following steps: S1. Obtain the large language model to be fine-tuned and the training dataset, and divide the set of trainable parameters of the large language model into multiple parameter blocks according to layers or modules; each parameter block corresponds to a weight matrix or its equivalent reshape representation; the training dataset includes text training data; S2. Generate a column orthogonal basis for a low-rank random subspace for each parameter block; the column orthogonal basis is obtained by QR decomposition of the random matrix; S3. In each iteration, training batch data is sampled from the training dataset, and a random seed is sampled; a coefficient matrix is ​​generated for each parameter block based on the random seed, the corresponding gradient matrix is ​​calculated using the public dataset, and the perturbation direction is constructed based on the low-rank random subspace and the gradient matrix. S4. Perform two-point finite difference forward computation, apply positive and negative perturbations to all parameter blocks respectively, obtain the loss amount for each sample, and calculate the directional difference scalar for each sample. S5. Perform sample-by-sample pruning on the difference scalar to obtain the pruned difference scalar; S6. The trimmed difference scalars are averaged to obtain aggregated scalars, and one-dimensional Gaussian noise is added only to the aggregated scalars to obtain the noise amplitude. S7. Use the noise amplitude to uniformly scale the low-rank perturbation direction of each parameter block and update the parameters of the large language model; S8. Repeat steps S3 to S7 until the preset number of iterations is completed, and output the fine-tuned large language model that satisfies the differential privacy constraint.

2. The differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient migration according to claim 1, characterized in that, In step S1, the set of trainable parameters is either a set of fully fine-tuned parameters, a subset of trainable parameters after freezing some layers, or a set of parameters that maps multiple tensors to several matrix parameter blocks through reshaping.

3. The differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient migration according to claim 1, characterized in that, In step S2, the weight matrix of the i-th parameter block... Construct an orthogonal basis for a low-rank random subspace sequence. and ;in, and Let r represent the row and column dimensions of the weight matrix corresponding to the parameter block, respectively, where r is the low-rank dimension. The low-rank dimension r is a preset constant or a layered setting value that varies with the layer, satisfying the following conditions: This is to limit the rank of the perturbation direction and reduce the zero-order estimation variance.

4. The differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient migration according to claim 3, characterized in that, In step S3, for the i-th parameter block, in the t-th iteration, the coefficient matrix... It is obtained by sampling from a Gaussian distribution and can be reproduced by a random seed, so that positive and negative perturbations use the same perturbation direction.

5. The differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient migration according to claim 4, characterized in that, In step S3, for the i-th parameter block, the perturbation direction constructed in the t-th iteration is: ; in, The weighting factor is used to control the fusion ratio of the common prior gradient and the zeroth-order private perturbation. , It forms a column orthogonal basis for low-rank random subspaces. This is the gradient matrix corresponding to the i-th parameter block, calculated using a public dataset.

6. The differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient transfer according to claim 1, characterized in that, In step S4, the step size of the two-point finite difference is a preset value and is shared by all parameter blocks in one iteration to ensure the comparability and stability of the difference scalar.

7. The differential privacy-preserving zero-order fine-tuning method based on gradient transfer of low-rank random subspaces and public datasets according to claim 1, characterized in that, In step S6, the variance parameter of the one-dimensional Gaussian noise is calculated based on a combination of random sampling mechanism and iteration number to meet the given differential privacy budget.

8. The differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient migration according to claim 1, characterized in that, During iterative training, a subspace update frequency F is set, where F is a preset positive integer; when the number of iterations t mod F... At time 0, update the column orthogonal basis of the low-rank random subspace; when t mod F At time 0, the orthogonal basis from the previous round is reused to reduce the computational overhead of QR decomposition.

9. A differential privacy-preserving zero-order fine-tuning system based on low-rank random subspaces and gradient transfer, used to implement the differential privacy-preserving zero-order fine-tuning method according to any one of claims 1-8, characterized in that, The system includes: The parameter block partitioning module is used to acquire the large language model to be fine-tuned and the training dataset, determine the set of trainable parameters, and divide the trainable parameters into multiple parameter blocks according to layers or modules; each parameter block corresponds to a weight matrix or its equivalent renormal form representation; the training dataset includes text training data; A low-rank random subspace construction module is used to generate a column orthogonal basis for a low-rank random subspace for each parameter block; the column orthogonal basis is obtained by QR decomposition of the random matrix; The low-rank perturbation generation module is used to sample training batch data from the training dataset in each iteration and sample random seeds; generate coefficient matrices for each parameter block based on the random seeds, calculate the corresponding gradient matrix using the public dataset, and construct perturbation directions based on the low-rank random subspace and the gradient matrix. The differential forward computation module is used to perform two-point finite difference forward computation, apply positive and negative perturbations to all parameter blocks respectively, obtain the loss amount of each sample, and calculate the directional differential scalar for each sample. The differential privacy pruning module is used to perform sample-by-sample pruning on the differential scalar to obtain the pruned differential scalar; The aggregation and noise-adding module is used to perform mean aggregation on the clipped difference scalars to obtain an aggregated scalar, and add one-dimensional Gaussian noise only to the aggregated scalar to obtain a noise-enhanced amplitude. The parameter update module is used to uniformly scale the low-rank perturbation direction of each parameter block using the noise amplitude and update the parameters of the large language model. The training loop control module is used to cyclically control the low-rank perturbation generation module, the differential forward computation module, the differential privacy pruning module, the aggregation and noise addition module, and the parameter update module until the preset number of iterations is completed, and outputs a fine-tuned large language model that satisfies the differential privacy constraints.

10. A computer system comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the computer program is executed by the processor, it implements the steps of the differential privacy-preserving zero-order fine-tuning method based on low-rank random subspaces and gradient migration as described in any one of claims 1-8.