Large Model Retrieval Enhancement Generation Method and Apparatus
By employing a two-layer differential privacy mechanism and a word-by-word meta-generation loop in RAG technology, the contradiction between privacy protection and data availability is resolved, achieving secure generation and availability in highly privacy-sensitive scenarios, defending against attacks, and ensuring the privacy and accuracy of generated data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI JIAOTONG UNIV
- Filing Date
- 2026-04-22
- Publication Date
- 2026-06-30
AI Technical Summary
Existing RAG technology has significant security vulnerabilities when processing external knowledge bases containing private or confidential data. It is susceptible to prompt injection or jailbreak attacks, leading to the leakage of private content and making it difficult to deploy and apply at scale in highly privacy-sensitive scenarios.
A large-model retrieval enhancement generation method is adopted. The target data set is determined from the local knowledge base by obtaining the target index sequence and divided into multiple data sample subsets. A word-by-word meta-generation loop is executed, and a utility function is constructed using a two-layer differential privacy perturbation to generate de-identified synthetic data, thus defending against vector reverse engineering attacks and large-model jailbreak attacks.
It provides formalized differential privacy guarantees, effectively defends against attacks, and ensures that the generated de-identified synthetic data can still maintain high accuracy and usability even with a low privacy budget, thus solving the difficult problem of balancing privacy protection and data usability.
Smart Images

Figure CN122309679A_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the technical field of large model retrieval enhancement generation methods, and in particular to a large model retrieval enhancement generation method and apparatus. Background Technology
[0002] Retrieval-Augmented Generation (RAG) technology effectively enhances a model's knowledge representation capabilities in specific domains by combining external knowledge base retrieval with text output model generation capabilities. Its typical workflow includes: retrieving query-related contextual fragments from an external knowledge base during the retrieval phase; and inputting these contextual fragments as prompts into the text output model to generate an answer during the generation phase.
[0003] However, existing RAG technology presents significant security vulnerabilities when processing external knowledge bases involving private or confidential data. During the generation phase, the retrieved sensitive context is directly input into the model as prompt words, making it vulnerable to prompt word injection or jailbreak attacks, which could induce the model to leak private content from the context. These privacy leakage risks in the generation process make it difficult for data holders to securely share private knowledge bases, severely restricting the large-scale deployment and application of RAG technology in highly privacy-sensitive scenarios. Summary of the Invention
[0004] In view of this, one or more embodiments of this specification provide a method and apparatus for enhancing the generation of large model retrieval.
[0005] To achieve the above objectives, one or more embodiments of this specification provide the following technical solutions: According to a first aspect of one or more embodiments of this specification, a large model retrieval enhancement generation method is proposed, the method being applied to a target data holder, the method comprising: Obtain the target index sequence, determine the target data set corresponding to the target index sequence from the local knowledge base of the target data holder, and divide it into multiple data sample subsets; the target index sequence is obtained by similarity matching between the data vector corresponding to the local knowledge base and the query vector corresponding to the user query text. Execute a word-by-word meta-generation loop until a preset termination condition is met; each loop includes: For each subset of data samples, the corresponding context fragment is input into the text output model, and the conditional probability distribution of each candidate word in the vocabulary is output. Based on the conditional probability distribution, the first layer of utility function is constructed, and the first differential privacy perturbation is performed based on the first layer of utility function to obtain the candidate words of the subset of data samples. Determine the relative probability advantage of each candidate word in its corresponding data sample subset, construct a second-layer utility function based on the relative probability advantage, and perform a second differential privacy perturbation based on the second-layer utility function to obtain the target word for this iteration; The target words of this iteration are appended to the context segments corresponding to each subset of data samples to complete the update; the initial content of each context segment is its corresponding subset of data samples. When the loop terminates, the target words determined in each loop are concatenated in order to obtain the desensitized synthetic data; The anonymized synthetic data is sent to the data receiver so that the data receiver can use it as contextual reference information to input into the large language model and generate a response result for the user's query text.
[0006] According to a second aspect of one or more embodiments of this specification, an electronic device is provided, comprising: processor; Memory used to store processor-executable instructions; The processor implements the method as described in the first aspect by running the executable instructions.
[0007] According to a third aspect of one or more embodiments of this specification, a computer-readable storage medium is provided that stores computer instructions thereon, which, when executed by a processor, implement the steps of the method as described in the first aspect.
[0008] According to a fourth aspect of one or more embodiments of this specification, a computer program product is provided, comprising: a computer program / instructions that, when executed by a processor, implement the method as described in the first aspect.
[0009] As can be seen from the above embodiments, the large-scale model retrieval enhancement generation method and apparatus provided in one or more embodiments of this specification do not directly input the retrieved original sensitive data fragments as context into the large language model when performing retrieval enhancement generation. Instead, it first obtains the target index sequence obtained based on privacy-preserving retrieval, and determines the corresponding target dataset from the local knowledge base and merges it into multiple data sample subsets. Subsequently, it executes a word-by-word generation loop. In each loop, it first constructs a first-layer utility function based on the conditional probability distribution output by the text output model and performs a first differential privacy perturbation to obtain candidate words. Then, it determines the relative probability advantage of the candidate words and constructs a second-layer utility function to perform a second differential privacy perturbation to obtain target words. At the same time, it appends the target words to the context fragment to complete the update. After the termination condition is met, it concatenates the target words determined by each loop to obtain desensitized synthetic data and sends it to the data receiver for generating response results of user query text through the large language model. Because a two-layer differential privacy mechanism is employed in the generation stage, the first layer of perturbation uses the original probability distribution of the language model to initially screen candidate lexical units that meet privacy requirements, ensuring the authenticity of the statistical distribution. The second layer of perturbation introduces a relative probability advantage to perform a secondary evaluation of candidate lexical units, effectively suppressing extreme value bias caused by noise and ensuring the semantic coherence of the selected lexical units in the local context. Simultaneously, by dividing the data into multiple subsets and processing them in parallel, subsampling amplifies the privacy protection effect. Therefore, this embodiment can provide formal differential privacy guarantees, effectively defend against vector inversion attacks and large model jailbreak attacks, achieve usability without visibility of retrieved data, significantly reduce the damage of noise to semantic logic, and enable the generated desensitized synthetic data to maintain high accuracy and usability even with a low privacy budget. This solves the technical problem of balancing privacy protection and data usability in existing RAG technologies. Attached Figure Description
[0010] Figure 1 This is an exemplary embodiment of the architecture diagram of an application scenario for a large model retrieval enhancement generation method.
[0011] Figure 2 This is a schematic flowchart of an exemplary embodiment of a method for enhancing the generation of large model retrieval.
[0012] Figure 3 This is a flowchart illustrating a single-loop process for word generation, provided in an exemplary embodiment.
[0013] Figure 4 This is a flowchart illustrating another large model retrieval enhancement generation method provided in an exemplary embodiment.
[0014] Figure 5 This is a schematic diagram of the structure of a device provided in an exemplary embodiment.
[0015] Figure 6 This is a block diagram of a large model retrieval enhancement generation apparatus provided in an exemplary embodiment. Detailed Implementation
[0016] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this specification.
[0017] The organizational information (including but not limited to organizational equipment information, organizational personal information, etc.) and data (including but not limited to data used for analysis, stored data, and displayed data) involved in this manual are all information and data authorized by the organization or fully authorized by all parties. Furthermore, the collection, use, and processing of such data must comply with the relevant laws, regulations, and standards of the relevant countries and regions, and corresponding operation portals are provided for the organization to choose to authorize or refuse.
[0018] As described in the background section, current RAG (Rapid Access Graph) technologies present significant security vulnerabilities when processing external knowledge bases involving private or confidential data. During the generation phase, retrieved sensitive context is typically input directly into the large language model as a plaintext prompt. This mechanism exposes the raw data directly to the model's context window, making it highly vulnerable to prompt injection or jailbreaking attacks. Furthermore, some users can use carefully crafted query commands to induce the model to ignore security restrictions, directly repeating or leaking sensitive information (such as personal identification information or trade secrets) from the context. In addition, even without active attacks, the model may unintentionally output training data or private fragments from the context during generation due to the "memory effect." Because current technologies lack mechanisms for effectively desensitizing contextual content while maintaining semantic usability during the generation phase, data holders find it difficult to share private knowledge bases while ensuring privacy compliance, severely restricting the large-scale deployment and application of RAG technology in highly privacy-sensitive scenarios.
[0019] Therefore, to solve the above-mentioned technical problems, this specification provides a method and apparatus for enhancing retrieval generation in a large-scale language model. When performing retrieval enhancement generation, instead of directly inputting the retrieved original sensitive data fragments as context into the large-scale language model, it first obtains the target index sequence based on privacy-preserving retrieval, and then determines the corresponding target dataset from the local knowledge base and merges it into multiple data sample subsets. Subsequently, a word-by-word generation loop is executed. In each loop, a first-layer utility function is constructed based on the conditional probability distribution output by the text output model, and a first differential privacy perturbation is performed to obtain candidate words. Then, the relative probability advantage of the candidate words is determined, and a second-layer utility function is constructed, and a second differential privacy perturbation is performed to obtain the target words. Simultaneously, the target words are appended to the context fragment to complete the update. After the termination condition is met, the target words determined in each loop are concatenated to obtain desensitized synthetic data, which is sent to the data receiver for generating the user query text response result through the large-scale language model. By dividing the target data set into multiple data sample subsets for parallel processing, the privacy protection effect is amplified using subsampling technology. A two-layer differential privacy mechanism was employed during the generation phase. The first layer of perturbation initially screened candidate lexical units that met privacy requirements based on the original probability distribution of the language model, ensuring the basic authenticity of the statistical distribution. In the second layer of perturbation, considering the differences in the quantity and content distribution of the original corpus contained in different data sample subsets, the conditional probability distributions output by each subset may differ in magnitude. For example, subsets containing more data or with significant semantic features may have higher maximum probability distribution values; while subsets containing less data or with weaker semantic features may have lower maximum probability distribution values, but the information they carry is equally important. If selection were directly based on absolute probability values, high-probability subsets would dominate the generation results, leading to the neglect of low-probability but crucial information, and the noise sensitivity between different subsets would be difficult to measure uniformly. Therefore, the second layer of perturbation introduces a relative probability advantage to perform secondary evaluation and correction of candidate lexical units. This effectively eliminates the absolute probability magnitude bias caused by differences in data scale and distribution between different subsets, allowing each subset to compete at an equal confidence level. This not only suppresses extreme value bias caused by noise introduction but also ensures the semantic coherence and representativeness of selected lexical units in the local context. Thus, while providing strict differential privacy guarantees, it significantly improves the usability and generation quality of the anonymized synthetic data. Furthermore, this embodiment provides formal differential privacy guarantees, effectively defending against vector inversion attacks and large model jailbreak attacks. It achieves usability without visibility of retrieved data, significantly reducing the damage of noise to semantic logic. This allows the generated anonymized synthetic data to maintain high accuracy and usability even with a low privacy budget, solving the technical problem of balancing privacy protection and data usability in existing RAG technologies.
[0020] Figure 1This is an exemplary embodiment illustrating the architecture of an application scenario for a method for enhancing the generation of large model retrieval. For example... Figure 1 As shown, the method may include a data receiver, multiple data holders, and a third-party server. Both the data receiver and data holders can be implemented using servers or electronic devices. After obtaining the user's query text, the data receiver can send the query vector corresponding to the query text to the third-party server. Simultaneously, the third-party server can also receive data vectors distributed from data holders A to N, where each data vector distribution corresponds to the local knowledge base of its respective data holder. Upon receiving the query vector corresponding to the user's query text and the data vectors sent by each data holder, the third-party server performs similarity matching on the query vector and data vectors. Based on the matching results, it obtains a target index sequence and sends this target index sequence to each data holder. Upon receiving the target index sequence, each data holder determines the target data set corresponding to the target index sequence from its local knowledge base. Using a privacy protection mechanism, it generates de-identified synthetic data corresponding to the target data set. This de-identified synthetic data is sent to the data receiver, allowing the data receiver to input it as contextual reference information into a large language model to generate a response result for the user's query text.
[0021] In some embodiments, the server can be a physical server containing a single host, or it can be a virtual server hosted by a host cluster. Electronic devices can include PCs (Personal Computers), mobile phones, tablets, laptops, PDAs (Personal Digital Assistants), etc., and this specification does not limit this to any particular embodiment. During operation, the electronic device can run a client-side program of an application to implement the application's related functions. This client-side program can be a native application installed on the electronic device, or it can be a mini-program, quick app, or other similar form. Of course, when using web technologies such as HTML5, the related functions can be implemented through a browser interface. This browser can be a standalone browser application or a browser module embedded in some applications.
[0022] Regarding the network for interaction between electronic devices and servers, the specific choice between wired or wireless networks can be made based on the communication methods supported by the corresponding electronic devices; this manual does not impose any restrictions on this. For example, a PC can support both wired and wireless communication, so it can use either wired or wireless networks as needed. Mobile phones, on the other hand, typically only support wireless communication, so they can use wireless networks.
[0023] refer to Figure 2 This is a flowchart of a large model retrieval enhancement generation method provided in this specification. This method is applied to the target data holder and includes the following steps: S202, obtain the target index sequence, determine the target data set corresponding to the target index sequence from the local knowledge base of the target data holder, and divide it into multiple data sample subsets; the target index sequence is obtained by similarity matching between the data vector corresponding to the local knowledge base and the query vector corresponding to the user query text.
[0024] The data holder is the party that provides local private knowledge data during the retrieval enhancement generation process. The target data holder can be any data holder providing local private knowledge data. In the embodiments of this specification, there can be multiple or one data holder, without limitation. The target index sequence can be a set of identifiers used to retrieve data entries related to the user's query text from the local knowledge base. This target index sequence helps the data holder quickly find the target data set matching the user's query text. It should be noted that the target data set can be a data set whose corresponding data vector and query vector have a similarity greater than a preset similarity threshold, or it can be a data set whose corresponding data vector and query vector's similarity is within a preset ranking range in all similarity rankings. In some embodiments, the determination process of the target index sequence can be completed in the target data holder, the data receiver, or a third-party server, without limitation. When the determination process of the target index sequence is completed in the target data holder, the target index sequence can be obtained directly from the local machine. When the determination process of the target index sequence is completed in the data receiver or a third-party server, the target index sequence can be obtained from the data receiver or the third-party server. It should be noted that when dividing the target dataset into multiple data sample subsets, the method of division can be selected as needed. For example, the division can be random, and the resulting data sample subsets may be non-overlapping or partially overlapping, without any restrictions.
[0025] To protect the privacy of data holders and data recipients during the retrieval phase, in some embodiments of this specification, the process of matching the data vector corresponding to the local knowledge base with the query vector corresponding to the user's query text includes: Obtain the query perturbation vector carrying query privacy perturbation features corresponding to the query vector, and the data perturbation vector carrying data privacy perturbation features corresponding to each data vector; each data vector belongs to multiple data holders, including the target data holder; Determine the vector similarity between the query perturbation vector and each of the data perturbation vectors, and determine the target index sequence based on the vector similarity.
[0026] Considering that the retrieval stage of enhanced retrieval generation requires calculating the similarity between the user's query text and the data held by the data holder (i.e., generating the target index sequence), directly calculating this using plaintext data could potentially leak the privacy data of both the data holder and the data receiver. Therefore, in this embodiment, privacy perturbations are added locally to the query vector and data vector respectively, resulting in query perturbation vectors and data perturbation vectors. These vectors are then acquired, and the vector similarity between them is calculated. Finally, the target index sequence is determined based on this vector similarity. By adding noise to the original vector space, the original data information is not reconstructed from the vectors, thus achieving privacy protection during the retrieval stage. It should be noted that the method of adding privacy perturbations locally to the query vector and data vector can be chosen as needed. For example, in some embodiments, the added privacy perturbations can conform to a Laplace distribution or a Gaussian distribution; this is not limited.
[0027] To better distinguish between query vectors and data vectors with added perturbation noise, the perturbation noise added to the query vector can be called a query privacy perturbation feature, and the result can be called a query perturbation vector. Similarly, the perturbation noise added to the data vector can be called a privacy perturbation feature, and the result can be called a data perturbation vector. It should be noted that when there are multiple data holders, different data holders can be assigned different data privacy perturbation features. In the embodiments of this specification, the local knowledge base corresponding to each data holder may include multiple original data sets, each of which corresponds to a data perturbation vector. To facilitate the distinction between the data perturbation vectors of the original data corresponding to each data holder, the data perturbation vectors mentioned in the embodiments of this specification generally refer to the data perturbation vectors corresponding to each data holder.
[0028] In some embodiments of this specification, the process of matching the similarity between the data vector corresponding to the local knowledge base and the query vector corresponding to the user query text runs in the target data holder, the data receiver, or a third-party server; when the process of matching the similarity between the data vector corresponding to the local knowledge base and the query vector corresponding to the user query text runs in the data receiver or a third-party server, the method further includes: Apply a data privacy perturbation feature to the data vector corresponding to the local knowledge base to obtain the data perturbation vector corresponding to the target data holder; Send the data perturbation vector corresponding to the target data holder to the data receiver or the third-party server.
[0029] In the embodiments of this specification, the process of determining the target index sequence can be performed by the target data holder, the data receiver, or a third-party server. To protect the privacy of all parties, when the process of determining the target index sequence is not performed by the data holder, the data holder applies a data privacy perturbation feature to the data vector corresponding to the local knowledge base, obtaining a data perturbation vector corresponding to the target data holder. Then, the data holder sends this data perturbation vector to the party performing the target index sequence determination. Similarly, when the process of determining the target index sequence is performed by the data holder, the query perturbation vector carrying the query privacy perturbation feature corresponding to the query vector is generated by the data receiver, and the query perturbation vector is sent from the data receiver to the data holder.
[0030] To ensure both privacy protection and search accuracy, in some embodiments of this specification, the vector similarity between the query perturbation vector and each of the data perturbation vectors is determined, including: For each data holder, a similarity compensation parameter for that data holder is determined based on the query privacy perturbation feature and the corresponding data privacy perturbation feature. The similarity compensation parameter is used to correct the similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder, thereby determining the vector similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder.
[0031] Considering that adding perturbation noise to the original vector can protect privacy, the accuracy of the retrieval stage may be affected by the perturbation noise. Therefore, in the embodiments of this specification, when calculating the vector similarity between the query perturbation vector and each of the data perturbation vectors, a similarity compensation parameter for the data holder is first determined based on the query privacy perturbation feature and the data privacy perturbation feature corresponding to the data holder. Then, the original similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder is corrected using the similarity compensation parameter to determine the corrected vector similarity. It should be noted that when determining the similarity compensation parameter, the deviation in similarity between the original vectors after adding these two perturbation noises can be determined first based on the query privacy perturbation feature and the data privacy perturbation feature. Then, the corresponding similarity compensation parameter is determined based on this deviation, so as to reduce the impact of perturbation noise on the original similarity.
[0032] To more accurately correct the original similarity, in some embodiments of this specification, the query privacy perturbation feature conforms to a Gaussian distribution with the query perturbation variance as the variance, and the data privacy perturbation feature conforms to a Gaussian distribution with the data perturbation variance of the corresponding data holder as the variance. Based on the query privacy perturbation feature and the data privacy perturbation feature corresponding to the data holder, a similarity compensation parameter for the data holder is determined, including: For each data holder, a first compensation parameter is determined based on the query perturbation variance and the dimension of the query perturbation vector; The second compensation parameter is determined based on the data perturbation variance corresponding to the data holder and the dimension of the data perturbation vector; Based on the first compensation parameter and the second compensation parameter, the similarity compensation parameter of the data holder is determined.
[0033] Considering that the addition of noise alters the magnitude and direction of a vector, and that when both the query privacy perturbation feature and the data privacy perturbation feature conform to a Gaussian distribution, the resulting change in the original vector can be reflected by the vector's dimension and the variance of the Gaussian distribution, based on this principle, for each data holder, a first compensation parameter can be determined based on the query perturbation variance and the dimension of the query perturbation vector, and a second compensation parameter can be determined based on the data perturbation variance corresponding to that data holder and the dimension of the data perturbation vector. Finally, a similarity compensation parameter for that data holder is determined based on the first and second compensation parameters. It should be noted that when there are multiple data holders, the data privacy perturbation features corresponding to different data holders can conform to different Gaussian distributions; that is, each data holder corresponds to a different Gaussian distribution. For ease of distinction, the variance of the Gaussian distribution conforming to the query privacy perturbation feature is called the query perturbation variance, and the variance of the Gaussian distribution conforming to the data privacy perturbation feature is called the data perturbation variance.
[0034] To accurately calibrate the perturbed similarity, in some embodiments of this specification, the similarity compensation parameter is used to correct the similarity between the query perturbation vector and the data perturbation vector, including: The magnitude of the query disturbance vector is compensated based on the first compensation parameter to obtain the first compensation magnitude. The second compensation modulus is obtained by compensating for the magnitude of the data disturbance vector corresponding to the data holder based on the second compensation parameter. Using the first compensation modulus and the second compensation modulus as the modulus for calculating cosine similarity, the cosine similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder is corrected.
[0035] Considering that when calculating the similarity between the query perturbation vector and the data perturbation vector using cosine similarity, the addition of noise will change the magnitude and direction of the vectors, leading to bias in the direct calculation of cosine similarity, and when both the query privacy perturbation feature and the data privacy perturbation feature conform to a Gaussian distribution, the impact of noise on the numerator (i.e., the inner product term) and the denominator (i.e., the magnitude term) is fundamentally different when calculating the cosine similarity of the vectors after adding noise. Specifically, in the numerator of the cosine similarity formula, the inner product expansion includes the inner product of the original vectors, the interaction term between the original vector and the noise, and the product term between the noises. Since the mean of Gaussian noise is 0, and the query noise and database noise are independent, the expected value of all noise-related interaction terms and noise product terms is 0. Therefore, from a statistical average perspective, the inner product value of the numerator does not undergo a systematic change due to the introduction of noise and does not require correction. However, in the denominator of the cosine similarity formula, the calculation of the modulus involves the sum of the squares of each vector component. The square of the noise component is always non-negative, and its accumulation leads to a significant increase in the expected value of the vector length (i.e., noise energy superposition). This systematic shift caused by the squaring operation makes the denominator value larger, which in turn leads to the calculated cosine similarity score being artificially suppressed. Therefore, it is necessary to perform a special covariance correction to address the expansion of the denominator's modulus in order to restore the true similarity. Therefore, in the embodiments of this specification, the modulus of the query perturbation vector is first compensated using the first compensation parameter to obtain a first compensated modulus. Then, the modulus of the data perturbation vector corresponding to the data holder is compensated using the second compensation parameter to obtain a second compensated modulus. The first compensated modulus and the second compensated modulus are used as the modulus for calculating the cosine similarity to correct the cosine similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder.
[0036] It should be noted that when calculating the first and second compensation moduli, considering that the original vector moduli may expand, the moduli of the query perturbation vector can be reduced based on the first compensation parameter. For example, the moduli of the query perturbation vector can be divided by the first compensation parameter, or the first compensation parameter can be directly subtracted from the moduli of the query perturbation vector. However, considering that in actual operation, the value of the first compensation parameter may be greater than the moduli of the query perturbation vector, subtracting the two may result in a negative number, thus affecting normal similarity calculation. Therefore, in some embodiments, the first compensation moduli can also be obtained by summing the first compensation parameter and the moduli of the query perturbation vector. Since the magnitude of the query perturbation vector is in the denominator when calculating cosine similarity, the larger the magnitude, the lower the similarity result. The first compensation parameter reflects the magnitude of the deviation generated by the query perturbation vector, meaning the larger the first compensation parameter, the greater the deviation generated by the query perturbation vector. Therefore, increasing the denominator of the query perturbation vector in the calculation of cosine similarity can also reduce the similarity score. The greater the deviation of the query perturbation vector, the less reliable the data is. Therefore, the probability of the query perturbation vector matching successfully can be reduced by weakening its corresponding similarity score.
[0037] In some embodiments, the modified cosine similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder can be calculated using the following formula: ; in, This represents the query perturbation vector. This represents the perturbation vector of the k-th data corresponding to the i-th data holder. This represents the modified cosine similarity between the two. Indicates the length of the first compensation module. This indicates the query perturbation variance. Indicates the dimension of the query perturbation vector. This represents the second compensation parameter. Let represent the data perturbation variance corresponding to the i-th data holder.
[0038] To accurately determine the target index sequence, in some embodiments of this specification, the target index sequence is determined based on the similarity of the vectors described, including: For each data perturbation vector, a scoring weight is determined based on the data perturbation variance corresponding to the data perturbation vector. Based on the similarity between the scoring weight and the vector corresponding to the data perturbation vector, a weighted similarity score is obtained for the data perturbation vector. The scoring weight is negatively correlated with the data perturbation variance. The target index sequence is determined based on the weighted similarity scores.
[0039] Considering that a larger data perturbation variance corresponding to a data holder indicates higher uncertainty in the vector corresponding to that data holder, and lower reliability of its similarity score, this embodiment of the specification, when calculating the target index sequence, first determines the scoring weight of each data perturbation vector based on the data perturbation variance corresponding to the data perturbation vector. The scoring weight is negatively correlated with the data perturbation variance. Then, the scoring weight is multiplied by the vector similarity corresponding to the data perturbation vector to obtain the weighted similarity score corresponding to the data perturbation vector. Finally, the target index sequence is determined based on each weighted similarity score. This achieves fair comparison during cross-database retrieval, prevents incorrect ranking due to high noise from one database but accidental high similarity, and improves the robustness of the retrieval results by reducing the weight of high-noise data.
[0040] To ensure the scientific validity and adaptability of the weight allocation, in some embodiments of this specification, the scoring weight of the data perturbation vector is determined based on the data perturbation variance corresponding to the data perturbation vector, including: Calculate the sum of the variances of the query perturbation variance and the data perturbation variances corresponding to all the data holders; The maximum variance is determined from the query perturbation variance and the data perturbation variances corresponding to all the data holders; The normalization compensation coefficient for the scoring weights is determined based on the sum of variances, the maximum variance, and the dimension of the data perturbation vector. The scoring weight of the data perturbation vector is determined based on the normalized compensation coefficient and the data perturbation variance corresponding to the data perturbation vector.
[0041] Considering that in a multidimensional vector space, if Gaussian noise is added independently to each dimension, the total noise variance will linearly accumulate with the dimension, and this high-dimensional noise accumulation effect will lead to an amplification of the noise dimension. Furthermore, since the data perturbation variances corresponding to different data holders differ, and the noise levels differ between different data sources, it may lead to unfair calculation of the scoring weights corresponding to different data holders. That is, a party with lower noise might be blindly given a high score, or a party with higher noise might be completely ignored. Therefore, in this embodiment, when determining the scoring weight of the data perturbation vector, the following steps are taken: First, the sum of the variances of the query perturbation variance and the data perturbation variances corresponding to all data holders is calculated. Then, the maximum variance is determined from the query perturbation variance and the data perturbation variances corresponding to all data holders. A normalized compensation coefficient for the scoring weight is then determined based on the sum of variances, the maximum variance, and the dimension of the data perturbation vector. Finally, the scoring weight of the data perturbation vector is determined based on the normalized compensation coefficient and the data perturbation variance corresponding to the data perturbation vector.
[0042] In some embodiments, the scoring weight of the data perturbation vector can be calculated using the following formula: ; ; in, Represents the normalized compensation coefficient. The dimension of the data perturbation vector. This indicates the query perturbation variance. Indicates the first i The data disturbance variance corresponding to each data holder, where M represents the total number of data holders. Indicates the first i The scoring weights of the data perturbation vectors corresponding to each data holder.
[0043] To accurately determine the target index sequence, in some embodiments of this specification, the target index sequence is determined based on the weighted similarity scores described above, including: For each data perturbation vector, calculate the Mahalanobis distance between the query perturbation vector and the data perturbation vector, and obtain the comprehensive score of the data perturbation vector based on the weighted similarity score corresponding to the Mahalanobis distance and the data perturbation vector; Based on the comprehensive scores mentioned above, the target index sequence is generated.
[0044] To enhance the robustness of cross-database retrieval, this embodiment further introduces Mahalanobis distance when determining the target index sequence. Mahalanobis distance considers the covariance structure of the data and can better measure the true distance under noise distribution, making the scoring more consistent with statistical laws. In the specific implementation, for each data perturbation vector, the Mahalanobis distance between the query perturbation vector and the data perturbation vector is calculated. Then, based on the weighted sum of the Mahalanobis distance and the weighted similarity score corresponding to the data perturbation vector, a comprehensive score for the data perturbation vector is obtained. Finally, the target index sequence is generated based on each comprehensive score. In some embodiments, the target index sequence can be generated based on data perturbation vectors with a comprehensive score greater than a preset score threshold, or the comprehensive scores can be sorted in descending order, and then the comprehensive scores at a preset rank can be found, and the target index sequence can be generated based on the data perturbation vectors corresponding to these comprehensive scores at the preset rank.
[0045] In some embodiments, the comprehensive score of the data perturbation vector can be calculated using the following formula: ; in, This indicates the overall score. Indicates the first i The scoring weights of the data perturbation vectors corresponding to each data holder. This represents the query perturbation vector. Indicates the first i The k-th data perturbation vector corresponding to each data holder This represents the modified cosine similarity between the two. This represents the Mahalanobis distance between the two. This indicates the preset trade-off parameters. The specific preset trade-off parameters can be set as needed, and there are no restrictions on them.
[0046] To further improve the robustness of the target index sequence determination, in some embodiments of this specification, the target index sequence is generated based on the comprehensive scores described above, including: Based on the comprehensive score, perform multiple rounds of probability iteration on the data perturbation vector corresponding to each of the data holders; Each iteration includes: calculating the weight value of the comprehensive score of each candidate vector in the candidate vector set corresponding to the current iteration after exponential function transformation; normalizing each weight value to obtain the probability distribution of each candidate vector in the current iteration; sampling according to the probability distribution to determine the selected vector of the current iteration; and removing the selected vector of the current iteration from the candidate vector set; the candidate vector set corresponding to the first iteration is determined by the data perturbation vector corresponding to each data holder. The target index sequence is generated based on the selected vector determined through multiple rounds of probability iteration.
[0047] In the embodiments of this specification, a target index sequence is generated based on various comprehensive scores. Specifically, a multi-round probabilistic iteration method based on the Plackett-Luce model is adopted. In each iteration, the comprehensive score of each candidate vector in the candidate vector set corresponding to that iteration is first calculated and transformed by an exponential function, resulting in a weight value. These weight values are then normalized to obtain the probability distribution of each candidate vector in that iteration. Sampling is then performed according to this probability distribution to determine the selected vector for that iteration, and the selected vector is removed from the candidate vector set. This process is repeated until a sufficient number of vectors are selected. Finally, the target index sequence is generated based on the selected vectors determined by the multi-round probabilistic iteration. This multi-round probabilistic iteration method can smooth out minor fluctuations in scores caused by noise and avoid drastic changes in the ranking results due to random jumps in the scores of individual vectors, thereby providing a more reliable set of retrieval results.
[0048] To further improve the accuracy and stability of the retrieval results, in some embodiments of this specification, the process of determining the candidate vector set corresponding to the first iteration includes: Calculate the mean and variance of the comprehensive score of the data perturbation vector corresponding to each of the data holders; The confidence interval span of each comprehensive score is determined based on the mean and the variance. The data perturbation vectors whose confidence interval span is less than or equal to a preset span threshold are selected and retained to form the candidate vector set corresponding to the first round of iteration.
[0049] In the embodiments of this specification, a confidence interval filtering mechanism is introduced. For each candidate vector, a confidence interval is calculated based on the statistical characteristics of its score. If the confidence interval span is too large, it indicates that the uncertainty of the score is too high (possibly caused by excessive noise), and therefore it is directly eliminated before entering the sorting iteration. This effectively filters out unreliable candidates due to excessive noise, reduces the interference of extreme values on subsequent sorting, and further improves the accuracy and stability of the retrieval results. It should be noted that the preset span threshold can be set as needed and is not limited thereto.
[0050] To further reduce the interference of extreme values, in some embodiments of this specification, the weight value of each candidate vector in the candidate vector set corresponding to the current iteration is calculated after exponential function transformation, including: Obtain the preset upper and lower score thresholds; For each candidate vector in this iteration, the comprehensive score of the candidate vector is restricted between the preset upper limit threshold and the preset lower limit threshold to obtain the trimmed score; The cropped score is transformed by an exponential function to obtain the weight value of the candidate vector in this iteration.
[0051] In the embodiments of this specification, by introducing an extreme value pruning mechanism, the score is forcibly limited to a reasonable range before exponential transformation, preventing score extremes (maximum or minimum values) caused by noise. Without pruning, extreme values would dominate the probability distribution after exponential transformation, causing the vector probabilities of other normal scores to approach zero. Therefore, the method for calculating weight values in the embodiments of this specification ensures the smoothness of the probability distribution and enhances the algorithm's tolerance to abnormal noise. It should be noted that the aforementioned preset upper and lower score thresholds can be set as needed and are not limited thereto.
[0052] S204, execute the word-by-word meta-generation loop until the preset termination condition is met.
[0053] After obtaining multiple subsets of data samples corresponding to the target dataset, a word-by-word generation loop is executed until a preset termination condition is met. In some embodiments, the preset termination condition includes the target word in this loop being a preset terminator or the length of the concatenated target word segments reaching a preset maximum generation length. It should be noted that both the preset terminator and the preset maximum generation length can be set as needed and are not limited thereto. In some embodiments, the preset terminator can be determined based on the last word segment in the target dataset, and the preset maximum generation length can be determined by the number of words segments included in the target dataset. In some embodiments, the determination of whether the loop meets the preset termination condition can also be made through the judgment mechanism of the large language model itself, and this is not limited thereto.
[0054] refer to Figure 3 This is a flowchart illustrating a single-loop process for word generation provided in an embodiment of this specification. The process includes the following steps: S2042, for each subset of data samples, input the corresponding context fragment into the text output model, output the conditional probability distribution of each candidate word in the vocabulary, construct the first layer utility function based on the conditional probability distribution, and perform the first differential privacy perturbation based on the first layer utility function to obtain the candidate words of the subset of data samples.
[0055] The first-level utility function is a quantitative indicator used to measure the conditional probability of each candidate word in the first-level screening. In the embodiments of this specification, it is constructed directly based on the conditional probability distribution output by the language model. In some embodiments, the first-level utility function can be determined by the following formula: ; in, Indicates the candidate word element. This indicates the text output model in the current context fragment. The following generation The conditional probability.
[0056] After constructing the first-layer utility function, a first differential privacy perturbation can be performed based on the first-layer utility function to obtain candidate lexical units of the data sample subset. In some embodiments, the process of performing the first differential privacy perturbation may include: First, the first sensitivity of the first-layer utility function between adjacent datasets is calculated. Then, based on the first sensitivity and the first preset privacy budget, the conditional probabilities of each candidate word are sampled with noise perturbation to determine the candidate words of the data sample subset. Adjacent datasets refer to two datasets that differ by one sample in differential privacy. In one example, adjacent datasets could be dataset D corresponding to the aforementioned context segment and dataset D that differs from dataset D by one word. .
[0057] In some embodiments, the probability of each candidate word being selected after noise perturbation can be calculated using the following formula: ; ; in, Indicates the first sensitivity. This represents the dataset corresponding to the current context segment. Indicates and Data sets that differ by one lexical unit, This represents the probability of selecting r from multiple candidate words after noise perturbation. This represents the first preset privacy budget, which can be set as needed. After determining the probability of each candidate word being selected after noise perturbation, the candidate words of the data sample subset can be determined based on the probability distribution of each candidate word being selected after noise perturbation. For example, if the probability of a candidate word after noise perturbation is 60%, then that candidate word has a 60% probability of being selected as a candidate word.
[0058] S2044, determine the relative probability advantage of each candidate word in its corresponding data sample subset, construct a second-layer utility function based on the relative probability advantage, and perform a second differential privacy perturbation based on the second-layer utility function to obtain the target word for this iteration.
[0059] Relative probability advantage refers to the degree to which a candidate lexical unit leads in probability relative to all other candidate lexical units within its own subset of data samples. It is a standardized relative indicator, rather than an absolute probability value. This relative probability advantage eliminates the differences in absolute probability magnitudes between different subsets caused by variations in data volume and semantic complexity. It allows candidate lexical units from different subsets to be compared fairly on the same scale, ensuring that lexical units with a significant leading advantage in the local context are selected, even if their absolute probability values may not be high. The second-level utility function refers to the quantitative indicator constructed based on relative probability advantage for the second-level selection.
[0060] To accurately determine relative probability advantage, in some embodiments of this specification, the relative probability advantage of each candidate word element within its corresponding subset of data samples is determined, including: Calculate the degree of lead of the conditional probability of each candidate word relative to the probability statistics feature within the corresponding subset of data samples; The relative probability advantage of each candidate word element within its corresponding data sample subset is determined based on the aforementioned leading degree.
[0061] In this embodiment, relative probability advantage is defined by quantifying the leading degree of candidate lexical units in the local probability distribution. This leading degree refers to the extent to which the probability value of a candidate lexical unit deviates from the group average. This design avoids the magnitude difference problem that may arise from directly using absolute probability values, making lexical unit selections under different subsets and context lengths comparable. Simultaneously, even if some low-probability lexical units accidentally score higher due to first-layer noise, if they do not have a significant advantage relative to the group mean, they will be suppressed in subsequent screening, thereby improving the fluency of the synthesized text. It should be noted that the probabilistic statistical features are features obtained by statistically calculating the conditional probabilities of candidate lexical units within the data sample subset. In some embodiments, these probabilistic statistical features may include statistical features such as the average, standard deviation, or variance of the conditional probabilities of all candidate lexical units within the data sample subset.
[0062] In some embodiments of this specification, calculating the leading degree of the conditional probability of each candidate lexical unit relative to the probability statistics within the corresponding subset of data samples includes: For each candidate word, determine the probability metric value corresponding to the conditional probability of each candidate word in the subset of data samples corresponding to the candidate word; Calculate the average of the probability measures of all candidate words within the subset of data samples corresponding to the candidate word; The degree of leadership is determined based on the probability metric of the candidate lexical and the average value.
[0063] To more accurately calculate the leading degree, the probability metric corresponding to each conditional probability can be determined first. This probability metric is the value obtained after quantification based on the conditional probabilities. For example, in some embodiments, the conditional probabilities can be subjected to exponential or logarithmic operations, and the result can be used as the probability metric corresponding to the conditional probability. After obtaining the probability metric of each candidate word, the average of the probability metric of all candidate words in the corresponding subset of data samples is calculated. Then, the leading degree is determined based on the probability metric of the candidate word and the average value. In some embodiments, the leading degree can be determined directly by the ratio or difference between the probability metric of the candidate word and the average value.
[0064] Considering that probability values are typically small, multiplication can easily lead to underflow, while logarithmic probability transforms multiplication into addition, making it easier to calculate and store. Furthermore, logarithmic probability corresponds to self-information, which is more consistent with the measurement of uncertainty in information theory, and helps to more accurately measure the sensitivity of the utility function in differential privacy mechanisms. Therefore, in some embodiments of this specification, the probability metric includes the logarithmic probability corresponding to the conditional probability.
[0065] In some embodiments of this specification, determining the leading degree based on the probability metric of the candidate lexical and the average value includes: Calculate the standard deviation of the probability measure values of all candidate words within the subset of data samples corresponding to the candidate words; The target difference between the probability metric of the candidate word and the average value is determined, and the leading degree is determined based on the ratio of the target difference to the standard deviation.
[0066] This embodiment further introduces standard deviation for standardization. By normalizing using standard deviation, the influence of different subsets due to varying degrees of data dispersion can be eliminated. For example, the probability distribution may be more concentrated (small standard deviation) in some contexts, while it may be more dispersed (large standard deviation) in others. By dividing by the standard deviation, the leading degree becomes a dimensionless standardized indicator, ensuring that the target lexical units selected in different contexts have a consistent confidence level, further improving the stability of the generated quality.
[0067] In some embodiments, the second-level utility function can be determined by the following formula: ; Here, it is assumed that r is selected as a candidate word element. It is a word element The log probability when calculating the first-level utility function It is a word element The mean of the log probabilities of all tokens within the corresponding subset of data samples. It is the standard deviation of the logarithmic probabilities of all words in the subset of data samples corresponding to word r.
[0068] In some embodiments, a second differential privacy perturbation is performed based on the second-layer utility function to obtain the target lexical unit for this iteration, including: The second sensitivity of the second-layer utility function between the adjacent datasets is calculated, and the deviation of each candidate word is sampled with noise perturbation based on the second sensitivity and the second preset privacy budget. The target word for this iteration is then determined from the candidate words. Adjacent datasets refer to two datasets that differ by one sample in differential privacy. In one example, adjacent datasets could be dataset D corresponding to the aforementioned context segment and dataset D that differs from dataset D by one word. .
[0069] In some embodiments, the probability of each candidate word being selected after noise perturbation can be calculated using the following formula: ; ; ; in, Indicates the second sensitivity. This represents the dataset corresponding to the current context segment. Indicates and Data sets that differ by one lexical unit, This represents the probability of selecting r from multiple candidate words after noise perturbation. This represents the second preset privacy budget, which can be set as needed. In some embodiments, the total preset privacy budget can be determined first, and then set to equal the sum of the first and second preset privacy budgets. After determining the probability of each candidate word being selected after noise perturbation, the target word for this iteration can be determined based on the probability distribution of each candidate word being selected after noise perturbation.
[0070] S2046, the target lexical units of this iteration are appended to the context segments corresponding to each subset of data samples to complete the update; the initial content of each context segment is its corresponding subset of data samples.
[0071] After obtaining the target lexical unit corresponding to this iteration, the target lexical unit can be appended to the context segment corresponding to each of the data sample subsets to complete the update. The initial content of each context segment is its corresponding data sample subset.
[0072] S206, when the loop terminates, the target words determined in each loop are concatenated in order to obtain the desensitized synthetic data.
[0073] After the word-by-word generation loop stops, the target words determined in each loop can be concatenated according to the loop order to obtain desensitized synthetic data corresponding to the target dataset. This desensitized synthetic data is text data that does not contain the original sensitive information of the target dataset but retains the semantic logic. By using this desensitized synthetic data to replace the target dataset as the context input of a large language model, the privacy of the data holder can be made available but not visible.
[0074] S208, the de-identified synthetic data is sent to the data receiver so that the data receiver can use it as contextual reference information to input into the large language model and generate a response result for the user query text.
[0075] After obtaining the anonymized synthetic data, the data holder sends the generated anonymized synthetic data to the data recipient, so that the data recipient can use the anonymized synthetic data as contextual reference information to input into the large language model, thereby generating response results for the user's query text.
[0076] refer to Figure 4 This is a flowchart of another large model retrieval enhancement generation method provided in this specification. This method is applied to the data receiver and includes the following steps: S402, obtain the perturbation query vector carrying query privacy perturbation features corresponding to the user query text, and the data perturbation vector carrying data privacy perturbation features corresponding to each data holder.
[0077] S404, determine the vector similarity between the perturbation query vector and each of the data perturbation vectors, and determine the target index sequence based on the vector similarity.
[0078] S406, the target index sequence is sent to each of the data holders so that each of the data holders can determine the target data corresponding to the target index sequence.
[0079] S408, acquire the target data and input it as contextual reference information into the large language model to generate a response result for the user's query text.
[0080] It should be noted that the specific implementation of steps S402 to S408 in this embodiment can refer to the specific implementation of steps S202 to S208 described above, and will not be repeated here. The target data corresponding to the target index sequence determined by the data holder can be the original data in the data holder's local database or the anonymized synthetic data; there is no limitation on this.
[0081] The large-scale retrieval enhancement generation method provided in this specification does not directly input the retrieved original sensitive data fragments as context into the large language model during retrieval enhancement generation. Instead, it first obtains the target index sequence based on privacy-preserving retrieval and determines the corresponding target dataset from the local knowledge base, merging and dividing it into multiple data sample subsets. Subsequently, it executes a word-by-word generation loop. In each loop, it first constructs a first-layer utility function based on the conditional probability distribution output by the text output model and performs a first differential privacy perturbation to obtain candidate words. Then, it determines the relative probability advantage of the candidate words and constructs a second-layer utility function, performing a second differential privacy perturbation to obtain the target words. Simultaneously, it appends the target words to the context fragment to complete the update. After the termination condition is met, it concatenates the target words determined by each loop to obtain desensitized synthetic data, which is then sent to the data receiver for generating the user query text response result through the large language model. By dividing the target data set into multiple data sample subsets for parallel processing, it utilizes subsampling technology to amplify the privacy protection effect. A two-layer differential privacy mechanism was employed during the generation phase. The first layer of perturbation initially screened candidate lexical units that met privacy requirements based on the original probability distribution of the language model, ensuring the basic authenticity of the statistical distribution. In the second layer of perturbation, considering the differences in the quantity and content distribution of the original corpus contained in different data sample subsets, the conditional probability distributions output by each subset may differ in magnitude. For example, subsets containing more data or with significant semantic features may have higher maximum probability distribution values; while subsets containing less data or with weaker semantic features may have lower maximum probability distribution values, but the information they carry is equally important. If selection were directly based on absolute probability values, high-probability subsets would dominate the generation results, leading to the neglect of low-probability but crucial information, and the noise sensitivity between different subsets would be difficult to measure uniformly. Therefore, the second layer of perturbation introduces a relative probability advantage to perform secondary evaluation and correction of candidate lexical units. This effectively eliminates the absolute probability magnitude bias caused by differences in data scale and distribution between different subsets, allowing each subset to compete at an equal confidence level. This not only suppresses extreme value bias caused by noise introduction but also ensures the semantic coherence and representativeness of selected lexical units in the local context. Thus, while providing strict differential privacy guarantees, it significantly improves the usability and generation quality of the anonymized synthetic data. Furthermore, this embodiment provides formal differential privacy guarantees, effectively defending against vector inversion attacks and large model jailbreak attacks. It achieves usability without visibility of retrieved data, significantly reducing the damage of noise to semantic logic. This allows the generated anonymized synthetic data to maintain high accuracy and usability even with a low privacy budget, solving the technical problem of balancing privacy protection and data usability in existing RAG technologies.
[0082] Figure 5 This is a schematic structural diagram of a device provided in an exemplary embodiment. For example... Figure 5 As shown, device 500 mainly consists of a communication interface 502, a mechanism interface 504, a processor 506, and a data storage 508. These components are interconnected and communicate with each other via a system bus, network, or other connection mechanism 510. The communication interface 502 enables device 500 to communicate with other devices, access networks, and transmission networks via analog or digital modulation. For example, the communication interface 502 may include a chipset and antenna for wireless communication with a radio access network or access point. Furthermore, the communication interface 502 can be a wired interface such as Ethernet, Token Ring, or a USB port, or a wireless interface such as Wi-Fi, Bluetooth, Global Positioning System (GPS), or a wide-area wireless interface (e.g., WiMAX or LTE). Of course, the communication interface 502 can also support other forms of physical layer interfaces and standard or proprietary communication protocols. The communication interface 502 may also include multiple physical communication interfaces, such as Wi-Fi, Bluetooth, and wide-area wireless interfaces.
[0083] Mechanism interface 504 includes receiving mechanism input and providing output to the mechanism. Therefore, mechanism interface 504 may include input components such as a keypad, keyboard, touch-sensitive or presence-sensitive panel, computer mouse, trackball, joystick, microphone, still camera, and video camera, and output components such as a display screen (which may be combined with a touch-sensitive panel), CRT, LCD, LED, display using DLP technology, printer, and other similar devices known or developed in the future. Mechanism interface 504 may also generate auditory output via speakers, speaker jacks, audio output ports, audio output devices, headphones, and other similar devices known or developed in the future. In some embodiments, mechanism interface 504 may include software, circuitry, or other forms of logic capable of transmitting and receiving data from external mechanism input / output devices. Additionally or alternatively, device 500 may support remote access from other devices via communication interface 502 or another physical interface (not shown). Mechanism interface 504 may be configured to receive mechanism input, the position and movement of which may be indicated by an indicator or cursor described herein. Mechanism interface 504 may also be configured as a display device for rendering or displaying text fragments.
[0084] Processor 506 may contain one or more general-purpose processors and / or special-purpose processors.
[0085] Data storage 508 may include one or more volatile and / or non-volatile storage components and may be integrated wholly or partially with processor 506. Data storage 508 may include removable and non-removable components.
[0086] Processor 506 is capable of executing program instructions 518 (e.g., compiled or uncompiled program logic and / or machine code) stored in data storage 508 to perform the various functions described herein. Data storage 508 may contain a non-transitory computer-readable medium on which program instructions are stored, which, when executed by device 500, enable device 500 to perform any methods, processes, or functions disclosed in this specification and / or the accompanying drawings. Execution of program instructions 518 by processor 506 may result in processor 506 using data 512.
[0087] For example, program instructions 518 may include an operating system 522 (e.g., an operating system kernel, device drivers, and / or other modules) installed on device 500 and one or more applications 520 (e.g., a browser, social application, or game application). Similarly, data 512 may include operating system data 516 and application data 514. Operating system data 516 is primarily accessible to the operating system 522, while application data 514 is primarily accessible to one or more applications 520. Application data 514 may reside in a file system visible or hidden from the device 500.
[0088] Application 520 can communicate with operating system 522 through one or more application programming interfaces (APIs). These APIs help application 520 read and / or write application data 514, transmit or receive information via communication interface 502, receive or display information on mechanism interface 504, etc.
[0089] In some terminology, application 520 may be simply referred to as "app". Furthermore, application 520 can be downloaded to device 500 through one or more online app stores or app markets. However, applications can also be installed on device 500 in other ways, such as through a web browser or a physical interface on device 500 (e.g., a USB port).
[0090] Please refer to Figure 6 Large model retrieval enhancement generation devices can be applied to, for example... Figure 5 The device shown implements the technical solution described in this specification. This large model retrieval enhancement generation apparatus may include: The acquisition module 602 acquires the target index sequence, determines the target data set corresponding to the target index sequence from the local knowledge base of the target data holder, and divides it into multiple data sample subsets; the target index sequence is obtained by similarity matching between the data vector corresponding to the local knowledge base and the query vector corresponding to the user query text. Loop module 604 executes a word-by-word meta-generation loop until a preset termination condition is met; each loop includes: For each subset of data samples, the corresponding context fragment is input into the text output model, and the conditional probability distribution of each candidate word in the vocabulary is output. Based on the conditional probability distribution, the first layer of utility function is constructed, and the first differential privacy perturbation is performed based on the first layer of utility function to obtain the candidate words of the subset of data samples. Determine the relative probability advantage of each candidate word in its corresponding data sample subset, construct a second-layer utility function based on the relative probability advantage, and perform a second differential privacy perturbation based on the second-layer utility function to obtain the target word for this iteration; The target words of this iteration are appended to the context segments corresponding to each subset of data samples to complete the update; the initial content of each context segment is its corresponding subset of data samples. The desensitization module 606, upon termination of the loop, concatenates the target words determined in each loop in sequence to obtain desensitized synthetic data; The sending module 608 sends the de-identified synthetic data to the data receiver, so that the data receiver can use it as contextual reference information to input into the large language model and generate a response result for the user query text.
[0091] For ease of description, the above devices are described by dividing them into various modules or units based on their functions. Of course, when implementing one or more of these specifications, the functions of each module or unit can be implemented in the same or different software and / or hardware, or a module that performs the same function can be implemented by a combination of multiple sub-modules or sub-units, etc. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division; in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.
[0092] Based on the same concept as the above method, this specification also provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor executes the executable instructions to implement the steps of the large model retrieval enhancement generation method as described in any of the above embodiments.
[0093] Based on the same concept as the methods described above, this specification also provides a computer-readable storage medium having computer instructions stored thereon that, when executed by a processor, implement the steps of the large model retrieval enhancement generation method as described in any of the above embodiments.
[0094] Based on the same concept as the methods described above, this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the large model retrieval enhancement generation method as described in any of the above embodiments.
[0095] What those skilled in the art will understand is: In this specification, the terms "comprising," "including," or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, product, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, product, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, product, or apparatus that includes said elements is not excluded.
[0096] In this specification, “a,” “an,” and “the” do not specifically refer to the singular, but may also include the plural.
[0097] In this specification, ordinal numbers such as "first," "second," etc., do not necessarily indicate order; they are often used to distinguish between objects. For example, "first server" and "second server" usually refer to two servers. To differentiate between these two servers, they are described as "first server" and "second server." Of course, sometimes these two servers may be the same server.
[0098] In this specification, unless explicitly stated otherwise, "receiving and sending data" does not necessarily mean direct receiving and sending; it can also mean indirect receiving and sending. For example, A receiving data sent by B can be understood as A directly receiving the data sent by B, or it can be understood as A indirectly receiving the data sent by B through other entities such as C. Similarly, B sending data to A can be understood as B sending the data directly to A, or it can be understood as B indirectly sending the data to A through other entities such as C. Here, C can be one entity, or it can be two or more entities.
[0099] In this specification, unless explicitly stated otherwise, the relationships between structures can be direct or indirect. For example, when describing "A is connected to B," unless it is explicitly stated that A and B are directly connected, it should be understood that A can be directly connected to B or indirectly connected to B. Similarly, when describing "A is on top of B," unless it is explicitly stated that A is directly above B (AB is adjacent and A is above B), it should be understood that A can be directly above B or indirectly above B (AB is separated by other elements, and A is above B). And so on.
[0100] This specification uses specific terms to describe embodiments thereof. Terms such as "an embodiment," "one embodiment," and / or "some embodiments" refer to a particular feature, structure, or characteristic associated with at least one embodiment of this specification. Therefore, it should be emphasized and noted that references to "an embodiment," "one embodiment," or "an alternative embodiment" in different locations throughout this specification do not necessarily refer to the same embodiment. Furthermore, those skilled in the art can combine and integrate the different embodiments or examples described herein, as well as the features of those different embodiments or examples, without contradiction.
[0101] Although one or more embodiments of this specification provide method steps as described in the embodiments or flowcharts, it is understood that the order of steps listed in the embodiments or flowcharts is only one of many possible execution orders and does not represent the only execution order. Therefore, when the claims involve method steps, any changes or adjustments to the order of such steps, or the parallelism between steps, are also within the scope of protection of the claims.
Claims
1. A method for enhancing the generation of large model retrieval, the method being applied to a target data holder, the method comprising: Obtain the target index sequence, determine the target data set corresponding to the target index sequence from the local knowledge base of the target data holder, and divide it into multiple data sample subsets; the target index sequence is obtained by similarity matching between the data vector corresponding to the local knowledge base and the query vector corresponding to the user query text. Execute a word-by-word meta-generation loop until a preset termination condition is met; each loop includes: For each subset of data samples, the corresponding context fragment is input into the text output model, and the conditional probability distribution of each candidate word in the vocabulary is output. Based on the conditional probability distribution, the first layer of utility function is constructed, and the first differential privacy perturbation is performed based on the first layer of utility function to obtain the candidate words of the subset of data samples. Determine the relative probability advantage of each candidate word in its corresponding data sample subset, construct a second-layer utility function based on the relative probability advantage, and perform a second differential privacy perturbation based on the second-layer utility function to obtain the target word for this iteration; The target words of this iteration are appended to the context segments corresponding to each subset of data samples to complete the update; the initial content of each context segment is its corresponding subset of data samples. When the loop terminates, the target words determined in each loop are concatenated in order to obtain the desensitized synthetic data; The anonymized synthetic data is sent to the data receiver so that the data receiver can use it as contextual reference information to input into the large language model and generate a response result for the user's query text.
2. The method according to claim 1, determining the relative probability advantage of each candidate word in its corresponding subset of data samples, includes: Calculate the degree of lead of the conditional probability of each candidate word relative to the probability statistics feature within the corresponding subset of data samples; The relative probability advantage of each candidate word element within its corresponding data sample subset is determined based on the aforementioned leading degree.
3. The method according to claim 2, calculating the leading degree of the conditional probability of each candidate word relative to the probability statistical features within the corresponding subset of data samples, includes: For each candidate word, determine the probability metric value corresponding to the conditional probability of each candidate word in the subset of data samples corresponding to the candidate word; Calculate the average of the probability measures of all candidate words within the subset of data samples corresponding to the candidate word; The degree of leadership is determined based on the probability metric of the candidate lexical and the average value.
4. The method according to claim 3, wherein determining the leading degree based on the probability metric value of the candidate lexical unit and the average value comprises: Calculate the standard deviation of the probability measure values of all candidate words within the subset of data samples corresponding to the candidate words; The target difference between the probability metric of the candidate word and the average value is determined, and the leading degree is determined based on the ratio of the target difference to the standard deviation.
5. The method according to claim 3 or 4, wherein the probability metric includes the log probability corresponding to the conditional probability.
6. The method according to claim 1, wherein the preset termination condition includes the target word of the current cycle being a preset end symbol or the length of the concatenated target words reaching a preset maximum generation length.
7. The method according to claim 1, wherein the process of performing similarity matching between the data vector corresponding to the local knowledge base and the query vector corresponding to the user query text includes: Obtain the query perturbation vector carrying query privacy perturbation features corresponding to the query vector, and the data perturbation vector carrying data privacy perturbation features corresponding to each data vector; each data vector belongs to multiple data holders, including the target data holder; Determine the vector similarity between the query perturbation vector and each of the data perturbation vectors, and determine the target index sequence based on the vector similarity.
8. The method according to claim 7, wherein determining the vector similarity between the query perturbation vector and each of the data perturbation vectors comprises: For each data holder, a similarity compensation parameter for that data holder is determined based on the query privacy perturbation feature and the corresponding data privacy perturbation feature. The similarity compensation parameter is used to correct the similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder, thereby determining the vector similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder.
9. The method according to claim 8, wherein the query privacy perturbation feature conforms to a Gaussian distribution with the query perturbation variance as the variance, and the data privacy perturbation feature conforms to a Gaussian distribution with the data perturbation variance of the corresponding data holder as the variance, and the similarity compensation parameter of the data holder is determined based on the query privacy perturbation feature and the data privacy perturbation feature corresponding to the data holder, including: For each data holder, a first compensation parameter is determined based on the query perturbation variance and the dimension of the query perturbation vector; The second compensation parameter is determined based on the data perturbation variance corresponding to the data holder and the dimension of the data perturbation vector; Based on the first compensation parameter and the second compensation parameter, the similarity compensation parameter of the data holder is determined.
10. The method according to claim 9, wherein the similarity compensation parameter is used to correct the similarity between the query perturbation vector and the data perturbation vector, comprising: The magnitude of the query disturbance vector is compensated based on the first compensation parameter to obtain the first compensation magnitude. The second compensation modulus is obtained by compensating for the magnitude of the data disturbance vector corresponding to the data holder based on the second compensation parameter. Using the first compensation modulus and the second compensation modulus as the modulus for calculating cosine similarity, the cosine similarity between the query perturbation vector and the data perturbation vector corresponding to the data holder is corrected.
11. The method according to claim 7, wherein determining the target index sequence based on the similarity of each of the vectors comprises: For each data perturbation vector, a scoring weight is determined based on the data perturbation variance corresponding to the data perturbation vector. Based on the similarity between the scoring weight and the vector corresponding to the data perturbation vector, a weighted similarity score is obtained for the data perturbation vector. The scoring weight is negatively correlated with the data perturbation variance. The target index sequence is determined based on the weighted similarity scores.
12. The method according to claim 11, wherein determining the scoring weight of the data perturbation vector based on the data perturbation variance corresponding to the data perturbation vector comprises: Calculate the sum of the variances of the query perturbation variance and the data perturbation variances corresponding to all the data holders; The maximum variance is determined from the query perturbation variance and the data perturbation variances corresponding to all the data holders; The normalization compensation coefficient for the scoring weights is determined based on the sum of variances, the maximum variance, and the dimension of the data perturbation vector. The scoring weight of the data perturbation vector is determined based on the normalized compensation coefficient and the data perturbation variance corresponding to the data perturbation vector.
13. The method according to claim 11, wherein determining the target index sequence based on each of the weighted similarity scores comprises: For each data perturbation vector, calculate the Mahalanobis distance between the query perturbation vector and the data perturbation vector, and obtain the comprehensive score of the data perturbation vector based on the weighted similarity score corresponding to the Mahalanobis distance and the data perturbation vector; Based on the comprehensive scores mentioned above, the target index sequence is generated.
14. The method according to claim 13, wherein generating the target index sequence based on each of the comprehensive scores comprises: Based on the comprehensive score, perform multiple rounds of probability iteration on the data perturbation vector corresponding to each of the data holders; Each iteration includes: calculating the weight value of the comprehensive score of each candidate vector in the candidate vector set corresponding to the current iteration after exponential function transformation; normalizing each weight value to obtain the probability distribution of each candidate vector in the current iteration; sampling according to the probability distribution to determine the selected vector of the current iteration; and removing the selected vector of the current iteration from the candidate vector set; the candidate vector set corresponding to the first iteration is determined by the data perturbation vector corresponding to each data holder. The target index sequence is generated based on the selected vector determined through multiple rounds of probability iteration.
15. The method according to claim 14, the process of determining the candidate vector set corresponding to the first round of iteration includes: Calculate the mean and variance of the comprehensive score of the data perturbation vector corresponding to each of the data holders; The confidence interval span of each comprehensive score is determined based on the mean and the variance. The data perturbation vectors whose confidence interval span is less than or equal to a preset span threshold are selected and retained to form the candidate vector set corresponding to the first round of iteration.
16. The method according to claim 14, calculating the weight value of each candidate vector in the candidate vector set corresponding to the current iteration after exponential function transformation of the comprehensive score, includes: Obtain the preset upper and lower score thresholds; For each candidate vector in this iteration, the comprehensive score of the candidate vector is restricted between the preset upper limit threshold and the preset lower limit threshold to obtain the trimmed score; The cropped score is transformed by an exponential function to obtain the weight value of the candidate vector in this iteration.
17. The method according to claim 7, wherein the process of matching the data vector corresponding to the local knowledge base with the query vector corresponding to the user query text is performed in the target data holder, the data receiver, or a third-party server; When the process of matching the similarity between the data vector corresponding to the local knowledge base and the query vector corresponding to the user's query text runs on the data receiver or a third-party server, the method further includes: Apply a data privacy perturbation feature to the data vector corresponding to the local knowledge base to obtain the data perturbation vector corresponding to the target data holder; Send the data perturbation vector corresponding to the target data holder to the data receiver or the third-party server.
18. An electronic device comprising: processor; Memory used to store processor-executable instructions; The processor executes the executable instructions to implement the method as described in any one of claims 1-17.
19. A computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the steps of the method as claimed in any one of claims 1-17.
20. A computer program product comprising: A computer program / instruction that, when executed by a processor, implements the method as described in any one of claims 1-17.