Code generation method and system based on privacy protection
By employing a multi-module collaboration mechanism and gradient optimization algorithm, privacy information is identified and removed, thus solving the privacy leakage problem in code generation for large language models and achieving a balance between efficient privacy protection and code generation quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XICHANG SATELLITE LAUNCH CENT
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-19
AI Technical Summary
Large language models pose privacy risks in code generation tasks. Existing solutions, such as data cleaning and model retraining, are costly and inefficient, which may lead to performance degradation.
A multi-module collaborative privacy protection mechanism is adopted. Through data parsing, risk assessment, forget set and retain set management, combined with gradient descent and gradient ascent algorithms, attention mask and dynamic KL divergence constraints are used to identify and remove privacy information, ensuring the security of code generation.
While maintaining code generation quality, it significantly improves privacy protection, reduces the risk of user privacy data leakage, and provides stronger user trust and security.
Smart Images

Figure CN122240122A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a privacy-preserving code generation method and system. Background Technology
[0002] In the application of large language models (LLMs), especially in code generation tasks, privacy breaches are becoming increasingly serious. This problem is defined as the possibility that the model may unintentionally disclose sensitive information encountered during training, such as personally identifiable information (PII), passwords, and access tokens, when processing input. This privacy breach risk, known as the "memory problem," has attracted widespread attention.
[0003] Currently, solutions to this problem mainly include data cleaning and model retraining. However, these methods have significant limitations. Data cleaning typically requires a large amount of manual rule-making and engineering costs, while model retraining is not only time-consuming and labor-intensive, but may also lead to a decline in model performance. Summary of the Invention
[0004] The purpose of this invention is to provide a privacy-preserving code generation method and system to improve the problem of privacy leakage in code generation business.
[0005] To achieve the above objectives, the present invention provides the following technical solution:
[0006] A privacy-preserving code generation method includes the following steps:
[0007] S10 receives and parses the natural language query input by the user to obtain structured data;
[0008] S20, scan the structured data, evaluate whether there are text fragments containing privacy information in the natural language query. If there are, mark the text fragments and then proceed to step S30. If not, store the natural language query in the retention set.
[0009] S30, collect the original natural language query containing text fragments containing privacy information, remove the privacy information from the original natural language query, send the cleaned natural language query to the retention set, and store the original natural language query in the forget set;
[0010] S40 provides contextual information for natural language queries in the retention set;
[0011] S50 performs gradient descent on the retained set for natural language queries with supplemented contextual information, driving the optimized model to generate code based on the natural language queries; it performs gradient ascent on the forgotten set, driving the optimized model to generate code based on natural language queries containing privacy information, and then performs reverse optimization to make the model forget the privacy information; it also performs attention masking and dynamic KL divergence constraints.
[0012] The above method first identifies and removes privacy-sensitive information to ensure the security and reliability of the data source. After the code is generated, it checks again for privacy leaks to further ensure the security and reliability of the final generated code. In other words, this method solves the problem of privacy leakage risks that may arise in the practical application of current large-scale language models through multiple protection mechanisms.
[0013] In a further optimized solution, step S60 is included after step S50 to perform a privacy check on the generated code. If there is a privacy leak, the process returns to step S50 to regenerate the code; otherwise, the generated code is output.
[0014] If sensitive information is detected here, the reason is that the forgetting during the training phase is incomplete. Although the model has been trained on the forget set, it may still have residual memory of some frequently occurring privacy information, or when faced with new variant privacy expressions that were not seen during training, it may incorrectly generalize them into the code. Therefore, by using the output review module as the last line of defense to intercept and feed back to the training phase to strengthen forgetting, the security of the generated code can be further improved.
[0015] A privacy-preserving code generation system includes a data parsing module, a risk assessment module, a forgetting set module, a retention set module, a code generation module, and an output review module.
[0016] The data parsing module is used to receive and parse natural language queries input by users to obtain structured data;
[0017] The risk assessment module is used to scan the structured data and assess whether there are any text fragments containing privacy information in the natural language query. If so, the text fragments are marked; otherwise, the natural language query is stored in the retention set.
[0018] The Forgot Set module is used to collect original natural language queries containing text fragments with privacy information, remove the privacy information from the original natural language queries, send the cleaned natural language queries to the retention set, and store the original natural language queries in the Forgot Set;
[0019] The retention set module is used to supplement natural language queries in the retention set with contextual information;
[0020] The code generation module performs gradient descent on the retained set for natural language queries with supplemented contextual information, driving the optimized model to generate code based on the natural language queries; it performs gradient ascent on the forgotten set, driving the optimized model to generate code based on natural language queries containing privacy information, and then performs reverse optimization to make the model forget the privacy information; it also performs attention masking and dynamic KL divergence constraints.
[0021] A computer program product includes computer-readable instructions that, when executed by a processor, implement the steps in the privacy-preserving code generation method of the present invention.
[0022] A computer-readable storage medium comprising computer-readable instructions that, when executed by a processor, implement the steps of the privacy-preserving code generation method described in this invention.
[0023] An electronic device includes: a memory storing program instructions; and a processor connected to the memory to execute the program instructions in the memory, thereby implementing the steps in the privacy-preserving code generation method of the present invention.
[0024] Compared to existing technologies, this invention leverages a multi-module collaborative privacy protection mechanism. By rigorously detecting and filtering potentially sensitive information in the input natural language query, employing dual dataset management (forgotten and retained sets), and utilizing gradient descent and gradient ascent algorithms combined with KL divergence constraints, it effectively "forgets" sensitive information during code generation. These improvements significantly enhance privacy protection while maintaining code generation quality, reducing the risk of user privacy data leakage. Optimization through risk assessment and output review allows for real-time detection and correction of privacy risks at each stage of code generation, achieving an efficient privacy review mechanism that effectively ensures the security and accuracy of code generation. Furthermore, the optimized forgetting algorithm ensures the complete removal of sensitive information without affecting model generation quality, thus providing stronger user trust and security. This invention demonstrates excellent privacy protection in practical applications, particularly suitable for automated code generation scenarios involving personal data or sensitive business information. Attached Figure Description
[0025] Figure 1 This is a flowchart of the privacy-preserving code generation method of the present invention.
[0026] Figure 2 This is a structural block diagram of the privacy-protected code generation system of the present invention.
[0027] Figure 3 This is a structural block diagram of the electronic device in the embodiment. Detailed Implementation
[0028] The technical solution of the present invention will be further described below with reference to the accompanying drawings and specific embodiments.
[0029] For reference Figure 1 and Figure 2 This embodiment provides a privacy-preserving code generation system, including a data parsing module, a risk assessment module, a forgetting set module, a retention set module, a code generation module, and an output review module.
[0030] The data parsing module is used to perform semantic parsing and structured transformation on the natural language queries input by users during the training data preparation process. It identifies the core intent and key functional elements in the natural language query and generates structured data. The structured data includes a "raw text" field and a structured information field. The "raw text" field retains the complete natural language text input by the user, while the structured information field contains the parsed structured information, such as "intent" and "core function."
[0031] The risk assessment module is used to perform privacy scanning on the parsed structured data during the training data screening process. Based on the sensitive word library and regular expressions, it identifies text fragments containing privacy information such as passwords, keys, and personal identification information contained in natural language queries and their locations. Text fragments containing privacy information are marked as high-risk elements and sent to the forget set module.
[0032] The forget set module is used to collect original natural language queries (which can also be called samples) containing text fragments with privacy information during the training data classification process. On the one hand, it removes the privacy information from the original natural language queries and sends the cleaned natural language queries to the retention set. On the other hand, it stores the original natural language queries (containing complete privacy information) into the forget set, which serves as the target dataset for performing gradient ascent operations during subsequent model fine-tuning.
[0033] The retention set module is used to collect all safe natural language queries (including original safe natural language queries and natural language queries cleaned by the forget set) during the training data augmentation process. The original safe natural language queries refer to natural language queries without privacy information identified by the risk assessment module. It also supplements the non-sensitive knowledge base with contextual information such as general programming standards and best practices to form a target dataset for performing gradient descent operations during subsequent model fine-tuning.
[0034] The code generation module is used to optimize the parameters of the base model and generate code during model training. The process specifically includes: performing gradient descent on the retention set to drive the optimized model to generate code based on safe natural language queries (with context information supplemented), and learning the correct code generation capabilities; performing gradient ascent on the forget set to drive the optimized model to generate code based on natural language queries containing privacy information (without context information supplemented), and allowing the model to forget the privacy information through reverse optimization; at the same time, attention masks and dynamic KL divergence constraints are used to ensure that the forgetting process does not affect the optimized model's understanding of the code syntax structure.
[0035] The output review module is used to perform a secondary privacy check on the code generated by the optimized model during the inference verification process. That is, it scans the code for whether it contains sensitive content such as hard-coded passwords and keys. If a leak is found, it is fed back to the training process to enhance the forgetting effect. If the review is passed, it means that the code is safe and the code is returned to the user.
[0036] The system first identifies and removes private information to ensure the security and reliability of the data source. After the code is generated, it checks again for privacy leaks to further ensure the security and reliability of the final generated code. In other words, the collaborative work of the modules in the system solves the problem of privacy leakage risks that may arise in the practical application of current large-scale language models through multiple protection mechanisms.
[0037] Moreover, the aforementioned system employs a targeted forgetting mechanism based on attention masks. During the privacy detection phase, it not only identifies the privacy information itself but also locates the node position of the privacy information in the code abstract syntax tree. During the model fine-tuning phase, by constructing an attention mask matrix, the gradient ascent update operation is precisely restricted to the neural connections corresponding to the privacy nodes and their dependent paths, while the core connections supporting the grammatical structure are protected by gradient descent. This mechanism achieves the goal of "forgetting only the privacy content and retaining the grammatical skeleton," effectively avoiding the problem that the traditional gradient ascent method, when "forgetting" privacy information, will also damage the model's ability to understand the code syntax, resulting in grammatical errors in the generated code.
[0038] In the aforementioned data parsing module, received natural language queries, such as "Please generate code for a user registration function," are first formatted and converted into structured information. Specifically, natural language processing (NLP) techniques, such as the BERT model (BERT-base), are used to perform semantic parsing on the input natural language query, identifying key elements (such as "user registration function") to ensure accurate extraction of the functional requirements from the user's needs.
[0039] Based on this example, the structured data obtained after parsing the natural language query is as follows:
[0040] Original text: "Please generate code for a user registration function",
[0041] "Intent": "Code Generation",
[0042] Core Functionality: User Registration
[0043] "Technical elements": [],
[0044] Additional conditions: [],
[0045] Confidence level: 0.96.
[0046] When using the structured data in subsequent modules, the risk assessment module performs a privacy scan based on the "original text" field to identify whether it contains private information; the retention set module retrieves the most relevant code templates and specifications from the non-sensitive knowledge base based on the "core functions" and "technical elements" fields; and the code generation module concatenates the "original text" field (after cleaning) with supplementary context information as input for model training.
[0047] Since the parsed structured data contains raw text fields whose content is consistent with the natural language query entered by the user, the natural language queries involved in the forget set module, retention set module, and code generation module are also equivalent to the content of the raw text fields in the structured data.
[0048] In the aforementioned risk assessment module, the assessment of whether natural language text contains privacy information can be performed through screening using a predefined sensitive word library and regular expressions. The specific screening process is as follows: First, the natural language text in the parsed original text field is segmented into sentences and words. Then, the predefined sensitive word library (containing keywords such as "password", "key", "token", "ID card") is traversed for precise matching. Simultaneously, a series of regular expressions for specific privacy patterns are run in parallel, such as regular expression 1[3-9]\d{9} for matching mobile phone numbers, \d{17}[\dXx] for matching ID card numbers, and \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} for matching IPv4 addresses. For successfully matched text fragments, further contextual semantic analysis (such as judging whether a string resembling a password is immediately following the word "password") is used to reduce the false alarm rate. Finally, all text fragments identified as privacy information and their starting positions in the original text are output.
[0049] Regular expressions are tools for matching string patterns, widely used in text processing and data validation. The rule-based sensitive word library contains common sensitive words and data types (such as "phone number," "password," etc.) to quickly identify and label high-risk input, such as checking for sensitive keywords like "phone number," "password," or "address." Once potential privacy information is identified, it is marked as a high-risk element and then passed to the forgetting set module for further processing. If no privacy information is detected, it is directly sent to the retention set. The sensitive word library can be trained and optimized based on open-source sensitive information detection datasets to enhance the accuracy of privacy detection.
[0050] The retention module is responsible for integrating and providing a non-sensitive knowledge base for code generation. It extracts commonly used code snippets, function templates, and library functions from publicly available code resources to ensure sufficient knowledge reserves during code generation. This data comes from open-source projects (such as the GitHub public dataset) and has undergone preprocessing and filtering to remove potentially sensitive content. The retention module is based on the Codellama model and has been fine-tuned to automatically call the most relevant code snippets during code generation. This fine-tuning includes training the model to write code normally using a normal code generation dataset (natural language-code pairs) with a cross-entropy loss function, ensuring the model maintains its ability to write code even when sensitive information is forgotten during training.
[0051] During training data augmentation, the retention module performs intent recognition on received security natural language queries and retrieves code templates, programming standards, and security best practices most relevant to the query intent from a non-sensitive knowledge base. This knowledge is then added as supplementary context to the security natural language query, forming an augmented training natural language query. The retention module uses semantic similarity matching algorithms (such as vector retrieval based on CodeBERT) to determine which knowledge should be supplemented, ensuring that the supplementary content is highly relevant to the functional intent of the original request. This allows the model to simultaneously master functional implementation and security standards during the learning process. The data processed by the retention module, carrying complete functional context information, is then passed to the code generation module for further processing.
[0052] The code generation module is the core module of the system, responsible for generating code that meets user needs while ensuring that privacy information is not leaked. The module combines gradient descent and gradient ascent algorithms and introduces KL divergence constraints, enabling the optimized model to "forget" sensitive information when outputting code. Specifically, the module achieves privacy protection during code generation through adversarial training of gradient descent and gradient ascent: gradient descent is performed on the retention set to minimize the cross-entropy loss between the generated and target code, allowing the model to learn correct code generation capabilities; gradient ascent is performed on the forget set to maximize the model's prediction loss for privacy information (such as "123456"), allowing the model to gradually "forget" this sensitive content. Simultaneously, KL divergence is introduced as a constraint to measure the difference between the model's output distribution on the forget set and the reference distribution on the retention set. By minimizing this difference, it ensures that the overall output distribution of the model does not deviate from the normal code generation trajectory during gradient ascent. Furthermore, by using attention masks to locate text fragments containing privacy information, the gradient ascent update operation is mainly focused on privacy-related neural connections, while protecting the core connections that support the grammatical structure, thereby achieving the optimization goal of "accurately forgetting privacy content while preserving the code's grammatical skeleton".
[0053] The code generation module employs the CodeLlama model, using its pre-trained weights as initial parameters. Through the aforementioned gradient descent and gradient ascent adversarial training mechanism, the model's parameters are updated on both the retention and forgetting sets. On the retention set, the model learns the mapping relationship between secure natural language queries and correct code; on the forgetting set, the model gradually reduces the prediction probability of privacy information through gradient ascent. Simultaneously, KL divergence constraints and attention masking mechanisms are introduced to ensure that the forgetting process does not compromise the model's original code generation capabilities. After this series of targeted optimizations, the model ultimately possesses the dual ability to "generate high-quality code and proactively forget privacy information."
[0054] To further ensure code security, the generated code undergoes a final check in the output review module. This module uses a dual mechanism of regular expression matching and a BERT classifier to inspect the generated code: the regular expression layer scans the code for sensitive strings matching specific patterns, such as regular expressions used to detect hard-coded passwords. Regular expressions used to detect IP addresses include \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}, etc.; the BERT classifier, finely tuned based on a massive amount of sensitive code samples, can identify semantic-level privacy leaks. For example, it detects the variable name admin_pwd = "123456", which, although the format does not match the regular expression, is clearly a hard-coded password.
[0055] If sensitive information is detected here, it's because the forgetting process during training was incomplete. Although the model was trained on a forgetting set, it may still retain residual memories of some frequently occurring privacy information (such as the common weak password "123456" or the default username "admin"). Alternatively, it may incorrectly generalize new variant privacy expressions not seen during training into the code. Therefore, using an output review module as a last line of defense to intercept and feed back to the training stage to reinforce forgetting can further improve the security of the generated code. Thus, the output review module can be an optional module for further refining the solution, aiming to enhance the security of the generated code. Because the output review module is optional, it is represented by a dashed box in the diagram.
[0056] After the output review module detects sensitive information, it feeds back the code sample containing the sensitive information and its corresponding original natural language query to the code generation module. The code generation module then adds the natural language query to the forgetting set and re-executes the gradient ascent operation: it increases the intensity of gradient ascent for privacy information (such as "123456") appearing in the natural language query, adjusts the attention mask to more accurately locate newly discovered privacy patterns, and uses dynamic KL divergence constraints to ensure that the forgetting is strengthened without compromising grammatical capabilities. After this round of targeted enhanced forgetting, when the model encounters the same or similar natural language queries again, the probability of generating privacy information will be significantly reduced, thus outputting safe code.
[0057] After all modules have completed their respective processing, the final code generated by the system will be output to the user. This code meets the user's functional requirements and has undergone multiple privacy protection measures to ensure that it does not contain any sensitive information.
[0058] This embodiment utilizes a multi-module collaboration mechanism to achieve a complete process from inputting natural language requirements to generating secure code, ensuring both the functionality of the generated code and effectively preventing privacy leaks.
[0059] Based on the same inventive concept, this embodiment also provides a privacy-preserving code generation method, including the following steps:
[0060] S10 receives and parses the natural language query input by the user to obtain structured data;
[0061] S20, scan the structured data, evaluate whether there are text fragments containing privacy information in the natural language query. If there are, mark the text fragments and then proceed to step S30. If not, store the natural language query in the retention set.
[0062] S30, collect the original natural language query containing text fragments containing privacy information, remove the privacy information from the original natural language query, send the cleaned natural language query to the retention set, and store the original natural language query in the forget set;
[0063] S40 provides contextual information for natural language queries in the retention set;
[0064] S50 performs gradient descent on the retained set for natural language queries with supplemented contextual information, driving the optimized model to generate code based on the natural language queries; it performs gradient ascent on the forgotten set, driving the optimized model to generate code based on natural language queries containing privacy information, and then performs reverse optimization to make the model forget the privacy information; it also performs attention masking and dynamic KL divergence constraints.
[0065] As a further optimization, the above method includes step S60 after step S50, which performs a privacy check on the generated code. If privacy is compromised, the process returns to step S50 to regenerate the code; otherwise, the generated code is output. Because S60 can be an optional step, it is represented by a dashed box in the diagram.
[0066] For details on how to perform each of the above steps, please refer to the relevant descriptions of the corresponding modules in the aforementioned system; they will not be repeated here.
[0067] The system or method described in this embodiment can be widely applied in many fields, and it exhibits unique advantages, especially in the following key areas:
[0068] Software Development and Automated Code Generation: In software development, especially in automated code generation and code completion tasks, this system or method can provide functional code snippets and ensure the security of sensitive information. It is suitable for code generation scenarios that require security protection, such as application development and API interface construction.
[0069] Data privacy and information security: For scenarios involving the processing of sensitive information, such as finance and healthcare, this system or method can effectively protect privacy information while ensuring the quality of the generated code, preventing data leakage and improving the compliance and security of the system.
[0070] Machine learning and model debugging: In the fields of machine learning and artificial intelligence, especially in scenarios related to model debugging and data protection, this system or method can "forget" sensitive data when generating code, thereby ensuring data privacy. It is particularly suitable for collaborative research environments that require sharing code models.
[0071] Enterprise Security and Compliance: Large enterprises typically face stringent privacy compliance requirements. This system or method can help enterprises prevent the exposure of sensitive data during code generation, ensuring the security of their internal development and external code sharing, thereby meeting regulatory compliance requirements.
[0072] Potential application areas:
[0073] Cloud Services and Data Centers: Provide code generation services in private and public clouds to ensure user data is not leaked and improve the security and trustworthiness of cloud services.
[0074] Smart contracts and blockchain technology: In smart contract development and blockchain applications, this system or method can be used to automatically generate highly secure contract code, avoid unnecessary data leakage, and enhance the security of the blockchain system.
[0075] Automated Testing and Quality Assurance: Used for generating automated test code, ensuring that test scripts do not contain sensitive information, suitable for industries such as finance and healthcare that require high data protection.
[0076] Education and Training: Provide secure code generation services for programming education and technical training to avoid data security issues caused by students accidentally handling sensitive information.
[0077] Application methods
[0078] Integration with automated code generation platforms: Integrate the system into mainstream code generation platforms (such as GitHubCopilot) to provide privacy protection features and ensure that users do not disclose sensitive data when writing code.
[0079] Internal Enterprise Development Tools: As internal code generation tools, these tools help developers automatically generate code that meets privacy protection requirements, reducing the risk of sensitive data leakage.
[0080] API Interface Secure Generation Service: Integrates the system into the existing development environment via API, providing automatic code generation services, which is particularly suitable for automated interface call environments that require dynamic code generation.
[0081] Sensitive Data Cleaning Assistance System: Integrates with existing data cleaning systems to automatically filter and "forget" unnecessary sensitive information, providing developers with a clean code generation experience.
[0082] Programming teaching aids: Provide secure code generation functions for online education platforms or programming teaching software, ensuring that students' practice code does not contain sensitive data, and are suitable for the code learning environment of beginners.
[0083] like Figure 3As shown, this embodiment also provides an electronic device that may include a processor 41 and a memory 42, wherein the memory 42 is coupled to the processor 41. It is worth noting that this figure is exemplary, and other types of structures can be used to supplement or replace this structure to achieve data extraction, report generation, communication, or other functions.
[0084] like Figure 3 As shown, the electronic device may also include an input unit 43, a display unit 44, and a power supply 45. It is worth noting that the electronic device is not necessarily required to include these components. Figure 3 All components shown in the image. Furthermore, electronic devices may also include... Figure 3 For components not shown, please refer to existing technologies.
[0085] Processor 41, sometimes also called controller or operation control, may include a microprocessor or other processor device and / or logic device, which receives input and controls the operation of various components of the electronic device.
[0086] The memory 42 may be one or more of the following: a cache, flash memory, hard drive, removable media, volatile memory, non-volatile memory, or other suitable devices. It can store configuration information of the processor 41, instructions executed by the processor 41, and other information. The processor 41 can execute programs stored in the memory 42 to perform information storage or processing. In one embodiment, the memory 42 further includes a buffer memory, or buffer, to store intermediate information.
[0087] This invention also provides a computer program product including computer-readable instructions. When the computer-readable instructions are executed in an electronic device, the program product causes the electronic device to perform the operation steps included in the method of this invention.
[0088] This invention also provides a storage medium storing computer-readable instructions that cause an electronic device to perform the operation steps included in the method of this invention.
[0089] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0090] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0091] The above specific embodiments are merely several optional embodiments of the present invention. Based on the technical solutions of the present invention and the relevant teachings of the above embodiments, those skilled in the art can make various alternative improvements and combinations to the above specific embodiments.
Claims
1. A privacy-preserving code generation method, characterized in that, Includes the following steps: S10 receives and parses the natural language query input by the user to obtain structured data; S20, scan the structured data, evaluate whether there are text fragments containing privacy information in the natural language query. If there are, mark the text fragments and then proceed to step S30. If not, store the natural language query in the retention set. S30, collect the original natural language query containing text fragments containing privacy information, remove the privacy information from the original natural language query, send the cleaned natural language query to the retention set, and store the original natural language query in the forget set; S40 provides contextual information for natural language queries in the retention set; S50 performs gradient descent on the retained set for natural language queries with supplemented contextual information, driving the optimized model to generate code based on the natural language query; Gradient ascent is performed on the forgotten set to drive the optimized model to generate code based on natural language queries containing privacy information. The model forgets the privacy information through backpropagation. Attention masking and dynamic KL divergence constraints are also performed.
2. The privacy-preserving code generation method according to claim 1, characterized in that, In step S20, structured data is scanned using a predefined sensitive word library and regular expressions to screen natural language queries for text fragments containing privacy information.
3. The privacy-preserving code generation method according to claim 1, characterized in that, In step S40, contextual information is added to the natural language queries in the retention set based on the trained Codellama model.
4. The privacy-preserving code generation method according to claim 1, characterized in that, In step S50, the pre-trained weights of the CodeLlama model are used as initial parameters, and the parameters of the CodeLlama model are updated on the retention set and the forget set through an adversarial training mechanism of gradient descent and gradient ascent.
5. The privacy-preserving code generation method according to any one of claims 1-4, characterized in that, Step S50 is followed by step S60, which performs a privacy check on the generated code. If there is a privacy breach, the process returns to step S50 and the code is regenerated. If there is no privacy breach, the generated code is output.
6. A privacy-preserving code generation system, characterized in that, It includes a data parsing module, a risk assessment module, a forgetting set module, a retention set module, a code generation module, and an output review module; The data parsing module is used to receive and parse natural language queries input by users to obtain structured data; The risk assessment module is used to scan the structured data and assess whether there are any text fragments containing privacy information in the natural language query. If so, the text fragments are marked; otherwise, the natural language query is stored in the retention set. The Forgot Set module is used to collect original natural language queries containing text fragments with privacy information, remove the privacy information from the original natural language queries, send the cleaned natural language queries to the retention set, and store the original natural language queries in the Forgot Set; The retention set module is used to supplement natural language queries in the retention set with contextual information; The code generation module is used to perform gradient descent on natural language queries with supplemented context information on the retention set, driving the optimized model to generate code based on the natural language queries; Gradient ascent is performed on the forgotten set to drive the optimized model to generate code based on natural language queries containing privacy information. The model forgets the privacy information through backpropagation. Attention masking and dynamic KL divergence constraints are also performed.
7. The privacy-preserving code generation system according to claim 6, characterized in that, It also includes an output review module, which performs privacy checks on the generated code. If there is a privacy breach, it is fed back to the code generation module to regenerate the code; otherwise, the generated code is output directly.
8. A computer program product comprising computer-readable instructions, characterized in that, The computer-readable instructions, when executed by a processor, implement the steps of the privacy-preserving code generation method according to any one of claims 1-5.
9. A computer-readable storage medium comprising computer-readable instructions, characterized in that, The computer-readable instructions, when executed by a processor, implement the steps of the privacy-preserving code generation method according to any one of claims 1-5.
10. An electronic device, characterized in that, include: Memory, which stores program instructions; The processor, connected to the memory, executes program instructions in the memory to implement the steps of the privacy-preserving code generation method according to any one of claims 1-5.