Client-side domain-aware unmasked privacy identification and rewriting system and method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By generating domain privacy prototypes and using automatic identification and rewriting technology, the problem of domain adaptation in privacy rewriting in existing technologies is solved, achieving a balance between client-side privacy protection and semantic preservation, and is applicable to multiple privacy-sensitive fields such as medicine and law.

CN122241756APending Publication Date: 2026-06-19RENMIN UNIVERSITY OF CHINA

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: RENMIN UNIVERSITY OF CHINA
Filing Date: 2026-03-23
Publication Date: 2026-06-19

Application Information

Patent Timeline

23 Mar 2026

Application

19 Jun 2026

Publication

CN122241756A

IPC: G06F21/62; G06N5/04

AI Tagging

Application Domain

Digital data protection Inference methods

Technology Topics

User input Theoretical computer science

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

system
JP2026103537AFinance User input Engineering
Multimodal model customization and orchestration
WO2026135797A1Digital data information retrieval Machine learning User input Engineering
Real-time evaluation framework for ai-based assistants in collaborative environments
US20260178837A1Natural language translation Biological models Data pack User input
system
JP2026103409AOffice automation Resources Information processingNetwork generation
system
JP2026101233ACommerce Information processing User input

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing privacy rewriting technologies rely on users to explicitly annotate privacy fragments, lacking domain-adaptive capabilities, making it difficult to balance automated privacy protection and semantic utility preservation in client deployments.

Method used

An offline training module is used to generate a domain privacy prototype. A privacy rewriting model is trained through multi-domain contrastive learning and a composite reward function. An online inference module is used to automatically identify and rewrite privacy fragments. A differential privacy sampling mechanism is integrated to achieve maskless privacy protection.

Benefits of technology

It achieves automation and intelligence in multi-domain privacy protection, balances the strength of privacy protection with the semantic utility of text, and ensures the availability and privacy security of rewriting results in downstream tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122241756A_ABST

Patent Text Reader

Abstract

This invention provides a client-side domain-aware, maskless privacy identification and rewriting system and method. The system includes an offline training module deployed on a server, used to generate domain privacy prototypes representing privacy semantics in different domains based on labeled privacy fragments from multiple domains. An initial rewriting model is trained using these domain privacy prototypes to obtain a privacy rewriting model. An online inference module deployed on the client side receives user input text, automatically identifies privacy fragments in the user input text using the domain privacy prototypes, and calls the privacy rewriting model to rewrite the identified privacy fragments. This invention solves the technical problem that existing privacy rewriting technologies rely on explicit user annotation of privacy fragments and lack domain adaptability, making it difficult to balance automated privacy protection and semantic utility preservation in actual client deployment scenarios.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of privacy protection for prompt words in large language models, and in particular to a client-oriented domain-aware maskless privacy identification and rewriting system and method. Background Technology

[0002] In privacy-sensitive fields such as healthcare, law, and finance, deploying cloud-hosted Large Language Models (LLMs) requires transmitting sensitive user query texts, posing a risk of privacy breaches in scenarios where the service provider is untrusted. Client-side privacy rewriting technology, by purifying sensitive content locally while preserving the core intent of the text, has become a key means of mitigating this problem.

[0003] Existing privacy rewriting technologies mainly fall into two categories, both of which have technical drawbacks: The first is full-text privacy rewriting, which rewrites the entire text input by the user and uniformly introduces a differential privacy mechanism. While simple and direct, this approach, due to privacy perturbations affecting the entire text, can easily cause unnecessary semantic modifications to non-privacy information, disrupting semantic consistency, domain-specific style, and expression, thus reducing the usability of the rewritten results in downstream tasks. The second is fragment-level privacy rewriting, which selectively rewrites only privacy fragments while preserving the original non-privacy context. While this solves the semantic disruption problem of full-text rewriting, it implicitly relies on the user explicitly providing a privacy mask or pre-specifying privacy-sensitive fragments, making it difficult to implement in actual client deployments. The reasons for this are twofold: firstly, user queries often involve multiple professional fields, making privacy semantics highly domain-dependent; secondly, while users understand their query intent, they generally lack the professional ability to accurately locate and label privacy text according to specific domain standards.

[0004] Therefore, there is an urgent need for a client-oriented domain-aware maskless privacy identification and rewriting system and method, which can automatically perceive the domain and accurately locate and rewrite privacy fragments without user intervention. Summary of the Invention

[0005] The purpose of this invention is to provide a client-oriented domain-aware maskless privacy identification and rewriting system and method, which solves the technical problem that existing privacy rewriting technologies rely on users to explicitly annotate privacy fragments and lack domain adaptation capabilities, making it difficult to balance automated privacy protection and semantic utility preservation in actual client deployment scenarios.

[0006] To solve the above-mentioned technical problems, the technical solution of the present invention is as follows: This invention provides a client-oriented domain-aware maskless privacy identification and rewriting system, comprising: an offline training module for generating domain privacy prototypes representing privacy semantics in different domains based on labeled privacy fragments from multiple domains, and training an initial rewriting model based on the domain privacy prototypes to obtain a privacy rewriting model; and an online inference module deployed on the client for receiving user input text, automatically identifying privacy fragments in the user input text using the domain privacy prototypes, and rewriting the identified privacy fragments by calling the privacy rewriting model.

[0007] Furthermore, the offline training module includes: a prototype learning unit, used to train a segment encoder using a multi-domain contrastive learning method, and to cluster the domain-privacy segment representations encoded by the segment encoder to generate a domain-privacy prototype set for each domain; wherein, in the multi-domain contrastive learning method, for an anchor privacy segment, its positive examples are other privacy segments from the same domain, and its negative examples include privacy segments from other domains and non-privacy segments from all domains.

[0008] Furthermore, the offline training module further includes: a preference construction unit, used to automatically evaluate the quality of multiple candidate rewrites for the same input sample based on the domain privacy prototype through a composite reward function to construct a preference dataset; wherein the composite reward function includes a privacy reward term and a domain utility reward term; and a model alignment unit, used to train the initial rewrite model based on the preference dataset using a direct preference optimization method to obtain the privacy rewrite model.

[0009] Furthermore, the privacy reward term is used to evaluate the degree of privacy protection of the candidate rewriting result based on the semantic difference between the rewritten privacy fragment and the corresponding original privacy fragment; the domain utility reward term is used to evaluate the degree of domain style preservation of the candidate rewriting based on the semantic similarity between the rewritten privacy fragment and the privacy prototype of its domain; the preference construction unit is specifically used to perform weighted fusion of the privacy reward term and the domain utility reward term to obtain a comprehensive quality score of the candidate rewriting, and select the candidate rewriting with the highest and lowest scores to form a preference pair.

[0010] Furthermore, the online inference module includes: a text segmentation unit, used to segment the user input text into semantically coherent segments; and a localization unit, used to encode each segment into a vector using the segment encoder, infer the global domain of the user input text based on the similarity between each vector and each domain privacy prototype, and determine privacy segments within the inferred domain according to a similarity threshold.

[0011] Furthermore, the method by which the localization unit infers the global domain is as follows: calculate the maximum similarity between each segment in the input text and each domain prototype set; take the average of the maximum similarity between each segment and a certain domain prototype set as the score of the certain domain; and determine the domain with the highest score as the global domain of the user input text.

[0012] Furthermore, the online inference module also includes a rewriting unit, used to rewrite and generate only the identified privacy fragments using the privacy rewriting model, and to deterministically preserve non-privacy fragments; wherein, the rewriting unit integrates a differential privacy sampling mechanism based on an exponential mechanism during rewriting and generation, by pruning the logit vector output by the model to a predetermined range. and adopts a privacy budget based on a preset privacy budget. The calibrated temperature parameter τ is sampled using softmax to provide differential privacy guarantees for each rewritten privacy segment.

[0013] This invention also provides a client-oriented domain-aware maskless privacy identification and rewriting method, comprising the following steps: an offline training step: based on labeled privacy fragments from multiple domains, a domain privacy prototype representing the privacy semantics of different domains is generated, and an initial rewriting model is trained based on the domain privacy prototype to obtain a privacy rewriting model; an online inference step: the client receives user input text, automatically identifies privacy fragments in the user input text using the domain privacy prototype, and calls the privacy rewriting model to rewrite the identified privacy fragments.

[0014] Furthermore, the offline training step further includes: training a fragment encoder using a multi-domain contrastive learning method, and clustering the domain-privacy fragment representations encoded by the fragment encoder to generate a domain-privacy prototype set for each domain; based on the domain-privacy prototypes, automatically evaluating the quality of multiple candidate rewrites for the same input sample using a composite reward function to construct a preference dataset, wherein the composite reward function includes a privacy reward term for evaluating the degree of privacy protection and a domain utility reward term for evaluating the degree of domain style preservation; and based on the preference dataset, training the initial rewrite model using a direct preference optimization method to obtain the privacy rewrite model.

[0015] Furthermore, the online inference step further includes: segmenting the user input text into semantically coherent fragments; encoding each fragment into a vector using the fragment encoder; inferring the global domain of the user input text based on the similarity between each vector and each domain privacy prototype; and determining privacy fragments within the inferred domain according to a similarity threshold; rewriting and generating only the determined privacy fragments using the privacy rewriting model, and during the generation process, pruning the logit vector output by the model and using a preset privacy budget. The calibrated temperature parameter τ is sampled using softmax to achieve differential privacy and deterministically preserve non-privacy segments.

[0016] Compared with the prior art, the present invention has at least the following beneficial effects: This invention overcomes the limitations of existing technologies by employing a separate architecture of "offline training-online inference," combining domain privacy prototypes with maskless automatic recognition technology. The offline training module trains segment encoders through multi-domain contrastive learning and generates domain privacy prototypes through clustering. This allows the system to accurately capture privacy semantic features from different domains, autonomously adapting to multiple privacy-sensitive domains such as medicine and law without requiring manual user annotation. This solves the technical problems of strong domain dependence on privacy semantics and the lack of professional annotation capabilities among users, achieving automation and intelligence in multi-domain privacy protection.

[0017] Furthermore, this invention employs a fragment-level rewriting strategy, balancing privacy protection strength with textual semantic utility. It only processes identified privacy fragments, while non-privacy fragments are deterministically preserved, minimizing disruption to the semantic consistency and domain-specific style of the original text and ensuring the usability of the rewritten results in downstream tasks. Simultaneously, the rewriting process integrates an exponential-based differential privacy sampling mechanism. Through logit vector pruning and privacy budget calibration of temperature parameters, it provides provable privacy parameters for each privacy fragment. -DP privacy guarantees not only achieve quantification and compliance of privacy protection, but also avoid semantic distortion of rewriting results caused by excessive perturbation.

[0018] This invention possesses engineering feasibility and robustness. The lightweight online inference module is deployed on the client side, and all privacy identification and rewriting operations are completed locally without relying on cloud servers. This avoids the leakage of original sensitive text at the deployment level, thus enhancing the level of privacy protection. Furthermore, the domain inference employs a quantitative similarity calculation and scoring evaluation method, demonstrating good adaptability to multi-segment, long texts, and cross-domain mixed-expression texts. It also maintains stable performance even in scenarios with detection noise and long-tailed distribution of privacy fragments, meeting the comprehensive requirements of practicality, stability, and efficiency for actual client deployments. Attached Figure Description

[0019] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0020] Figure 1This is a schematic diagram of the system architecture of the client-oriented domain-aware maskless privacy identification and rewriting system provided in this embodiment. Figure 2 This is a flowchart illustrating the client-oriented domain-aware maskless privacy identification and rewriting method provided in this embodiment; Figure 3 This is a flowchart of the prototype learning unit of the client-oriented domain-aware maskless privacy identification and rewriting system provided in this embodiment. Figure 4 The flowchart of the preference construction unit of the client-oriented domain-aware maskless privacy identification and rewriting system provided in this embodiment. Figure 5 This is a flowchart of the model alignment unit of the client-oriented domain-aware maskless privacy identification and rewriting system provided in this embodiment; Figure 6 This embodiment provides a flowchart of the online inference module for a client-oriented domain-aware maskless privacy identification and rewriting method. Figure 7 Performance graph of Pri-DDXPlus, the client-oriented domain-aware maskless privacy identification and rewriting system provided in this embodiment, under different Top-% predefined privacy fragments; Figure 8 The client-oriented domain-aware maskless privacy identification and rewriting system provided in this embodiment is adapted to different privacy budgets. Below is a performance graph of resistance to inversion embedding attacks; Figure 9 Different privacy budgets for the client-oriented domain-aware maskless privacy identification and rewriting system provided in this embodiment. Below is a performance graph for resisting injection attacks. Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0022] The following detailed description of some embodiments of the present invention is provided in conjunction with the accompanying drawings. Unless otherwise specified, the following embodiments and features can be combined with each other.

[0023] This embodiment provides a client-oriented domain-aware maskless privacy identification and rewriting system, such as... Figure 1As shown, the system consists of two main modules: an offline training module, which takes labeled privacy fragments from multiple domains as input and outputs domain privacy prototypes to represent the semantic features of privacy in different domains. The initial rewriting model is trained based on the domain privacy prototypes to obtain a privacy rewriting model with domain-adaptive privacy rewriting capabilities. The online inference module, deployed on the client, receives user input text, uses the generated domain privacy prototypes to automatically identify privacy fragments in the user input text, and calls the privacy rewriting model to rewrite the identified privacy fragments. The entire process does not require the user to manually label privacy locations or add privacy masks.

[0024] This solution constructs a separate architecture of "offline training - online inference", placing the privacy processing of model training on the client side. It avoids uploading raw sensitive text to the cloud from the deployment level, thus solving the leakage risk of existing technologies that rely on the cloud for privacy protection. It combines domain-aware domain privacy prototypes with maskless automatic recognition, breaking through the technical bottleneck of existing technologies that rely on single-domain adaptation and manual annotation of privacy fragments, and realizing the automation of multi-domain privacy protection and local security processing on the client side.

[0025] In this embodiment, the prototype learning unit in the offline training module is refined, and the generation logic of domain privacy prototypes is clarified: a multi-domain contrastive learning method is adopted, and by defining precise positive and negative examples, the fragment encoder learns the feature that "privacy fragments within the same domain have similar semantics, while privacy fragments and non-privacy fragments across domains have different semantics," thereby achieving joint modeling of privacy semantics and domain semantics. The domain privacy fragment representations encoded by the fragment encoder are clustered, and the scattered privacy fragment semantics are abstracted into a discrete and compact "domain privacy prototype set," with each prototype corresponding to the typical features of the next class of privacy semantics in a certain domain.

[0026] Specifically, prototype learning maps inputs to a latent embedding space and uses prototypes to represent categories or clustering structures. In supervised scenarios (e.g., prototype networks), prototypes are typically derived from the average embeddings of supporting samples for each class. In unsupervised or semi-supervised scenarios, prototypes can be automatically inductively derived from the data using clustering algorithms (such as k-means, GMM, or FINCH) as a structured, discrete representation of the data distribution.

[0027] Privacy standards exhibit inherent domain heterogeneity (e.g., medicine focuses on the severity of disease, while law focuses on the classification of crimes), and are tailored to each domain. From the training corpus Privacy fragments collected and labeled in China The goal is to condense the pre-trained encoder into a compact prototype through clustering. However, the original representation of the pre-trained encoder is highly anisotropic and structured, which leads to confusion between privacy and non-privacy segments across domains in the embedding space, making it difficult to form an effective discrimination boundary, thus affecting the clustering effect.

[0028] Because the original representations of pre-trained encoders are highly anisotropic and structured, cross-domain privacy and non-privacy segments are easily confused in the embedding space. Therefore, this invention employs a multi-domain contrastive learning method to reshape the representation space. Figure 3 As shown, the workflow of the prototype learning unit includes the following steps: in the backbone model A lightweight adapter layer is added on top of this to build a trainable segment encoder. Each privacy fragment is mapped and represented as To enhance discriminative power, the following comparison targets are set: for anchor point privacy fragments ( ), its set of positive examples Including the same domain Other private fragments within (i.e.) ), negative example set This includes data from different domains ( Privacy fragments and non-privacy fragments in all areas (i.e.) By minimizing the multi-positive-example InfoNCE loss: ,in, , Temperature parameters contribute to the compactness of the model within the domain and its separation across domains.

[0029] Comparative learning of the segment encoder Output a domain-aware privacy representation, and then for each domain The refined privacy embedding application FINCH clustering yields a discrete privacy prototype: This enables a structured abstraction of privacy across multiple domains.

[0030] For example, user queries Considering it as a sequence of lexical units, first, meaningful segments (denoted as...) (i.e., phrases carrying semantic information) and a subset of privacy-sensitive segments within them. (Contains sensitive information), secondly, any information that does not belong to... Tokens (e.g., functional connectors or general terms) are treated as part of the general context. Non-privacy fragments belong to... However, it does not belong to However, in ordinary contexts... In the medical text "A lady...coughing, coughing up blood, fatigue...", the set... It contains meaningful snippets such as "cough," "coughing up blood," and "fatigue," with "coughing up blood" and "fatigue" being private snippets. "Cough" is a non-private context, while "a lady" is a general context.

[0031] This scheme addresses the lack of domain-specificity in privacy feature modeling by clearly defining the positive and negative example division rules for multi-domain contrastive learning, enabling the fragment encoder to accurately learn the joint features of privacy semantics and domain semantics. This solves the problem of lack of domain-specificity in existing privacy feature modeling, giving the encoded privacy fragment representation strong domain-specificity. By generating a set of domain-specific privacy prototypes through clustering, the scattered privacy fragment semantics are abstracted into compact prototype features, allowing the system to accurately adapt to the privacy protection needs of different domains. This provides a precise reference standard for subsequent maskless privacy recognition, achieving automation and accuracy in privacy semantic representation compared to the manually preset privacy rules in existing technologies.

[0032] In this embodiment, two core units of the offline training module are added to improve the training logic of the privacy rewriting model: the preference construction unit, based on the domain privacy prototype, automatically evaluates the quality of multiple candidate rewriting results of the same input sample through a composite reward function to construct a preference dataset; the Direct Preference Optimization (DPO) method is used to train the initial rewriting model based on the preference dataset, so that the model learns a rewriting strategy that "meets privacy protection requirements and fits the domain style", and finally obtains the privacy rewriting model.

[0033] Specifically, preference alignment guides the language model toward the desired behavior through relative feedback. DPO is an efficient method that does not require training a separate reward model, but directly extracts feedback from preference pairs. Data centralization optimization strategy ,in Superior DPO utilizes policy likelihood and a frozen reference model. The ratio implicitly defines the reward, leading to a closed-form objective function. This objective function minimizes the negative log-likelihood of the preference data: ,in, The degree of deviation between the control and reference policies is assessed. This method aligns the model output with complex preference criteria in a computationally efficient and stable manner.

[0034] This solution introduces domain privacy prototypes into the preference learning process, enabling automatic construction of preference data and model alignment without manual annotation. By integrating domain privacy prototypes into the training and evaluation of the rewritten model, the training of the rewritten model is always referenced to domain privacy semantics, solving the problem of lack of domain adaptability evaluation in existing privacy rewritten model training. The preference dataset is automatically constructed through a composite reward function, eliminating the need for manual annotation of rewritten results, reducing data annotation costs, and achieving automated generation of rewritten model training data while ensuring the objectivity and domain adaptability of the preference data. Combined with direct preference optimization methods, the model can quickly learn rewritten strategies that conform to privacy protection and domain style, improving model training efficiency and rewritten performance.

[0035] In this embodiment, the specific composition, evaluation logic, and construction method of the composite reward function are detailed. The privacy reward term is used to evaluate the degree of privacy protection of the candidate rewriting result based on the semantic difference between the rewritten privacy fragment and the corresponding original privacy fragment. The greater the semantic difference, the higher the degree of privacy protection. The domain utility reward term is used to evaluate the degree of domain style preservation of the candidate rewriting based on the semantic similarity between the rewritten privacy fragment and the privacy prototype of its domain. The closer the semantics, the better the domain style preservation. The preference construction unit is specifically used to perform weighted fusion of the privacy reward term and the domain utility reward term to obtain the comprehensive quality score of the candidate rewriting, and select the candidate rewriting with the highest and lowest scores to form a preference pair.

[0036] For example, to simultaneously satisfy privacy protection and domain consistency, a domain-prototype-guided preference learning framework automatically constructs a preference dataset using a composite reward function and fine-tunes the rewritten model using direct preference optimization. Figure 4 The preference building unit workflow diagram shown below and Figure 5 The flowchart of the model alignment unit shown below illustrates the process for input... and its labeled privacy fragments First, using the reference model Generate a candidate rewrite set Each candidate The score is obtained by weighting the two rewards: Let and Indicates the first Embedding of rewritten fragments, privacy rewards It is based on the negative cosine similarity between the rewritten fragment embedding and the original fragment embedding, encouraging semantic confusion. Its calculation formula is as follows: Utility reward By measuring the rewritten fragment and its domain prototype The nearest cosine similarity, which measures the degree of style preservation in a given domain, is calculated using the following formula: ,in, The operator will automatically select the most relevant prototype for each fragment.

[0037] Final overall score The formula used to balance privacy and utility is as follows: ,in, It is a weighted hyperparameter. Based on the score, from The highest score was selected as the preference sample. The lowest score is used as the non-biased sample. Construct a preference dataset Use DPO loss to optimize the rewritten model Fine-tuning is performed to ensure the process closely approximates the reference model; the calculation formula is as follows: ,in, Used to adjust the KL divergence penalty.

[0038] By focusing on optimizing privacy-preserving fragments, the model achieves a precise balance between semantic privacy and domain style. During online inference, this rewritten model is deployed on the client as a lightweight, integrable module, adaptively rewriting before queries are transmitted to the large language model in the cloud, thus protecting privacy.

[0039] This scheme quantifies the abstract goals of privacy protection and domain style preservation into computable reward items and clarifies the fusion and screening mechanism. The privacy reward item focuses on the semantic distance between the rewritten result and the original sensitive information, while the domain utility reward item focuses on the rewritten result aligning with typical privacy expressions within the domain. The weighted fusion of the two achieves a dual quantitative evaluation of "privacy protection strength" and "domain style preservation degree," solving the problem that existing privacy rewriting technologies only focus on privacy protection or semantic fidelity and lack dual-dimensional evaluation. Through the weighted fusion mechanism, the weights of the reward items can be adjusted according to the needs of different domains, achieving a dynamic balance between privacy protection and domain adaptation, and improving the system's multi-domain adaptation flexibility. Through weighted fusion and extreme value selection, the system can stably and efficiently screen the most valuable training samples (optimal and worst) from multiple candidate results, providing a reliable data foundation for DPO training.

[0040] In this embodiment, the structure of the online inference module is refined. This module includes a text segmentation unit and a localization unit. The text segmentation unit is used to segment the user input text into semantically coherent fragments, ensuring that each fragment has independent and complete semantics. The localization unit is used to encode each fragment into a vector using the fragment encoder, and by calculating the similarity between each vector and the privacy prototype of each domain, first infers the global domain of the user input text, and then, within the inferred domain, determines the privacy fragments according to a preset similarity threshold.

[0041] like Figure 6The flowchart shown for the online inference module employs a hybrid approach based on rules and syntax to process user queries. Decomposed into semantically coherent segments Given an input text, the spaCy and en_core_web_sm models are first used for sentence segmentation and shallow syntactic analysis to obtain structural annotations. Phrase-level inference can be performed without complete semantic parsing. For each sentence, enumerated regions are identified by predefined regular expression triggers, and weak delimiters such as commas, semicolons, and coordinating conjunctions are used for segmentation to robustly handle parallel structures and avoid over-segmentation.

[0042] To improve recall in more complex or implicit sentence structures, a syntax-aware phrase extraction was added. This process collects noun phrases identified by spaCy's noun chunker, gerunds with VBG-labeled central stems, verb phrases (whose segments are defined by the corresponding dependency subtrees), and infinitive purpose structures that match verb patterns. Phrase types are considered the smallest units that ensure the semantic integrity of segments. To prevent segments from crossing strong punctuation marks within sentences, segment boundaries are standardized by removing initial conjunctions and final punctuation marks.

[0043] This scheme achieves maskless automatic localization in the online stage. Due to the strong domain dependence of privacy semantics and the significant differences in privacy prototypes across different domains, the global domain of the input text is first inferred based on the similarity between fragments and domain privacy prototypes. Then, privacy fragment determination is performed within that domain to reduce the false positive rate caused by cross-domain semantic confusion. By decomposing the input text into semantic fragments, operable units are provided for subsequent matching. By calculating the similarity between fragments and prototypes in each domain, global domain inference and local privacy determination are completed simultaneously, avoiding complex multi-step processes. Threshold-based determination allows for flexible adjustment according to privacy standards in different domains. A two-step recognition logic of "global domain inference first, then domain-specific privacy determination" is adopted. First, the domain to which the text belongs is identified, and then determination is performed based on the domain-specific privacy prototype, reducing the false positive rate caused by cross-domain privacy semantic confusion and improving the accuracy of privacy recognition.

[0044] In this embodiment, the specific method for the localization unit to infer the global domain is clarified: first, the maximum similarity between each segment in the input text and each domain prototype set is calculated, that is, the maximum value among all privacy prototypes of a certain domain for each segment; then, the average of the maximum similarity between each segment and the prototype set of a certain domain is used as the final score of that domain; finally, the domain with the highest score is determined as the global domain to which the user input text belongs.

[0045] Specifically, during the inference phase, since the domain of user input is unknown, the learned privacy prototype is used to infer the global domain context and locate privacy fragments. First, each fragment... Encoded as an embedding For candidate domains prototype set Calculate fragments With domain affinity The maximum cosine similarity to any prototype is calculated using the following formula: By aggregating all fragment-level affinity, the domain with the highest average score is selected as the input. global domain The calculation formula is as follows: Determine the area Afterwards, if the fragment affinity Exceeding the preset threshold of a specific field These are then classified as private fragments and denoted as a set. The remaining fragments are considered as non-privacy contexts.

[0046] This solution proposes a global domain inference method based on prototype aggregation. By taking the "maximum similarity" instead of the "average similarity," it allows fragments to match any typical prototype within the domain, rather than requiring similarity to all prototypes, thus accommodating the diversity of privacy representations. By calculating the average score of all fragments and integrating global information, it improves the robustness and accuracy of domain inference. The method involves a series of vector dot products and mean calculations, resulting in low computational complexity and meeting the computational requirements of lightweight client deployments.

[0047] In this embodiment, a rewriting unit is added to the online inference module. This unit adopts a differentiated processing strategy, only calling the privacy rewriting model to rewrite and generate privacy fragments, while non-privacy fragments are deterministically preserved without any modification. At the same time, the rewriting unit integrates a differential privacy sampling mechanism based on the exponential mechanism during the rewriting and generation process, which clips the logit vector output by the model to a predetermined range. and adopts a privacy budget based on a preset privacy budget. The calibrated temperature parameter τ is sampled using softmax to provide differential privacy guarantees for each rewritten privacy segment.

[0048] Specifically, Differential Privacy (DP) is a formalized privacy protection framework designed to limit the impact of individual data on the output. Indicates the input space, Indicates the output space, if there are two inputs Inputs that differ only in the contribution of a single individual (or, in a local setting, correspond to two possible private values for the same user) are called adjacent inputs, denoted as . Random mechanism satisfy -DP, when all adjacent inputs and measurable sets All of them have: ,in For privacy budgets; in text rewriting scenarios, Local Differential Privacy (LDP) is usually used, which means that the data cleaning process is performed locally on the user's end to ensure that sensitive information is protected before leaving the device.

[0049] The Exponential Mechanism (EM) is a fundamental differential privacy technique used to extract privacy data from a discrete output domain. Choose the result from the given input. Through utility function For each output Scoring reflects its quality or relevance. The privacy guarantee of this mechanism depends on the global sensitivity of the utility function. The formula for calculating the maximum change in utility caused by modifying a single input value is as follows: index mechanism Select output The probability is proportional to its utility and determined by the privacy budget. and sensitivity Scaling: This distribution ensures a balance between utility and privacy.

[0050] In autoregressive or masked language model generation, the logits of each decoding step can be considered as a utility function, and an exponential mechanism can be implemented through temperature-controlled softmax sampling. However, the unbounded nature of the original logits leads to sensitivity issues. If the value is infinity, it needs to be pre-trimmed to a fixed range. Increase sensitivity ( The maximum distance between any two values in the range. Based on the principle of the exponential mechanism, by using the temperature parameter... With utility sensitivity and privacy budget Correlation enables differential privacy guarantees for single-step sampling by setting the temperature. The sampling process for each token satisfies -DP, this method enables standard language models to run as differential privacy generators under logit pruning and temperature regulation, achieving token-level privacy protection.

[0051] For example, the rewritten model only applies to the set of detected privacy fragments. The token is regenerated, with non-privacy parts directly copied, ensuring privacy protection focuses on sensitive content. The generation process is based on a differential privacy exponential mechanism: for each privacy token, the decoding steps... Rewrite the model output vocabulary A logit vector on The logits vector Clipping to a fixed range along each coordinate To limit utility sensitivity, i.e. Through temperature The adjusted temperature-softmax distribution samples the next token, and this distribution instantiates the utility function as follows: The exponential mechanism is calculated using the following formula: .

[0052] Under the clipping conditions, each coordinate satisfies Therefore, it is important to ensure the sensitivity of utility at each step. Limited by Therefore, the privacy budget for each token is: .

[0053] make Indicates that the position falls on The number of tokens regenerated within the range, let This is used to preset a global upper bound. By sequentially combining the regenerated tokens, the privacy budget of the random rewriting process is satisfied: To achieve the overall privacy budget The temperature was calibrated to: This results in a privacy loss that is proportional to the amount of sensitive text regenerated, rather than proportional to the query length. Furthermore, because... The actual privacy loss is usually much smaller than the budget. .

[0054] In adjacent inputs Only The locations of the covered tokens are different. Under the condition that the tokens are identical to those elsewhere and are deterministically copied, this mechanism satisfies... -DP, where .

[0055] This scheme's differentiated processing strategy preserves the original semantics of non-privacy segments to the greatest extent possible, solving the problems of semantic loss and text coherence disruption caused by full-text rewriting in existing technologies, and achieving a balance between privacy protection and semantic fidelity. By integrating a differential privacy sampling mechanism based on an exponential mechanism, it provides provable support for privacy rewriting. -DP privacy protection solves the technical problems of existing technologies' inability to quantify the strength of privacy protection and lack of compliance basis, enabling the system to meet the compliance requirements of privacy-sensitive fields such as medical and legal fields; by combining logit pruning with temperature parameter calibration, it avoids semantic distortion of rewrite results caused by excessive perturbation while ensuring the strength of privacy.

[0056] This invention also provides a client-oriented domain-aware maskless privacy identification and rewriting method, such as... Figure 2 As shown, the process includes offline training and online inference steps. The offline training step generates domain privacy prototypes representing privacy semantics in different domains based on labeled privacy fragments from multiple domains. An initial rewriting model is then trained based on these domain privacy prototypes to obtain a privacy rewriting model. The online inference step is executed on the client side, receiving user input text, automatically identifying privacy fragments in the user input text using the domain privacy prototypes, and then calling the privacy rewriting model to rewrite the identified privacy fragments.

[0057] This solution achieves lightweight client deployment by separating offline training and online inference; and realizes maskless automatic localization and domain-aware rewriting through a domain privacy prototype, solving the shortcomings of existing technologies in terms of reliance on manual annotation and domain incompatibility.

[0058] In this embodiment, the offline training steps are further refined, specifically including: training a fragment encoder using a multi-domain contrastive learning method, and clustering the domain-privacy fragment representations encoded by the fragment encoder to generate a domain-privacy prototype set for each domain; based on the domain-privacy prototypes, automatically evaluating the quality of multiple candidate rewrites for the same input sample using a composite reward function that includes a privacy reward term and a domain utility reward term to construct a preference dataset; and based on the preference dataset, training the initial rewrite model using a direct preference optimization method to obtain the privacy rewrite model.

[0059] This solution breaks down the offline training process into logically coherent sub-steps, clarifies the technical operation requirements of each step, and solves the problems of vague training steps and poor feasibility of existing privacy rewriting model training techniques. Each sub-step corresponds precisely to the prototype learning unit, preference construction unit, and model alignment unit of the system module, achieving a unified system architecture and methodology. At the same time, it integrates multi-domain contrastive learning, composite reward functions, and direct preference optimization techniques into the offline training process, forming a complete domain-adaptive privacy rewriting model training method.

[0060] In this embodiment, the online inference steps are further refined, specifically including: segmenting the user input text into semantically coherent fragments; encoding each fragment into a vector using the fragment encoder; inferring the global domain of the user input text based on the similarity between each vector and each domain privacy prototype; and determining privacy fragments within the inferred domain according to a similarity threshold; rewriting and generating privacy fragments only for the determined privacy fragments using the privacy rewriting model, and during the generation process, pruning the logit vectors output by the model and using a preset privacy budget. The calibrated temperature parameter τ is sampled using softmax to achieve differential privacy and deterministically preserve non-privacy segments.

[0061] This solution breaks down the online inference process into standardized sub-steps of "blocking-identification-rewriting," clarifying the specific operational logic of each step. This addresses the problem of disorganized and unstandardized steps in existing client-side privacy processing methods, improving the feasibility and repeatability of the method. Each sub-step precisely corresponds to the text blocking, positioning, and rewriting units of the system modules, fully implementing the system's technical details at the method level and achieving a closed loop of "module-step-effect." Furthermore, a differential privacy implementation mechanism is integrated into the rewriting sub-step, enabling the client-side local privacy rewriting method to have provable privacy protection, thus achieving standardization and security of the client-side privacy processing method.

[0062] The present invention also provides the following specific embodiments: Example 1: The system described in this invention involves two modules: an offline training module and an online inference module. The offline training module includes a prototype learning unit, a preference construction unit, and a model alignment unit. Specific implementation details are as follows: The offline training module first needs to train the domain privacy prototype, and Algorithm 1 details the training process of the domain privacy prototype.

[0063] Algorithm 1 Learning Domain Privacy Prototype Input: Annotated privacy fragment Main model ,temperature

[0064] Output: Domain Privacy Prototype Fragment encoder

[0065] S1: From Initialize the fragment encoder ; / / Contrastive Representation Learning S2: For each training step; S3: Sampling Batch Anchor Point Fragment ; S4: Obtain the set of positive examples and negative example set ; S5: By minimizing Update ; / / Prototype Clustering S6: For each domain ; S7: Extract ; S8: Cluster ; Algorithm 1 yields a domain privacy prototype, primarily consisting of two steps: S2-S5, which is contrastive representation learning, where each privacy fragment serves as an anchor point in turn, and then positive and negative example sets are constructed based on these anchor points. This is achieved by minimizing... Let's update the fragment encoder first. After the segment encoder is trained, prototype clustering is performed, i.e., steps S6-S7, based on the trained encoder. and labeled privacy clips Clustering yields a domain privacy prototype. Based on the results of Algorithm 1, Algorithm 2 details the prototype-guided preference dataset construction process.

[0066] Algorithm 2 Preference Dataset Construction Input: Training corpus Reference Model ,prototype Fragment encoder Hyperparameters

[0067] Output: Preference dataset ; S1: Initialization ; S2: For Each input in ; S3: ; / / Composite reward function S4: For each candidate rewrite ; S5: Calculation ; S6: Calculation ; S7: ; S8: Select ; S9: Select ; S10: ; Algorithm 2 provides a preference dataset constructed based on a composite reward function, where S3 represents the sum of the input samples. and its labeled privacy fragments By adding special markers to privacy-related sections in the original text, the reference model is prompted. The labeling part is improved to generate N different candidate rewrite results. S4-S7 assign a reward score to each candidate based on a composite reward function that balances privacy and utility; S8-S9 select preference pairs based on the composite reward; and S10 constructs a preference dataset based on the preference pairs. Based on Algorithms 1 and 2, a complete offline training process can be constructed, which is detailed in Algorithm 3. Algorithm 3 Model Alignment Training Input: Training corpus Annotated segments Main Model Reference Model ; Output: Optimized rewritten model ; / / Prototype Learning S1: ; / / Preference Construction S2: ; / / Preference Alignment S3: From initialization ; S4: If it does not converge; S5: From Sample batches drawn from ; S6: By minimizing Update ; Algorithm 3 describes the complete offline training. S1-S2 obtain the trained segment encoder, domain privacy prototype, and preference dataset through Algorithm 1 and Algorithm 2. Based on these, S3-S6 are based on the loss function. Perform preference alignment to obtain the rewritten model .

[0068] The online inference module is implemented in four parts: text segmentation, domain inference, privacy fragment localization, and differential privacy fragment-level rewriting. The implementation details are as follows: Algorithm 4 Online Inference Module Input: User query Fragment encoder ,prototype Rewriting the model Threshold .

[0069] Output: Rewrite query ; / / Text chunking S1: Blocking ; S2: Calculation ,in ; / / Domain Reasoning S3: For each domain ; S4: For each segment ; S5: Calculation ; S6: Domain Score: ; S7: Reasoning Global Domain: ; / / Location of privacy fragments S8: Initialize the set of privacy fragments ; S9: For each segment ; S10: If ; S11: ; / / Differential privacy fragment-level rewriting S12: Apply the rewrite model only to the detected segments; S13: Based on the exponential mechanism: ; In Algorithm 4, S1-S2 segment the user's input and represent all segments using a segment encoder. S3-S7 infer the domain of the user's query. S5 calculates the maximum similarity between each segment and the prototype. S6 calculates the domain score based on the average maximum similarity. S7 indicates that the maximum score represents the domain of the user's query. S8-S11 determine whether a segment is a privacy segment based on the relationship between the maximum similarity of each segment and a threshold. Finally, S12-S13 use a rewriting model to perform controlled rewriting only on the identified privacy segments.

[0070] Example 2: Dataset: (1) The Pri-DDXPlus dataset focuses on the field of medical diagnosis. Each sample contains a description of a patient's symptoms and clearly marks the range of privacy-sensitive and non-sensitive information according to medical confidentiality standards. The task is designed as a multiple-choice question, with each sample containing one correct diagnosis and three randomly selected incorrect diagnoses as in-domain distractors.

[0071] (2) Pri-SLJA dataset, similar to Pri-DDXPlus, is for the domain of legal judgments. Each entry contains a detailed description of a legal case and distinguishes privacy-sensitive details from general background information by fragment-level annotation according to legal norms. Each sample contains one correct judgment and three random incorrect judgments.

[0072] (3) To evaluate the performance of the present invention in multi-domain scenarios, the Pri-Mixture dataset was constructed by adding multi-domain interference terms to the two basic datasets. Specifically, four legal-related options (Pri-DDXPlus) were added to the original medical samples, and four medical-related options (Pri-SLJA) were added to the legal samples. Then, the expanded datasets were merged to form a unified multi-domain benchmark dataset to test the model's ability to distinguish the correct answer under noise interference from within and outside the domain.

[0073] Comparison method: (1) No Rewriting: This method is the baseline method, which does not process the user input in any way and directly uploads it to the LLM for downstream inference.

[0074] (2) DP-Paraphrase: This method is a method for differential privacy text rewriting in a fine-tuned paraphrasing model using differential privacy temperature sampling.

[0075] (3) DP-Prompt: This method uses prompts from a large language model to perform differential privacy text rewriting, that is, to interpret the input text.

[0076] (4) DP-MLM: This method uses a masked language model to perform differential privacy per-token rewriting.

[0077] (5) PrivacyRestore: This method breaks down the process into client-side deletion and server-side restoration to enhance downstream utility.

[0078] Method variants: Two additional analysis settings were introduced to isolate the localization effect; (1) Oracle variant (DP-MLM) oracle DAMPER oracle They access real-world privacy fragments to independently evaluate the rewrite model.

[0079] (2) DP-MLM auto A hybrid baseline that specifically applies the DP-MLM rewrite model to fragments detected by the prototype-guided locator.

[0080] Experimental metrics: A comprehensive set of metrics was used to evaluate the framework of this invention, with evaluation dimensions including three key aspects: downstream utility, semantic fidelity, and privacy protection.

[0081] Accuracy (Acc): Measures the prediction accuracy of a processed query based on a cloud-based language learning model (LLM).

[0082] BERTScore(BS): Quantizes the raw query and rewrite output The semantic similarity between them is calculated by using a pre-trained encoder to compute the cosine similarity of their context embeddings, thus effectively measuring the degree of semantic content preservation after rewriting.

[0083] LLM-Judge (LLM-J): Evaluates the overall quality and coherence of the generated text. First, based on the rewritten query, a response is generated using Qwen2.5-7B-Instruct, and then DeepSeek-V3 is used as the evaluator to assign quality scores on a scale of 1 to 10.

[0084] Privacy F1-Score (PF1): To comprehensively evaluate the fragment localization module, the F1 score is used, balancing the trade-off between precision and recall. Precision quantifies the ratio between predicted fragments and actual, real privacy fragments.

[0085] Recall measures the proportion of real data segments that a model successfully retrieves:

[0086] The unified method for calculating the PF1 index is the harmonic mean:

[0087] ROUGE-L: This metric measures the structural similarity between a generated sequence and a reference sequence based on the longest common subsequence. In privacy-preserving scenarios, the ROUGE-L metric is calculated between the original sensitive information fragment and its rewritten version.

[0088] A lower ROUGE-L value indicates less word overlap, reflecting more effective obfuscation of sensitive information.

[0089] Model and parameter settings: The Robustly Optimized BERT Pretraining Approach-base (RoBERTa-base model) is used as the backbone network for the segment encoder, and Qwen2.5-1.5B-Instruct is used as the reference model and policy model for preference learning. The cloud-based LLM is instantiated using Qwen2.5-7B-Instruct. Key hyperparameters are configured as follows: , and Set the total privacy budget for all rewritten tokens in the query to [value]. The fragment encoder was fine-tuned on 149 medical fragment types and 142 legal and privacy fragment types. Regarding dataset-specific hyperparameters, settings were... 5 and ; Filtering threshold Set it to 0.85 for Pri-DDXPlus and 0.80 for Pri-SLJA.

[0090] Table 1 compares the performance of the three datasets under two different privacy budgets.

[0091] Table 1 shows a performance comparison of the three datasets under two different privacy budgets. Accuracy (ACC) is expressed as a percentage, where "-" indicates that the method cannot report this metric, and DAMPER represents the present invention. The PrivacyRestore method does not report the BS metric in the original paper, so it is indicated by "-".

[0092] The results in Table 1 show that the present invention (DAMPER) exhibits excellent or highly competitive performance across all metrics.

[0093] (1) In terms of downstream utility (ACC), the present invention achieves an ACC of 79.02% on the Pri-Mixture dataset. Compared to the baseline method No Rewriting's 88.45%, this invention only reduces performance by 9.43%, while other comparable methods such as DP-MLM have an ACC of only 28.93%. This invention, while protecting privacy, maximizes the preservation of utility for downstream tasks. This advantage is particularly significant on the Pri-Mixture dataset, where other methods are affected by inter-domain interference. This invention effectively preserves domain boundaries, demonstrating that prototype-based methods can capture robust discriminative features unaffected by interference.

[0094] (2) In terms of semantic consistency (BS), this invention performs best on the Pri-DDXPlus and Pri-Mixture datasets, and is second only to DP-MLM on the Pri-SLJA dataset. oracle This gap was expected because of DP-MLM oracle Based on real-world fragments. Although this invention performs autonomous reasoning without masking, it still maintains a high degree of semantic fidelity.

[0095] (3) In terms of generated quality (LLM-J), the present invention generally outperforms other methods. Although PrivacyRestore has high performance... While it performs slightly better on the Pri-SLJA dataset with a lower value, it requires significant overhead. In contrast, this invention achieves similar practicality through efficient one-time rewriting without the need for post-processing mechanisms.

[0096] Table 2 shows the robustness of Oracle location and auto location.

[0097] Table 2 evaluates the robustness of localization by benchmarking automatic inference against oracle settings preset with ground truth values. This invention exhibits excellent robustness: on the Pri-DDXPlus dataset, accuracy remains stable, only slightly improving from 78.12% to 78.34%; on the Pri-SLJA dataset, performance degradation is controlled within the range of 85.27% to 82.12%, while DP-MLM experiences a sharp performance drop on the Pri-SLJA dataset, with accuracy plummeting from 78.38% to 55.76%, indicating its strong dependence on perfect masks. On the Pri-Mixture dataset, the performance of this invention is comparable to DAMPER. oracle Comparatively, their accuracy rates were 79.02% and 79.67%, respectively. In contrast, DP-MLM... auto Performance and DP-MLM oracleCompared to the previous method, there was a significant regression. Under Oracle settings, DP-MLM achieved an accuracy of 59.60%, but under automatic positioning settings, DP-MLM only achieved an accuracy of 49.63%. The results confirm that the prototype-guided alignment method enhances end-to-end performance and resists the influence of detection noise. Compared to the common mask baseline method, this invention provides a more robust solution.

[0098] Privacy fragments typically exhibit a long-tail distribution. To evaluate the model's generalization ability to unseen or rare instances, the proposed framework is trained using only the top k% of the most frequently occurring fragments and evaluated on the full test set. Figure 7 The figure shows the recall performance of the system on the Pri-DDXPlus dataset under different Top-% predefined privacy fragments. This invention demonstrates strong robustness to lexical scarcity; even when only the top 20% of fragments appearing most frequently are used during training, the recall rate remains at 84.88%, only showing a significant decrease below this threshold. This confirms that the prototype learned by this invention can capture abstract semantic patterns, rather than relying on surface memory, thus achieving effective detection of long-tailed distributions.

[0099] Table 3 Performance Comparison in Individual Domains

[0100] Table 3 shows the performance comparison using a single dataset during the training phase, DAMPER multi This invention indicates that the Pri-DDXPlus and Pri-SLJA datasets are used together during training (evaluation is performed on the corresponding datasets), while DAMPER... single This indicates that the invention uses only a single dataset for training. Performance on two single-domain datasets is almost equivalent to that of multi-domain training. On the Pri-DDXPlus dataset, both rewriting quality and task accuracy decrease slightly, while on the Pri-SLJA dataset, single-domain training performs slightly better. These results demonstrate that DAMPER can effectively adapt to multi-domain rewriting tasks.

[0101] Table 4 Comparison of computational costs

[0102] Five different methods, including our invention (DAMPER), were evaluated for privacy localization and rewriting on 1013 samples from Pri-DDXPlus and 509 samples from Pri-SLJA, and their respective time costs were reported. As shown in Table 4, DP-MLM and its variants were the most time-consuming on both datasets, while DP-Paraphrase was the least time-consuming. The runtime of our invention was almost the same as that of DP-Prompt. The runtime of DP-MLM and its variants increased significantly with the length of the input text, while the other methods were largely insensitive to text length.

[0103] To rigorously evaluate the privacy robustness of this invention, two types of adversarial evaluations were performed on the fragment-level rewriting method.

[0104] (1) Inverted Embedding Attack: The inverted embedding attack attempts to reconstruct sensitive segments directly from the user's input embedding by training a dedicated inversion model. The potential leakage is quantified using ROUGE-L, a metric that measures the difference between the reconstructed privacy segment and the real privacy segment. Empirical results are as follows... Figure 8 The systems shown are in different privacy budgets Below is a performance graph of resisting inversion embedding attacks, quantifying the degree of privacy leakage using the ROUGE-L metric, which shows that it can maintain a high level of protection comparable to other defense baselines.

[0105] (2) Cue Injection Attack: Cue injection attacks aim to extract the original input from the cloud-based LLM by injecting adversarial cues into the rewritten text. The success rate of the attack is measured using the ROUGE-L similarity between the recovered input and the original text. Figure 9 The systems shown are in different privacy budgets Below is a performance graph against hint injection attacks, with the ROUGE-L metric measuring attack success rate. This invention and DP-MLM achieve similar defense capabilities under both Oracle and automatic configurations. It is worth noting that automatic configuration consistently provides stronger privacy protection (i.e., lower leakage) than Oracle configuration because conservative detection typically masks additional contextual information.

[0106] In summary, the comparative experiments and analysis results on multiple privacy-sensitive domain datasets demonstrate that the client-oriented domain-aware maskless privacy identification and rewriting system proposed in this invention can automatically identify and effectively rewrite privacy information without requiring users to explicitly annotate privacy fragments. The experimental results show that it maintains good overall performance in terms of downstream task accuracy, semantic consistency, and generated text quality, indicating that it achieves an effective balance between privacy protection and text usability.

[0107] Meanwhile, in the robustness experiment comparing automatic localization with real privacy fragments, the method of this invention maintained stable performance even in the presence of detection noise, verifying its insensitivity to localization errors. Furthermore, in experimental verifications involving multi-domain mixed scenarios, long-tailed privacy fragment distribution, and computational overhead, it demonstrated good generalization ability and engineering feasibility.

[0108] The above experimental results further demonstrate that the present invention can stably achieve privacy protection in actual client deployment environments and meet the comprehensive requirements of privacy-sensitive applications for security and practicality.

[0109] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A client-oriented domain-aware maskless privacy identification and rewriting system, characterized in that, include: The offline training module is used to generate domain privacy prototypes that represent privacy semantics in different domains based on multi-domain labeled privacy fragments, and to train the initial rewriting model based on the domain privacy prototypes to obtain the privacy rewriting model. The online inference module, deployed on the client, is used to receive user input text, automatically identify privacy fragments in the user input text using the domain privacy prototype, and call the privacy rewriting model to rewrite the identified privacy fragments.

2. The client-oriented domain-aware maskless privacy identification and rewriting system according to claim 1, characterized in that, The offline training module includes: The prototype learning unit is used to train the segment encoder through a multi-domain contrastive learning method and cluster the domain privacy segment representations encoded by the segment encoder to generate a domain privacy prototype set for each domain. In the multi-domain contrastive learning method, for an anchor privacy fragment, its positive examples are other privacy fragments from the same domain, and its negative examples include privacy fragments from other domains and non-privacy fragments from all domains.

3. The client-oriented domain-aware maskless privacy identification and rewriting system according to claim 2, characterized in that, The offline training module also includes: A preference construction unit is used to automatically evaluate the quality of multiple candidate rewrites for the same input sample based on the domain privacy prototype through a composite reward function to construct a preference dataset; wherein the composite reward function includes a privacy reward term and a domain utility reward term; The model alignment unit is used to train the initial rewrite model based on the preference dataset using the direct preference optimization method to obtain the privacy rewrite model.

4. The client-oriented domain-aware maskless privacy identification and rewriting system according to claim 3, characterized in that, The privacy reward item is used to evaluate the degree of privacy protection of the candidate rewriting results based on the semantic difference between the rewritten privacy fragment and the corresponding original privacy fragment. The domain utility reward term is used to evaluate the degree of domain style preservation in candidate rewriting based on the semantic similarity between the rewritten privacy fragment and the privacy prototype of its domain. The preference construction unit is specifically used to perform weighted fusion of the privacy reward item and the domain utility reward item to obtain a comprehensive quality score for the candidate rewrites, and select the candidate rewrites with the highest and lowest scores to form a preference pair.

5. The client-oriented domain-aware maskless privacy identification and rewriting system according to claim 1, characterized in that, The online inference module includes: A text segmentation unit is used to segment the user input text into semantically coherent segments; The localization unit is used to encode each of the segments into vectors using the segment encoder, infer the global domain of the user input text based on the similarity between each of the vectors and each of the domain privacy prototypes, and determine the privacy segment within the inferred domain according to a similarity threshold.

6. The client-oriented domain-aware maskless privacy identification and rewriting system according to claim 5, characterized in that, The method by which the positioning unit infers the global domain is as follows: Calculate the maximum similarity between each segment of the input text and each domain prototype set; The average of the maximum similarity between each fragment and the prototype set of a certain domain is used as the score of that domain; The domain with the highest score is determined as the global domain of the user input text.

7. The client-oriented domain-aware maskless privacy identification and rewriting system according to claim 5, characterized in that, The online inference module also includes: The rewriting unit is used to call the privacy rewriting model to rewrite and generate only the identified privacy fragments, and to deterministically retain the non-privacy fragments. The rewriting unit integrates a differential privacy sampling mechanism based on an exponential mechanism during rewriting generation, which clips the logit vector output by the model to a predetermined range. and adopts a privacy budget based on a preset privacy budget. The calibrated temperature parameter τ is sampled using softmax to provide differential privacy guarantees for each rewritten privacy segment.

8. A client-oriented domain-aware maskless privacy identification and rewriting method, characterized in that, Includes the following steps: Offline training steps: Based on multi-domain labeled privacy fragments, generate domain privacy prototypes that represent privacy semantics in different domains, and train the initial rewriting model based on the domain privacy prototypes to obtain the privacy rewriting model; Online inference steps: The client receives user input text, uses the domain privacy prototype to automatically identify privacy fragments in the user input text, and calls the privacy rewriting model to rewrite the identified privacy fragments.

9. The client-oriented domain-aware maskless privacy identification and rewriting method according to claim 8, characterized in that, The offline training step further includes: A fragment encoder is trained using a multi-domain contrastive learning method, and the domain-privacy fragment representations encoded by the fragment encoder are clustered to generate a set of domain-privacy prototypes for each domain. Based on the aforementioned domain privacy prototype, the quality of multiple candidate rewrites for the same input sample is automatically evaluated through a composite reward function to construct a preference dataset. The composite reward function includes a privacy reward term for evaluating the degree of privacy protection and a domain utility reward term for evaluating the degree of domain style preservation. Based on the preference dataset, the initial rewrite model is trained using the direct preference optimization method to obtain the privacy rewrite model.

10. The client-oriented domain-aware maskless privacy identification and rewriting method according to claim 8, characterized in that, The online inference step further includes: The user input text is segmented into semantically coherent segments; The fragment encoder is used to encode each fragment into a vector. The global domain of the user input text is inferred based on the similarity between each vector and each domain privacy prototype. Privacy fragments are then determined within the inferred domain based on a similarity threshold. The privacy rewriting model is only invoked to rewrite and generate the identified privacy fragments. During the generation process, the logit vector output by the model is pruned, and a preset privacy budget is adopted. The calibrated temperature parameter τ is sampled using softmax to achieve differential privacy and deterministically preserve non-privacy segments.