Data processing method
By automatically identifying and applying appropriate desensitization strategies through data processing models, the problems of fragmented desensitization strategies and omission of implicit sensitive information in medical data are solved, achieving unified privacy protection and semantic preservation, and improving data usability and security.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ALI HEALTH TECH CO LTD
- Filing Date
- 2026-01-28
- Publication Date
- 2026-06-19
Smart Images

Figure CN122241746A_ABST
Abstract
Description
Technical Field
[0001] The embodiments in this specification relate to the field of computer technology, and in particular to a data processing method. Background Technology
[0002] In medical settings, patient data is needed when training medical-related models. To protect patient privacy, patient data is usually anonymized before the medical model is trained.
[0003] When data sensitivity levels are classified by human judgment or expert experience, and different degrees of desensitization are carried out for data of different sensitivity levels, the desensitization strategy will be fragmented due to the inconsistent classification standards of different experts or departments (for example, the psychology department believes that diagnostic descriptions are highly sensitive, while the surgery department may pay more attention to identity coding). Moreover, it is difficult to identify implicit sensitive information by human experience, and there is a risk of missing contextual associations. Summary of the Invention
[0004] In view of the above, embodiments of this specification provide a data processing method. One or more embodiments of this specification also relate to a data processing apparatus, a computing device, a computer-readable storage medium, and a computer program product, to address the technical deficiencies existing in the prior art.
[0005] According to a first aspect of the embodiments of this specification, a data processing method is provided, comprising: Obtain the text to be processed and the prompt text corresponding to the text to be processed, wherein the prompt text contains at least one sensitive attribute category and the desensitization strategy corresponding to each sensitive attribute category; The text to be processed and the prompt text are input into the data processing model to obtain the target text output by the data processing model. The target text is obtained by desensitizing at least one sensitive field in the text to be processed. The sensitive field is determined according to the sensitive attribute category in the text to be processed, and each sensitive field is desensitized through the desensitization strategy corresponding to the sensitive attribute category.
[0006] According to a second aspect of the embodiments of this specification, a data processing apparatus is provided, comprising: The acquisition module is configured to acquire the text to be processed and the prompt text corresponding to the text to be processed, wherein the prompt text contains at least one sensitive attribute category and the desensitization strategy corresponding to each sensitive attribute category; The desensitization module is configured to input the text to be processed and the prompt text into a data processing model to obtain the target text output by the data processing model. The target text is obtained by desensitizing at least one sensitive field in the text to be processed. The sensitive field is determined according to the sensitive attribute category in the text to be processed, and each sensitive field is desensitized through a desensitization strategy corresponding to the sensitive attribute category.
[0007] According to a third aspect of the embodiments of this specification, a computing device is provided, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the above-described data processing method.
[0008] According to a fourth aspect of the embodiments of this specification, a computer-readable storage medium is provided that stores a computer program / instructions that, when executed by a processor, implement the steps of the data processing method described above.
[0009] According to a fifth aspect of the embodiments of this specification, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described data processing method.
[0010] The data processing method provided in one embodiment of this specification inputs preset sensitive attribute categories and their corresponding desensitization strategies as prompt text along with the text to be processed into the data processing model. This effectively solves the problem of strategy fragmentation caused by traditional reliance on expert experience for sensitivity classification. The data processing model can uniformly perform desensitization based on the same set of configurable desensitization strategies. Moreover, based on contextual understanding capabilities, the data processing model can accurately identify and process implicit sensitive information that may be missed by humans. Furthermore, when different sensitive attribute categories correspond to different desensitization strategies, it can dynamically achieve reasonable desensitization processing, reduce information loss caused by crude masking, achieve more comprehensive and unified desensitization processing, and obtain target text that balances semantic preservation and strong privacy protection. Attached Figure Description
[0011] Figure 1 This is a schematic diagram illustrating a data processing method provided in one embodiment of this specification. Figure 2 This is a flowchart illustrating a data processing method provided in one embodiment of this specification; Figure 3 This is a schematic diagram of the processing procedure of a data processing method provided in one embodiment of this specification; Figure 4This is a schematic diagram of the structure of a data processing apparatus provided in one embodiment of this specification; Figure 5 This is a structural block diagram of a computing device provided in one embodiment of this specification. Detailed Implementation
[0012] Many specific details are set forth in the following description to provide a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this specification. Therefore, this specification is not limited to the specific implementations disclosed below.
[0013] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this specification. The singular forms “a,” “described,” and “the” as used in one or more embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items.
[0014] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this specification, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."
[0015] Furthermore, it should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in one or more embodiments of this specification are all information and data authorized by the user or fully authorized by all parties. Moreover, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0016] First, the terms and concepts used in one or more embodiments of this specification will be explained.
[0017] Reversible desensitization: Under the premise of protecting patient privacy, sensitive information is converted into an unidentifiable form through encryption or encoding technology, while retaining the reversibility of the original data.
[0018] Direct Identifiers: Information that can directly identify a patient (such as name, identification code, or telephone number). Format Preservation Encryption: Encrypted data retains its original format (e.g., encrypted medical record numbers are still numerical sequences). Context-aware replacement: Dynamically replace sensitive entities based on text semantics.
[0019] When training a predictive model using patient data, the patient data is obtained from real user patient information. Due to confidentiality issues, it cannot be used directly for training. It is necessary to desensitize the privacy information in the user patient information. However, the desensitized data may lead to clinical semantic distortion, which will affect the training effect.
[0020] Therefore, there is an urgent need for a reliable desensitization method that can balance the preservation of model semantics with strong privacy protection.
[0021] This specification provides a data processing method, and also relates to a data processing apparatus, a computing device, a computer-readable storage medium, and a computer program product, which will be described in detail in the following embodiments.
[0022] See Figure 1 , Figure 1 A schematic diagram of a data processing method according to an embodiment of this specification is shown.
[0023] Specifically, the data processing method is applied to a data processing system, which includes an end-side device 102 and a server 104. The end-side device 102 is used to send the text to be processed and the prompt text corresponding to the text to be processed to the server 104. The prompt text contains at least one sensitive attribute category and a desensitization strategy corresponding to each sensitive attribute category.
[0024] A data processing model is trained in server 104. Server 104 obtains the text to be processed and the prompt text corresponding to the text to be processed. The text to be processed and the prompt text are input into the data processing model to obtain the target text output by the data processing model. The target text is obtained by desensitizing at least one sensitive field in the text to be processed. The sensitive field is determined according to the sensitive attribute category in the text to be processed. Each sensitive field is desensitized through the desensitization strategy corresponding to the sensitive attribute category. The target text is returned to the end device 102.
[0025] The edge device 102 may include a browser, an app (application), or a web application such as an H5 (Hypertext Markup Language 5) application, a lightweight application (also known as a mini-program), or a cloud application. The edge device can be developed based on a software development kit (SDK) provided by the server, such as a real-time communication (RTC) SDK. The edge device can be deployed in an electronic device and depends on the device's operation or certain apps within the device to run. The electronic device may have a display screen and support information browsing, such as a personal mobile terminal like a mobile phone, tablet, or personal computer. Various other types of applications can also be configured in the electronic device, such as human-computer interaction applications, model training applications, data processing applications, web browser applications, shopping applications, search applications, instant messaging tools, email clients, and social media platform software.
[0026] Server 104 can be understood as a server providing various services, including physical servers and cloud servers. Examples include servers providing communication services to multiple clients, servers supporting backend training of models used on clients, and servers processing data sent by clients. It's important to note that Server 104 can be implemented as a distributed server cluster composed of multiple servers, or as a single server. Server 104 can also be a server in a distributed system, or a server integrated with blockchain. Server 104 can also be a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.
[0027] It is worth noting that the data processing method provided in the embodiments of this specification can be executed by the server 104. In other embodiments of this specification, the data processing model can be deployed in the end-side device 102, so that the end-side device 102 can also have similar functions to the server 104, thereby executing the data processing method provided in the embodiments of this specification. In other embodiments, the data processing method provided in the embodiments of this specification can also be jointly executed by the end-side device 102 and the server 104.
[0028] See Figure 2 , Figure 2A flowchart of a data processing method provided in one embodiment of this specification is shown, which specifically includes the following steps.
[0029] Step 202: Obtain the text to be processed and the prompt text corresponding to the text to be processed, wherein the prompt text contains at least one sensitive attribute category and the desensitization strategy corresponding to each sensitive attribute category.
[0030] The text to be processed can be understood as text containing sensitive information. This text is the original text to be de-identified. After de-identification processing, it can be used in scenarios such as model training and data analysis. The prompt text is used to guide the data processing model to de-identify the text to be processed.
[0031] Sensitive attribute categories can be understood as data attribute types defined based on the functional attributes of data in a specific scenario. Data is considered sensitive data if it belongs to any of these sensitive attribute categories. Sensitive data policies can be understood as rules for de-identifying sensitive data. Sensitive data policies include, but are not limited to, removal de-identification policies, semantic placeholder replacement policies, format preservation replacement policies, generalized de-identification policies, and context-aware replacement policies.
[0032] Specifically, taking the medical scenario as an example, patient data (i.e., text to be processed) is obtained from multiple data sources. The patient data contains structured fields (such as age, date of visit, medical record number, etc.) and structured text (such as medical history, diagnosis description, etc.). Since the patient data contains real patient identity and clinical information, it belongs to personal health information that is strictly protected by law and ethics, so it needs to be de-identified before it can be used.
[0033] Obtain the prompt text corresponding to the text to be processed. The prompt text contains at least one sensitive attribute category and the desensitization strategy corresponding to each sensitive attribute category. That is, the prompt text contains the mapping relationship between sensitive attribute categories and corresponding desensitization strategies. By configuring the structured mapping relationship in advance, the prompt text provides clear and reusable execution guidance for the desensitization process.
[0034] In practical applications, the mapping relationships contained in the prompt text can be dynamically expanded. That is, new sensitive attribute categories and corresponding desensitization strategies can be added, or new desensitization strategies can be added for existing sensitive attribute categories. There are no restrictions here.
[0035] Step 204: Input the text to be processed and the prompt text into the data processing model to obtain the target text output by the data processing model. The target text is obtained by desensitizing at least one sensitive field in the text to be processed. The sensitive field is determined according to the sensitive attribute category in the text to be processed. Each sensitive field is desensitized through a desensitization strategy corresponding to the sensitive attribute category.
[0036] The data processing model can be understood as a large language model built on natural language processing. This model can identify and transform sensitive information based on the input rules and text. The target text can be understood as the text generated after anonymization; it contains no sensitive information and thus meets privacy protection requirements.
[0037] Sensitive fields can be understood as specific data fragments in the text to be processed that belong to a certain sensitive attribute category, such as names, identification codes, addresses, etc. in the text to be processed; desensitization can be understood as the process of converting the original sensitive fields into non-sensitive fields by applying corresponding desensitization strategies.
[0038] Specifically, taking a medical scenario as an example, for a text to be processed containing a patient's medical record number "123456", the prompt text contains a defined "direct identifier - format-preserving replacement strategy". The data processing model will identify that "123456" is a sensitive field and replace it with content such as "654321" that conforms to the original format but contains false information, and finally generate a target text that does not reveal the real identity.
[0039] In one or more embodiments of this specification, when the text to be processed and the prompt text are input into the data processing model, the data processing model can be used to classify the text fields contained in the text to be processed, thereby determining whether the text fields belong to any sensitive attribute category. If so, the text fields belonging to the sensitive attribute category are identified as sensitive fields, thereby achieving desensitization for sensitive fields. Specific implementation methods are as follows: Input the text to be processed and the prompt text into the data processing model to obtain the target text output by the data processing model, including: The text to be processed and the prompt text are input into the data processing model. The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to any sensitive attribute category are identified as sensitive fields. At least one sensitive field in the text to be processed is desensitized to obtain the target text.
[0040] In this context, a text field can be understood as a data unit with independent semantics or structure within the text to be processed; it can be a single word, phrase, or a string conforming to a specific format. Classification can be understood as using a data processing model to automatically identify and determine the category attribute to which each text field belongs based on the sensitive attribute categories defined in the prompt text. Sensitive fields can be understood as marking identified text fields belonging to sensitive attribute categories as specific objects that require de-identification processing.
[0041] Specifically, taking a medical scenario as an example, when a text to be processed containing "Patient Zhang San, age 45, blood routine re-examination" and corresponding prompt text are input into the data processing model, the data processing model will first segment the text to be processed to obtain multiple text fields (such as "Zhang San" and "45 years old"), and then classify the multiple text fields.
[0042] If the prompt text defines "name" and "age" as sensitive attribute categories, the data processing model will identify "Zhang San" and "45 years old" as sensitive fields and process them according to their corresponding desensitization strategies. However, "re-examination of blood routine" does not belong to the defined sensitive attribute categories, so it can be retained. This ensures that the final target text protects patient privacy while retaining necessary clinical semantic information for subsequent analysis.
[0043] The data processing method provided in the embodiments of this specification achieves efficient identification of sensitive fields in the text to be processed by automatically recognizing and classifying the text fields in the text to be processed, and ensures data security by de-identifying the sensitive fields.
[0044] In one or more embodiments of this specification, when de-identifying sensitive fields, the de-identification strategy corresponding to the sensitive attribute category to which the sensitive field belongs is determined based on the correspondence between the sensitive attribute category in the prompt text and the de-identification strategy. Thus, the de-identification processing of the sensitive field is achieved by utilizing the de-identification strategy corresponding to the sensitive field. Specific implementation methods are described below: To obtain the target text, the process involves de-identifying at least one sensitive field in the text to be processed, including: Based on the sensitive attribute category of each sensitive field in the text to be processed, determine the desensitization strategy corresponding to each sensitive field; Based on the desensitization strategy corresponding to each sensitive field, the sensitive fields are desensitized to obtain the target fields; Update the text to be processed according to the target field to obtain the target text.
[0045] In this context, the target field can be understood as a non-sensitive field that meets privacy protection requirements after being processed by a corresponding de-identification strategy for sensitive fields. The target text can be understood as safe and usable text formed by replacing sensitive fields in the text to be processed with the target field while maintaining the structural and semantic integrity of the rest of the text.
[0046] Specifically, taking a medical scenario as an example, when the data processing model identifies the sensitive fields "Patient ID: P123456" and "Contact Number: 12345678" in the patient's medical record text, it matches the corresponding desensitization strategy (such as format preservation and replacement) according to the sensitive attribute category (such as "direct identifier") of the sensitive fields. Thus, according to the corresponding desensitization strategy, "P123456" is converted to "P654321" and "12345678" is converted to "87654321", generating the corresponding target fields.
[0047] By replacing the corresponding sensitive fields in the text to be processed with these desensitized target fields, a target text that retains the original format but hides the real information is generated, such as "Patient ID: P654321, Contact number: 87654321".
[0048] The data processing method provided in the embodiments of this specification determines the corresponding desensitization strategy by determining the sensitive attribute category to which each sensitive field belongs. This can avoid information loss caused by indiscriminate desensitization of different sensitive information using a single mask or encryption method. By dynamically selecting the desensitization strategy based on field attributes, it can ensure the complete protection of privacy data while retaining as much scene-related semantic information as possible, thereby enhancing the usability and readability of the target text.
[0049] In one or more embodiments of this specification, when using a data processing model to de-identify the text to be processed, the data processing model can not only generate the de-identified target text, but also generate a de-identification mapping table containing sensitive fields and corresponding target fields. Specific implementation methods are described below: Input the text to be processed and the prompt text into the data processing model to obtain the target text output by the data processing model, including: The text to be processed and the prompt text are input into the data processing model to obtain the desensitization mapping table and the target text output by the data processing model. The desensitization mapping table contains the mapping relationship between each sensitive field in the text to be processed and the corresponding target field.
[0050] The desensitization mapping table can be understood as a structured record file used to record the correspondence between sensitive fields in the text to be processed and the target fields generated after desensitization. The mapping relationship can be understood as the association information between sensitive fields and corresponding target fields. This mapping relationship is usually stored in key-value pairs or tables to ensure the traceability of the processing.
[0051] Specifically, after receiving the text to be processed and the prompt text, the data processing model first identifies the sensitive fields in the text to be processed, and then performs corresponding desensitization processing on each sensitive field according to the desensitization strategy in the prompt text, obtaining the target field corresponding to each sensitive field and generating the target text. Based on each sensitive field and its corresponding target field, a desensitization mapping table is created and output. This desensitization mapping table clearly records each original sensitive field and its corresponding desensitized target field.
[0052] For example, in a medical data processing scenario, the text to be processed contains the sensitive field "medical record number: P123456". After processing by the data model, the target field generated is "medical record number: P654321". Therefore, the mapping relationship record P123456-P654321 exists in the desensitization mapping table output by the data processing model.
[0053] In practical applications, once a desensitization mapping table is obtained, it is stored in a secure database that is physically isolated from the training environment. This allows the model training platform to access the target text during subsequent model training, while maintaining physical isolation from the original text to be processed. This significantly reduces the risk of data leakage. The secure database can be accessed under authorized scenarios (such as ethical review and result verification) to ensure that the target text can be safely and reliably restored to the original text to be processed when data needs to be traced or verified.
[0054] The data processing method provided in the embodiments of this specification can not only generate directly usable target text during the de-identification process, but also retain the possibility of restoring or tracing the original information through the de-identification mapping table, thus taking into account both compliance and practicality.
[0055] In one or more embodiments of this specification, the sensitive attribute category includes a first attribute category, which is an attribute category related to user identity. When a sensitive field belongs to the first attribute category, it is further determined whether the sensitive field contains scene semantic information. If it does not contain it, to ensure patient data security, the sensitive field related to user identity is removed. If the sensitive field contains scene semantic information, to preserve the semantic information, a semantic placeholder is used to replace the sensitive field. Specific implementation methods are as follows: The data processing model is used to classify the text fields in the text to be processed, and text fields belonging to any sensitive attribute category are identified as sensitive fields, including: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the first attribute category are identified as the first sensitive fields; The step of desensitizing each sensitive field according to the desensitization strategy corresponding to each sensitive field to obtain the target field includes: For the first sensitive field, if the first sensitive field does not contain scene semantic information, the first sensitive field is removed using the removal and desensitization strategy corresponding to the first attribute category; When the first sensitive field contains scene semantic information, the first sensitive field is desensitized using the semantic placeholder replacement strategy corresponding to the first attribute category to obtain the target field. The semantic placeholder replacement strategy is a strategy of replacing the first sensitive field with a structured target field that is related to the scene semantic information.
[0056] The first attribute category can be understood as data types that can directly identify an individual, such as name, ID number, phone number, and detailed address. Contextual semantic information can be understood as auxiliary information that, in addition to identifying an individual, carries meaning for subsequent tasks within the context of a scenario (such as medical diagnosis or statistical analysis). For example, the province, city, and district level information in "Address: Province A, City B, District C, Community D, Unit E" has geographical analysis value. The removal and desensitization strategy can be understood as a strategy that completely deletes sensitive fields without leaving any placeholders or markers.
[0057] In practical applications, when classifying text fields, it is first necessary to define at least one sensitive attribute category. In the embodiments of this specification, based on the functional attributes of the data in a specific scenario, a first attribute category related to user identity identification, a second attribute category related to structured scenario identification, and a third attribute category related to scenario semantic information are divided.
[0058] Specifically, the data processing model categorizes each text field. If a text field can directly identify personal identity information, it is determined to be a first sensitive field belonging to the first attribute category. The model further determines whether each first sensitive field carries scene semantic information useful for subsequent tasks. Based on the determination result, it decides whether to use a removal desensitization strategy or a semantic placeholder replacement strategy to process the first sensitive field. Specifically, if the first sensitive field does not carry scene semantic information, it can be directly removed. If the first sensitive field carries scene semantic information, it can be replaced by a target system related to scene semantics.
[0059] For example, in a medical data processing scenario, for a patient data set "Patient Zhang San, male, ID number 123456789123456789, address is Unit E, Community D, District C, City B, Province A", the data processing model identifies text fields such as "Patient Zhang San", "male", "ID number 123456789123456789", and "Unit E, Community D, District C, City B, Province A", and determines that these text fields all belong to the first attribute category (direct identifier).
[0060] For fields like "Zhang San" (name) and "identity code" that do not contain semantic information meaningful for medical analysis, a removal and desensitization strategy is adopted. However, for fields like "male" (gender) and "Province A, City B, District C, Community D, Unit E", gender and geographical information may be valuable for disease distribution and disease research. Therefore, a semantic placeholder replacement strategy is adopted, and the final target text can be "Patient [male], address: [Province A][City B]". The data processing method provided in the embodiments of this specification avoids the destruction of text structure and semantic integrity by using traditional meaningless masks (such as "**" or "XXX") for the first sensitive field related to user identity identification. Instead, it dynamically selects a desensitization strategy based on whether the field contains contextual semantic information. This ensures that the semantic information of the data that is valuable to the context is preserved to a large extent while strictly protecting personal identity from being disclosed, thereby improving the semantic integrity of the data after desensitization in model training or data analysis.
[0061] In one or more embodiments of this specification, the sensitive attribute category includes a second attribute category, which is an attribute category related to the structured scene identifier. When a sensitive field belongs to the second attribute category, since the sensitive field is related to the structured scene identifier and carries association information within a specific scene, to ensure that the temporal modeling capability is retained during subsequent training of the model for the specific scene, the content of the second sensitive field can be changed, but the format of the second sensitive field must be preserved. Specific implementation methods are as follows: The data processing model is used to classify the text fields in the text to be processed, and text fields belonging to any sensitive attribute category are identified as sensitive fields, including: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the second attribute category are identified as the second sensitive fields; The step of desensitizing each sensitive field according to the desensitization strategy corresponding to each sensitive field to obtain the target field includes: For the second sensitive field, the format preservation and replacement strategy corresponding to the second attribute category is used to desensitize the second sensitive field to obtain the target field. The format preservation and replacement strategy is to replace the second sensitive field with the target field that has the same format as the second sensitive field.
[0062] The second attribute category can be understood as the scenario identifier type of dynamic events or records generated under a specific process or scenario. For example, a user's identity code belongs to the first attribute category (the user's basic identity identifier), while structured medical operation identifiers such as medical record numbers, hospitalization numbers, and examination order numbers in the medical scenario correspond to an independent medical service process and are specific scenario identifiers in the medical scenario. Therefore, these structured medical operation identifiers belong to the second attribute category. Similarly, case numbers and file numbers in the legal scenario belong to the second attribute category, and transaction serial numbers and contract numbers in the financial field also belong to the second attribute category.
[0063] The format-preserving replacement strategy can be understood as a method that uses a specific encryption or mapping algorithm to generate a string that is completely identical to the original sensitive field in terms of character type (letter / number), length, and structure (such as specific prefixes and delimiters), but with different content.
[0064] Specifically, when a text field is related to a structured scene identifier, the text field is determined to belong to the second attribute category. This text field is then designated as the second sensitive field. In fact, when a second sensitive field is related to a structured scene identifier, the second sensitive field has a specific format. Therefore, for the second sensitive field, a format preservation and replacement strategy can be adopted to ensure that these identifiers in the desensitized data retain their original format while hiding the real scene identifier information, so as not to affect subsequent logical associations and statistical analyses based on these formats.
[0065] For example, for a text file containing the medical record number "EMR202405001", the data processing model identifies that "EMR202405001" belongs to the second attribute category (structured medical operation identifier). It applies a format-preserving replacement strategy (such as using Format-Preserving Encryption) to desensitize the medical record number, generating the target field "EMR739284615". The desensitized target field "EMR739284615" still maintains the structure of "letter prefix EMR + number" and a total length of 11 characters.
[0066] In practical applications, in medical scenarios, if a patient has multiple medical records and the original medical record numbers of these records are the same, then the target fields generated after anonymization should also be the same. This ensures that the records belonging to the patient can still be correctly associated in the anonymized dataset, thus fully preserving the data integrity and temporal dependencies at the patient level.
[0067] The data processing method provided in this embodiment effectively protects sensitive identification information from being leaked while maintaining the availability of data at the scene logic and data structure levels to a large extent, providing a high-quality de-identified data foundation for downstream tasks that need to maintain the correlation and integrity between data.
[0068] In one or more embodiments of this specification, the sensitive attribute category includes a third attribute category, which is an attribute category related to scene semantic information; for fields in the text to be processed that are related to the semantics of a specific scene, their semantic information needs to be preserved as much as possible. Specific implementation methods are as follows: The data processing model is used to classify the text fields in the text to be processed, and text fields belonging to any sensitive attribute category are identified as sensitive fields, including: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the third attribute category are identified as the third sensitive fields; The step of desensitizing each sensitive field according to the desensitization strategy corresponding to each sensitive field to obtain the target field includes: For the third sensitive field, if the third sensitive field is a numeric field, the third sensitive field is desensitized using the generalized desensitization strategy corresponding to the third attribute category to obtain the target field. The generalized desensitization strategy is a strategy that generalizes to the numeric field to retain the statistical distribution corresponding to the numeric field. When the third sensitive field is a text entity type field, the third sensitive field is desensitized using the context-aware replacement strategy corresponding to the third attribute category to obtain the target field. The context-aware replacement strategy is a strategy that uses context information to replace the text entity type field.
[0069] The third attribute category can be understood as a data type that carries specific semantic information of a particular scenario, is of great value to model training or data analysis, but also involves personal privacy. For example, in a medical scenario, age contains specific semantic information of a particular scenario, but also involves the user's personal identity information, so it is necessary to desensitize the specific age.
[0070] Numerical fields can be understood as data that can be directly calculated numerically, such as age and date; generalized desensitization strategies can be understood as methods that transform precise numerical values into a range by performing intervalization, grading, or generalization, thereby hiding the specific numerical values while retaining their group statistical distribution characteristics (such as age structure and time trend).
[0071] Text entity fields can be understood as entities with specific semantic types that appear in unstructured free text, such as names of people, places, and organizations. Context-aware replacement strategies can be understood as a desensitization method that combines contextual semantic understanding to replace sensitive text entity fields with fields that do not expose privacy information but maintain the logical coherence of the context.
[0072] Specifically, taking the medical scenario as an example, for texts containing both sensitive privacy information and clinical content that is crucial for medical research and model training, strategic transformations are used to preserve the statistical distribution and clinical semantics of the data as much as possible while protecting privacy.
[0073] In practice, when a text field belongs to the third attribute category, it is identified as a third sensitive field (a field that carries medical semantic information and contains private data). The third sensitive field is further divided. If the third sensitive field is a numerical field, a generalized desensitization strategy is adopted to generalize the specific and precise value to a range, thereby reducing the impact on the statistical distribution. If the third sensitive field is a text entity field (i.e., a sensitive entity), a context-aware replacement strategy is adopted to identify and replace sensitive entities through context awareness, ensuring that the context is semantically coherent after replacement.
[0074] For example, in a patient record that reads "Patient Zhang San, 45 years old, had a follow-up blood routine test on February 10, 2025", the data processing model identifies that "45 years old", "February 10, 2025" and "Zhang San" all belong to the third attribute category.
[0075] For the numerical fields "45 years old" and "February 10, 2025", a generalized desensitization strategy is used. For example, "45 years old" is generalized to the age range [40-49] years old, and "February 10, 2025" is generalized to the quarter "first quarter of 2025". For the textual entity field "Zhang San", a context-aware replacement strategy is used. Combining contextual information (e.g., the previous text may have indicated gender), "Zhang San" may be replaced with "a male patient". After desensitizing the above patient records, the processed target text can be "a male patient aged [40-49] years old, who had a follow-up blood routine examination in the first quarter of 2025".
[0076] In practical applications, this generalized desensitization strategy and context-aware replacement strategy can ensure that the replaced target text still retains the semantic information of "who performed what examination when" and maintains the topic logic of the clinical event.
[0077] The data processing method provided in the embodiments of this specification effectively meets privacy protection requirements by generalizing precise numerical values and replacing specific entities. By retaining key contextual semantic information such as age range, time quarter, and patient gender, it not only maintains the logical chain of clinical narratives but also provides a high-quality and usable data foundation for population-based statistical analysis (such as disease age distribution and quarterly visit trends) and context-dependent semantic model training (such as medical record text understanding). This avoids data value loss due to excessive desensitization, i.e., reduces information loss caused by crude masking and improves model training effectiveness.
[0078] In one or more embodiments of this specification, when new text to be processed is added, it is determined whether a scene identifier field identical to that in the new text exists. If it does, the desensitization target field corresponding to the scene identifier field can be directly reused according to the desensitization mapping table. Specific implementation methods are as follows: After inputting the text to be processed and the prompt text into the data processing model to obtain the target text output by the data processing model, the process further includes: Retrieve the newly added text to be processed and the corresponding prompt text for the newly added text to be processed; The newly added text to be processed and the prompt text are input into the data processing model to obtain the newly added target text output by the data processing model. In the case that the newly added text to be processed and the text to be processed contain the same scene identifier field, the desensitization target field corresponding to the scene identifier field is determined according to the desensitization mapping table, and the desensitization target field is determined as the target field corresponding to the scene identifier field in the newly added text to be processed.
[0079] The newly added text to be processed can be understood as newly acquired raw text data that needs to be anonymized; the newly added target text can be understood as the output text generated by the data processing model after anonymizing the newly added text to be processed. The same scenario identifier field can be understood as the identity information that appears in different texts to be processed and represents the same real entity (such as the same user or the same medical record), such as the same medical record number or transaction serial number.
[0080] Specifically, when processing newly added text to be processed, if it is found that the scene identifier field contained therein is already recorded in the historical desensitization mapping table, the desensitization result (i.e. the desensitization target field) in the desensitization mapping table is directly reused to replace the scene identifier field in the newly added text to be processed, instead of generating a new desensitization result. This ensures that in the scenario of processing data in batches, the same entity can still be associated with the same desensitized data after data desensitization.
[0081] For example, if the anonymization mapping table records the mapping relationship "Original Medical Record Number P12345 - Target Field P67890", and a newly added patient follow-up record is obtained, containing "Medical Record Number: P12345", the data processing model identifies "P12345" as a scenario identifier field (belonging to the first attribute category or the second attribute category). A query in the anonymization mapping table reveals that this field already has a historical anonymized record. Therefore, instead of generating a new target field for "P12345", the existing target field "P67890" in the anonymization mapping table is directly used for replacement.
[0082] The data processing method provided in the embodiments of this specification maintains the internal logical consistency of data after desensitization through a desensitization mapping table (for example, ensuring that multiple medical records of the same patient are automatically associated because they share the same target field), avoiding data association breaks caused by inconsistencies in desensitization, and also enhancing the traceability of desensitization.
[0083] After inputting the text to be processed and the prompt text into the data processing model to obtain the target text output by the data processing model, the process further includes: The initial data processing model is trained based on the target text to obtain the trained scene data processing model, wherein the scene data processing model is verified by the text to be processed corresponding to the target text.
[0084] The initial data processing model can be understood as a pre-trained model that has not yet been optimized for specific scenario data. The scenario data processing model can be understood as the model obtained by adjusting and optimizing the parameters of the initial data processing model using the anonymized target text as training data. Validation can be understood as the process of evaluating and testing the trained scenario data processing model using the text to be processed corresponding to the target text (i.e., the real data before anonymization).
[0085] Specifically, by using the de-identified target text, a model for a specific scenario can be safely trained. In practice, a pre-trained initial data processing model is obtained, which has basic text understanding and text generation capabilities. The initial data processing model is trained using the de-identified target text. When the target text corresponds to the text to be processed in a specific scenario or domain, a scenario data processing model capable of processing data in that specific scenario is obtained.
[0086] In practical applications, to ensure the accuracy and robustness of the scene data processing model, the output of the scene data processing model can be verified using the original text to be processed corresponding to the target text (i.e., the original medical record containing real, un-de-identified information). In this embodiment, reversible de-identification is achieved in the text to be processed by constructing a de-identification mapping table, so the text to be processed can be used for verification.
[0087] The data processing method provided in the embodiments of this specification uses secure and desensitized target text to train the initial data processing model, so that the training process can be carried out under the premise of protecting privacy. Moreover, the target text retains the scene semantic information to a large extent, which can improve the model training effect. Furthermore, the trained scene data processing model can be verified using the original data to be processed, ensuring the usability and reliability of the scene data processing model in real-world scenarios.
[0088] See Figure 3 , Figure 3 A schematic diagram of the processing procedure of a data processing method provided in one embodiment of this specification is shown.
[0089] Taking the desensitization of medical scenario data as an example, the direct application of general text desensitization tools often leads to irreversible loss of information, making it unable to support scenarios that require data restoration, such as scientific research retrospection. At the same time, due to the lack of understanding of medical semantics, processing methods such as replacing patient names with meaningless symbols will seriously weaken the integrity of key role and entity information in the text, thereby affecting the recognition and reasoning ability of downstream models.
[0090] Using static generalization or directly deleting sensitive fields will result in significant information loss. For example, removing medical record numbers will sever the connection between multiple medical records of the same patient, destroy the crucial foundation of time-series modeling in medical data, and make the data difficult to use for effective fine-tuning of large models and generation of real content due to a large number of gaps in the text.
[0091] Solutions relying on subjective "sensitivity stratification" (applying different levels of desensitization to data with high / medium / low sensitivity) often suffer from inconsistent standards depending on the expert or department. For example, psychiatry and surgery may have different judgments on the sensitivity of diagnostic descriptions, leading to fragmented desensitization strategies. At the same time, solutions based on "sensitivity stratification" are difficult to fully cover implicit sensitive information in the context. For example, inferring a patient's geographical location based on the medical institution is easily overlooked, and they cannot adapt to dynamic task requirements. The sensitivity of the same field varies in different application scenarios, making fixed stratification models lack flexible and adaptive adjustment capabilities.
[0092] This embodiment provides a data processing method that can dynamically select desensitization strategies based on field attributes and context. The desensitization strategies include those that can preserve semantic information, thus balancing semantic preservation with strong privacy protection, rather than a single mask or single encryption. This method preserves medical features and context to the greatest extent, reduces information loss caused by crude masking, and improves model training efficiency.
[0093] Specifically, raw patient data (i.e., the text to be processed in the above embodiments) is collected, including structured fields (such as age, date of visit, and medical record number) and unstructured free text (such as medical history and diagnostic descriptions). All collected data is personal health information containing real identity and clinical information, and is protected personal health information.
[0094] Based on the semantic attributes of the data, identification and classification are performed. Specifically, based on the functional attributes of the data in clinical semantic understanding and model training, at least one sensitive attribute is defined, and a dynamic and scalable desensitization strategy mapping table is constructed. The desensitization strategy mapping table contains the mapping relationship between sensitive attribute categories and corresponding desensitization strategies.
[0095] Specifically, patient information can be divided into three categories: direct identity identifiers that need to be directly removed or semantically replaced (i.e., the first sensitive fields in the above embodiments, such as name and address); structured medical operation identifiers that need to be format-preserved and encrypted (i.e., the second sensitive fields in the above embodiments, such as medical record number and hospitalization number); and clinical semantic variables that need to be generalized, retain statistical distribution, and context-aware replacement (i.e., the third sensitive fields in the above embodiments, such as age, date, and sensitive entities in free text).
[0096] Differentiated desensitization strategies are implemented for sensitive fields of different sensitive attribute categories to obtain the target fields. The desensitization strategies for sensitive fields corresponding to each sensitive attribute category can be found in the above embodiments and are not limited here.
[0097] It should be noted that an anonymization mapping table can be constructed based on sensitive fields and target fields and stored in a secure database. To ensure the reversibility of anonymization, all reversible anonymization (such as format-preserving encryption and the correspondence between semantic placeholders and original values) generates mapping records. These records constitute the anonymization mapping table and are stored in a physically isolated secure database independent of the training environment. Access to this anonymization mapping table is strictly restricted and can only be accessed in authorized scenarios (such as ethical review and result verification), thus achieving a balance between security and practicality—keeping the original information invisible during training while accurately restoring it when necessary.
[0098] The anonymized target text is input into the initial data processing model for training, enabling the trained model to acquire medical knowledge. Specifically, the target text is used as input for supervised learning training of the initial data processing model. Because the anonymization process does not disrupt the text's grammatical structure, introduces no data gaps, and preserves key association identifiers (such as encrypted medical record numbers), the model can effectively learn the clinical semantics and temporal dependency patterns inherent in the data. It does not generate logical inconsistencies or fictitious content due to text fragmentation, and its performance is significantly better than traditional masking or direct deletion-based solutions.
[0099] It should be noted that the classification system and desensitization strategy adopted in this embodiment are both dynamically scalable designs. That is, sensitive attribute categories and their corresponding desensitization strategies can be flexibly added in the future according to actual needs to cope with the ever-changing medical data scenarios and privacy protection requirements.
[0100] The data processing method provided in the embodiments of this specification is based on the unified classification of sensitive attribute categories according to data functional attributes and the design of corresponding desensitization strategies. Through semantic placeholder replacement, format-preserving encryption, and context-aware replacement, it effectively preserves the semantic integrity, temporal correlation, and statistical authenticity required for model training in medical texts while ensuring privacy compliance, overcoming the pain points of traditional methods where "desensitization equals distortion, and usability equals leakage." Simultaneously, it adopts a security architecture that separates training and restoration, achieving irreversible data restoration during the training phase and secure reversible restoration under authorized scenarios through independent storage and physically isolated desensitization mapping tables, thus balancing compliance and scientific research practicality.
[0101] Corresponding to the above method embodiments, this specification also provides data processing apparatus embodiments. Figure 4 A schematic diagram of the structure of a data processing apparatus according to one embodiment of this specification is shown. Figure 4 As shown, the device includes: The acquisition module 402 is configured to acquire the text to be processed and the prompt text corresponding to the text to be processed, wherein the prompt text contains at least one sensitive attribute category and a desensitization strategy corresponding to each sensitive attribute category; The desensitization module 404 is configured to input the text to be processed and the prompt text into a data processing model to obtain the target text output by the data processing model. The target text is obtained by desensitizing at least one sensitive field in the text to be processed. The sensitive field is determined according to the sensitive attribute category in the text to be processed, and each sensitive field is desensitized through a desensitization strategy corresponding to the sensitive attribute category.
[0102] Optionally, the desensitization module 404 is further configured to: The text to be processed and the prompt text are input into the data processing model. The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to any sensitive attribute category are identified as sensitive fields. At least one sensitive field in the text to be processed is desensitized to obtain the target text.
[0103] Optionally, the desensitization module 404 is further configured to: Based on the sensitive attribute category of each sensitive field in the text to be processed, determine the desensitization strategy corresponding to each sensitive field; based on the desensitization strategy corresponding to each sensitive field, desensitize each sensitive field to obtain the target field; update the text to be processed based on the target field to obtain the target text.
[0104] Optionally, the desensitization module 404 is further configured to: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the first attribute category are identified as the first sensitive fields; For the first sensitive field, if the first sensitive field does not contain scene semantic information, the first sensitive field is removed using the removal and desensitization strategy corresponding to the first attribute category; When the first sensitive field contains scene semantic information, the first sensitive field is desensitized using the semantic placeholder replacement strategy corresponding to the first attribute category to obtain the target field. The semantic placeholder replacement strategy is a strategy of replacing the first sensitive field with a structured target field that is related to the scene semantic information.
[0105] Optionally, the desensitization module 404 is further configured to: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the second attribute category are identified as the second sensitive fields; For the second sensitive field, the format preservation and replacement strategy corresponding to the second attribute category is used to desensitize the second sensitive field to obtain the target field. The format preservation and replacement strategy is to replace the second sensitive field with the target field that has the same format as the second sensitive field.
[0106] Optionally, the desensitization module 404 is further configured to: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the third attribute category are identified as the third sensitive fields; For the third sensitive field, if the third sensitive field is a numeric field, the third sensitive field is desensitized using the generalized desensitization strategy corresponding to the third attribute category to obtain the target field. The generalized desensitization strategy is a strategy that generalizes to the numeric field to retain the statistical distribution corresponding to the numeric field. When the third sensitive field is a text entity type field, the third sensitive field is desensitized using the context-aware replacement strategy corresponding to the third attribute category to obtain the target field. The context-aware replacement strategy is a strategy that uses context information to replace the text entity type field.
[0107] Optionally, the acquisition module 402 is further configured to: The text to be processed and the prompt text are input into the data processing model to obtain the desensitization mapping table and the target text output by the data processing model. The desensitization mapping table contains the mapping relationship between each sensitive field in the text to be processed and the corresponding target field.
[0108] Optionally, the acquisition module 402 is further configured to: Get the newly added text to be processed and the corresponding prompt text.
[0109] Optionally, the desensitization module 404 is further configured to: The newly added text to be processed and the prompt text are input into the data processing model to obtain the newly added target text output by the data processing model. In the case that the newly added text to be processed and the text to be processed contain the same scene identifier field, the desensitization target field corresponding to the scene identifier field is determined according to the desensitization mapping table, and the desensitization target field is determined as the target field corresponding to the scene identifier field in the newly added text to be processed.
[0110] The device further includes: The training module is configured to train an initial data processing model based on the target text to obtain a trained scene data processing model, wherein the scene data processing model is validated by the text to be processed corresponding to the target text.
[0111] The above is an illustrative scheme of a data processing apparatus according to this embodiment. It should be noted that the technical solution of this data processing apparatus and the technical solution of the data processing method described above belong to the same concept. For details not described in detail in the technical solution of the data processing apparatus, please refer to the description of the technical solution of the data processing method described above.
[0112] Figure 5 A structural block diagram of a computing device 500 according to one embodiment of this specification is shown. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. The processor 520 is connected to the memory 510 via a bus 530, and a database 550 is used to store data.
[0113] The computing device 500 also includes an access device 540, which enables the computing device 500 to communicate via one or more networks 560. Examples of these networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or combinations of communication networks such as the Internet. The access device 540 may include one or more of any type of wired or wireless network interface (e.g., a network interface card (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, a Wi-MAX (Worldwide Interoperability for Microwave Access) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, or a Near Field Communication (NFC) interface.
[0114] In one embodiment of this specification, the above-described components of the computing device 500 and Figure 5 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 5The block diagram of the computing device shown is for illustrative purposes only and is not intended to limit the scope of this specification. Those skilled in the art can add or replace other components as needed.
[0115] The computing device 500 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or personal computers (PCs). The computing device 500 can also be a mobile or stationary server.
[0116] The processor 520 is used to execute the following computer program / instructions, which, when executed by the processor, implement the steps of the above-described data processing method.
[0117] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the computing device embodiments are basically similar to the data processing method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the data processing method embodiments.
[0118] An embodiment of this specification also provides a computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the above-described data processing method.
[0119] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the computer-readable storage medium embodiments are basically similar to the data processing method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the data processing method embodiments.
[0120] An embodiment of this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described data processing method.
[0121] The above is an illustrative scheme of a computer program product according to this embodiment. It should be noted that the technical solution of this computer program product and the technical solution of the data processing method described above belong to the same concept. For details not described in detail in the technical solution of the computer program product, please refer to the description of the technical solution of the data processing method described above.
[0122] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.
[0123] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or certain intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added or removed according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.
[0124] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments in this specification are not limited to the described order of actions, because according to the embodiments in this specification, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments in this specification.
[0125] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0126] The preferred embodiments disclosed above are merely illustrative of this specification. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments described herein. These embodiments are selected and specifically described in this specification to better explain the principles and practical applications of the embodiments, thereby enabling those skilled in the art to better understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.
Claims
1. A data processing method, comprising: Obtain the text to be processed and the prompt text corresponding to the text to be processed, wherein the prompt text contains at least one sensitive attribute category and the desensitization strategy corresponding to each sensitive attribute category; The text to be processed and the prompt text are input into the data processing model to obtain the target text output by the data processing model. The target text is obtained by desensitizing at least one sensitive field in the text to be processed. The sensitive field is determined according to the sensitive attribute category in the text to be processed, and each sensitive field is desensitized through the desensitization strategy corresponding to the sensitive attribute category.
2. The method as described in claim 1, wherein the text to be processed and the prompt text are input into a data processing model to obtain the target text output by the data processing model, comprising: The text to be processed and the prompt text are input into the data processing model. The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to any sensitive attribute category are identified as sensitive fields. At least one sensitive field in the text to be processed is desensitized to obtain the target text.
3. The method as described in claim 2, wherein at least one sensitive field in the text to be processed is de-identified to obtain the target text, comprising: Based on the sensitive attribute category of each sensitive field in the text to be processed, determine the desensitization strategy corresponding to each sensitive field; Based on the desensitization strategy corresponding to each sensitive field, the sensitive fields are desensitized to obtain the target fields; Update the text to be processed according to the target field to obtain the target text.
4. The method as described in claim 3, wherein the sensitive attribute category includes a first attribute category, the first attribute category being an attribute category related to user identity; The data processing model is used to classify the text fields in the text to be processed, and text fields belonging to any sensitive attribute category are identified as sensitive fields, including: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the first attribute category are identified as the first sensitive fields; The step of desensitizing each sensitive field according to the desensitization strategy corresponding to each sensitive field to obtain the target field includes: For the first sensitive field, if the first sensitive field does not contain scene semantic information, the first sensitive field is removed using the removal and desensitization strategy corresponding to the first attribute category; When the first sensitive field contains scene semantic information, the first sensitive field is desensitized using the semantic placeholder replacement strategy corresponding to the first attribute category to obtain the target field. The semantic placeholder replacement strategy is a strategy of replacing the first sensitive field with a structured target field that is related to the scene semantic information.
5. The method as described in claim 3, wherein the sensitive attribute category includes a second attribute category, the second attribute category being an attribute category related to the structured scene identifier; The data processing model is used to classify the text fields in the text to be processed, and text fields belonging to any sensitive attribute category are identified as sensitive fields, including: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the second attribute category are identified as the second sensitive fields; The step of desensitizing each sensitive field according to the desensitization strategy corresponding to each sensitive field to obtain the target field includes: For the second sensitive field, the format preservation and replacement strategy corresponding to the second attribute category is used to desensitize the second sensitive field to obtain the target field. The format preservation and replacement strategy is to replace the second sensitive field with the target field that has the same format as the second sensitive field.
6. The method as described in claim 3, wherein the sensitive attribute category includes a third attribute category, which is an attribute category related to scene semantic information; The data processing model is used to classify the text fields in the text to be processed, and text fields belonging to any sensitive attribute category are identified as sensitive fields, including: The data processing model is used to classify the text fields in the text to be processed, and the text fields belonging to the third attribute category are identified as the third sensitive fields; The step of desensitizing each sensitive field according to the desensitization strategy corresponding to each sensitive field to obtain the target field includes: For the third sensitive field, if the third sensitive field is a numeric field, the third sensitive field is desensitized using the generalized desensitization strategy corresponding to the third attribute category to obtain the target field. The generalized desensitization strategy is a strategy that generalizes to the numeric field to retain the statistical distribution corresponding to the numeric field. When the third sensitive field is a text entity type field, the third sensitive field is desensitized using the context-aware replacement strategy corresponding to the third attribute category to obtain the target field. The context-aware replacement strategy is a strategy that uses context information to replace the text entity type field.
7. The method according to any one of claims 3-6, wherein the text to be processed and the prompt text are input into a data processing model to obtain the target text output by the data processing model, comprising: The text to be processed and the prompt text are input into the data processing model to obtain the desensitization mapping table and the target text output by the data processing model. The desensitization mapping table contains the mapping relationship between each sensitive field in the text to be processed and the corresponding target field.
8. The method of claim 7, after inputting the text to be processed and the prompt text into the data processing model to obtain the target text output by the data processing model, further includes: Retrieve the newly added text to be processed and the corresponding prompt text for the newly added text to be processed; The newly added text to be processed and the prompt text are input into the data processing model to obtain the newly added target text output by the data processing model. In the case that the newly added text to be processed and the text to be processed contain the same scene identifier field, the desensitization target field corresponding to the scene identifier field is determined according to the desensitization mapping table, and the desensitization target field is determined as the target field corresponding to the scene identifier field in the newly added text to be processed.
9. A computing device, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the data processing method according to any one of claims 1 to 8.
10. A computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the data processing method according to any one of claims 1 to 8.
11. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the data processing method according to any one of claims 1 to 8.