Data desensitization method and device, computer device, storage medium and program product

By comparing the hash values ​​of current and historical data in the primary key lookup table and auxiliary data table, the changed data is identified and the data anonymization operation is performed. This solves the problem of computational resource consumption during large-scale data synchronization, and achieves efficient data change identification and reduced computational resource consumption.

CN122241747APending Publication Date: 2026-06-19湖南长银五八消费金融股份有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
湖南长银五八消费金融股份有限公司
Filing Date
2026-02-02
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies consume excessive computing resources when synchronizing Hive table data from the enterprise production environment to the test environment when processing large-scale data.

Method used

By querying the target primary key identifier of the initial data table in the primary key lookup table, obtaining the historical hash value in the auxiliary data table, comparing the current data with the historical data to determine the changed data, and performing a de-identification operation on the changed data to avoid de-identifying all data.

Benefits of technology

It reduces the consumption of computing resources, improves the accuracy and efficiency of data change identification, and avoids the duplication of processing of the entire dataset.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241747A_ABST
    Figure CN122241747A_ABST
Patent Text Reader

Abstract

This application relates to a data anonymization method, apparatus, computer device, storage medium, and program product. In response to a data anonymization request for an initial data table, the method involves: querying a primary key identifier of the initial data table in a primary key lookup table; if the primary key identifier is found, determining the current data of the target data row indicated by the primary key identifier in the initial data table; obtaining an auxiliary data table corresponding to the initial data table; determining the changed data of the initial data table based on the current data and historical hash values, and performing an anonymization operation on the changed data. This scheme determines whether the data in the target data row indicated by the primary key identifier has changed by comparing the historical hash values ​​in the auxiliary data table corresponding to the initial data table with the current data, thereby identifying the changed data of the initial data table and performing an anonymization operation on the changed data. This eliminates the need to anonymize all data in the initial data table, reducing the consumption of computational resources.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of big data technology, and in particular to a data desensitization method, apparatus, computer equipment, storage medium, and program product. Background Technology

[0002] As the scale of enterprise business data expands, the reliance of enterprise testing environments on anonymized data is increasing. As the "first stop" in the enterprise data warehouse for receiving business data from the production environment, it has become common practice to frequently anonymize the Hive table data in the enterprise production environment and synchronize it to the testing environment.

[0003] In related technologies, a common approach is to perform a full table scan and full write to synchronize all data from the enterprise production environment's Hive tables to the test environment after anonymizing the data. However, this method suffers from high computational resource consumption when processing large-scale data. Summary of the Invention

[0004] Therefore, it is necessary to provide a data anonymization method, apparatus, computer equipment, storage medium, and program product to address the aforementioned technical problems, thereby reducing the consumption of computing resources.

[0005] Firstly, this application provides a data anonymization method, including:

[0006] In response to a data anonymization request for an initial data table, the target primary key identifier of the initial data table is queried in the primary key lookup table; wherein, the primary key lookup table contains candidate primary key identifiers corresponding to different candidate data tables, and the candidate primary key identifiers are used to locate data rows in the candidate data tables;

[0007] If the target primary key identifier is found, determine the current data of the target data row indicated by the target primary key identifier in the initial data table; and,

[0008] Obtain the auxiliary data table corresponding to the initial data table; wherein, the auxiliary data table corresponding to the initial data table stores the target primary key identifier of the initial data table and the historical hash value of the historical data of the target data row;

[0009] Based on the current data and the historical hash value, the changed data of the initial data table is determined, and the changed data is de-identified.

[0010] In one embodiment, determining the changed data of the initial data table based on the current data and the historical hash value includes:

[0011] Determine the current hash value of the current data;

[0012] The historical hash value and the current hash value are compared for consistency to obtain the consistency comparison result.

[0013] If the consistency comparison result is inconsistent, then the current data will be used as the changed data of the initial data table.

[0014] In one embodiment, determining the current hash value of the current data includes:

[0015] The fields and their values ​​in the current data are concatenated to obtain a concatenated string; wherein the field value of an empty field in the current data is a preset string.

[0016] The concatenated string is encrypted using a preset encryption algorithm to obtain the current hash value.

[0017] In one embodiment, if the initial data table is a partitioned table, the auxiliary data table stores the target primary key identifier of the initial data table, the historical hash value of the historical data of the target data row, and the partition field corresponding to the initial data table; the partition field is used to jointly locate the historical hash value with the target primary key identifier.

[0018] In one embodiment, the auxiliary data table further stores historical version identifiers; if the consistency comparison result is inconsistent, the method further includes:

[0019] Generate the current version identifier based on the historical version identifier;

[0020] The target primary key identifier, the historical hash value, and the historical version identifier stored in the auxiliary data table are replaced with the target primary key identifier, the current hash value, and the current version identifier.

[0021] In one embodiment, performing the desensitization operation on the changed data includes:

[0022] If the changed data is data that has been deleted from the initial data table, then the de-identified data corresponding to the deleted data will be deleted from the target data table.

[0023] If the changed data is new data in the initial data table, then the new data is de-identified, and the de-identified data corresponding to the new data is stored in the target data table;

[0024] If the changed data is modified data in the initial data table, then the de-identified data corresponding to the original data of the modified data already stored in the target data table is deleted, the modified data is de-identified, and the de-identified data corresponding to the modified data is stored in the target data table.

[0025] Secondly, this application also provides a data desensitization device, comprising:

[0026] The query module is used to respond to a data anonymization request for an initial data table by querying the target primary key identifier of the initial data table in the primary key query table; wherein, the primary key query table contains candidate primary key identifiers corresponding to different candidate data tables, and the candidate primary key identifiers are used to locate data rows in the candidate data tables;

[0027] The first determining module is configured to determine the current data of the target data row indicated by the target primary key identifier in the initial data table when the target primary key identifier is found; and to obtain the auxiliary data table corresponding to the initial data table; wherein the auxiliary data table corresponding to the initial data table stores the historical hash value of the target primary key identifier of the initial data table and the historical data of the target data row in association.

[0028] The second determining module is used to determine the changed data of the initial data table based on the current data and the historical hash value, and to perform a de-identification operation on the changed data.

[0029] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0030] In response to a data anonymization request for an initial data table, the target primary key identifier of the initial data table is queried in the primary key lookup table; wherein, the primary key lookup table contains candidate primary key identifiers corresponding to different candidate data tables, and the candidate primary key identifiers are used to locate data rows in the candidate data tables;

[0031] If the target primary key identifier is found, determine the current data of the target data row indicated by the target primary key identifier in the initial data table; and,

[0032] Obtain the auxiliary data table corresponding to the initial data table; wherein, the auxiliary data table corresponding to the initial data table stores the target primary key identifier of the initial data table and the historical hash value of the historical data of the target data row;

[0033] Based on the current data and the historical hash value, the changed data of the initial data table is determined, and the changed data is de-identified.

[0034] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the following steps:

[0035] In response to a data anonymization request for an initial data table, the target primary key identifier of the initial data table is queried in the primary key lookup table; wherein, the primary key lookup table contains candidate primary key identifiers corresponding to different candidate data tables, and the candidate primary key identifiers are used to locate data rows in the candidate data tables;

[0036] If the target primary key identifier is found, determine the current data of the target data row indicated by the target primary key identifier in the initial data table; and,

[0037] Obtain the auxiliary data table corresponding to the initial data table; wherein, the auxiliary data table corresponding to the initial data table stores the target primary key identifier of the initial data table and the historical hash value of the historical data of the target data row;

[0038] Based on the current data and the historical hash value, the changed data of the initial data table is determined, and the changed data is de-identified.

[0039] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, performs the following steps:

[0040] In response to a data anonymization request for an initial data table, the target primary key identifier of the initial data table is queried in the primary key lookup table; wherein, the primary key lookup table contains candidate primary key identifiers corresponding to different candidate data tables, and the candidate primary key identifiers are used to locate data rows in the candidate data tables;

[0041] If the target primary key identifier is found, determine the current data of the target data row indicated by the target primary key identifier in the initial data table; and,

[0042] Obtain the auxiliary data table corresponding to the initial data table; wherein, the auxiliary data table corresponding to the initial data table stores the target primary key identifier of the initial data table and the historical hash value of the historical data of the target data row;

[0043] Based on the current data and the historical hash value, the changed data of the initial data table is determined, and the changed data is de-identified.

[0044] The aforementioned data anonymization method, apparatus, computer equipment, storage medium, and program product, in response to a data anonymization request for an initial data table, query a primary key identifier of the initial data table in a primary key lookup table. The primary key lookup table contains candidate primary key identifiers corresponding to different candidate data tables, and these candidate primary key identifiers are used to locate data rows in the candidate data tables. If a target primary key identifier is found, the current data of the target data row indicated by the target primary key identifier in the initial data table is determined. Furthermore, an auxiliary data table corresponding to the initial data table is obtained. This auxiliary data table stores historical hash values ​​associated with the target primary key identifier of the initial data table and historical data of the target data row. Based on the current data and the historical hash values, the changed data of the initial data table is determined, and an anonymization operation is performed on the changed data. This scheme, by using the historical hash values ​​and current data in the auxiliary data table corresponding to the initial data table, determines whether the data in the target data row indicated by the target primary key identifier has changed, thereby identifying the changed data of the initial data table and performing an anonymization operation on the changed data. This eliminates the need to anonymize all data in the initial data table, reducing the consumption of computational resources. Attached Figure Description

[0045] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0046] Figure 1 This is a flowchart illustrating a data anonymization method in one embodiment;

[0047] Figure 2 This is a flowchart illustrating the data desensitization method in another embodiment;

[0048] Figure 3 This is a structural block diagram of a data desensitization device in one embodiment;

[0049] Figure 4 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0051] The data anonymization method provided in this application can be applied to application scenarios involving incremental data anonymization of enterprise data.

[0052] This method can be executed by a server or a terminal. The server can be a standalone physical server, a server cluster or distributed system consisting of multiple physical servers, or a cloud server providing cloud computing services. The terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart in-vehicle systems, and projection devices. Portable wearable devices can include smartwatches, smart bracelets, and head-mounted displays. Head-mounted displays can be virtual reality (VR) devices, augmented reality (AR) devices, and smart glasses.

[0053] In one exemplary embodiment, such as Figure 1 As shown, a data anonymization method is provided. Taking the application of this method to a server as an example, the method includes the following steps:

[0054] S101, in response to the data desensitization request for the initial data table, query the primary key identifier of the initial data table in the primary key lookup table.

[0055] The initial data table is the source table that needs to be de-identified. For example, it can be a source-layer Hive table in the enterprise data warehouse, such as the source-layer Hive table d1.t1, which stores business data in the production environment and supports various structures such as no primary key, single primary key, composite primary key and partitioned table.

[0056] Data masking requests are offline masking task instructions, which include parameters such as the initial data table name, the target data table name, and masking rules, such as the task instruction "synchronize the masked data of table d1.t1 to table desensitize_d1_t1".

[0057] The primary key lookup table contains candidate primary key identifiers corresponding to different candidate data tables. The candidate primary key identifiers are used to locate data rows in the candidate data tables. The primary key lookup table is a tool table used to manage the primary key metadata of all source layer tables. The table structure can be database_name, table_name, primary_key, where database_name is the name of the database where the initial data table is located, table_name is the name of the initial data table, and primary_key is the name of the primary key field. It supports records with single primary keys and composite primary keys. For example, composite primary keys k2 and k3 correspond to two records. Tables without primary keys do not have records.

[0058] The target primary key identifier is a field or combination of fields that uniquely identifies a data row in the initial data table, such as the primary key k1 of table d1.t1, or k2 and k3 of the composite primary key table d2.t2, used to locate a specific data row in the initial data table.

[0059] For example, the server receives a data de-identification request and parses parameters such as the initial data table name (e.g., d1.t1) and the target data table name (e.g., desensitize_d1_t1) in the data de-identification request, triggering the start of the de-identification process.

[0060] Furthermore, the primary key lookup table `table_primary_key` can be accessed via SQL queries, and the target primary key identifier can be filtered based on the `database_name` and `table_name` of the initial data table. For example, when querying the target primary key identifier of the `d1.t1` table, if the returned result is `k1`, then the target primary key identifier is `k1`; if it returns `k2` and `k3`, then the target primary key identifier is the composite primary key identifier `k2` and `k3`.

[0061] S102, if the target primary key identifier is found, determine the current data of the target data row indicated by the target primary key identifier in the initial data table; and obtain the auxiliary data table corresponding to the initial data table.

[0062] The auxiliary data table corresponding to the initial data table stores the target primary key identifier of the initial data table and the historical hash value of the historical data of the target data row.

[0063] In some optional implementations, if the initial data table is a partitioned table, the auxiliary data table stores the target primary key identifier of the initial data table, the historical hash value of the historical data of the target data row, and the partition field corresponding to the initial data table; the partition field is used to jointly locate the historical hash value with the target primary key identifier.

[0064] A partitioned table is a table in the initial data table that divides the data into multiple logical parts based on the value of a certain field (partition field). For example, the d1.t1 table is a partitioned table, the partition field p1 is the date, and the data is stored in partitions such as p1=202405, p1=202406, etc., which is used to optimize the query and management of large-scale data.

[0065] Partition fields are fields used to divide data into partitions, such as date field p1 and region field area. Each partition field value corresponds to a data partition.

[0066] Joint location refers to using both the target primary key identifier and the partition field as query conditions to locate a unique historical hash value in the auxiliary data table, avoiding hash value confusion caused by duplicate primary key identifiers in different partitions.

[0067] For example, when the initial data table is a partitioned table, a partition field column is added when creating the auxiliary data table. The table structure is: primary key field 1, primary key field 2, ..., partition field, historical hash value (hash_code), and historical version identifier (operation_count). For example, if d1.t1 is a partitioned table with partition field p1, the structure of the auxiliary table d1.t1_help is (k1, p1, hash_code, operation_count). When querying historical hash values, the target primary key identifier and partition field can be added to the SQL query statement as filter conditions to ensure that the historical hash value of the target data row under the unique partition is located.

[0068] In the above embodiments, in view of the characteristics of partitioned tables, a partition field is added to the auxiliary data table and the historical hash value is located by combining the primary key and the partition field. This avoids hash value confusion caused by duplicate primary keys in different partitions and improves the accuracy of identifying data changes in partitioned tables. At the same time, the combined location narrows the query range of historical hash values, reduces the amount of data scanning during the query process, and improves query efficiency.

[0069] The target data behavior is a row of data in the initial data table that is uniquely identified by the target primary key, such as the row of data corresponding to k1=101 in the d1.t1 table: k1=101, name="Zhang San", age=30.

[0070] The current data is the latest field value of the target data row when the current de-identification task is executed. For example, the latest data of the row k1=101 above is name="Zhang San" and age=31.

[0071] The auxiliary data table is a tool table that corresponds one-to-one with the initial data table. It is dynamically created and maintained as the data masking task progresses. It is used to associate and store the target primary key identifier, the historical data hash value (hash_code) of the target data row, and the version identifier (operation_count). If the initial data table is a partitioned table, the partition field also needs to be stored.

[0072] Historical hash values ​​are the hash values ​​stored in the auxiliary data table at the time of the last de-identification of the target data row. For example, the hash_code of the previous row k1=101 was "c81e728d9d4c2f636f067f89cc14862c", which can be calculated from historical data using a preset encryption algorithm.

[0073] For example, based on the queried target primary key identifier, locate the current data of the target data row indicated by the target primary key identifier in the initial data table. For instance, for the target primary key identifier k1 in the initial data table d1.t1, obtain all row data and current field values ​​corresponding to k1.

[0074] Furthermore, auxiliary data tables are dynamically generated based on the initial data table name. For example, the auxiliary data table name corresponding to d1.t1 is d1.t1_help. If the auxiliary data table does not exist, its structure can be determined based on whether the initial data table is a partitioned table, and the auxiliary data table will be dynamically created accordingly. For instance, if the initial data table is a non-partitioned table, the structure of the auxiliary data table is: primary key field 1, primary key field 2, ..., historical hash value (hash_code), and historical version identifier (operation_count). If the initial data table is a partitioned table, the structure of the auxiliary data table is: primary key field 1, primary key field 2, ..., partition field, historical hash value (hash_code), and historical version identifier (operation_count). If an auxiliary data table already exists, it can be directly located and accessed by its table name.

[0075] In some optional implementations, if the target primary key identifier cannot be found, it means that the initial data table did not have a primary key identifier. For example, in the d3.t3 table, the fields are name, age, and phone, but there are no unique identifier fields, making it impossible to locate a unique data row using a single or combined field. In this case, all data in the initial data table can be anonymized and stored in the target data table.

[0076] S103. Based on the current data and historical hash values, determine the changed data in the initial data table and perform a desensitization operation on the changed data.

[0077] The changed data refers to data that has been added, modified, or deleted in the initial data table compared to historical data. For example, adding a row with k1=105, modifying the age field value with k1=101, or deleting a row with k1=104. The desensitization operation involves transforming, replacing, or encrypting sensitive data according to preset rules, such as desensitizing the name "Zhang San" to "Zhang*".

[0078] For example, the current hash value can be determined based on the current data, and the changed data in the initial data table can be determined based on the current hash value and the historical hash values ​​stored in the auxiliary data table. Alternatively, historical data can be traced back based on the historical hash values ​​stored in the auxiliary data table, and then the consistency between the current data and the historical data can be compared to determine the changed data in the initial data table. For different types of changed data, corresponding de-identification and target data table update operations can be performed.

[0079] In some optional implementations, the current hash value of the current data can be determined; and the historical hash value and the current hash value can be compared for consistency to obtain the consistency comparison result; if the consistency comparison result is inconsistent, the current data can be used as the changed data of the initial data table.

[0080] For example, an encryption algorithm can be used to encrypt the current data to obtain the current hash value of the current data.

[0081] In some optional implementations, the fields and their values ​​in the current data can be concatenated to obtain a concatenated string; where the values ​​of empty fields in the current data are preset strings; and a preset encryption algorithm is used to encrypt the concatenated string to obtain the current hash value.

[0082] The concatenated string is a string formed by sorting the field names and values ​​of all fields in the current data according to a preset rule and then connecting them with a delimiter, such as "field1:value1|field2:value2", which is used to unify the data format for encrypted calculations.

[0083] An empty field is a field in the current data whose value is NULL or has not been assigned a value, such as the name field in table d1.t1 where k1=105 has a value of NULL.

[0084] The preset string is a fixed string used to uniformly replace empty field values, such as "NULL", to avoid incorrect hash value calculations caused by inconsistent empty value formats.

[0085] The preset encryption algorithm is used to encrypt the concatenated string, such as Message-Digest Algorithm 5 (MD5) and Secure Hash Algorithm 256-bit (SHA256). It is necessary to ensure that the same concatenated string generates the same hash value, and different concatenated strings generate different hash values.

[0086] For example, you can iterate through all fields of the current data, sort them lexicographically by field name, such as sorting the fields of table d1.t1 as age, k1, and name, to ensure a fixed concatenation order; and concatenate each field in the format "field name: field value", such as concatenating the age field value 30 as "age:30"; if the field is empty, replace the field value with the preset string "NULL", such as concatenating the name field as "name:NULL" when it is empty; finally, connect the concatenation results of all fields with the separator "|" to form the final concatenated string.

[0087] You can call the preset encryption algorithm interface, such as the MD5 encryption method of Java's MessageDigest class, input a concatenated string, and output the encrypted hash value. For example, after MD5 encryption, you get a 32-bit hexadecimal string and obtain the current hash value.

[0088] In the above embodiments, the concatenated string is generated by fixing the field order and uniformly replacing empty fields, which ensures that the same data content generates a consistent concatenated string and avoids hash value calculation deviations caused by disordered field order or inconsistent empty value formats. At the same time, the use of a preset encryption algorithm with a low collision rate ensures that different data content generates different hash values, thereby improving the accuracy of comparing the current hash value with the historical hash value and improving the accuracy of identifying changed data.

[0089] Furthermore, two hash values ​​can be compared by string matching. If the characters are exactly the same, the consistency comparison result is consistent; otherwise, it is inconsistent.

[0090] When the comparison results are inconsistent, it indicates that there is a difference between the current data and the historical data, and the data is determined to be modified data. Data anonymization operation is then performed on the modified data.

[0091] In this embodiment, hash value consistency comparison is used instead of field-by-field data content comparison. Hash values ​​have the characteristics of fixed length, fast calculation, and unique representation, thus improving the speed of identifying changed data, reducing the computational overhead in the data comparison process, and improving the accuracy of change identification.

[0092] In the above embodiments, the changes in the target data row indicated by the target primary key are determined by comparing the historical hash value with the current data in the auxiliary data table corresponding to the initial data table. This identifies the changed data in the initial data table and performs a desensitization operation on the changed data. In this way, it is not necessary to desensitize all the data in the initial data table, reducing the consumption of computing resources.

[0093] In some optional implementations, when performing data desensitization on changed data, the specific desensitization operation can be determined based on the type of changed data.

[0094] For example, if the changed data is data that has been deleted from the initial data table, then the de-identified data corresponding to the deleted data will be deleted from the target data table. Here, the deleted data refers to data in the initial data table whose target primary key identifier exists in the historical batches of the auxiliary data table but does not exist in the current data. For example, the auxiliary table may have a historical record of k1=104, but the current d1.t1 table does not contain data with k1=104.

[0095] In this case, an SQL DELETE statement can be generated. Based on the target primary key identifier, if it is a partitioned table, the deletion operation needs to be performed in conjunction with the partition field to delete the corresponding de-identified data in the target data table.

[0096] If the changed data is new data in the initial data table, the new data will be anonymized, and the anonymized data will be stored in the target data table. The new data refers to data whose target primary key identifier exists in the current data table but not in the historical batches of the auxiliary data table. For example, if the current d1.t1 table has data with k1=103, but there is no corresponding record in the auxiliary table.

[0097] In this case, a preset de-identification rule engine can be invoked, such as a field replacement tool based on regular expressions, to de-identify sensitive fields of the newly added data, generate an SQL INSERT statement, and write the de-identified data into the target data table.

[0098] If the changed data is modified data in the initial data table, then the de-identified data corresponding to the original data of the modified data already stored in the target data table will be deleted, the modified data will be de-identified, and the de-identified data corresponding to the modified data will be stored in the target data table. Modified data refers to data in the initial data table where the target primary key identifier exists in both the current data and the historical batches of the auxiliary data table, but the current hash value is inconsistent with the historical hash value, such as changing the name field value of k1=101 from "Zhang San" to "Zhang Sanfeng".

[0099] In this case, you can first generate an SQL DELETE statement to delete the old de-identified data corresponding to the target primary key identifier in the target data table; then de-identify the modified data and generate an SQL INSERT statement to write the new de-identified data into the target data table.

[0100] In the above embodiments, different desensitization operation procedures were designed for different types of changed data, avoiding a full overwrite of the target data table and reducing storage redundancy and computational overhead during the target data table update process.

[0101] In some optional implementations, the auxiliary data table also stores historical version identifiers. If the consistency comparison result is inconsistent, the current version identifier can be generated based on the historical version identifier. The target primary key identifier, historical hash value, and historical version identifier stored in the auxiliary data table are then replaced with the target primary key identifier, the current hash value, and the current version identifier.

[0102] The historical version identifier is a field in the auxiliary data table that records the batches of data changes. Each batch of data masking tasks corresponds to a unique version identifier, used to distinguish data snapshots at different points in time. The current version identifier is the version identifier corresponding to the current data masking task, generated by adding 1 to the historical version identifier. For example, if the historical version identifier is 3, the current version identifier is 4, used to mark the batch corresponding to the hash value of the current data.

[0103] For example, when creating the auxiliary data table, an operation_count field can be set. Each time a historical hash value is written, the corresponding batch number is recorded. The batch number is 1 when the de-identification task is executed for the first time, and it increments by 1 each time thereafter.

[0104] You can query the maximum value of the operation_count field in the auxiliary data table, which is the historical version identifier i, and the current version identifier is i+1.

[0105] If the consistency comparison result is inconsistent, an SQL update statement can be executed to replace the hash_code of the record corresponding to the target primary key identifier in the auxiliary data table with the current hash value and the operation_count with the current version identifier. If it is new data, a new record containing the target primary key identifier, the current hash value, and the current version identifier will be inserted. At the same time, the old batch data corresponding to the historical version identifier in the auxiliary data table can be deleted, and only the latest batch can be kept.

[0106] Optionally, if the consistency comparison result is consistent, the historical version identifier in the auxiliary data table can be replaced with the current version identifier.

[0107] In the above embodiments, by managing the hash values ​​in the auxiliary data table through version identifiers, the change history of the data can be clearly tracked, ensuring that the hash value comparison of each de-identification task is based on the snapshot data of adjacent batches, avoiding the confusion of change identification caused by cross-batch comparisons; at the same time, by replacing the hash values ​​and version identifiers of the old versions, the auxiliary data table retains only the latest data flow status, reducing the storage overhead of the auxiliary data table, and the version identifier can be used as the traceability basis for de-identification tasks, improving the maintainability of the solution.

[0108] In some alternative implementations, see [link to relevant documentation]. Figure 2 , Figure 2 A flowchart illustrating another data anonymization method is provided, which includes the following steps:

[0109] Step 1: Task initiation and pre-judgment.

[0110] Start the data anonymization task and trigger its execution.

[0111] First, determine whether the initial data table to be processed is a partitioned table. If it is, obtain the partition list, that is, get the partitioned table of the initial data table that has changed. If not, mark it as a full table to be processed, that is, no partition filtering is required, and the initial data table is processed directly.

[0112] Step 2: Basic verification of primary key and auxiliary tables.

[0113] The target primary key identifier of the initial data table can be obtained from the primary key lookup table. The target primary key identifier can be a single primary key or a composite primary key.

[0114] Next, it is determined whether the initial data table has a primary key. For example, if the target primary key identifier is found in the primary key lookup table, it means that the initial data table has a primary key; if it is not found, it means that the initial data table does not have a primary key identifier. If yes, that is, there is a primary key, then proceed to determine whether the auxiliary table exists; if no, that is, there is no primary key, then proceed to step four.

[0115] Further, determine whether an auxiliary table corresponding to the initial data table, such as d1.t1_help, has been created. If yes, proceed to step three; otherwise, proceed to step four.

[0116] Step 3: Incremental desensitization process.

[0117] Query the maximum batch number i in the auxiliary table, which is the historical version identifier: retrieve the batch number of the last de-identification task stored in the auxiliary table.

[0118] Calculate the current data hash, i.e. the current hash value, and write it to the new batch i+1 in the batch auxiliary table.

[0119] By comparing the hash values ​​of the old and new batches, changed data can be identified. For example, based on the hash comparison results, three types of data changes can be distinguished:

[0120] If it exists only in the old batch, a DELETE statement (data deleted) is generated.

[0121] If the data exists only in the new batch, an INSERT statement (data addition) is generated.

[0122] If both batches exist but have different hashes, a DELETE-INSERT statement (data modification) is generated.

[0123] Execute the SQL to update the target table, that is, run the SQL generated above to update the target table storing the de-identified data; at the same time, delete the data of batch i in the auxiliary table, and keep only the latest batch.

[0124] Step 4: Full Desensitization Process.

[0125] Auxiliary tables can be created dynamically based on the attributes (partition or primary key) of the initial data table. A full data hash is then calculated and written to batch 1, meaning the hash value of all data in the initial table is calculated and written to the first batch of the auxiliary table. Finally, full data masking is performed and written to the target table, meaning the entire table's data is masked and the result is written to the target table.

[0126] Step 5: Process Closure.

[0127] Update the task status, that is, mark the execution status of the current de-identification task, whether it was successful or failed.

[0128] The process ends when the entire data anonymization task is completed.

[0129] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0130] Based on the same inventive concept, this application also provides a data desensitization apparatus for implementing the data desensitization method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations in one or more data desensitization apparatus embodiments provided below can be found in the limitations of the data desensitization method described above, and will not be repeated here.

[0131] In one exemplary embodiment, such as Figure 3 As shown, a data anonymization device is provided, comprising:

[0132] Query module 10 is used to query the target primary key identifier of the initial data table in the primary key query table in response to the data desensitization request for the initial data table; wherein, the primary key query table contains candidate primary key identifiers corresponding to different candidate data tables, and the candidate primary key identifiers are used to locate data rows in the candidate data tables;

[0133] The first determining module 20 is used to determine the current data of the target data row indicated by the target primary key identifier in the initial data table when the target primary key identifier is found; and to obtain the auxiliary data table corresponding to the initial data table; wherein the auxiliary data table corresponding to the initial data table stores the historical hash value of the target primary key identifier of the initial data table and the historical data of the target data row.

[0134] The second determining module 30 is used to determine the changed data of the initial data table based on the current data and historical hash values, and to perform a desensitization operation on the changed data.

[0135] The above scheme determines whether the data in the target data row indicated by the target primary key has changed by comparing the historical hash value in the auxiliary data table corresponding to the initial data table with the current data. This identifies the changed data in the initial data table and performs a desensitization operation on the changed data. In this way, it is not necessary to desensitize all the data in the initial data table, thus reducing the consumption of computing resources.

[0136] In one embodiment, the second determining module 30 is specifically used for:

[0137] Determine the current hash value of the current data; perform a consistency comparison between the historical hash value and the current hash value to obtain the consistency comparison result; if the consistency comparison result is inconsistent, then use the current data as the changed data of the initial data table.

[0138] In one embodiment, the second determining module 30 is specifically used for:

[0139] The concatenation string is obtained by concatenating the values ​​of each field in the current data; the values ​​of empty fields in the current data are preset strings; the concatenation string is encrypted using a preset encryption algorithm to obtain the current hash value.

[0140] In one embodiment, if the initial data table is a partitioned table, the auxiliary data table stores the target primary key identifier of the initial data table, the historical hash value of the historical data of the target data row, and the partition field corresponding to the initial data table; the partition field is used to jointly locate the historical hash value with the target primary key identifier.

[0141] In one embodiment, the device further includes a replacement module, and the auxiliary data table stores historical version identifiers; if the consistency comparison result is inconsistent, the replacement module is used to:

[0142] Generate the current version identifier based on the historical version identifier; replace the target primary key identifier, historical hash value, and historical version identifier stored in the auxiliary data table with the target primary key identifier, current hash value, and current version identifier.

[0143] In one embodiment, the second determining module 30 is specifically used for:

[0144] If the changed data is data that has been deleted from the initial data table, then the de-identified data corresponding to the deleted data will be deleted from the target data table. If the changed data is newly added data in the initial data table, then the newly added data will be de-identified, and the de-identified data corresponding to the newly added data will be stored in the target data table. If the changed data is modified data in the initial data table, then the de-identified data corresponding to the original data of the modified data already stored in the target data table will be deleted, the modified data will be de-identified, and the de-identified data corresponding to the modified data will be stored in the target data table.

[0145] Each module in the aforementioned data anonymization device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0146] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 4 As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores both un-anonymized and anonymized data. The I / O interfaces allow the processor to exchange information with external devices. The communication interface allows communication with external terminals via a network connection. When executed by the processor, the computer program implements a data anonymization method.

[0147] Those skilled in the art will understand that Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0148] In one exemplary embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the data desensitization method described in any of the above embodiments.

[0149] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the data desensitization method described in any of the above embodiments.

[0150] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the data desensitization method described in any of the above embodiments.

[0151] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0152] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0153] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0154] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A data de-sensitization method, characterized in that, The method includes: In response to a data anonymization request for an initial data table, the target primary key identifier of the initial data table is queried in the primary key lookup table; wherein, the primary key lookup table contains candidate primary key identifiers corresponding to different candidate data tables, and the candidate primary key identifiers are used to locate data rows in the candidate data tables; If the target primary key identifier is found, determine the current data of the target data row indicated by the target primary key identifier in the initial data table; and, Obtain the auxiliary data table corresponding to the initial data table; wherein, the auxiliary data table corresponding to the initial data table stores the target primary key identifier of the initial data table and the historical hash value of the historical data of the target data row; Based on the current data and the historical hash value, the changed data of the initial data table is determined, and the changed data is de-identified.

2. The method of claim 1, wherein, The step of determining the changed data of the initial data table based on the current data and the historical hash value includes: Determine the current hash value of the current data; The historical hash value and the current hash value are compared for consistency to obtain the consistency comparison result. If the consistency comparison result is inconsistent, then the current data will be used as the changed data of the initial data table.

3. The method of claim 2, wherein, Determining the current hash value of the current data includes: The fields and their values ​​in the current data are concatenated to obtain a concatenated string; wherein the field value of an empty field in the current data is a preset string. The concatenated string is encrypted using a preset encryption algorithm to obtain the current hash value.

4. The method of claim 2, wherein, If the initial data table is a partitioned table, the auxiliary data table stores the target primary key identifier of the initial data table, the historical hash value of the historical data of the target data row, and the partition field corresponding to the initial data table; the partition field is used to jointly locate the historical hash value with the target primary key identifier.

5. The method of claim 2, wherein, The auxiliary data table also stores historical version identifiers; if the consistency comparison result is inconsistent, the method further includes: Generate the current version identifier based on the historical version identifier; The target primary key identifier, the historical hash value, and the historical version identifier stored in the auxiliary data table are replaced with the target primary key identifier, the current hash value, and the current version identifier.

6. The method according to claim 1, characterized in that, The process of performing data anonymization on the changed data includes: If the changed data is data that has been deleted from the initial data table, then the de-identified data corresponding to the deleted data will be deleted from the target data table. If the changed data is new data in the initial data table, then the new data is de-identified, and the de-identified data corresponding to the new data is stored in the target data table; If the changed data is modified data in the initial data table, then the de-identified data corresponding to the original data of the modified data already stored in the target data table is deleted, the modified data is de-identified, and the de-identified data corresponding to the modified data is stored in the target data table.

7. A data desensitization device, characterized in that, The device includes: The query module is used to respond to a data anonymization request for an initial data table by querying the target primary key identifier of the initial data table in the primary key query table; wherein, the primary key query table contains candidate primary key identifiers corresponding to different candidate data tables, and the candidate primary key identifiers are used to locate data rows in the candidate data tables; The first determining module is configured to determine the current data of the target data row indicated by the target primary key identifier in the initial data table when the target primary key identifier is found; and to obtain the auxiliary data table corresponding to the initial data table; wherein the auxiliary data table corresponding to the initial data table stores the historical hash value of the target primary key identifier of the initial data table and the historical data of the target data row in association. The second determining module is used to determine the changed data of the initial data table based on the current data and the historical hash value, and to perform a de-identification operation on the changed data.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.