A government affair data opening-oriented storage data sensitivity automatic evaluation method and system
By constructing a virtual connection result distribution and calculating the information entropy increment in an isolated sandbox environment, the problem of inaccurate assessment of privacy risks and data leakage in cross-table associations in existing technologies is solved, realizing a scientific and secure assessment of the sensitivity of government data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU YITUO SOFTWARE DEV CO LTD
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for assessing the sensitivity of government data cannot accurately assess the privacy risks caused by cross-table associations, and the assessment process carries the risk of data leakage.
By acquiring metadata statistics from government databases, a virtual connection result distribution is constructed in an isolated sandbox environment. The information entropy increment is calculated, the risk of individual re-identification caused by cross-table logical associations is quantified, and sensitivity assessment results are generated.
It enables accurate assessment of data sensitivity after cross-table joins without accessing actual data records, avoids data leakage, systematically identifies the cascading propagation effect of multi-hop paths, and achieves scientific and comprehensive sensitivity assessment.
Smart Images

Figure CN122241367A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data security technology, and in particular to an automatic method and system for assessing the sensitivity of stored data for government data disclosure. Background Technology
[0002] With the deepening of e-government construction and the implementation of data openness and sharing policies, government departments have accumulated a large amount of data involving citizens' personal information. This data is scattered across different business systems and databases, including population information, medical records, social security data, tax information, etc., and is characterized by a large number of data tables, complex relationships between tables, and diverse field types. To improve the efficiency of government services and promote the mining of data value, government departments need to open some of this data to the public.
[0003] Existing methods for assessing the sensitivity of government data typically employ a single-table assessment strategy, evaluating the sensitivity of each data table individually based on whether its fields contain sensitive information such as names, ID numbers, and phone numbers. This method treats each data table as an independent entity, neglecting cross-table logical relationships formed through foreign keys. When multiple data tables are linked by foreign keys, combinations of fields that were individually insensitive may form quasi-identifiers that uniquely identify individuals, significantly increasing the risk of re-identification. This method still classifies these tables as low-sensitivity and releases them directly, leading to privacy breaches after the data is made public.
[0004] Furthermore, existing methods typically require performing cross-table joins to physically connect real data from multiple tables when assessing sensitivity, generating an intermediate result table containing complete personal information. The risk of individual weight identification is then analyzed based on this intermediate result table. This assessment process itself exposes a large amount of sensitive information, posing a data leakage risk. Therefore, existing technologies suffer from the inability to accurately assess the privacy risks caused by cross-table joins and the risk of data leakage during the assessment process. Summary of the Invention
[0005] In view of the aforementioned problems, this application is hereby filed.
[0006] Therefore, this application provides an automatic data sensitivity assessment method and system for open government data storage, which can solve the problems mentioned in the background art.
[0007] To solve the above-mentioned technical problems, this application provides the following technical solution: In the first aspect, this application provides an automatic assessment method for the sensitivity of stored data for government data openness, including: obtaining metadata statistics from government databases, wherein the metadata statistics include the cardinality of association keys, the total number of records in the data table, the histogram of field value distribution, and the weights of association topology edges; Based on the metadata statistics, a virtual connection result distribution based on statistical inference is constructed in an isolated sandbox environment, where access to actual data records in the government database is prohibited. The information entropy increment is calculated based on the distribution of the virtual connection results and the weight of the associated topology edge. The information entropy increment represents the change in individual re-identification risk caused by cross-table logical association. The sensitivity level of the stored data is determined based on the information entropy increment, and a sensitivity assessment result containing data openness policy configuration parameters is generated.
[0008] Preferably, the construction of a virtual connectivity result distribution based on statistical inference in an isolated sandbox environment includes: Extract the cardinality of the association keys of the first data table and the cardinality of the association keys of the second data table from the metadata statistics, and calculate the association key overlap coefficient between the first data table and the second data table; Based on the overlap coefficient of the association key, the total number of records in the first data table, and the total number of records in the second data table, the expected number of records after cross-table join is deduced. Extract the value range distribution characteristics of each field from the field value distribution histogram of the first data table and the field value distribution histogram of the second data table, and calculate the joint distribution characteristics of the field combination after cross-table join; The expected number of records and the joint distribution features are combined to obtain the distribution of the virtual connection results.
[0009] Preferably, the calculation of the joint distribution characteristics of the field combinations after cross-table join includes: Extract the value range division results of the quasi-identifier field from the field value distribution histogram of the first data table. The quasi-identifier field includes the age field, address field, occupation field, and timestamp field. Extract the value range division results of the associated fields that are semantically related to the quasi-identifier field from the field value distribution histogram of the second data table; Expand the value ranges of the quasi-identifier field and the value ranges of the associated field to generate a candidate range combination set for cross-table field combinations; The number of candidate interval combinations in the candidate interval combination set whose expected record number is a single record is counted. The ratio of the number of such combinations to the total number of candidate interval combinations is used as the uniqueness ratio, which is used as the core quantitative indicator of the joint distribution feature.
[0010] Preferably, the associated topological edge weights are determined in the following ways: Obtain the foreign key relationships between various data tables in the government database, and construct an inter-table relationship topology graph, wherein the inter-table relationship topology graph uses data tables as nodes and foreign key relationships as directed edges; For each directed edge in the table association topology graph, the physical association strength of the directed edge is calculated based on the data type consistency of the foreign key field, the integrity of the foreign key constraint, and the cardinality type of the foreign key association. Extract the frequency of JOIN operations involving the table pairs corresponding to the directed edges from the database query logs, and normalize the frequency of JOIN operations to use as the logical association strength of the directed edges. The physical association strength and the logical association strength are weighted and fused to obtain the comprehensive association strength of the directed edge, and the comprehensive association strength is used as the weight of the association topology edge; When the foreign key association corresponding to the directed edge involves a field that has undergone de-identification, the weight of the associated topology edge is attenuated according to the de-identification strength level of the field.
[0011] Preferably, the calculation of the information entropy increment includes: For each data table in the government database, based on the total number of records in the data table and the histogram of the field value distribution, the initial information entropy value of the data table before cross-table join is calculated; Based on the expected number of records and the joint distribution characteristics, the connection information entropy value after the virtual connection is calculated for the distribution of the virtual connection results. The difference between the connection information entropy value and the initial information entropy value is calculated to obtain the single-hop information entropy increment. When a multi-hop propagation path is detected in the inter-table association topology graph, the single-hop information entropy increment of each hop is calculated sequentially along the multi-hop propagation path. The single-hop information entropy increment is then weighted according to the association topology edge weights corresponding to each hop. The weighted single-hop information entropy increment is then accumulated hop by hop to obtain the multi-hop cumulative information entropy increment as the information entropy increment. This increment is then stored in the sensitivity assessment result database through the data persistence interface.
[0012] Preferably, the step of weighting the single-hop information entropy increment according to the associated topological edge weights corresponding to each hop includes: Obtain the associated topological edge weight corresponding to the nth hop on the multi-hop propagation path, and use the associated topological edge weight as the base value of the propagation attenuation coefficient; Calculate the hop count attenuation factor based on the hop count n of the multi-hop propagation path; Multiply the base value of the propagation attenuation coefficient by the hop count attenuation factor to obtain the actual propagation attenuation coefficient of the nth hop; Multiply the single-hop information entropy increment of the nth hop by the actual propagation attenuation coefficient to obtain the weighted information entropy increment of the nth hop; When it is detected that the data table involved in the nth hop of the multi-hop propagation path contains a field that has undergone de-identification, a de-identification blocking coefficient is calculated based on the de-identification strength level of the field, and the weighted information entropy increment is multiplied by the de-identification blocking coefficient.
[0013] Preferably, determining the sensitivity level of stored data includes: The privacy protection benchmark thresholds for the opening of government data are obtained, and the privacy protection benchmark thresholds include a high sensitivity judgment threshold and a medium sensitivity judgment threshold; When the information entropy increment is negative and the absolute value of the information entropy increment is greater than the high sensitivity determination threshold, the corresponding stored data is determined to be of high sensitivity level. When the information entropy increment is negative and the absolute value of the information entropy increment is between the high sensitivity threshold and the medium sensitivity threshold, the corresponding stored data is determined to be of medium sensitivity level. When the information entropy increment is negative and the absolute value of the information entropy increment is less than the medium sensitivity determination threshold, or when the information entropy increment is non-negative, the corresponding stored data is determined to be of low sensitivity level. For the stored data with the high sensitivity level, the key propagation path that causes the absolute value of the information entropy increment to exceed the high sensitivity judgment threshold is extracted from the table association topology graph, and each data table and field on the key propagation path is marked as a high-risk association object.
[0014] Preferably, the generation of sensitivity assessment results including data openness policy configuration parameters includes: For the stored data with the high sensitivity level, a linkage desensitization suggestion is generated. The linkage desensitization suggestion includes a list of fields that need to be desensitized on the key propagation path and a recommended desensitization intensity level for each field. For the stored data of the medium sensitivity level, data access restriction suggestions are generated. These suggestions include restricting cross-table query permissions, prohibiting the export of specific field combinations, and setting data access frequency limits. These data access restriction suggestions are expressed through an access control policy configuration file. For the stored data at the low sensitivity level, a data open license identifier is generated, which indicates that the stored data can be directly opened in the current desensitized state; The linked desensitization suggestions, the data opening restriction suggestions, and the data opening license identifier are summarized to generate a structured sensitivity assessment result report. The sensitivity assessment result report includes the sensitivity level distribution of each data table, a visualization map of high-risk association paths, and a list of hierarchical opening strategy configurations.
[0015] Preferably, after generating the sensitivity assessment result containing data openness policy configuration parameters, the method further includes: Monitor update events in the government database's data update logs that involve new records, modified field values, and changes to table structure; When the update event is detected to involve a high-risk associated object, an incremental sensitivity reassessment is triggered. The incremental sensitivity reassessment re-acquires the metadata statistics for the data table affected by the update event, reconstructs the virtual connection result distribution in the isolated sandbox environment, and recalculates the information entropy increment. The newly calculated information entropy increment is compared with the historical information entropy increment. When the absolute value of the new information entropy increment increases by more than a preset change range compared with the absolute value of the historical information entropy increment, a sensitivity upgrade alarm is generated. In response to the sensitivity upgrade alarm, adjust the recommended desensitization strength level of the corresponding field in the linkage desensitization suggestion.
[0016] Secondly, this application also provides an automatic data sensitivity assessment system for open government data storage, including: a statistical acquisition module, which acquires metadata statistical information from the government database, wherein the metadata statistical information includes the cardinality of association keys, the total number of records in the data table, the histogram of field value distribution, and the weight of association topology edges; The distributed construction module constructs a virtual connection result distribution based on statistical inference in an isolated sandbox environment based on the metadata statistics. The isolated sandbox environment prohibits access to the actual data records in the government database. The entropy increase calculation module calculates the information entropy increment based on the distribution of the virtual connection results and the associated topology edge weights. The information entropy increment represents the change in individual re-identification risk caused by cross-table logical association. The evaluation generation module determines the sensitivity level of the stored data based on the information entropy increment and generates a sensitivity evaluation result that includes data openness policy configuration parameters.
[0017] Thirdly, an electronic device is provided, comprising: a memory, a processor, and a computer program, wherein the computer program is stored in the memory, and the processor executes the computer program to perform the methods described in the first aspect of this application and various possible methods related to the first aspect.
[0018] Implementing this application will have the following beneficial effects: This application provides an automatic data sensitivity assessment method and system for open government data storage. 1. This application constructs a virtual join result distribution based on metadata statistics in an isolated sandbox environment, deducing the data distribution after cross-table joins without accessing actual data records in the government database. Through association key overlap coefficients, Cartesian product reduction calculations, and interval cross-combination deduction algorithms, the expected number of records and joint distribution characteristics are accurately calculated, obtaining a virtual join result distribution that is highly consistent with the actual join results in statistical characteristics. This avoids access to and exposure of real sensitive data during the assessment process, fundamentally eliminating the risk of data leakage during the assessment process. It solves the problem in existing technologies where actual cross-table join operations are required, leading to the exposure of sensitive information, and achieves secure sensitivity assessment.
[0019] 2. This application quantifies the change in individual re-identification risk caused by cross-table logical associations by calculating the information entropy increment. Based on the distribution of virtual connection results and the weights of the association topology edges, the initial information entropy value before the cross-table connection and the connection information entropy value after the connection are calculated, and the difference between the two is used as the information entropy increment. A graph traversal algorithm is used to detect multi-hop propagation paths and calculate the cumulative information entropy increment of multiple hops. The single-hop information entropy increment is weighted and accumulated according to the association topology edge weights, hop count attenuation factor, and desensitization blocking coefficient, systematically identifying the cascading propagation effect of sensitivity through multi-hop paths. Using the information entropy increment as the quantitative basis for sensitivity level determination replaces the qualitative judgment that relies on expert experience in the prior art, solving the problems of the inability to quantitatively assess the privacy risks of cross-table associations and ignoring the cascading propagation effect in the prior art, and realizing the scientific and comprehensive nature of sensitivity assessment. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is an overall flowchart of an automatic sensitivity assessment method for stored data that is open to government data, as described in this application. Figure 2This is an application environment diagram of an automatic sensitivity assessment method for stored data that is open to government data, as described in this application. Figure 3 This is a schematic diagram of the module structure of an automatic sensitivity assessment system for open government data storage, which is involved in this application. Figure 4 This is a computer device diagram of an automatic data sensitivity assessment method for open government data storage, which is the subject of this application. Detailed Implementation
[0022] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0023] In one exemplary embodiment, such as Figure 1 As shown, an automatic data sensitivity assessment method for open government data storage is provided, including: S1: Obtain metadata statistics from the government database. The metadata statistics include the cardinality of association keys, the total number of records in the data table, the histogram of field value distribution, and the weights of association topology edges.
[0024] It should be noted that the current sensitivity assessment of government data mainly relies on manual review or full scanning of the original data. For example, sensitive fields are identified through keyword matching, or business personnel manually mark the sensitivity level. However, reviewers are prone to misjudgment due to factors such as large data volume and misunderstanding of context. For example, the sensitivity of the same word may vary in different business scenarios. Relying solely on literal matching may lead to false positives or false negatives. Furthermore, manual assessment is difficult to identify the hidden risks after cross-table association, making the assessment results unable to truly reflect the actual risk of leakage after the data is opened up.
[0025] Therefore, this application will directly obtain metadata statistics from government databases, such as cardinality of association keys, total number of records in data tables, histogram of field value distribution, and weights of association topology edges, thereby reducing the security and compliance risks caused by scanning the original data content.
[0026] Based on this, since the metadata formats of different departments may differ, direct use of them for calculation may lead to assessment bias. Therefore, this application will first standardize and verify the obtained metadata statistics to eliminate interference caused by inconsistent formats, and then construct a virtual connection distribution based on the verified information to ensure that the input data for subsequent risk quantification is accurate and reliable.
[0027] Metadata statistics refer to summary parameters that describe the structural characteristics and distribution patterns of data tables. For example, the cardinality of the association key represents the number of unique values in a field, and the histogram of field value distribution represents the frequency distribution of values in each interval. It should be noted that the above information can be read directly from the database system directory without accessing the original data rows, and the acquisition process is completed only through the metadata interface with read-only permissions, without involving the physical movement of business data.
[0028] S2: Based on metadata statistics, construct a virtual connection result distribution based on statistical inference in an isolated sandbox environment. The isolated sandbox environment prohibits access to actual data records in the government database.
[0029] Specifically, the process of constructing the virtual connection result distribution includes the following operations: The server extracts the cardinality of the join key from the metadata statistics for both the first and second data tables. The cardinality of a join key refers to the number of unique values in the join key field. For example, if the ID number field in the user table is used as the join key, and the user table has 10,000 records with no duplicate ID numbers, then the cardinality of the join key is 10,000.
[0030] After extracting the cardinality of the join keys, the server calculates the join key overlap coefficient between the first and second data tables using a join key value range analysis algorithm. Further, the algorithm compares the value range of the join keys in the first and second data tables, calculating the proportion of the intersection of these two value ranges to their respective ranges. The join key overlap coefficient represents the proportion of intersection between the join key value ranges of the two tables.
[0031] For example, the range of ID card numbers in the user table is 1100000000000000000 to 659999999999999999, while the range of patient ID card numbers in the medical record table is 1100000000000000000 to 469999999999999999. The intersection of these two ranges is 1100000000000000000 to 4699999999999999999. This intersection represents 0.65 of the user table's range and 1.0 of the medical record table's range. The server takes the smaller of these two proportions as the association key overlap coefficient; in this example, it is 0.65.
[0032] After calculating the join key overlap coefficient, the server uses the join key overlap coefficient, the total number of records in the first data table, and the total number of records in the second data table to calculate and estimate the expected number of records after the cross-table join using Cartesian product reduction calculation. The Cartesian product reduction calculation selectively filters the theoretical Cartesian product result based on the join key overlap coefficient.
[0033] Specifically, the Cartesian product reduction calculation process is as follows: The server first calculates the product of the total number of records in the first data table and the total number of records in the second data table to obtain the theoretical Cartesian product result. The theoretical Cartesian product result represents the maximum number of records that may be generated when the two tables are fully cross-joined. The server multiplies the theoretical Cartesian product result by the overlap coefficient of the association key to obtain the expected number of records.
[0034] Continuing the example above, the user table has 10,000 records, and the medical record table has 50,000 records. Theoretically, the Cartesian product would be 500,000,000. Multiplying 500,000,000 by the join key overlap factor of 0.65 yields an expected record count of 325,000,000. This expected record count reflects the actual number of records that might be generated after the two tables are joined using the join key.
[0035] After estimating the expected number of records, the server extracts the value range distribution characteristics of each field from the field value distribution histograms of the first and second data tables. The field value distribution histogram divides the field's value range into multiple intervals, with each interval recording the number of records falling within that interval. For example, the age field's value range can be divided into 0-18 years, 19-35 years, 36-60 years, and 60 years and above, with each interval recording the number of people in the corresponding age group.
[0036] After extracting the value range distribution characteristics, the server calculates the joint distribution characteristics of each field combination after cross-table join using an interval cross-combination inference algorithm. The joint distribution characteristics characterize the degree of uniqueness of the multi-field combination after the join.
[0037] Furthermore, the execution process of the interval cross-combination deduction algorithm is as follows: The server extracts the value range division results of quasi-identifier fields from the field value distribution histogram of the first data table using quasi-identifier recognition rules. Quasi-identifier fields include age, address, occupation, and timestamp fields. Specifically, the quasi-identifier recognition rules are based on keyword matching of field names and determination of field data types. The server scans the field names in the field value distribution histogram; if a field name contains keywords such as "age," "address," "occupation," "job," "time," or "date," or if the field data type is date / time, then the field is identified as a quasi-identifier field.
[0038] After identifying the quasi-identifier field, the server extracts the value range division results of semantically related fields from the field value distribution histogram of the second data table using a semantic association matching algorithm. The semantic association matching algorithm is implemented based on field name similarity calculation and field data type matching.
[0039] Furthermore, the semantic association matching algorithm is implemented as follows: the server calculates the edit distance between each field name in the second data table and the quasi-identifier field name. Edit distance refers to the minimum number of single-character edit operations required to convert one string into another; edit operations include insertion, deletion, and replacement. The smaller the edit distance, the higher the similarity of the field names. The server marks fields with an edit distance less than 3 as candidate association fields.
[0040] After marking the candidate association fields, the server further determines whether the data type of the candidate association field matches the data type of the quasi-identifier field. If both the quasi-identifier field and the candidate association field are integers, then they match. If both the quasi-identifier field and the candidate association field are strings, then they match. Candidate association fields with matching data types are determined to be association fields.
[0041] After extracting the value range division results of the associated fields, the server expands each value range of the identifier field and each value range of the associated fields using the Cartesian product expansion algorithm to generate a candidate range combination set for cross-table field combinations.
[0042] Specifically, the Cartesian product expansion algorithm pairs each value range of the quasi-identifier field with each value range of the associated field. For example, the age field in the user table has four value ranges (0-18 years, 19-35 years, 36-60 years, and 60 years and older), and the patient age field in the medical record table has four value ranges (0-18 years, 19-35 years, 36-60 years, and 60 years and older). After Cartesian product expansion, 16 candidate range combinations are generated. Each candidate range combination contains a pair of age ranges from the user table and patient age ranges from the medical record table, such as (0-18 years, 0-18 years), (0-18 years, 19-35 years), etc.
[0043] After generating the candidate interval combination set, the server calculates and infers the expected number of records for each candidate interval combination after virtual connection based on the number of records in the corresponding interval in the field value distribution histograms of the first and second data tables, through probability product.
[0044] Furthermore, the probability product calculation process is as follows: The server reads the number of records in the first data table's value range corresponding to the candidate interval combination from the field value distribution histogram of the first data table, divides the number of records by the total number of records in the first data table, and obtains the record proportion of the first data table's value range. The server reads the number of records in the second data table's value range corresponding to the candidate interval combination from the field value distribution histogram of the second data table, divides the number of records by the total number of records in the second data table, and obtains the record proportion of the second data table's value range. The server multiplies the two record proportions together, and then multiplies by the expected number of records to obtain the expected number of records for the candidate interval combination after the virtual connection.
[0045] Continuing with the example above, the user table has 2000 records in the 0-18 age range, representing 0.2% of the total records. The medical record table has 10000 records in the 0-18 age range, also representing 0.2% of the total records. Multiplying these two percentages together gives 0.04. Multiplying this by the expected number of records (325,000,000) gives an expected number of records for the candidate range combination (0-18 years, 0-18 years) after virtual join: 13,000,000.
[0046] After estimating the expected number of records for each candidate interval combination, the server counts the number of candidate interval combinations in the set whose expected record count is a single record. An expected record count of a single record means that this field combination can uniquely identify an individual after a virtual join. The server divides the number of candidate interval combinations whose expected record count is a single record by the total number of candidate interval combinations in the set to obtain the uniqueness ratio. The uniqueness ratio represents the probability that an individual can be uniquely identified after a cross-table join.
[0047] The server uses the uniqueness ratio as the core quantitative indicator of the joint distribution characteristics and outputs it to a temporary storage area in the isolated sandbox environment through a data structure serialization interface. Specifically, the data structure serialization interface converts the uniqueness ratio into a JSON or XML data structure and writes it to a file in the temporary storage area.
[0048] After completing the above operations, the server synthesizes the expected number of records and the joint distribution characteristics to obtain the virtual connection result distribution. The distribution synthesis uses the expected number of records as the overall scale of the virtual connection result distribution and the joint distribution characteristics as its internal structural features. The server stores the virtual connection result distribution in a temporary data structure within the isolated sandbox environment. The temporary data structure uses a hash table or tree structure to store the various parameters of the virtual connection result distribution.
[0049] It should be noted that constructing a virtual join result distribution solves the problem in existing technologies that require actual cross-table join operations to assess data sensitivity. Existing technologies, when assessing the sensitivity of government data, typically require actually joining multiple data tables to generate an intermediate result table containing real personal information, and then analyzing the risk of individual re-identification based on the intermediate result table. Actual join operations expose a large amount of sensitive information, posing a data leakage risk. This application, by deriving the virtual join result distribution based on metadata statistics in an isolated sandbox environment, avoids accessing actual data records, fundamentally eliminating the risk of data leakage during the assessment process.
[0050] Furthermore, the construction of the virtual join result distribution relies on techniques such as association key overlap coefficient, Cartesian product reduction calculation, and interval cross-combination inference algorithms. Without accessing the actual data, it accurately infers the data distribution pattern and individual uniqueness characteristics after cross-table joins. The inferred results are highly consistent with the actual join results in statistical characteristics, supporting subsequent sensitivity assessments while ensuring data security.
[0051] Furthermore, the uniqueness ratio, as a core quantitative indicator of joint distribution characteristics, directly reflects the probability that an individual is uniquely identified after cross-table joins. A higher uniqueness ratio indicates stronger discriminative power of the field combination and a greater risk of individual re-identification. This application uses the uniqueness ratio as input for subsequent information entropy increment calculations, establishing a complete technical chain from data distribution characteristics to privacy risk quantification.
[0052] S3: Calculate the information entropy increment based on the distribution of virtual connection results and the weight of associated topological edges. The information entropy increment represents the change in individual weight identification risk caused by cross-table logical association.
[0053] Before calculating the information entropy increment, the server needs to determine the weights of the associated topological edges. Specifically, the weights of the associated topological edges are determined in the following way: The server retrieves the foreign key relationships between tables in the government database through a database system directory query interface. The interface reads the stored foreign key constraint definitions from the database system tables, extracting information such as the main table name, dependent table name, primary key field name, and foreign key field name. The server then constructs an inter-table relationship topology graph based on these foreign key relationships. This graph uses data tables as nodes and directed edges representing foreign key relationships. The directed edges point from the dependent table to the main table, indicating that the dependent table references the primary key of the main table through a foreign key.
[0054] For example, if the medical record table references the ID number field in the user table through the patient's ID number field, then there exists a directed edge in the inter-table relationship topology graph from the medical record table to the user table. The server uses an adjacency list data structure to store the inter-table relationship topology graph, and each node maintains a list recording all directed edges pointing to that node.
[0055] After constructing the inter-table relationship topology graph, the server calculates the physical relationship strength of each directed edge in the graph based on the data type consistency of the foreign key field, the integrity of the foreign key constraints, and the cardinality type of the foreign key relationship.
[0056] Furthermore, the calculation process for physical association strength is as follows: The server checks whether the data types of the foreign key field and the primary key field corresponding to the directed edge are consistent. If the foreign key field is of type VARCHAR and has a length of 18, and the primary key field is also of type VARCHAR and has a length of 18, then the data types are consistent. When the data types are consistent, the server marks the data type consistency as 1; otherwise, it marks it as 0.
[0057] The server reads the integrity information of foreign key constraints from the database system directory. Foreign key constraint integrity means that every value in a foreign key field in the dependent table can be found in the primary key field of the parent table. The server counts the number of records where the foreign key field value in the dependent table successfully matches the primary key field in the parent table, and divides this number by the total number of records in the dependent table to obtain the foreign key matching success rate. A higher foreign key matching success rate indicates better foreign key constraint integrity.
[0058] The server identifies the cardinality type of foreign key relationships. Cardinality types include one-to-one, one-to-many, and many-to-many relationships. The server counts the number of records corresponding to each foreign key value in the table. If the number of records corresponding to all foreign key values is 1, it is a one-to-one relationship. If the number of records corresponding to any foreign key value is greater than 1, it is a one-to-many relationship. Many-to-many relationships are implemented through an intermediate table, and the server checks whether an intermediate table exists that references the primary keys of two tables simultaneously.
[0059] After identifying the cardinality type, the server calculates the uniqueness of the foreign key field. Foreign key field uniqueness refers to the proportion of unique values in the foreign key field out of the total number of records in the child table. Higher foreign key field uniqueness indicates stronger discriminative power.
[0060] The server multiplies the foreign key match success rate by the uniqueness of the foreign key field to obtain the physical association strength. The physical association strength reflects the reliability of the foreign key association at the database structure level.
[0061] After calculating the physical association strength, the server extracts the frequency of JOIN operations involving the table pairs corresponding to the directed edges from the database query log using the database query log parser for each directed edge in the table association topology graph.
[0062] Specifically, the database query log parser reads the database query log file and analyzes the JOIN keywords in the SQL statements. The server identifies the table names involved in the JOIN statements and determines whether the table names match the table pairs corresponding to the directed edges. If the JOIN statement is "SELECT * FROM medical record table JOIN user table ON medical record table.patient ID number = user table.ID number", then this JOIN operation involves the table pair of the medical record table and the user table.
[0063] The server counts the number of occurrences of the JOIN operations involving this table pair in the query log, divides the number of occurrences by the total number of all JOIN operations in the query log to obtain the JOIN operation frequency. The JOIN operation frequency reflects the active degree of the association query of this table pair in the actual business.
[0064] The server normalizes the JOIN operation frequency and uses it as the logical association strength of the directed edge. The normalization operation maps the JOIN operation frequency to the interval from 0 to 1. The server finds the maximum value of the JOIN operation frequencies among all directed edges, divides the JOIN operation frequency of each directed edge by the maximum value to obtain the normalized logical association strength.
[0065] After calculating the physical association strength and the logical association strength, the server performs weighted fusion on the physical association strength and the logical association strength to obtain the comprehensive association strength of the directed edge.
[0066] Furthermore, the weighted fusion adopts a linear weighting method. The server multiplies the physical association strength by the physical weight parameter, multiplies the logical association strength by the logical weight parameter, and adds the two to obtain the comprehensive association strength. The sum of the physical weight parameter and the logical weight parameter is 1. Those skilled in the art can select the physical weight parameter and the logical weight parameter according to the degree of emphasis on the reliability of the database structure and the business activity in the actual application scenario.
[0067] The server uses the comprehensive association strength as the associated topology edge weight and writes it into the edge attributes of the inter-table association topology graph through the edge weight update interface. The edge weight update interface modifies the weight field corresponding to the directed edge in the adjacency list data structure and stores the comprehensive association strength as the weight value of the edge.
[0068] After writing the edge weight, the server checks whether the foreign key association corresponding to the directed edge involves fields that have been desensitized. The server obtains the desensitization intensity level of the field through the desensitization status query interface. The desensitization status query interface reads the desensitization method and desensitization parameters of the field from the desensitization operation record table. The desensitization intensity level is divided according to the information retention degree of the desensitization method. For example, mask desensitization retains some characters, and the desensitization intensity level is low. Generalization desensitization replaces the specific value with a range value, and the desensitization intensity level is medium. Substitution desensitization completely replaces it with a random value, and the desensitization intensity level is high.
[0069] After obtaining the anonymization strength level, the server adjusts the weights of the associated topology edges based on that level. Furthermore, this adjustment uses a multiplication of attenuation factors. The server determines the attenuation factor based on the anonymization strength level: 0.8 for a low level, 0.5 for a medium level, and 0.2 for a high level. The server then multiplies the weights of the associated topology edges by the attenuation factor to obtain the adjusted weights. This attenuation adjustment reflects the weakening effect of the anonymization operation on the propagation ability of the association relationships.
[0070] After determining the weights of the associated topological edges, the server calculates the information entropy increment. Specifically, calculating the information entropy increment includes the following operations: For each data table in the government database, the server calculates the initial information entropy value of the data table before cross-table joins based on the total number of records and the histogram of field value distribution in the data table.
[0071] The Shannon entropy formula is: ; in, Let Σ represent the percentage of records in the i-th interval, and Σ denotes the summation over all intervals. The server reads the number of records in each interval from the field value distribution histogram, divides the number of records in each interval by the total number of records in the data table, and obtains the percentage of records in each interval. The server takes the logarithm to base 2 of the percentage of records in each interval, multiplies it by the percentage of records in each interval, sums the results over all intervals, and takes the negative value to obtain the initial information entropy value.
[0072] The initial information entropy value reflects the uniformity of the distribution of records in a data table. A higher initial information entropy value indicates a more uniform distribution of records, making it more difficult to uniquely identify an individual. A lower initial information entropy value indicates a more concentrated distribution of records, making it easier to uniquely identify an individual.
[0073] After calculating the initial information entropy value, the server calculates the connection information entropy value after the virtual connection based on the expected number of records and joint distribution characteristics, using the Shannon entropy formula, for the distribution of virtual connection results.
[0074] Furthermore, the server reads the expected number of records for each field combination from the joint distribution characteristics, divides the expected number of records for each field combination by the total expected number of records, and obtains the percentage of expected records for each field combination. The server then applies the Shannon entropy formula to the percentage of expected records for each field combination to calculate the connection information entropy value.
[0075] The join entropy value reflects the uniformity of record distribution after a virtual join. Cross-table joins typically increase the dimensionality of field combinations, leading to a more dispersed record distribution, and the join entropy value may be higher than the initial entropy value. However, if a cross-table join introduces highly discriminative field combinations, resulting in a large number of records being uniquely identified, the join entropy value will be significantly lower than the initial entropy value.
[0076] After calculating the connection entropy value, the server calculates the difference between the connection entropy value and the initial entropy value to obtain the single-hop entropy increment. The single-hop entropy increment characterizes the change in entropy caused by a single cross-table join. A negative single-hop entropy increment indicates that the cross-table join reduces entropy, increasing the risk of an individual being uniquely identified. A positive single-hop entropy increment indicates that the cross-table join increases entropy, reducing the risk of an individual being uniquely identified.
[0077] After calculating the single-hop information entropy increment, the server detects multi-hop propagation paths in the inter-table association topology graph using a graph traversal algorithm. The graph traversal algorithm employs either depth-first search or breadth-first search. Starting from a node in a data table, the server traverses adjacent nodes along directed edges, recording the traversal path. If a traversal path passes through two or more directed edges, then the path is considered a multi-hop propagation path.
[0078] Upon detecting a multi-hop propagation path, the server sequentially calculates the single-hop information entropy increment for each hop along the path. The server treats the data tables corresponding to two adjacent nodes on the path as a cross-table join and calculates the single-hop information entropy increment for that join. The server repeats this operation until the single-hop information entropy increment for all hops on the path has been calculated.
[0079] After calculating the single-hop information entropy increment for each hop, the server weights the single-hop information entropy increments according to the weights of the associated topological edges corresponding to each hop. Specifically, the weighting operation includes the following steps: The server retrieves the associated topological edge weights corresponding to the nth hop in the multi-hop propagation path through an edge weight query interface. The edge weight query interface reads the weight field value of the directed edge at the nth hop from the adjacency table data structure of the inter-table association topology graph. The server uses these associated topological edge weights as the base values for the propagation attenuation coefficient.
[0080] After obtaining the associated topology edge weights, the server calculates a hop count decay factor based on the hop count *n* of the multi-hop propagation path. The hop count decay factor decays exponentially with increasing hop count *n*. Furthermore, the calculation of the hop count decay factor uses an exponential function. The server uses the *n*th power of the decay base as the hop count decay factor. The decay base is less than 1, for example, 0.9. When *n* is 1, the hop count decay factor is 0.9. When *n* is 2, the hop count decay factor is 0.81. When *n* is 3, the hop count decay factor is 0.729. The hop count decay factor reflects the law that sensitivity propagation gradually weakens with increasing path length.
[0081] After calculating the hop count attenuation factor, the server multiplies the base value of the propagation attenuation coefficient by the hop count attenuation factor to obtain the actual propagation attenuation coefficient for the nth hop. The actual propagation attenuation coefficient takes into account both the strength of the correlation and the length of the propagation path.
[0082] After obtaining the actual propagation attenuation coefficient, the server multiplies the single-hop information entropy increment of the nth hop by the actual propagation attenuation coefficient to obtain the weighted information entropy increment of the nth hop. The weighted information entropy increment reflects the actual contribution of that hop in the multi-hop propagation path.
[0083] After calculating the weighted information entropy increment, the server checks whether the data table involved in the nth hop of the multi-hop propagation path contains fields that have undergone de-identification. If it contains fields that have undergone de-identification, the server calculates the de-identification blocking coefficient based on the de-identification strength level of the fields.
[0084] Furthermore, the calculation of the desensitization blocking coefficient is similar to the attenuation adjustment of the associated topological edge weights. The server determines the desensitization blocking coefficient based on the desensitization strength level. When the desensitization strength level is low, the desensitization blocking coefficient is 0.8. When the desensitization strength level is medium, the desensitization blocking coefficient is 0.5. When the desensitization strength level is high, the desensitization blocking coefficient is 0.2. The server multiplies the weighted information entropy increment by the desensitization blocking coefficient to obtain the weighted information entropy increment for the nth hop after considering the desensitization effect.
[0085] After calculating the weighted information entropy increment for each hop, the server accumulates the weighted single-hop information entropy increment for each hop to obtain the multi-hop cumulative information entropy increment. The multi-hop cumulative information entropy increment characterizes the overall information entropy change caused by the multi-hop propagation path.
[0086] The server uses the single-hop entropy increment or the multi-hop cumulative entropy increment as the entropy increment and stores it in the sensitivity assessment result database through the data persistence interface. The data persistence interface writes the entropy increment into a record in the database table, and the record contains fields such as data table identifier, propagation path identifier, entropy increment value, and calculation timestamp.
[0087] It should be noted that calculating the information entropy increment solves the problem in existing technologies that cannot quantitatively assess changes in privacy risks caused by cross-table joins. Existing technologies, when assessing the sensitivity of government data, typically rely on expert experience to determine whether cross-table joins increase the risk of privacy breaches. These assessments lack quantitative basis and make it difficult to accurately classify risk levels. This application, through the quantitative indicator of information entropy increment, transforms the changes in privacy risks caused by cross-table joins into a calculable and comparable value, providing a scientific basis for the automatic determination of sensitivity levels.
[0088] Furthermore, the introduction of topological edge weights in the association allows the calculation of information entropy increment to take into account the differences in the strength of association relationships. Existing technologies treat all cross-table associations as equivalent, ignoring the differences in the activity levels of different association relationships in actual business and the reliability of data structures. This application accurately characterizes the actual propagation capability of association relationships through a weighted fusion of physical and logical association strengths, avoiding misjudgments caused by overestimating weak association relationships as strong ones.
[0089] Preferably, the detection of multi-hop propagation paths and the calculation of multi-hop cumulative information entropy increments solve the problem in existing technologies that only consider single cross-table joins while ignoring cascading propagation effects. There are complex relationship networks between tables in government databases, and sensitivity may propagate from one table to another through multi-hop paths. This application systematically detects all possible multi-hop propagation paths using a graph traversal algorithm and accurately calculates the cumulative effect of cascading propagation using a hop count attenuation factor and a desensitization blocking coefficient, ensuring the comprehensiveness and accuracy of sensitivity assessment.
[0090] Specifically, the exponential decay design of the hop count decay factor conforms to the natural laws of information propagation. As the propagation path length increases, the reliability and relevance of information gradually weaken. This application simulates this law through an exponential decay function, avoiding the overestimation of long-distance propagation paths as high-risk paths. The introduction of the desensitization blocking coefficient further considers the blocking effect of desensitization operations on the propagation path, making the calculation of information entropy increment closer to reality.
[0091] Information entropy increment, as a core quantitative indicator, establishes a complete technical chain from data distribution characteristics to privacy risk quantification. When the information entropy increment is negative and its absolute value is large, it indicates that cross-table associations significantly reduce information entropy, a large number of individuals can be uniquely identified, and the risk of privacy leakage is extremely high. When the information entropy increment is negative but its absolute value is small, it indicates that cross-table associations have a limited impact on information entropy, and the risk of privacy leakage is controllable. When the information entropy increment is positive, it indicates that cross-table associations increase information entropy, reduce the risk of individuals being uniquely identified, and the data can be securely released. This application uses the information entropy increment as a direct basis for subsequent sensitivity level determination, achieving a technical leap from qualitative judgment to quantitative calculation.
[0092] S4: Determine the sensitivity level of stored data based on the information entropy increment, and generate sensitivity assessment results including data openness policy configuration parameters.
[0093] Specifically, the process of determining the sensitivity level of stored data includes the following operations: The server obtains the privacy protection baseline thresholds for government data access by parsing a configuration file. The configuration file is stored in the server's configuration directory and is in INI or YAML format. The server reads the configuration file content, parses the key-value pairs, and extracts the values for the privacy protection baseline thresholds. These thresholds include high-sensitivity and medium-sensitivity thresholds.
[0094] The threshold for determining high sensitivity is lower than the threshold for determining medium sensitivity. Those skilled in the art can select the high sensitivity and medium sensitivity thresholds based on the actual security requirements and risk tolerance of government data disclosure. For example, if a government department has high privacy protection requirements, it can set the high sensitivity threshold to 2.0 and the medium sensitivity threshold to 1.0. If a government department wishes to disclose data as much as possible while ensuring security, it can set the high sensitivity threshold to 3.0 and the medium sensitivity threshold to 1.5.
[0095] After obtaining the privacy protection baseline threshold, the server detects the sign and absolute value of the information entropy increment. The server then determines whether the information entropy increment is negative. A negative information entropy increment indicates that cross-table joins reduce information entropy, increasing the risk of individuals being uniquely identified. A non-negative information entropy increment indicates that cross-table joins do not reduce information entropy, and the risk of individual re-identification does not increase.
[0096] When a negative information entropy increment is detected, the server further calculates the absolute value of the information entropy increment and compares the absolute value with the high sensitivity threshold and the medium sensitivity threshold.
[0097] Furthermore, when the absolute value of the information entropy increment exceeds the high sensitivity threshold, the server classifies the corresponding stored data as high sensitivity. A high sensitivity level indicates a significant decrease in information entropy due to cross-table joins, allowing for the unique identification of numerous individuals, posing an extremely high risk of privacy breaches. The stored data cannot be directly made public and must undergo strong desensitization procedures or be prohibited from being accessed.
[0098] When the absolute value of the information entropy increment falls between the high sensitivity threshold and the medium sensitivity threshold, the server classifies the corresponding stored data as medium sensitivity. Medium sensitivity indicates that the decrease in information entropy caused by cross-table joins is moderate, some individuals can be uniquely identified, the risk of privacy breaches is controllable, and the stored data can be opened under restricted conditions, requiring the setting of access permissions and query restrictions.
[0099] When the absolute value of the information entropy increment is less than the medium sensitivity threshold, the server classifies the corresponding stored data as low sensitivity. Low sensitivity indicates that the decrease in information entropy caused by cross-table joins is small, the risk of the individual being uniquely identified is low, and the stored data can be safely accessed.
[0100] When a non-negative information entropy increment is detected, the server directly classifies the corresponding stored data as low-sensitivity. A non-negative information entropy increment indicates that cross-table joins have not reduced information entropy, and may even have increased it. The risk of individual re-identification has not increased, and the stored data can be safely accessed.
[0101] After determining the sensitivity level, the server uses a critical path extraction algorithm to extract the critical propagation path from the inter-table relationship topology graph for the stored data with a high sensitivity level, which causes the absolute value of the information entropy increment to exceed the high sensitivity judgment threshold.
[0102] Furthermore, the critical path extraction algorithm is executed as follows: the server traverses all multi-hop propagation paths in the inter-table association topology graph and reads the multi-hop cumulative entropy increment corresponding to each path. The server filters out propagation paths whose absolute value of the multi-hop cumulative entropy increment is greater than the high sensitivity threshold and marks these propagation paths as critical propagation paths.
[0103] The critical propagation path represents the cross-table association link with the strongest sensitivity propagation capability and the highest risk of privacy leakage. The server marks each data table and field on the critical propagation path as a high-risk association. The marking operation adds a risk label to the data table and field, and the risk label includes information such as risk level, discovery time, and association path identifier.
[0104] The server writes the tagging results to the risk object registry. The risk object registry is a table in the database that stores the identifiers and risk information of all high-risk associated objects. The server inserts a new record into the risk object registry, which includes fields such as table name, field name, risk level, association path identifier, and tagging timestamp.
[0105] After determining the sensitivity level and marking high-risk associated objects, the server generates a sensitivity assessment result that includes data disclosure policy configuration parameters. Specifically, generating the sensitivity assessment result includes the following operations: The server generates linked desensitization suggestions for stored data with high sensitivity levels. These suggestions include a list of fields along critical propagation paths that require desensitization, along with a recommended desensitization strength level for each field.
[0106] The server extracts all fields from the critical propagation path and constructs a field list. Based on the position and contribution of each field in the critical propagation path, the server determines the recommended desensitization strength level. Fields located at the beginning of the critical propagation path are recommended to have a high desensitization strength level, using substitution desensitization or generalized desensitization. Fields located in the middle of the critical propagation path are recommended to have a medium desensitization strength level, using generalized desensitization or masking desensitization. Fields located at the end of the critical propagation path, if the field itself has already been desensitized, are recommended to have a low desensitization strength level, maintaining the current desensitization state.
[0107] The server generates data access restriction recommendations for stored data with medium sensitivity levels. These recommendations include restricting cross-table query permissions, prohibiting the export of specific field combinations, and setting limits on data access frequency.
[0108] Furthermore, restricting cross-table query permissions means prohibiting users from executing JOIN operations involving medium-sensitivity tables. The server adds rules to the database access control policy to block JOIN queries involving medium-sensitivity tables. Prohibiting the export of specific field combinations means prohibiting users from simultaneously exporting field combinations that might lead to individual re-identification. The server identifies quasi-identifier field combinations in medium-sensitivity tables, adds validation logic to the data export interface, and blocks export requests containing these field combinations. Setting data access frequency limits means limiting the number of times a user can access medium-sensitivity tables per unit of time. The server records the number of user accesses in the database access log; when the number of accesses exceeds the limit, subsequent access requests are rejected.
[0109] Data access restrictions should be expressed through access control policy configuration files. These files should be in JSON or XML format and include fields such as rule type, target table, restrictions, and effective date. The server will convert the data access restriction recommendations into access control policy configuration files and write them to the configuration directory.
[0110] The server generates a data access permission identifier for stored data with low sensitivity levels. This identifier indicates that the stored data can be directly accessed under the current anonymized state. The data access permission identifier is represented by a Boolean value or an enumerated value; the server sets the access permission identifier for low-sensitivity data tables to "Allow access".
[0111] After generating the linkage desensitization suggestions, data access restriction suggestions, and data access license identifier, the server summarizes these contents and generates a structured sensitivity assessment result report.
[0112] Furthermore, the sensitivity assessment results report includes the sensitivity level distribution of each data table, a visualization map of high-risk association paths, and a list of tiered openness strategy configurations.
[0113] The sensitivity level distribution of each data table is presented in tabular form, with columns including data table name, sensitivity level, information entropy increment, and evaluation time. The server iterates through all data tables, reads the sensitivity level and information entropy increment of each table, and populates the table content.
[0114] The high-risk association path visualization map presents key propagation paths graphically. The server uses a graphing library to draw the key propagation paths in the table association topology as a directed graph. Nodes in the graph represent data tables, edges represent foreign key relationships, and the color and thickness of the edges indicate the weight of the association topology edges. The server labels the sensitivity level of each node and the information entropy increment contribution value of each edge in the graph.
[0115] The tiered access policy configuration list presents access policies corresponding to different sensitivity levels in list format. The list includes columns such as sensitivity level, access policy type, policy parameters, and applicable data tables. The server populates the list content based on the linked data masking recommendations, data access restriction recommendations, and data access license identifiers.
[0116] The server outputs the sensitivity assessment results report in JSON or XML format. JSON uses a nested key-value pair structure, while XML uses a nested tag structure. The server converts the sensitivity level distribution of each data table, the visualization map of high-risk association paths, and the tiered access policy configuration list into JSON or XML data structures and writes them to a file.
[0117] After generating the sensitivity assessment results, the server also performs continuous monitoring and incremental reassessment. Specifically, continuous monitoring and incremental reassessment include the following steps: The server obtains data update logs from the government database through a database change data capture interface. This interface subscribes to the database's change event stream, receiving real-time notifications of data changes occurring in the database. The data update logs record all data change operations that occur in the database, including INSERT, UPDATE, DELETE, and ALTERTABLE operations.
[0118] The server monitors update events in the data update logs that involve new records, modified field values, and table structure changes. The server parses the operation type and object in the data update logs. INSERT operations correspond to new record events, UPDATE operations correspond to modified field value events, and ALTER TABLE operations correspond to table structure change events.
[0119] Upon detecting an update event, the server checks whether the update event involves high-risk associated objects. The server reads the identifiers of all high-risk associated objects from the risk object registry and determines whether the object operated on by the update event is in the list of high-risk associated objects.
[0120] When an update event is detected involving a high-risk associated object, the server triggers an incremental sensitivity reassessment. The incremental sensitivity reassessment re-acquires metadata statistics for the data tables affected by the update event, reconstructs the distribution of virtual connection results in an isolated sandbox environment, and recalculates the information entropy increment.
[0121] Furthermore, the incremental sensitivity reassessment process is as follows: The server identifies the data tables affected by the update event. If the update event involves adding a record or modifying a field value, the affected data table is the table where the update event's operation is located. If the update event involves a table structure change, the affected data tables are the changed data table and all data tables associated with it through foreign key relationships.
[0122] After identifying the affected data tables, the server re-obtains the metadata statistics of the affected data tables through the database metadata collection interface. The server reads the database system directory table and extracts information such as the cardinality of the related keys, the total number of records, and the histogram of field value distribution of the affected data tables.
[0123] After retrieving the metadata statistics, the server reconstructs the virtual join result distribution in the isolated sandbox environment. The server executes the virtual join result distribution construction process described in step S2, and extrapolates the expected number of records and joint distribution characteristics after cross-table joins based on the updated metadata statistics.
[0124] After reconstructing the virtual connection result distribution, the server recalculates the information entropy increment. The server executes the information entropy increment calculation process described in step S3, calculating the new information entropy increment based on the updated virtual connection result distribution and associated topological edge weights.
[0125] After calculating the new information entropy increment, the server compares the new information entropy increment with the historical information entropy increment retrieved from the sensitivity assessment result database. The server reads the information entropy increment of the affected data table at the time of the last assessment from the sensitivity assessment result database, and uses it as the historical information entropy increment.
[0126] The server calculates the difference between the absolute value of the new information entropy increment and the absolute value of the historical information entropy increment, and determines whether the difference exceeds a preset change range. The preset change range represents the minimum change amount that triggers a sensitivity escalation alarm. Those skilled in the art can select the preset change range based on the risk monitoring sensitivity of open government data. For example, if a high level of sensitivity to risk changes is desired, the preset change range can be set to 0.5. If false alarms are to be reduced, the preset change range can be set to 1.0.
[0127] When the absolute value of the new information entropy increment exceeds a preset change range compared to the absolute value of the historical information entropy increment, the server generates a sensitivity escalation alarm. The sensitivity escalation alarm is sent to the data administrator in the form of a message, which includes information such as the name of the affected data table, the historical information entropy increment, the new information entropy increment, the change range, and the alarm time.
[0128] After generating a sensitivity escalation alert, the server responds by adjusting the recommended desensitization strength level of the corresponding field in the linked desensitization suggestions. Furthermore, the server re-determines the sensitivity level of the affected data table based on the new information entropy increment. If the absolute value of the new information entropy increment exceeds the high sensitivity threshold, the server upgrades the sensitivity level of the affected data table from medium or low sensitivity to high sensitivity. The server regenerates the linked desensitization suggestions, increasing the recommended desensitization strength level of fields on critical propagation paths. For example, a field originally recommended at a medium desensitization strength level is upgraded to a high desensitization strength level.
[0129] After adjusting the linked desensitization recommendations, the server updates the sensitivity assessment result report. The server modifies the sensitivity level, information entropy increment, and linked desensitization recommendations of the affected data tables in the sensitivity assessment result report, and regenerates the report file in JSON or XML format.
[0130] After updating the sensitivity assessment report, the server checks whether the update event involves fields that have already undergone de-identification. The server reads all field identifiers that have undergone de-identification from the de-identification operation record table and determines whether the object of the update event is in the list of de-identified fields.
[0131] When an update event is detected that involves a field that has already undergone de-identification, the server recalculates the de-identification blocking coefficient. The server obtains the latest de-identification strength level of the field through the de-identification status query interface and re-determines the de-identification blocking coefficient based on the de-identification strength level. If the de-identification strength level of a field changes, the de-identification blocking coefficient also changes accordingly.
[0132] After recalculating the desensitization blocking coefficient, the server reassesses the effectiveness of the critical propagation path based on the updated coefficient. The server re-executes the multi-hop cumulative entropy increment calculation process described in step S3, applying the updated desensitization blocking coefficient to the weighted entropy increment calculation. The server determines whether the multi-hop cumulative entropy increment of the critical propagation path still exceeds the high-sensitivity threshold. If it no longer does, the server removes the propagation path from the critical propagation path list and updates the risk object registry.
[0133] It should be noted that determining the sensitivity level and generating assessment results solves the problem of the lack of automated sensitivity grading and access control policy generation capabilities in existing technologies. Existing technologies, after assessing the sensitivity of government data, typically require data administrators to manually formulate data access control policies based on the assessment results, including determining which data can be opened, which data needs to be anonymized, and which data needs to have access restricted. Manual policy formulation is inefficient and prone to inappropriate policies due to lack of experience or oversight. This application, through automated sensitivity level determination and tiered access control policy generation, directly transforms the assessment results into executable configuration parameters, significantly improving the efficiency and security of government data access control.
[0134] Furthermore, the introduction of privacy protection benchmark thresholds makes sensitivity level determination configurable and flexible. Different government departments have different security requirements and risk tolerance for data sharing, and a uniform sensitivity determination standard cannot meet the needs of all scenarios. This application provides privacy protection benchmark thresholds through configuration files, allowing each government department to adjust the determination standard according to its own actual situation, thus realizing the customization of sensitivity assessment.
[0135] Preferably, the critical path extraction algorithm and the high-risk associated object marking provide precise target location for subsequent linked desensitization operations. Existing technologies typically perform desensitization on the entire data table after identifying highly sensitive data, leading to over-desensitization and reduced data usability. This application, through a critical path extraction algorithm, accurately locates the fields and related links that lead to high risk, and performs desensitization only on fields along the critical propagation path, maximizing data usability while ensuring security.
[0136] Specifically, the tiered data access strategy reflects the concept of differentiated management. High-sensitivity data is subject to a linked de-identification recommendation, using technical means to reduce risk before being released. Medium-sensitivity data is subject to data access restriction recommendations, limiting risk exposure through access control measures. Low-sensitivity data is assigned a data access license identifier and can be released directly without additional processing. This tiered access strategy ensures the security of high-risk data while avoiding excessive restrictions on low-risk data, achieving a balance between security and availability.
[0137] Continuous monitoring and incremental reassessment address the issue of static sensitivity assessment results in existing technologies. Data in government databases is constantly changing; operations such as adding records, modifying field values, and altering table structures can change the data's sensitivity characteristics. Existing technologies typically update sensitivity assessment results through periodic full reassessments, resulting in long assessment cycles and failing to promptly detect changes in risk. This application uses a database change data capture interface to monitor data changes in real time and performs incremental reassessments on affected data tables, achieving dynamic updates of sensitivity assessment results and ensuring the timeliness and accuracy of the assessment outcomes.
[0138] Furthermore, the system establishes a closed-loop management process from risk discovery to risk mitigation by triggering sensitivity escalation alerts and adjusting desensitization recommendations. When a significant increase in data sensitivity is detected, the system automatically generates an alert and adjusts desensitization recommendations, enabling risk response without manual intervention and significantly improving the security of open government data.
[0139] The dynamic updating of the desensitization blocking coefficient and the reassessment of the effectiveness of key propagation paths ensure the consistency between the sensitivity assessment results and the actual state of the data. When the desensitization strength of a desensitized field changes, the system automatically recalculates the desensitization blocking coefficient and reassesses the risk level of the key propagation path, avoiding the invalidation of assessment results due to changes in the desensitization status.
[0140] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0141] Based on the same inventive concept, this application also provides an automatic sensitivity assessment system for stored data that is open to government data. The solution provided by this system is similar to the solution described in the above method. Therefore, the specific limitations of one or more embodiments of the automatic sensitivity assessment system for stored data that is open to government data provided below can be found in the limitations of the automatic sensitivity assessment method for stored data that is open to government data described above, and will not be repeated here.
[0142] Reference Figure 2 , Figure 2This diagram illustrates the application environment of the automatic sensitivity assessment method for open government data provided in this application embodiment. The client is a terminal device used by government data administrators, communicating with the server via a network. The server deploys an automatic sensitivity assessment system, acquiring metadata statistics from the government database through a database metadata collection interface. These statistics include the cardinality of association keys, the total number of records in the data table, a histogram of field value distribution, and the weights of association topologies. The server constructs a virtual connection result distribution based on the metadata statistics in an isolated sandbox environment. Access to actual data records in the government database is prohibited in the isolated sandbox environment, thus avoiding the risk of data leakage during the assessment process. The server calculates the information entropy increment based on the virtual connection result distribution and the weights of association topologies, determines the sensitivity level of the stored data based on the information entropy increment, and generates a sensitivity assessment result containing data openness policy configuration parameters. The data storage system stores data such as the government database, the sensitivity assessment result database, and the risk object registry. The data storage system can be integrated into the server or deployed on a separate database server. The client receives the sensitivity assessment result report generated by the server via the network. Data administrators formulate corresponding data openness policies based on the linked de-identification suggestions, data openness restriction suggestions, and data openness license identifiers in the report.
[0143] In one exemplary embodiment, such as Figure 3 As shown, an automatic data sensitivity assessment system for open government data storage is provided, including: The statistics acquisition module acquires metadata statistics from the government database. The metadata statistics include the cardinality of association keys, the total number of records in the data table, the histogram of field value distribution, and the weights of association topology edges. The distributed construction module constructs a virtual connection result distribution based on statistical inference in an isolated sandbox environment based on the metadata statistics. The isolated sandbox environment prohibits access to the actual data records in the government database. The entropy increase calculation module calculates the information entropy increment based on the distribution of the virtual connection results and the associated topology edge weights. The information entropy increment represents the change in individual re-identification risk caused by cross-table logical association. The evaluation generation module determines the sensitivity level of the stored data based on the information entropy increment and generates a sensitivity evaluation result that includes data openness policy configuration parameters.
[0144] The modules in the aforementioned automated sensitivity assessment system for open government data storage can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the computer device's memory as software, allowing the processor to invoke and execute the corresponding operations.
[0145] In one exemplary embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 4 As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, Near Field Communication (NFC), or other technologies. When executed by the processor, the computer program implements an automatic sensitivity assessment method for stored data open to government data. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.
[0146] Those skilled in the art will understand that Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0147] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.
[0148] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.
[0149] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0150] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.
[0151] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.
[0152] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.
[0153] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A method for automatically assessing the sensitivity of stored data for government data sharing, characterized in that, include: Obtain metadata statistics from the government database, including association key cardinality, total number of records in the data table, histogram of field value distribution, and association topology edge weights; Based on the metadata statistics, a virtual connection result distribution based on statistical inference is constructed in an isolated sandbox environment, where access to actual data records in the government database is prohibited. The information entropy increment is calculated based on the distribution of the virtual connection results and the weight of the associated topology edge. The information entropy increment represents the change in individual re-identification risk caused by cross-table logical association. The sensitivity level of the stored data is determined based on the information entropy increment, and a sensitivity assessment result containing data openness policy configuration parameters is generated.
2. The automatic sensitivity assessment method for stored data oriented towards government data openness as described in claim 1, characterized in that: The construction of a virtual connectivity result distribution based on statistical inference in an isolated sandbox environment includes: Extract the cardinality of the association keys of the first data table and the cardinality of the association keys of the second data table from the metadata statistics, and calculate the association key overlap coefficient between the first data table and the second data table; Based on the overlap coefficient of the association key, the total number of records in the first data table, and the total number of records in the second data table, the expected number of records after cross-table join is deduced. Extract the value range distribution characteristics of each field from the field value distribution histogram of the first data table and the field value distribution histogram of the second data table, and calculate the joint distribution characteristics of the field combination after cross-table join; The expected number of records and the joint distribution features are combined to obtain the distribution of the virtual connection results.
3. The automatic sensitivity assessment method for stored data oriented towards open government data as described in claim 2, characterized in that: The calculation of the joint distribution characteristics of the field combinations after cross-table join includes: Extract the value range division results of the quasi-identifier field from the field value distribution histogram of the first data table. The quasi-identifier field includes the age field, address field, occupation field, and timestamp field. Extract the value range division results of the associated fields that are semantically related to the quasi-identifier field from the field value distribution histogram of the second data table; Expand the value ranges of the quasi-identifier field and the value ranges of the associated field to generate a candidate range combination set for cross-table field combinations; The number of candidate interval combinations in the candidate interval combination set whose expected record number is a single record is counted. The ratio of the number of such combinations to the total number of candidate interval combinations is used as the uniqueness ratio, which is used as the core quantitative indicator of the joint distribution feature.
4. The automatic sensitivity assessment method for stored data oriented towards government data openness as described in claim 1, characterized in that: The associated topological edge weights are determined in the following ways: Obtain the foreign key relationships between various data tables in the government database, and construct an inter-table relationship topology graph, wherein the inter-table relationship topology graph uses data tables as nodes and foreign key relationships as directed edges; For each directed edge in the table association topology graph, the physical association strength of the directed edge is calculated based on the data type consistency of the foreign key field, the integrity of the foreign key constraint, and the cardinality type of the foreign key association. Extract the frequency of JOIN operations involving the table pairs corresponding to the directed edges from the database query logs, and normalize the frequency of JOIN operations to use as the logical association strength of the directed edges. The physical association strength and the logical association strength are weighted and fused to obtain the comprehensive association strength of the directed edge, and the comprehensive association strength is used as the weight of the association topology edge; When the foreign key association corresponding to the directed edge involves a field that has undergone de-identification, the weight of the associated topology edge is attenuated according to the de-identification strength level of the field.
5. The automatic sensitivity assessment method for stored data oriented towards government data openness as described in claim 1, characterized in that: The calculation of the information entropy increment includes: For each data table in the government database, based on the total number of records in the data table and the histogram of the field value distribution, the initial information entropy value of the data table before cross-table join is calculated; Based on the expected number of records and joint distribution characteristics, the connection information entropy value after the virtual connection is calculated for the distribution of the virtual connection results. The difference between the connection information entropy value and the initial information entropy value is calculated to obtain the single-hop information entropy increment. When a multi-hop propagation path is detected in the inter-table association topology graph, the single-hop information entropy increment of each hop is calculated sequentially along the multi-hop propagation path. The single-hop information entropy increment is then weighted according to the association topology edge weights corresponding to each hop. The weighted single-hop information entropy increment is then accumulated hop by hop to obtain the multi-hop cumulative information entropy increment as the information entropy increment. This increment is then stored in the sensitivity assessment result database through the data persistence interface.
6. The automatic sensitivity assessment method for stored data oriented towards government data openness as described in claim 5, characterized in that: The step of weighting the single-hop information entropy increment according to the associated topological edge weights corresponding to each hop includes: Obtain the associated topological edge weight corresponding to the nth hop on the multi-hop propagation path, and use the associated topological edge weight as the base value of the propagation attenuation coefficient; Calculate the hop count attenuation factor based on the hop count n of the multi-hop propagation path; Multiply the base value of the propagation attenuation coefficient by the hop count attenuation factor to obtain the actual propagation attenuation coefficient of the nth hop; Multiply the single-hop information entropy increment of the nth hop by the actual propagation attenuation coefficient to obtain the weighted information entropy increment of the nth hop; When it is detected that the data table involved in the nth hop of the multi-hop propagation path contains a field that has undergone de-identification, a de-identification blocking coefficient is calculated based on the de-identification strength level of the field, and the weighted information entropy increment is multiplied by the de-identification blocking coefficient.
7. The automatic sensitivity assessment method for stored data oriented towards government data openness as described in claim 1, characterized in that: Determining the sensitivity level of stored data includes: The privacy protection benchmark thresholds for the opening of government data are obtained, and the privacy protection benchmark thresholds include a high sensitivity judgment threshold and a medium sensitivity judgment threshold; When the information entropy increment is negative and the absolute value of the information entropy increment is greater than the high sensitivity determination threshold, the corresponding stored data is determined to be of high sensitivity level. When the information entropy increment is negative and the absolute value of the information entropy increment is between the high sensitivity threshold and the medium sensitivity threshold, the corresponding stored data is determined to be of medium sensitivity level. When the information entropy increment is negative and the absolute value of the information entropy increment is less than the medium sensitivity determination threshold, or when the information entropy increment is non-negative, the corresponding stored data is determined to be of low sensitivity level. For the stored data with the high sensitivity level, the key propagation path that causes the absolute value of the information entropy increment to exceed the high sensitivity judgment threshold is extracted from the table association topology graph, and each data table and field on the key propagation path is marked as a high-risk association object.
8. The automatic sensitivity assessment method for stored data oriented towards government data openness as described in claim 7, characterized in that: The generation of sensitivity assessment results, which include data openness policy configuration parameters, includes: For the stored data with the high sensitivity level, a linkage desensitization suggestion is generated. The linkage desensitization suggestion includes a list of fields that need to be desensitized on the key propagation path and a recommended desensitization intensity level for each field. For the stored data of the medium sensitivity level, data access restriction suggestions are generated. These suggestions include restricting cross-table query permissions, prohibiting the export of specific field combinations, and setting data access frequency limits. These data access restriction suggestions are expressed through an access control policy configuration file. For the stored data at the low sensitivity level, a data open license identifier is generated, which indicates that the stored data can be directly opened in the current desensitized state; The linked desensitization suggestions, the data opening restriction suggestions, and the data opening license identifier are summarized to generate a structured sensitivity assessment result report. The sensitivity assessment result report includes the sensitivity level distribution of each data table, a visualization map of high-risk association paths, and a list of hierarchical opening strategy configurations. The sensitivity assessment result report is output in JSON or XML format.
9. The automatic sensitivity assessment method for stored data oriented towards government data openness as described in claim 1, characterized in that: After generating the sensitivity assessment results containing data openness policy configuration parameters, the process also includes: Monitor update events in the government database's data update logs that involve new records, modified field values, and changes to table structure; When the update event is detected to involve a high-risk associated object, an incremental sensitivity reassessment is triggered. The incremental sensitivity reassessment re-acquires the metadata statistics for the data table affected by the update event, reconstructs the virtual connection result distribution in the isolated sandbox environment, and recalculates the information entropy increment. The newly calculated information entropy increment is compared with the historical information entropy increment. When the absolute value of the new information entropy increment increases by more than a preset change range compared with the absolute value of the historical information entropy increment, a sensitivity upgrade alarm is generated. In response to the sensitivity upgrade alarm, adjust the recommended desensitization strength level of the corresponding field in the linkage desensitization suggestion.
10. An automatic data sensitivity assessment system for open government data storage, employing the automatic data sensitivity assessment method for open government data storage as described in any one of claims 1 to 9, characterized in that, include: The statistics acquisition module acquires metadata statistics from the government database. The metadata statistics include the cardinality of association keys, the total number of records in the data table, the histogram of field value distribution, and the weights of association topology edges. The distributed construction module constructs a virtual connection result distribution based on statistical inference in an isolated sandbox environment based on the metadata statistics. The isolated sandbox environment prohibits access to the actual data records in the government database. The entropy increase calculation module calculates the information entropy increment based on the distribution of the virtual connection results and the associated topology edge weights. The information entropy increment represents the change in individual re-identification risk caused by cross-table logical association. The evaluation generation module determines the sensitivity level of the stored data based on the information entropy increment and generates a sensitivity evaluation result that includes data openness policy configuration parameters.