SQL dynamic desensitization method based on scope fingerprint and operator security matrix
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIUZHANG ARITHMETIC (ZHEJIANG) TECH CO LTD
- Filing Date
- 2026-01-21
- Publication Date
- 2026-06-23
Smart Images

Figure CN122263142A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data security technology, and in particular to a dynamic SQL desensitization method, a dynamic SQL data desensitization system, an electronic device, and a computer-readable storage medium based on scope fingerprinting and operator security matrix. Background Technology
[0002] With the deepening of digital transformation and the mandatory implementation of global data privacy regulations such as the General Data Protection Regulation (GDPR) and the Personal Information Protection Act, effectively protecting sensitive data in the production environment while ensuring business continuity has become a serious challenge for enterprises. Dynamic Data Masking, as a key technology, can mask, replace, or encrypt sensitive data in query results in real time, based on user roles, access context, and other policies, without modifying the underlying stored data, thereby effectively preventing data leakage.
[0003] However, existing SQL dynamic masking technologies generally suffer from the following deep-seated technical limitations when dealing with increasingly complex enterprise-level query scenarios:
[0004] Limitation 1: Coarse-grained scope identification capability. Modern SQL queries (such as data analysis and BI reports) often contain complex structures such as multi-level nested subqueries, common table expressions (CTEs), and window functions. Existing technologies mostly rely on pattern matching based on regular expressions or simple syntax analysis, lacking a systematic and precisely quantifiable mechanism to isolate and identify the scope of different query blocks. This leads to limitations in handling wildcards (…). In scenarios such as expansion, column source tracing, and alias referencing, "scope pollution" or omissions in data masking can easily occur. For example, incorrectly applying the masking rules of the outer query to the inner query, or failing to correctly identify fields defined in the CTE, can lead to query failures or data leaks.
[0005] Limitation 2: Lack of context-aware and differentiated expression processing strategies. When dealing with expressions containing sensitive fields, the existing technologies generally adopt a unified, black-and-white "one-size-fits-all" strategy. For example, once a sensitive column (such as "phone") appears in an expression, whether the expression is CONCAT('电话 : ', phone) or LENGTH(phone), it will be entirely replaced with a fixed masked value (such as ''). This method cannot distinguish the intrinsic semantics of the expression: the former (CONCAT) is structurally secure, and only the sensitive parameter needs to be desensitized to retain the business meaning; while the latter (LENGTH) will disclose the metadata (length) of the data, and the entire expression needs to be replaced to eliminate the risk. This crude processing method leads to a binary dilemma of over-desensitization (destroying business logic, such as splicing format) and under-desensitization (risk of data leakage).
[0006] Therefore, there is an urgent need for a brand-new SQL dynamic desensitization method that can overcome the above defects of the existing technologies, achieve precise scope control for complex SQL, and intelligently select the optimal desensitization strategy according to the context semantics of the expression. Summary of the Invention
[0007] To solve the technical problems existing in the prior art, the present invention provides an advanced SQL dynamic desensitization method and system, aiming to solve a series of technical problems such as desensitization omission, over-desensitization, business logic destruction, and potential data leakage in the existing SQL dynamic desensitization technology due to coarse scope recognition granularity, single expression processing strategy, and lack of context awareness ability.
[0008] On the one hand, it provides a SQL dynamic desensitization method based on scope fingerprint and operator security matrix, including the following steps:
[0009] a) Parse the received original SQL query statement into an AST (Abstract Syntax Tree);
[0010] b) Analysis stage: Perform the first traversal on the AST, analyze its structural information in the AST, generate the corresponding scope fingerprint according to the structural information, and construct a scope map based on the scope fingerprint and its hierarchical relationship;
[0011] c) Rewriting stage: Based on the scope map, perform the second traversal on the AST. When a node identified as a sensitive field is traversed, differentially rewrite the AST according to the preset operator security matrix;
[0012] d) Generate a new, secure SQL query statement according to the rewritten AST.
[0013] Preferably, the structured information on which the scope fingerprint is generated includes at least one of the following: the node path of the query block in the AST, the nesting depth, the signature of the referenced data table, or the hash value of its own content.
[0014] Preferably, when the original SQL query statement contains wildcards ( During the analysis phase, the wildcard is expanded using the scope map to ensure that its expansion range is strictly constrained by the boundary of the fingerprint of its scope.
[0015] Preferably, the step of performing a first traversal of the AST, analyzing its structured information in the AST, and generating a corresponding scope fingerprint based on the structured information includes:
[0016] i. Identify each query block in the AST;
[0017] ii. For each query block, generate a unique and stable scope fingerprint based on its structured information in the AST;
[0018] iii. Using the scope fingerprint and its hierarchical relationship, construct a scope graph, which is used to provide immutable boundary constraints for subsequent rewriting operations.
[0019] Preferably, based on the degree to which operators retain data information, the operator security matrix includes at least the following security levels and their corresponding processing strategies:
[0020] SAFE (Safety): When the operator itself does not disclose information, the operator structure is retained, and only the sensitive field parameters are replaced with custom desensitization function calls;
[0021] UNSAFE: When an operator would leak metadata or part of information of a sensitive field, the entire expression node containing the operator is replaced with a dummy value that is compatible with the return type of the original expression.
[0022] FORMAT_PRESERVING (Format Preservation): When the context of the operator requires preserving the comparability or groupability of the data, sensitive field nodes are replaced with calls to the format preservation desensitization function;
[0023] PASS_THROUGH (Pass-through): When the operator only serves as a structure wrapper or type conversion, the operator structure is preserved, and the context probing and rewriting of its internal parameters are recursively performed.
[0024] Preferably, the second traversal of the AST based on the scope graph, when encountering a node identified as a sensitive field, involves differentially rewriting the AST according to a preset operator security matrix, including:
[0025] i. Probe the direct parent node of the sensitive field node to determine its operator context;
[0026] ii. Query a preset operator security matrix, and obtain the corresponding security level and processing strategy based on the operator context;
[0027] iii. Based on the processing strategy, rewrite the expressions or their parameters that contain sensitive fields in the AST in a differentiated manner.
[0028] Preferably, when a sensitive field appears in the GROUP BY, ORDER BY, or DISTINCT clauses, or as an operand in an equal comparison (=, IN) in the WHERE clause, its context is identified as requiring format preservation. The processing strategy is to rewrite the sensitive field and its corresponding literal in the comparison expression using the same FORMAT_PRESERVING desensitization function.
[0029] Preferably, the method further includes an extension step: when an unknown function not defined in the operator security matrix is encountered during the rewriting phase, the metadata and context information of the function are asynchronously submitted to a Large Language Model (LLM) for security risk analysis, and the operator security matrix is updated according to the classification suggestions returned by the LLM.
[0030] On the other hand, a dynamic SQL data masking system is provided, including:
[0031] A SQL parser is used to parse raw SQL query statements into an abstract syntax tree (AST).
[0032] A scope analyzer is configured to perform a first pass of analysis traversal on the AST, generate a scope fingerprint for each query block, and construct a scope graph as an immutable boundary constraint.
[0033] A configuration center is used to store an operator security matrix that maps SQL operators to preset security levels and processing strategies;
[0034] A context-aware desensitizer is configured to perform a second rewrite traversal on the AST based on the scope graph, and to perform differential rewriting on the AST by probing the parent node operator context of the sensitive field and querying the operator security matrix;
[0035] A SQL generator for generating a new, safe SQL query based on a rewritten AST.
[0036] Preferably, the system further includes a Large Language Model (LLM) enhancement module that interacts with the configuration center, the module being configured as follows:
[0037] a) Receive unknown functions or suspicious content submitted by the context-aware desensitizer;
[0038] b) Call a large language model to perform security classification on the unknown function, or to identify sensitive information on the suspicious content;
[0039] c) Based on the analysis results, generate policy recommendations for updating the configuration center.
[0040] On the other hand, an electronic device is provided, comprising: a processor; and a memory storing computer-readable instructions, wherein when executed by the processor, the computer-readable instructions implement any of the methods described above for SQL dynamic desensitization based on scope fingerprint and operator security matrix.
[0041] On the other hand, a computer-readable storage medium is provided, wherein at least one instruction is stored in the storage medium, the at least one instruction being loaded and executed by a processor to implement any of the above-described SQL dynamic desensitization methods based on scope fingerprints and operator security matrices.
[0042] The beneficial effects of the technical solutions provided in the embodiments of the present invention include at least the following:
[0043] 1. Precise Scope Control: By introducing a "scope fingerprint" mechanism, this invention can precisely divide the scope and bind the source of arbitrarily complex SQL structures such as nested subqueries and CTEs, effectively solving the problem of wildcards caused by scope confusion. Problems such as incorrect expansion and contamination of the desensitization range are eliminated, ensuring the accuracy of the desensitization operation.
[0044] 2. Intelligent Context Adaptability: Through an innovative "operator security matrix" and context probes of AST parent nodes, this invention achieves fine-grained, differentiated desensitization at the expression level. It achieves the optimal balance between ensuring data security (replacing the UNSAFE operator) and maintaining business logic compatibility (retaining the SAFE operator), effectively avoiding the problems of over-desensitization and under-desensitization caused by the "one-size-fits-all" strategy of existing technologies.
[0045] 3. High scalability: The framework of this invention has good scalability. For example, the "operator security matrix" and sensitive data identification pattern can be integrated with the Large Language Model (LLM) to achieve dynamic security classification of unknown user-defined functions (UDFs) and intelligent identification of unstructured sensitive data, enabling the system to self-evolve and adapt to more complex scenarios. Attached Figure Description
[0046] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0047] Figure 1 This is a system architecture diagram of an embodiment of the present invention for implementing the SQL dynamic desensitization method based on scope fingerprint and operator security matrix.
[0048] Figure 2 This is a schematic diagram of the spectral data structure of a domain fingerprint spectrum provided in Embodiment 1 of the present invention;
[0049] Figure 3 This is a flowchart of an AST rewriting method provided in Embodiment 3 of the present invention;
[0050] Figure 4 This is an extended system architecture diagram that incorporates an LLM module, provided by an embodiment of the present invention. Detailed Implementation
[0051] The technical solution of the present invention will now be described with reference to the accompanying drawings.
[0052] In embodiments of the present invention, words such as "exemplarily," "for example," etc., are used to indicate that something is an example, illustration, or description. Any embodiment or design described as "exemplary" in the present invention should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of the word "exemplary" is intended to present the concept in a concrete manner. Furthermore, in embodiments of the present invention, the meaning expressed by "and / or" can be both, or either one.
[0053] In the embodiments of this invention, the terms "image" and "picture" may sometimes be used interchangeably. It should be noted that, without emphasizing the distinction between them, they convey the same meaning. Similarly, the terms "of," "corresponding (relevant)," and "corresponding" may sometimes be used interchangeably. It should be noted that, without emphasizing the distinction between them, they convey the same meaning.
[0054] In this embodiment of the invention, sometimes a subscript such as W1 may be mistakenly written as a non-subscript form such as W1. When the difference is not emphasized, the meaning they express is the same.
[0055] The hash signature in this application uses MD5 as an example, but it can also be SHA-1, SHA-256 or a custom unique identifier generation algorithm.
[0056] The parser used in this application is sqlglot / Druid, but it is not limited to a specific library. Any parser that can convert SQL into a tree structure is acceptable.
[0057] To make the technical problems, technical solutions and advantages of the present invention clearer, a detailed description will be given below in conjunction with the accompanying drawings and specific embodiments.
[0058] To address the aforementioned technical problems, this invention discloses a compiler-like, abstract syntax tree (AST)-based dynamic SQL de-identification method. The core feature of this method lies in its inclusion of a strongly coupled two-pass processing pipeline (combined with an appendix). Figure 1 As shown in the figure, ensure that you have a precise understanding of the complete structure and scope of the query before rewriting the SQL.
[0059] The method specifically includes the following steps:
[0060] First pass, analysis phase: constructing scope fingerprinting.
[0061] First, the input SQL statement is parsed into a Structured Abstract Syntax Tree (AST). Then, a depth-first traversal is performed on this AST. During the traversal, whenever a query block (such as a SELECT statement, a Common Table Expression (CTE), or a subquery) is entered, the system generates a stable and unique "scope fingerprint" for it based on its path in the AST, nesting depth, and the signatures of the referenced data tables. Simultaneously, the fingerprints of its parent query blocks are recorded, thus constructing a precise query scope graph with clear hierarchical relationships. This graph provides immutable scope boundary constraints for subsequent rewriting operations, effectively ensuring the accuracy of the operations. (See Example 1 for this process.)
[0062] The second rewrite stage involves context-aware desensitization based on the "operator security matrix (see Table 1 below for examples)".
[0063] After completing the first pass of analysis, the AST is traversed a second time. During this traversal, when a field (column) node that is predefined as sensitive is accessed, the system checks the parent node of that node to identify its direct parent operator (such as function calls, arithmetic operators, logical comparison operators, etc.).
[0064] The system then queries a pre-defined, configurable "Operator SecurityMatrix" to determine the security level of the operator. This matrix categorizes common SQL operators into several classes, including:
[0065] SAFE (Safe): This means that the operator itself does not reveal information, keeps the expression structure unchanged, and only performs desensitization and replacement on sensitive parameters (recursively applied to subexpressions).
[0066] Typical operators: CONCAT / ||, UPPER / LOWER, TRIM / LTRIM / RTRIM, REPLACE.
[0067] Example: CONCAT('tel: ', phone) → CONCAT('tel: ', MASK_PHONE(phone)) or CONCAT('tel: ', MASK_PHONE(phone)) (depending on dialect and original parameter order);
[0068] UPPER(name) →UPPER(MASK_NAME(name));
[0069] REPLACE(email, '@ ', '# ') →REPLACE(MASK_EMAIL(email), '@ ', '# ').
[0070] If its output is consumed by the upper-level UNSAFE operator, then the rules of the upper-level operator shall prevail.
[0071] UNSAFE (Unsafe): This means that the operator will reveal some information about the original value (length, substring, numerical relationship, equality, etc.), and the entire expression containing sensitive fields needs to be replaced with a safe dummy that is compatible with the return type.
[0072] Typical operators: LENGTH / LEN, SUBSTRING / SUBSTR / LEFT / RIGHT, arithmetic operations + and - / %, Comparison and Matching = != <> > < >= <= LIKE ILIKE BETWEEN IN NOT IN.
[0073] Example: LENGTH(phone) → 0; SUBSTRING(email, 1, 3) → ' '; price +phone → 0 or NULL; In a Boolean context (such as WHERE), the whole is replaced with a leak-free and logically safe constant condition (such as FALSE).
[0074] FORMAT_PRESERVING (Format Preservation): This refers to the context where "uniqueness / order / grouping stability" needs to be preserved. It uses a desensitization function that preserves "format preservation / determinism" to hide the plaintext while maintaining comparability and groupability.
[0075] Typical contexts: GROUP BY, ORDER BY, and expressions that affect key / index semantics.
[0076] Example: GROUP BY email → GROUP BY MASK_EMAIL_FP(email); ORDER BY email→
[0077] ORDER BY MASK_EMAIL_FP(email); DISTINCT email → DISTINCT MASK_EMAIL_FP(email).
[0078] The FP function is typically a hash / encryption function that guarantees the same input will map to the same output, different inputs will map to different outputs, and this mapping is irreversible.
[0079] PASS_THROUGH (transparent): refers to an operator that does not leak information itself, but only plays a "transparent" role such as wrapping / control flow / type conversion or window semantics.
[0080] Examples include CAST / CONVERT, CASE / WHEN, COALESCE / NULLIF / ISNULL, and most window functions (such as ROW_NUMBER, RANK, LAG / LEAD, FIRST_VALUE / LAST_VALUE, etc.).
[0081] The processing strategy is as follows: keep the operator structure unchanged, recursively perform the same context-aware analysis on its internal parameters as on the window specification; once the SAFE / UNSAFE / FORMAT_PRESERVING operator is encountered inside it, it is processed according to the corresponding strategy.
[0082] Simple example: CAST(email AS TEXT) →CAST(MASK_EMAIL(email) AS TEXT);
[0083] COALESCE(email, 'N / A ') →COALESCE(MASK_EMAIL(email), 'N / A ') ;
[0084] ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at) →
[0085] ROW_NUMBER() OVER (PARTITION BY MASK_EMAIL_FP(email) ORDER BYcreated_at).
[0086]
[0087] Table 1: Examples of Operator Security Matrix
[0088] In this invention, through this two-pass pipeline and differentiated rewriting strategy, a deep understanding and precise control of SQL queries at the structural and semantic levels is achieved.
[0089] Although the embodiments in this application describe two passes (Pass 1 & Pass 2), it can be extended to explain that equivalent alternatives such as "multiple passes" or "maintaining two state stacks simultaneously in one pass" are also within the scope of protection.
[0090] To make the objectives, technical solutions, and advantages of this invention clearer, the following description will be provided in conjunction with the appendix. Figure 1-4 The present invention will be further described in detail below with reference to specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. 1.
[0092] 1.1 Overall Process and System Components
[0093] This invention employs a two-pass pipeline similar to that of a compiler: inputting raw SQL, outputting SQL that has been safely rewritten and can be executed directly. To avoid repetition with the previous overview of the "two-step method / module," this section only provides a skeletal description; details can be found in subsequent subsections (1.2-2).
[0094] like Figure 1 Core components shown (overview):
[0095] 1. SQL Parser: Parses SQL text into AST, compatible with multiple dialects (such as using open-source sqlglot or Druid).
[0096] 2. Scope Analyzer (Pass 1): Generates a "scope fingerprint" and a "scope graph," and precisely expands wildcards based on these. Eliminate scope confusion.
[0097] 3. Configuration Center: Contains a "Sensitive Data Dictionary (which columns are sensitive fields)" and an "Operator Security Matrix (explained above)," providing a consistent basis for decision-making when rewriting.
[0098] 4. Context-Aware Masker (Pass 2): Performs differentiated rewriting based on the parent node context and the "operator security matrix", with built-in special scenario handling such as WHERE / GROUP BY.
[0099] 1. SQL Generator (SQL Generator, sqlglot Generator / Transpiler of Reuse Component 1): Losslessly converts the rewritten AST back to the target dialect SQL.
[0100] The specific algorithms and implementation examples are as follows: 1.2 (Scope Fingerprint), 1.3 (Context-Aware Desensitization), and 1.4 (LLM Enhancement).
[0101] Auditing and Traceability: As a consistent security principle, the system audits and logs all critical SQL rewriting decisions (such as expression replacement, predicate blocking or modification, etc.). The logs include the original SQL fragment, the application's rewriting strategy, the reason for the decision, and a timestamp, providing a solid chain of evidence for post-event traceability, compliance review, and security incident analysis.
[0102] In the following text, each module will first be explained in words, followed by 1-2 examples.
[0103] 1.2 Scope Fingerprint Module
[0104] This module is the cornerstone of the precise scope control in this invention, corresponding to the first pass (analysis phase) of the processing pipeline. Its core task is to perform a complete "survey" of the input AST without rewriting any content, and to draw a precise "sphere of influence map," or "scope map," for each independent query block.
[0105] 1.2.1 Core Algorithm
[0106] The algorithm for generating scope fingerprints and constructing hierarchical relationships is completed through a single depth-first search (DFS) traversal. The specific steps are as follows:
[0107] 1. Traversal and Recognition: The system traverses the AST in Visitor mode, specifically identifying nodes that can generate independent scopes, mainly SELECT expressions.
[0108] 2. Fingerprint Calculation: Whenever a new SELECT node is entered, the system calls the fingerprint generation function, which combines information from multiple dimensions to create a stable and unique string identifier (i.e., fingerprint):
[0109] Node type and path: The type (Select) of this node in the entire AST and its path from the root node to the current node ensure the uniqueness of the position.
[0110] Nesting Depth: This node represents the nesting level relative to the top-level query, used to distinguish queries at different levels.
[0111] Content hash: A hash of the node's own content to distinguish it from sibling queries at the same level.
[0112] Table signature: Extract all tables (or aliases) directly referenced in the FROM and JOIN clauses of this query block, sort them, and generate a hash signature. This is a key feature that distinguishes queries from different data sources.
[0113] 3. Graph Construction: During traversal, the system records the fingerprint of the parent query block of the current node. This stores key-value pairs of {"fingerprint": ScopeInfo} in the global "scope graph" (a hash table). The ScopeInfo object contains metadata about the scope, such as: tables (list of managed tables), depth, and parent_fingerprint (parent scope fingerprint).
[0114] Using the above algorithm, the system can construct a complete query scope graph with clear hierarchical relationships for arbitrarily complex SQL queries in a single traversal (e.g., ...). Figure 2 As shown, this provides immutable scope boundary constraints for the second, precise rewrite.
[0115] Although the processing flow of this module and traditional SQL parsing and optimization techniques (such as Apache Calcite) superficially involve traversing the Abstract Syntax Tree (AST), they differ fundamentally in their core technical objectives, the nature of key outputs, and scope handling mechanisms. These differences are key to the inventiveness of this invention:
[0116] 1) Fundamentally different properties of the products:
[0117] Traditional SQL optimizers: Their core output is a variable query plan or optimized AST for execution. The goal throughout their lifecycle is to continuously modify and equivalently transform the AST to serve the functional objective of performance.
[0118] The module of this invention produces an immutable "ScopeMap" for security constraints. This scopemap is a metadata structure independent of the AST, and together with the original AST, it forms a tuple (AST, ScopeMap). Once generated, this scopemap cannot be modified in subsequent rewriting stages; it serves as a "security contract," defining the boundaries for second-pass rewriting operations.
[0119] 2). Opposing Designs of Scope Handling Mechanisms:
[0120] Traditional SQL optimizers, in pursuit of performance, tend to merge, eliminate, or flatten scopes. For example, common optimizations such as "subquery unnesting" or "view merging" essentially break down the original scope boundaries.
[0121] This invention module, with its core design goal being the opposite, is to identify, solidify, and enforce the isolation of scopes. By generating a stable and unique "scope fingerprint," this module materializes each query block into an independent, inviolable namespace. This effectively prevents security policy application chaos or cross-domain leakage of sensitive information caused by optimization.
[0122] 3) Differences between information flow and architectural patterns:
[0123] Traditional SQL optimizers typically employ a series of cascading, rule-based rewriting processes, where the AST is progressively modified across multiple rules.
[0124] This invention's module employs a strict two-pass architecture. The sole task of the first pass (analysis) is to generate a "scope graph," a security contract, without ever rewriting the abstract stratum (AST). The second pass (rewriting) operates strictly according to this contract, with all decisions constrained by the graph. This tightly coupled architecture of "separation of analysis and execution" ensures the determinism and predictability of the security policy, which is fundamentally different from the opportunistic, performance-oriented transformation logic of optimizers.
[0125] In summary, the core of traditional SQL parsing / optimization techniques is "performance-optimal transformation under semantic equivalence," with its internal state being fluid and serving execution. In contrast, the core of this invention is "safe and minimal rewriting based on static boundaries," and its key technological product, "scope fingerprint and graph," is an externalized, immutable security constraint structure specifically designed to guide subsequent, performance-independent, safe rewriting. This is the most fundamental difference between the two in their technical approaches.
[0126] 1.2.2 Example 1: Nested Subquery Scenario
[0127] Objective: To demonstrate how the system generates different fingerprints for inner and outer layer queries, thereby expanding wildcards ( When identifying the scope of a table, it is important to correctly identify the scope of the table to avoid scope pollution.
[0128] Enter SQL:
[0129] SELECT
[0130] ,
[0131] length(s3.phone_number) as phone_len FROM users s1
[0132] LEFT JOIN (
[0133] SELECT user_id, total_amount
[0134] FROM orders
[0135] WHERE status = 'completed ' ) s3 ON s1.id = s3.user_id.
[0136] 1. Entering the outer query: The scope analyzer first encounters the top-level SELECT statement. The system generates a specific and stable fingerprint for it, which is a composite string composed of multiple parts, specifically structured as follows:
[0137] Parent scope path: This part is empty for top-level queries;
[0138] Node type and depth: For example, Select_0 indicates a SELECT node at level 0;
[0139] Node content hash: A short hash of the node's own content (excluding subqueries) to distinguish it from sibling nodes at the same level;
[0140] Table signature: This is the most crucial part. The system extracts all tables or aliases directly referenced in the FROM and JOIN clauses (s1 and s3 in this example), sorts them, and generates a hash signature.
[0141] The final generated fingerprint is Select_0_a8f5c2_md5(s1,s3). This fingerprint is then used as a key in the "scope map" (a hash table). The analyzer stores the metadata of that scope under this key, as follows:
[0142] { "Select_0_a8f5c2_md5(s1,s3)": { tables: ["s1", "s3"], parent: null}}.
[0143] 2. Entering the Inner Subquery: Next, the analyzer recursively enters the subquery within the LEFT JOIN. This is a new, independent query block. The system generates a completely different fingerprint for it, such as Select_1_b9g6d3_md5(orders). The analyzer identifies the table associated with its FROM clause as orders and records the fingerprint of its parent scope as the outer query.
[0144] Add a record to the scope map:
[0145] { "Select_1_b9g6d3_md5(orders)": { tables: ["orders"], parent: "Select_0_a8f5c2_md5(s1,s3)"}}
[0146] wildcards ( Expanding: In subsequent steps of the analysis phase, when the system needs to expand the outer query... When this happens, it looks up the scope information corresponding to the fingerprint Select_0_a8f5c2_md5(s1,s3). The system learns that its jurisdiction is limited to s1 and s3. Therefore, It is precisely expanded to all columns of table s1 and all columns exposed by subquery s3 (user_id, total_amount).
[0147] The fingerprint generation algorithm described above is based on "path + depth + table signature + content hash", which may cause collisions in certain scenarios. However, there are no restrictions on how the fingerprint generation algorithm is used (any one is acceptable; for example, more distinguishing dimensions can be added, such as node memory address, serial number, or fingerprint collision detection and alarm mechanism can be added directly).
[0148] Results of this phase:
[0149] Upon completion of this stage of processing, this application yields a crucial intermediate result: a clearly structured and unambiguous abstract syntax tree, and a "scope graph" providing complete context for it. In this result, the wildcards of the outer query ( The scope is expanded only: visible columns are strictly limited to all columns of s1 and the columns explicitly projected and exposed by the s3 subquery (user_id, total_amount); any original columns in the inner orders table that are not projected by s3 will not appear in the outer one, thus avoiding overflow of privileged fields. This provides an absolutely reliable and clearly sourced operational basis for the second-pass context-aware desensitizer, ensuring that desensitization rules are only applied to the correct target columns.
[0150] 1.2.3 Example 2: Common Table Expression (CTE) Scenario
[0151] Objective: This is very similar to the subquery process in Example 1, and aims to demonstrate that the process described in this module can also be correctly parsed and processed in the CTE scenario.
[0152] Enter SQL:
[0153] WITH user_info AS (
[0154] SELECT id, name, phone FROM users
[0155] ),
[0156] contact_info AS (
[0157] SELECT user_id, email, address FROM contacts )
[0158] SELECT u.name, c.email
[0159] FROM user_info u
[0160] JOIN contact_info c ON u.id = c.user_id.
[0161] Here is an example of a scope lookup:
[0162] SELECT , LENGTH(s3.phone_number) as phone_len FROM users s1
[0163] LEFT JOIN(SELECT user_id, total_amount
[0164] FROM orders WHERE status = 'completed ') s3
[0165] ON s1.id = s3.user_id.
[0166] This scenario is essentially the same as the subquery processing flow in Example 1, and will not be repeated here. During traversal, the system first processes each CTE in WITH, generating fingerprints for each as an independent query block and recording the columns exposed externally. When the main query references a CTE alias, it traces back to the original table through the scope graph, achieving cross-CTE source tracing and policy application. Therefore, wildcard expansion and desensitization decisions are both constrained by fingerprint boundaries, and the policy only takes effect within the allowed set of columns, preventing it from spreading to irrelevant columns.
[0167] 1.3 Context-Aware Desensitization Module
[0168] This module is the core of the intelligent desensitization decision-making in this invention, corresponding to the second pass (rewrite stage) of the processing pipeline. Its core task is to perform a second traversal of the AST based on the precise scope map constructed in the first pass, and dynamically select and execute the safest desensitization strategy by analyzing the specific expression context of the sensitive fields.
[0169] 1.3.1 Algorithm Flow
[0170] The core of this module is a context analysis algorithm based on "parent node probes", and its decision-making basis is the "operator security matrix" preset in the "configuration center".
[0171] 1. Traversal and Sensitivity Detection: The system traverses the AST after the first pass using the visitor pattern. When a column or identifier node is encountered, the "sensitive data dictionary" is queried to determine if it is a sensitive field. If not, it is skipped.
[0172] 2. Context Probe: If a field is sensitive, the algorithm immediately checks its parent node in the AST. This parent node represents the most direct upper-level operator of the sensitive field, i.e., its context. For example, for the SQL fragment LENGTH(phone), the parent node of the phone node is the LENGTH function call node.
[0173] 3. Query operator security matrix: The system extracts the name of the operator (such as length) from the parent node and uses it as the key to query its preset security level (SAFE, UNSAFE, FORMAT_PRESERVING, PASS_THROUGH, etc.) in the "operator security matrix".
[0174] 4. Differentiated AST rewriting: Execute corresponding strategies according to security level (described in Section 3.2): SAFE → Only replace sensitive parameters and retain the outer structure; UNSAFE → Replace the entire AST with type-compatible security dummy elements; FORMAT_PRESERVING → Use format-preserving desensitization to maintain comparability / grouping stability; PASS_THROUGH → Preserve the operator structure and recursively process its internal parameters and window specifications.
[0175] Through this mechanism, the present invention achieves a deep understanding and precise control of SQL expressions at the semantic level, and can achieve the best balance between ensuring data security and maintaining business logic compatibility.
[0176] 1.3.2 Example 3: Safe and Unsafe Function Contexts
[0177] Objective: By comparing two typical functions, CONCAT and LENGTH, this study demonstrates how the system makes differentiated decisions on "preserving the structure" and "overall replacement" based on the "operator safety matrix".
[0178] Enter SQL snippet:
[0179] SELECT
[0180] CONCAT('Name:', name), -- SAFE operation for sensitive fields LENGTH(phone) -- UNSAFE operation for sensitive fields
[0181] FROM users.
[0182] The administrator's configuration: both name and phone are sensitive fields. CONCAT is SAFE in the security matrix, and LENGTH is UNSAFE. The processing flow is as follows:
[0183] 1. Processing CONCAT('Name:', name):
[0184] Traverse to the name node and confirm it is sensitive;
[0185] Executing the "parent node probe" reveals that its parent node is a CONCAT function call;
[0186] The query for "operator security matrix" shows that the CONCAT level is SAFE.
[0187] Decision: Implement the "parameter replacement only" strategy.
[0188] AST rewriting (as attached) Figure 3 As shown): Only the name node is replaced with the MASK_NAME(name) node, while the CONCAT structure is preserved. The final expression is CONCAT('Name: ', MASK_NAME(name)).
[0189] 2. Handling LENGTH (phone):
[0190] Traversing to the phone node, it was confirmed to be sensitive.
[0191] Executing the "parent node probe" reveals that its parent node is a LENGTH function call.
[0192] The query for "operator security matrix" shows that the level of LENGTH is UNSAFE.
[0193] Decision: Implement the "total replacement" strategy.
[0194] AST rewriting: Replaces the entire LENGTH(phone) function call node with a safe dummy value. Since LENGTH returns an integer, the system will choose CAST(NULL AS INT) or the integer 0 as the replacement node.
[0195] Final output SQL:
[0196] SELECT
[0197] CONCAT('Name:', MASK_NAME(name)),
[0198] 0 -- or CAST(NULL AS INT)
[0199] FROM users.
[0200] This example clearly demonstrates how the present invention intelligently distinguishes the inherent semantic risks of expressions and adaptively selects the optimal desensitization strategy.
[0201] 1.3.3 Example 4: Aggregation and Grouping Context
[0202] Objective: Using GROUP BY email as an example, this paper explains why special format-preserving masking is needed to protect the correctness of business logic in operations such as grouping and sorting.
[0203] Enter SQL:
[0204] SELECT
[0205] email, -- Sensitive field
[0206] SUBSTRING(email, 1, 5), -- UNSAFE operation for sensitive fields.
[0207] COUNT( )
[0208] FROM users
[0209] GROUP BY
[0210] Email.
[0211] Configure email as a sensitive field.
[0212] Problem Analysis:
[0213] Applying standard desensitization (such as replacing all emails with single quotes) to emails in GROUP BY email will cause all different email addresses to be grouped into the same group (because they are all literal single quotes, which will seriously disrupt the business logic of grouping and statistics).
[0214] Processing flow:
[0215] 1. Special Context Recognition: Identifies that the GROUP BY clause is a special context in which data uniqueness and formatting need to be preserved.
[0216] 2. Decision: The system decides to use a special desensitization function that can preserve the uniqueness of the data for replacement.
[0217] 3. AST rewriting:
[0218] 4. For SUBSTRING(email, 1, 5) in the SELECT clause, since SUBSTRING is an UNSAFE operator, the system will replace it entirely with a type-safe dummy value, such as II or CAST(NULL AS TEXT).
[0219] 1. For the email node in the GROUP BY clause, the system will use the format-preserving de-identification function call MASK_EMAIL_FP(email) to replace it to ensure the correctness of the grouping logic.
[0220] For ease of understanding, the results are shown below:
[0221] Enter your email address and the anonymized result (example):
[0222] alice@example.com → em_9Q2K7M4C;
[0223] bob@acme.cn → em_D4K1X2P9;
[0224] alice@foo.com → em_7MZQ3R1H.
[0225] Comparison before and after grouping (example):
[0226] Original group by email: 3 groups {alice@example.com, bob@acme.cn, alice@foo.com};
[0227] After desensitization, GROUP BY MASK_EMAIL_FP (email) remains 3 groups: {em_9Q2K7M4C, em_D4K1X2P9, em_7MZQ3R1H}.
[0228] Each group COUNT ( Aggregate statistics remain unchanged; only the grouping key is safely replaced.
[0229] Final output SQL:
[0230] SELECT
[0231] MASK_EMAIL_FP(email),
[0232] ' ' AS masked_email_substring, -- or CAST(NULL AS TEXT)
[0233] COUNT( FROM users
[0234] GROUP BY
[0235] MASK_EMAIL_FP(email).
[0236] In this way, the present invention not only protects sensitive data in the GROUP BY clause but also ensures the logical correctness of grouping operations, demonstrating its completeness in complex business scenarios. The present invention not only handles standard SELECT queries but also designs specialized processing mechanisms for some special and high-risk boundary scenarios in SQL to ensure the security and completeness of the solution.
[0237] 2. Extended Implementation: Combining Large Language Model (LLM) to handle critical boundary scenarios.
[0238] To address increasingly complex business scenarios, evolving unknown threats, and reduce manual configuration costs, the core framework proposed in this invention can be further deeply integrated with Large Language Models (LLMs), achieving a leapfrog upgrade from "rule-driven" to "intelligence-driven." Figure 4 As shown, in Figure 1 Building upon this foundation, an LLM module is added that interacts with the "Operator Security Matrix" and "Content Pattern Matching" modules. This module demonstrates how the system submits unknown functions or dynamic data to the LLM for intelligent analysis and feeds the analysis results back to the decision-making system. In this extended architecture, the LLM does not directly participate in the real-time rewriting of SQL (to ensure the determinism and security of the system), but rather serves as an intelligent "configuration" and "analysis" assistant for the core engine.
[0239] The core idea is to construct a hybrid model of "LLM for Configuration, AST for Execution". This leverages the powerful natural language understanding and code analysis capabilities of LLM to automatically generate and expand the rules of the "configuration center", while retaining the core AST-based engine of this invention as a deterministic and reliable execution unit. This model combines the advantages of both technologies, solving the problem of potential "illusion" or instability in security-critical scenarios with LLM, and compensating for the shortcomings of pure rule engines in terms of flexibility and intelligence.
[0240] This LLM enhancement mode is particularly suitable for handling boundary scenarios that traditional methods struggle with, such as:
[0241] 2.1 Example 6: Dynamic Security Classification of Unknown Functions (UDFs)
[0242] Challenge: When a new user-defined function (UDF) appears in SQL that is not predefined in the "operator safety matrix," the default approach is to adopt a conservative "all-inclusive rejection" strategy. However, this lacks intelligent judgment regarding the actual risk of the function.
[0243] LLM enhancement solutions:
[0244] i. Triggering: When the core engine encounters an unknown function (such as company_specific_hash(email)) in the AST, it can asynchronously submit the meta-information of that function to the LLM analysis module.
[0245] ii. Input: The content submitted to the LLM includes: the name of the function, the source code of the function (if available from the database or code repository), and the SQL context fragment that uses the function.
[0246] iii. Prompt instruction illustration: "You are a data security expert. Please analyze the following SQL user-defined function. If the input parameter of this function is a sensitive field, determine whether its return value poses a risk of leaking original information, partial information, or statistical information. Based on the risk, classify its security level as SAFE, UNSAFE, FORMAT_PRESERVING, or PASS_THROUGH, and provide a brief reason."
[0247] iv. LLM Analysis and Decision: The LLM analyzes the implementation of `company_specific_hash` to determine if it is a strong hash algorithm. If the analysis shows that the function effectively encrypts data and is irreversible, the LLM returns a SAFE rating; if it finds that it is simply concatenation or encoding, it returns UNSAFE.
[0248] v. Dynamic Updates: The analysis results will be added or suggested to the "Operator Security Matrix" by the administrator, enabling the system to dynamically learn and evolve, and securely handle the ever-increasing custom business logic.
[0249] 2.2 Example 7: Semantic-based dynamic and semi-structured sensitive data identification
[0250] Challenge: Sensitive data (such as home addresses, personal profiles, medical record summaries, etc.) may not have a fixed format or may be buried deep within the complex paths of semi-structured fields such as JSON / XML, making it difficult to completely cover with limited regular expressions.
[0251] LLM enhancement solutions:
[0252] i. Trigger: When processing the results of string literals or JSON / XML path extraction functions, if traditional pattern matching fails to identify them, the system can submit this content to LLM for deep semantic analysis.
[0253] ii. Prompt instruction: "Perform Named Entity Recognition (NER) and content classification on the following string to determine whether it contains sensitive personal information such as name, location, phone number, ID number, email address, medical diagnosis, etc. If it does, please indicate the type of sensitive information."
[0254] iii. LLM Analysis and Decision Making: For the string 'Patient Zhang San visited Peking Union Medical College Hospital on May 10, 2023, due to headache', LLM can accurately identify that it contains the name (Zhang San), date, and medical information, thus determining it to be sensitive data. LLM can also analyze the content extracted from the JSON path data->'details'->'summary'.
[0255] iv. Precise desensitization: Based on the sensitivity type returned by the LLM, the core engine can select a more targeted desensitization strategy, achieving more accurate dynamic data detection than a fixed pattern.
[0256] 2.3 Example 8: Natural Language Strategy Configuration
[0257] Challenge: Defining and maintaining complex de-identification rule configuration files for the entire enterprise requires high technical skills from data security administrators, and the process is tedious and error-prone.
[0258] LLM enhancement solutions:
[0259] i. Human-computer interaction interface: Provide administrators with a natural language input interface.
[0260] ii. Input: Administrators can directly input advanced policy commands, such as: "Anonymize all users' addresses and phone numbers, but allow the BI department head to view the city information in the address, and allow the finance department to view the complete phone number."
[0261] iii. Prompt instruction: "You are a data security policy converter. Please convert the following natural language description of the de-identification requirements into precise YAML format configuration rules that can be executed by this application system. You need to identify the tables, fields, user roles involved, and their respective de-identification algorithm strategies."
[0262] iv. LLM processing and generation: LLM parses the natural language policy and automatically generates or updates the underlying sensitive data dictionary and access control list (ACL) configuration file.
[0263] v. Advantages: It greatly lowers the barrier to entry for using the system, enabling policymakers without a technical background to easily and accurately manage data security policies, thereby improving the system's availability and management efficiency.
[0264] The technical solution of this invention fully considers the engineering feasibility, performance requirements and safety boundaries in the real world in its design, aiming to provide a solution that is both efficient and robust.
[0265] 1. Performance and Efficiency
[0266] The core of this solution is a two-pass AST traversal pipeline, which exhibits the following key performance characteristics:
[0267] Linear time complexity: The core "analysis-rewrite" pipeline traverses the AST nodes, and its time complexity is proportional to the number of nodes (N) in the AST, i.e., O(N). This means that the processing time increases linearly with the complexity of SQL, without the risk of exponential performance explosion, ensuring the predictability of processing efficiency.
[0268] Real-time processing capability: All heavyweight or high-latency operations (such as LLM analysis as described in Section 2) are designed as asynchronous or offline configuration phase tasks, without blocking the real-time SQL query path. Internal benchmarking shows that for the vast majority of complex enterprise-level SQL queries, the complete processing time of this solution (from SQL parsing to final generation) can be completed in milliseconds, fully meeting the stringent performance requirements of a real-time database proxy or gateway component.
[0269] 2. Threat Model and Scope Definition
[0270] Any security solution must clearly define its protection boundaries and threat model. The design philosophy of this invention is to achieve a pragmatic balance between usability and security, with the following security considerations:
[0271] Target Scenario: This solution is primarily designed for controlled environments within enterprises, aiming to defend against data breaches caused by "honest but curious" users and unintentional misoperations. Its core objective is to systematically and on a large scale reduce the exposure of sensitive data without significantly disrupting business availability.
[0272] The tension between security and availability: There is an inherent tension between data security and query freedom. Overly strict rules may hinder normal business analysis, while overly lenient rules may leave room for data leakage. This invention does not seek an unattainable state of "absolute security," but rather provides an interpretable, configurable, and context-aware risk control framework. It allows enterprises to flexibly select and customize differentiated anonymization strategies based on the risk levels of different user roles and business scenarios.
[0273] Regarding evasion and vulnerabilities: For proactive, professional attackers (such as red teams or Advanced Persistent Threats (APTs), any rule-based system may have boundary scenarios that can be bypassed (e.g., through deeply nested CAST functions or the construction of unexpected expressions). This invention treats such issues as part of the ongoing governance of security strategies, rather than as fundamental flaws in the solution itself. The innovation of this invention lies in providing a powerful underlying engine capable of identifying and controlling these patterns, while specific defense strategies require continuous refinement through ongoing operations.
[0274] 3. Engineering Practice and Continuous Governance
[0275] To ensure this solution achieves maximum effectiveness in actual deployment, the following engineering practice principles are recommended to form a safety closed loop:
[0276] Enhanced auditing and traceability: Fine-grained log auditing and alerting are conducted for all SQL rewrite decisions, rule version changes, policy hit reasons, and potential bypass attempts. This provides a solid chain of evidence for security incident response and compliance review.
[0277] Layered protection and default denial: Employ security strategies with varying levels of stringency in different environments such as development, testing, and production. For high-risk operators or unknown functions, a "default denial, whitelist allow" principle can be adopted to control risks to the greatest extent possible.
[0278] Establish a continuous governance cycle: Continuously incorporate known bypass examples and newly discovered risk patterns into the updates of automated regression testing suites and the "operator security matrix." Through a cycle of "vulnerability discovery – patching strategy – configuration update – re-audit," achieve continuous iteration and evolution of security capabilities.
[0279] In summary, the core innovation of this invention lies in providing a systematic SQL rewriting pipeline based on deep semantic understanding. It minimizes the exposure of sensitive data without altering core business logic. It is not a static, one-time solution, but a powerful platform that supports continuous security governance and can adapt to dynamic enterprise risks.
[0280] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A dynamic SQL de-identification method based on scope fingerprinting and operator security matrix, characterized in that, Includes the following steps: a) Parse the received raw SQL query statement into an AST (Abstract Syntax Tree); b) Analysis phase: Perform the first traversal of the AST, analyze its structured information in the AST and generate the corresponding scope fingerprint based on the structured information, and construct the scope map based on the scope fingerprint and its hierarchical relationship; c) Rewriting stage: Based on the scope map, the AST is traversed a second time. When a node identified as a sensitive field is encountered, the AST is rewritten in a differentiated manner according to the preset operator security matrix. d) Generate a new, secure SQL query statement based on the rewritten AST.
2. The method according to claim 1, characterized in that, When the original SQL query statement contains a wildcard (*), in the analysis phase, the wildcard is expanded using the scope graph to ensure that its expansion range is strictly constrained by the boundary of the scope fingerprint in which it is located. The structured information on which the scope fingerprint is generated includes at least one of the following: the node path of the query block in the AST, the nesting depth, the signature of the referenced data table, or the hash value of its own content.
3. The method according to claim 1, characterized in that, The first traversal of the AST, analyzing its structured information within the AST, and generating a corresponding scope fingerprint based on the structured information includes: i. Identify each query block in the AST; ii. For each query block, generate a unique and stable scope fingerprint based on its structured information in the AST; iii. Using the scope fingerprint and its hierarchical relationship, construct a scope graph, which is used to provide immutable boundary constraints for subsequent rewriting operations.
4. The method according to claim 1, characterized in that, Based on the degree to which operators retain data information, the operator security matrix includes at least the following security levels and their corresponding processing strategies: SAFE (Safety): When the operator itself does not disclose information, the operator structure is retained, and only the sensitive field parameters are replaced with custom desensitization function calls; UNSAFE: When an operator would leak metadata or part of information of a sensitive field, the entire expression node containing the operator is replaced with a dummy value that is compatible with the return type of the original expression. FORMAT_PRESERVING (Format Preservation): When the context of the operator requires preserving the comparability or groupability of the data, sensitive field nodes are replaced with calls to the format preservation desensitization function; PASS_THROUGH (Pass-through): When the operator only serves as a structure wrapper or type conversion, the operator structure is preserved, and the context probing and rewriting of its internal parameters are recursively performed.
5. The method according to claim 1, characterized in that, Based on the scope graph, a second traversal is performed on the AST. When a node identified as a sensitive field is encountered, the AST is rewritten differentially according to a preset operator security matrix, including: i. Probe the direct parent node of the sensitive field node to determine its operator context; ii. Query a preset operator security matrix, and obtain the corresponding security level and processing strategy based on the operator context; iii. Based on the processing strategy, rewrite the expressions or their parameters that contain sensitive fields in the AST in a differentiated manner; When a sensitive field appears in the GROUP BY, ORDER BY, or DISTINCT clauses, or as an operand in an equal comparison (=, IN) in the WHERE clause, its context is identified as requiring format preservation. The processing strategy is to rewrite the sensitive field and its corresponding literal in the comparison expression using the same FORMAT_PRESERVING desensitization function.
6. The method according to claim 1, characterized in that, It also includes an extension step: when an unknown function not defined in the operator security matrix is encountered during the rewriting phase, the metadata and context information of the function are asynchronously submitted to a Large Language Model (LLM) for security risk analysis, and the operator security matrix is updated based on the classification suggestions returned by the LLM.
7. A SQL dynamic data masking system for implementing the method of any one of claims 1-6, characterized in that, include: A SQL parser is used to parse raw SQL query statements into an abstract syntax tree (AST). A scope analyzer is configured to perform a first pass of analysis traversal on the AST, generate a scope fingerprint for each query block, and construct a scope graph as an immutable boundary constraint. A configuration center is used to store an operator security matrix that maps SQL operators to preset security levels and processing strategies; A context-aware desensitizer is configured to perform a second rewrite traversal on the AST based on the scope graph, and to perform differential rewriting on the AST by probing the parent node operator context of the sensitive field and querying the operator security matrix; A SQL generator for generating a new, safe SQL query based on a rewritten AST.
8. The system according to claim 7, characterized in that, The system also includes a Large Language Model (LLM) enhancement module that interacts with the configuration center, and this module is configured as follows: a) Receive unknown functions or suspicious content submitted by the context-aware desensitizer; b) Call a large language model to perform security classification on the unknown function, or to identify sensitive information on the suspicious content; c) Based on the analysis results, generate policy recommendations for updating the configuration center.
9. An electronic device, characterized in that, The electronic device includes: One or more processors; Memory, on which computer-executable instructions are stored; The processor is configured to execute the instructions to implement the method as described in any one of claims 1 to 6.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium contains program code that can be invoked by a processor to execute the method as described in any one of claims 1 to 6.