LLM-oriented text-to-sql intent query method and system
By introducing a dual-channel processing architecture of security proactivity and cognitive evolution into the Text-to-SQL system, the problems of security risks and insufficient trustworthiness in high-security scenarios are solved, enabling continuous system evolution and compliance auditing, and improving the system's reliability and auditability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN UNIV
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-12
AI Technical Summary
Existing LLM-driven Text-to-SQL systems suffer from security risks and insufficient trustworthiness in industrial application scenarios with high security and high reliability requirements.
Design a dual-channel processing and enhancement architecture based on security proactivity and cognitive evolution, including a security execution channel and a cognitive enhancement and auditing channel. The security execution channel introduces security constraints and task decomposition mechanisms through the intent parsing phase to ensure that query operations are executed within compliance boundaries; the cognitive enhancement and auditing channel identifies and optimizes cognitive biases by comparing actual results with intermediate expectations.
It has achieved a systematic leap from single error correction to continuous evolution, from passive verification to proactive defense, and from operation logging to compliance auditing, significantly improving the practicality, reliability, and auditability of the Text-to-SQL system in scenarios with high security requirements.
Smart Images

Figure CN121997377B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the intersection of artificial intelligence and database technology, and more specifically, to a Text-to-SQL intent query method and system for LLM. Background Technology
[0002] Text-to-SQL (TTO) technology, based on natural language processing, serves as a crucial bridge connecting human intent with structured data, and its development is closely intertwined with the evolution of artificial intelligence. Early methods primarily relied on rule templates and syntax parsing, mapping natural language keywords to SQL components through predefined rules. While offering strong interpretability, these methods suffered from weak generalization capabilities, struggling to handle complex and diverse user expressions. With the rise of machine learning, statistical learning methods were introduced, training models on labeled data to learn the mapping relationship between language and SQL. However, limitations in annotation costs and semantic understanding depth resulted in insufficient adaptability to complex queries and unknown database patterns.
[0003] In recent years, deep learning and pre-trained language models have driven a paradigm shift in Text-to-SQL technology. In particular, methods based on Large Language Models (LLMs), with their powerful language understanding and generation capabilities, can better handle semantic ambiguity, contextual dependencies, and complex logic in natural language, significantly improving the overall performance of tasks and enabling non-technical users to interact with databases using natural language.
[0004] However, despite continuous technological advancements, existing LLM-driven Text-to-SQL systems still suffer from security risks and insufficient trustworthiness in industrial application scenarios with high security and reliability requirements. Summary of the Invention
[0005] To address the security risks and insufficient credibility of existing technologies, this invention provides a Text-to-SQL intent query method and system for LLM (Language Modeling). The aim is to build a closed-loop system that includes secure intent parsing, secure query compilation, cognitive enhancement based on real feedback, and end-to-end auditing and tracing, so as to realize natural language querying for large language models and achieve highly reliable application in database interactions.
[0006] The core solution of this invention lies in designing a dual-channel processing and enhancement architecture based on security proactivity and cognitive evolution. The first channel is the secure execution channel. This channel introduces security constraints and task decomposition mechanisms during the intent parsing stage, transforming user intents into atomic tasks labeled with security levels; and generates and compiles SQL within a secure query framework composed of database pattern awareness and operation whitelists. Through a multi-layered defense system consisting of pattern verification, operation whitelist verification, and semantic security verification, it ensures that query operations are executed within compliance boundaries, solving the problems of passive security control and easy generation of unauthorized or illegal operations in traditional methods.
[0007] The second channel is the cognitive enhancement and auditing channel. This channel automatically identifies and captures cognitive biases in field mapping and logical understanding by comparing the actual results returned by the database with the intermediate expectations (such as thought chains) of the large language model during the generation process. It generates feedback signals to build an aligned dataset, driving continuous model calibration and optimization. This mechanism upgrades traditional single-execution error correction to a system-level continuous evolution capability, fundamentally alleviating data illusion. Simultaneously, this channel records structured audit logs throughout the entire process, from intent parsing to result return, achieving full traceability and reliable assessment of the operational chain, meeting the rigid requirements of process compliance auditing in high-security fields.
[0008] Through the collaborative work of the above two channels, this invention achieves a systematic leap from single error correction to continuous evolution, from passive verification to proactive defense, and from operation records to compliance auditing, significantly improving the practicality, reliability, and auditability of the Text-to-SQL system in scenarios with high security requirements.
[0009] To achieve the above objectives, a first aspect of the present invention provides a Text-to-SQL intent query method for LLM, comprising:
[0010] It receives natural language queries, drives the large language model to perform intent parsing and task decomposition of natural language queries through prompt templates with built-in security constraint rules, and outputs atomic tasks with security annotations. The atomic tasks with security annotations are represented in a structured form, and each task contains a unique identifier, task description, expected access table list and security attribute object.
[0011] For each atomic task, the database schema information of the tables it is expected to access is obtained from the pre-built knowledge base. The task description and the database schema information are combined to form a hint, which drives the large language model to generate candidate SQL.
[0012] Based on the security labeling of atomic tasks and security constraint rules, the generated candidate SQL is sequentially subjected to multi-level verification, including mode consistency verification, operation whitelist verification and semantic security verification. When all multi-level verifications pass, the candidate SQL is marked as safe and executable.
[0013] Execute SQL marked as safe and executable to obtain the actual query results, capture the intermediate inference results of the large language model during the generation process, compare the actual query results as real data with the intermediate inference results, and optimize the large language model based on the comparison results.
[0014] In one implementation, the security constraint rules include an operation whitelist, table-level access control rules, and resource-level protection rules. The operation whitelist is used to control permissions for specific SQL operations executed on the database, the table-level access control rules are used to define access permissions for different roles or users to specific database tables, and the resource-level protection rules are used to limit the resource consumption of query operations.
[0015] In one implementation, a prompt template with built-in security constraint rules drives a large language model to perform intent parsing and task decomposition on natural language queries, outputting atomic tasks with security annotations, including:
[0016] Perform intent parsing on natural language queries and decompose the natural language queries into independently executable atomic tasks based on the intent parsing results;
[0017] Based on the matching of the current user's role information with the table-level access control rules, appropriate security labels are attached to the atomic tasks, and the atomic tasks with security labels are output. The atomic tasks are represented in a structured form.
[0018] In one implementation, the database schema information includes table names, field names, data types, field comments, primary and foreign key constraints, index information, and relationships between tables.
[0019] In one implementation, based on the security annotations and security constraint rules of atomic tasks, the generated candidate SQL is subjected to multi-layered verification, including pattern consistency verification, operation whitelist verification, and semantic security verification, in sequence, including:
[0020] Check whether the table names and field names referenced in the SQL exist in the database schema and whether the data types match. If both conditions are met, the consistency check passes.
[0021] Check whether the SQL operation type is within the preset operation whitelist, and also check whether it contains dangerous syntax. If it is within the operation whitelist and does not contain dangerous syntax, the operation whitelist verification passes. The dangerous syntax includes data definition language operations, delete and update operations without WHERE conditions, and bypassing the check by using comment characters.
[0022] Check whether the SQL complies with the security annotations of atomic tasks, specifically including whether it accesses tables outside the expected list of accessed tables, whether it meets field-level filtering requirements, and whether it complies with resource protection rules. If it complies, the semantic security check passes.
[0023] In one implementation, the actual query results are compared with intermediate inference results as real data, and the large language model is optimized based on the comparison results, including:
[0024] By comparing real data with intermediate inference results, we can identify cognitive biases in the model's field mapping and logical understanding, and obtain bias data.
[0025] An aligned dataset is formed based on the deviation data;
[0026] Using the aforementioned aligned dataset, supervised fine-tuning or reinforcement learning methods are employed to periodically calibrate and optimize the large language model.
[0027] In one embodiment, the method further includes:
[0028] The system records structured audit logs from user queries to result returns. These audit logs are represented in JSON format and include the user's natural language query, atomic task list, candidate SQL, validation results at each level, execution results, and feedback data.
[0029] Based on the same inventive concept, a second aspect of the present invention provides a Text-to-SQL intent query system for LLM, comprising:
[0030] The security intent parsing module receives natural language queries and drives the large language model to perform intent parsing and task decomposition on the natural language queries through prompt templates with built-in security constraint rules. It outputs atomic tasks with security annotations. The atomic tasks with security annotations are represented in a structured form. Each task contains a unique identifier, task description, expected access table list, and security attribute object.
[0031] The secure SQL compilation module is used to obtain the database schema information of the tables to be accessed from a pre-built knowledge base for each atomic task, combine the task description with the database schema information to form a hint, drive the large language model to generate candidate SQL; and perform multi-level verification on the generated candidate SQL based on the security annotation and security constraint rules of the atomic task, including schema consistency verification, operation whitelist verification and semantic security verification. When all multi-level verifications pass, the candidate SQL is marked as secure and executable.
[0032] The cognitive enhancement feedback module is used to execute SQL marked as safe and executable to obtain the actual query results and capture the intermediate inference results of the large language model during the generation process. The actual query results are compared with the intermediate inference results as real data, and the large language model is optimized based on the comparison results.
[0033] Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, provides an intent query method for LLM-oriented Text-to-SQL as described in the first aspect.
[0034] Based on the same inventive concept, a fourth aspect of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the intent query method for LLM-oriented Text-to-SQL described in the first aspect.
[0035] Compared with the prior art, the advantages and beneficial technical effects of the present invention are as follows:
[0036] This invention proposes a Text-to-SQL intent query method for LLM (Limited Language Management). After receiving a natural language query, it drives a large language model to parse the query intent and decompose tasks using a prompt template with built-in security constraint rules. It outputs atomic tasks with security annotations, which provide machine-readable authorization boundaries for subsequent semantic security verification. For each atomic task, it retrieves the database schema information of the expected access table from the knowledge base and combines it with the schema information to form a prompt, driving the large language model to generate candidate SQL. This allows SQL generation and compilation within a secure query framework composed of database schema awareness and operation whitelists. A multi-layered defense system consisting of schema verification, operation whitelist verification, and semantic security verification ensures that query operations are executed within compliance boundaries, solving the problems of passive security control and easy unauthorized or illegal operations in traditional methods. Furthermore, the large language model is optimized by comparing actual query results with intermediate inference results, enabling the model to continuously learn from historical errors and gradually improve its understanding of specific database schemas and business semantics. This achieves a secure and controllable conversion from natural language to database operations, reliable result assurance, and traceable compliance processes.
[0037] Furthermore, the structured audit log records the entire process from user query to result return, supporting process traceability and compliance analysis. Through the structured audit log, it provides an immutable operation chain record for high-security fields, meeting the rigid requirements of audit traceability. Attached Figure Description
[0038] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0039] Figure 1 This is a flowchart of a Text-to-SQL intent query method for LLM in an embodiment of the present invention;
[0040] Figure 2 This is a schematic diagram of the security intent parsing and task decomposition process in an embodiment of the present invention;
[0041] Figure 3 This is a flowchart illustrating the secure SQL compilation and multi-layered verification process in an embodiment of the present invention.
[0042] Figure 4 This is a schematic diagram of the cognitive enhancement feedback closed loop in an embodiment of the present invention;
[0043] Figure 5This is a schematic diagram of the end-to-end structured audit log in an embodiment of the present invention. Detailed Implementation
[0044] Guided by the principles of high data reliability and enhanced cognitive intelligence, this invention addresses the security risks and insufficient reliability issues in natural language database query (Text-to-SQL) scenarios by proposing an interactive enhancement method that integrates multi-layered security verification and continuous cognitive learning. Through a closed-loop mechanism encompassing secure intent parsing, secure query compilation, real-feedback-driven cognitive optimization, and end-to-end auditing and traceability, it achieves secure and controllable conversion from natural language to database operations, reliable result assurance, and compliant and traceable processes.
[0045] This invention significantly improves the practicality, reliability, and auditability of intelligent database queries in high-security scenarios such as government affairs and finance, and can be widely applied in business areas such as macroeconomic analysis and compliance data retrieval. For example, in government data queries, it ensures that analysts can obtain economic indicators while completely eliminating the risk of unauthorized access and leakage of sensitive information.
[0046] Example 1
[0047] This embodiment provides a Text-to-SQL intent query method for LLM. Please refer to [link to relevant documentation]. Figure 1 ,include:
[0048] S1: Receives natural language queries, drives the large language model to perform intent parsing and task decomposition of natural language queries through prompt templates with built-in security constraint rules, and outputs atomic tasks with security annotations. The atomic tasks with security annotations are represented in a structured format, and each task contains a unique identifier, task description, expected access table list and security attribute object.
[0049] To address the issue of insecure queries and data illusions arising from large language models in Text-to-SQL tasks, this invention employs a triple verification mechanism involving pattern awareness, operation whitelisting, and semantic security. The core idea is to construct a closed-loop system encompassing intent-secure parsing, query-secure compilation, cognitive enhancement feedback, and end-to-end auditing to ensure secure and reliable database interactions. The cognitive enhancement loop generates feedback by comparing the actual execution results with the model's expectations, driving continuous system optimization. The end-to-end structured auditing module generates complete operation chain logs, meeting compliance and traceability requirements.
[0050] Specifically, S1 is the security intent parsing step. By embedding security constraint rules in the prompt template, it guides the large model to perform intent parsing and task decomposition on natural language queries, resulting in structured atomic tasks with security annotations. Atomic tasks are represented in structured JSON format. Each task includes a unique identifier (id), a task description (desc), a list of expected access tables (tables), and a security attribute object (security). The security attribute object includes security level (level) and field-level filtering notes. The security annotations on the atomic tasks provide machine-readable authorization boundaries for subsequent semantic security verification.
[0051] The security constraint rules include operation whitelists, table-level access control rules, and resource-level protection rules. Operation whitelists are used to control permissions for specific SQL operations executed on the database. Table-level access control rules are used to define access permissions for different roles or users to specific database tables. Resource-level protection rules are used to limit the resource consumption of query operations.
[0052] Specifically, table-level access control rules are used to define access permissions for different roles or users to specific database tables, ensuring that only authorized entities can query sensitive data. The configuration format is `sensitive_tables: {table_name: [role_list]}`. For example, configuring `sensitive_tables: {personal_income: ["role_a"]}` means that only the user with the role "role_a" can access the `personal_income` table. This rule plays a role in the security intent parsing phase. The parser adds corresponding security annotations based on the current user's role information when generating atomic tasks: if the user does not belong to an authorized role for the `personal_income` table, any query intent involving that table will be marked as "high" security level, and a strict permission check will be recorded in the task's security attribute object. In the subsequent semantic security check phase, the compiler extracts all table names actually referenced in the candidate SQL and compares them with the current user's authorized roles. If an unauthorized table is found (e.g., a user with role "role_b" attempting to query the `personal_income` table), the semantic check fails and SQL execution is immediately blocked, while the unauthorized behavior is recorded in the audit log.
[0053] Resource protection rules are used to limit the resource consumption of query operations, preventing complex queries from causing database performance degradation or service interruption. Common configurations include the maximum number of joined tables, the maximum number of returned rows, and execution timeout. For example, `max_join_tables: 3` means that the maximum number of tables allowed to be joined in a single query is 3. This rule guides the generator's query path selection during the secure SQL compilation phase: when building the SQL framework, the generator prioritizes query methods that comply with resource constraints. If the number of tables to be joined exceeds the preset threshold, the generator will attempt to optimize through subqueries, temporary tables, etc., or automatically split the complex query into multiple atomic tasks for separate execution. During the semantic safety verification phase, the verifier analyzes the number of joined tables and nested query depth of candidate SQL queries. If the number of JOINed tables exceeds the set value of `max_join_tables`, it is judged as "complexity exceeded" and verification fails, preventing excessive resource consumption from affecting system stability. For example, when a user queries a join analysis involving 5 tables, the system will trigger the resource protection mechanism, prompting the user to simplify the query or automatically split the execution.
[0054] By defining and applying the above two types of rules in advance, this invention achieves dual security control from access permissions to execution resources, effectively preventing data leakage and system overload risks, and providing reliable protection for database interaction in high-security scenarios.
[0055] S1 can be implemented in the following way:
[0056] S1.1: Perform intent parsing on natural language queries and decompose the natural language queries into independently executable atomic tasks based on the intent parsing results;
[0057] S1.2: Based on the matching of the current user's role information with the table-level access control rules, attach corresponding security labels to the atomic tasks and output the atomic tasks with security labels. The atomic tasks are represented in a structured form.
[0058] Specifically, adding a corresponding security label to an atomic task based on the matching of the current user's role information with the table-level access control rules means determining whether the current user's role information is an authorized role for a specific data table in the table-level access control rules. If not, it indicates a mismatch, and the security label is set to a high security level.
[0059] In specific implementation, such as Figure 2 As shown, when a user enters a natural language query (such as "Compare the GDP growth rates of Wuhan and Changsha over the past three years and analyze the influencing factors"), the security intent parsing module works first. The core of this step lies in security-oriented prompting engineering. This step does not directly send the original query to the LLM, but rather embeds it into a carefully designed prompt template. The specific form of the template is as follows:
[0060] start
[0061] You are a database security assistant. Please analyze user queries and break them down into independently executable atomic database query tasks.
[0062] Please follow these rules:
[0063] 1. Specify the target data table for each task and assess its security level (low / medium / high).
[0064] 2. If the task involves sensitive information such as policies or personnel, the "field-level filtering required" label must be added.
[0065] 3. The output must strictly conform to the following JSON format: {"tasks": [{"id":1, "desc":"...", "tables":[], "security":{"level":"...", "note":"..."}}]}
[0066] User query: "{user_query}"
[0067] Finish
[0068] Here, `id` is the unique identifier for the task, used to uniquely identify an atomic task in subsequent task scheduling, execution tracking, and audit logs, facilitating inter-module collaboration and problem localization; `desc` is the natural language description of the task, clearly outlining the database queries that the subtask needs to execute; `tables` is a list of database table names that the task is expected to access. This field is used for subsequent semantic security checks—during the SQL compilation phase, the system extracts the actual table names referenced in the generated SQL statement and compares them with the `tables` list. If table access is found to be outside the list range, it is considered an unauthorized operation and execution is blocked, thus ensuring that the task only accesses tables within its specified range. The declared data scope is used for operations; security can encapsulate the security attribute information of the task, specifically including two subfields: 'level', which is the security level of the task, such as "low", "medium", and "high". The system can dynamically adjust subsequent processing strategies according to this level, such as triggering additional approval processes for high-level tasks, forcing field-level desensitization, or recording more detailed audit logs; and 'note', which is supplementary explanation of security matters. For example, when the value of note is "field-level filtering required", the subsequent SQL compiler must ensure that the generated query statement only requests a subset of allowed sensitive fields, or desensitize sensitive data before returning the results.
[0069] Furthermore, the placeholder {user_query} used in the prompt template represents the user's original natural language query text. During actual operation, the system replaces this placeholder with the actual user input and submits the complete prompt template to the large language model to drive the model to generate the structured task list mentioned above.
[0070] As indicated here, LLM outputs no longer free text, but a structured task list. For example, for the query above, the output would be:
[0071] start
[0072] {
[0073] "tasks": [
[0074] {"id": 1, "desc": "Query the GDP value of Wuhan City from 2022 to 2024 from the gdp_data table", "tables": ["gdp_data"], "security": {"level": "Low", "note": ""}},
[0075] {"id": 2, "desc": "Query the gdp_value of Changsha City for the same period from the gdp_data table", "tables": ["gdp_data"], "security": {"level": "low", "note": ""}},
[0076] {"id": 3, "desc": "Query the summary of relevant factors from the policy_report table, returning only the public_summary field", "tables": ["policy_report"], "security": {"level": "Medium", "note": "Field-level filtering required"}} ]
[0078] }
[0079] Finish
[0080] This step completes the initial security filtering and task definition at the cognitive level, providing precise, machine-readable input with accompanying security metadata for subsequent operations.
[0081] S2: For each atomic task, retrieve the database schema information of the tables it is expected to access from the pre-built knowledge base, combine the task description with the database schema information to form a hint, and drive the large language model to generate candidate SQL.
[0082] Specifically, S2 is the candidate SQL generation for the safe SQL compilation step.
[0083] Database schema information includes table names, field names, data types, field comments, primary and foreign key constraints, index information, and relationships between tables.
[0084] like Figure 3 As shown, the secure SQL compilation and multi-layered defense verification constituted by S2 and S3 are the core components for ensuring secure database interactions. Taking atomic task 1 as an example, the secure SQL compiler executes the following process:
[0085] First, SQL generation is performed using schema context injection. The compiler combines the description information of the atomic task (i.e., the `desc` field) with the detailed schema information of the target database table obtained from the knowledge base to form enhanced prompt text. Here, schema context specifically refers to the structured description information of the database table, including but not limited to: table name, field name, data type, field comments, primary and foreign key constraints, index information, and relationships between tables. Taking Task 1 as an example, the compiler obtains the complete schema information of the `gdp_data` table from the knowledge base (such as the `year` field being an integer type, `city_name` being a string type, `gdp_value` being a floating-point number type, and corresponding business meaning comments), and embeds this information along with the task description "Query the `gdp_value` of Wuhan City from the `gdp_data` table for 2022-2024" into the prompt template to form structured input, driving the large language model to generate candidate SQL. The candidate SQL is the initial SQL statement generated by the LLM based on the above context. It has integrated the actual structure of the database, providing a foundation for subsequent multi-level validation.
[0086] S3: Based on the security annotation of atomic tasks and security constraint rules, the generated candidate SQL is subjected to multi-level verification of mode consistency, operation whitelist verification and semantic security verification in sequence. When all multi-level verifications pass, the candidate SQL is marked as safe to execute.
[0087] Specifically, S2 is the verification of candidate SQL in the secure SQL compilation step. Among them, schema consistency verification is used to check whether the table names and field names referenced in the SQL exist in the database schema and the data types match. Operation whitelist verification is used to check whether the operation type of the SQL is within the preset operation whitelist and does not contain dangerous syntax. Semantic security verification is used to check whether the SQL complies with the security annotation of atomic tasks. The multi-layer verification realizes full-process security control from source to execution, which is different from the post-remedial verification of existing technologies.
[0088] S3 specifically includes:
[0089] S3.1: Check whether the table names and field names referenced in the SQL exist in the database schema and whether the data types match. If both are satisfied, the consistency check passes.
[0090] S3.2: Check whether the SQL operation type is within the preset operation whitelist, and check whether it contains dangerous syntax. If it is within the operation whitelist and does not contain dangerous syntax, the operation whitelist verification passes. The dangerous syntax includes data definition language operations, delete and update operations without WHERE conditions, and bypassing the check by using comment characters.
[0091] S3.3: Check whether the SQL complies with the security annotations of the atomic task, specifically including whether it accesses tables outside the expected list of accessed tables, whether it meets field-level filtering requirements, and whether it complies with resource protection rules. If it complies, the semantic security check passes.
[0092] In practice, the security verification is not completed in one step, but is implemented by three cascaded verifiers, as shown in the following code:
[0093] start
[0094] # Pseudocode: Core Logic of a Secure SQL Validation Pipeline
[0095] class SafeSQLValidator:
[0096] def validate(self, candidate_sql: str, task_metadata: dict) ->ValidationResult:
[0097] # 1. Schema consistency check (defense against structural illusion)
[0098] schema_check = self._schema_validation(candidate_sql)
[0099] if schema_check.passed:
[0100] return ValidationResult(passed=False, error_type="SCHEMA_MISMATCH",details=schema_check.details)
[0101] # 2. Operation whitelist verification (defense against malicious operations)
[0102] whitelist_check = self._operation_whitelist_check(candidate_sql)
[0103] if not whitelist_check.passed:
[0104] return ValidationResult(passed=False, error_type="OPERATION_VIOLATION", details=whitelist_check.details)
[0105] # 3. Semantic security and compliance verification (defending against unauthorized access and abuse)
[0106] semantic_check = self._semantic_safety_check(candidate_sql, task_metadata)
[0107] if semantic_check.passed:
[0108] return ValidationResult(passed=False, error_type="SEMANTIC_VIOLATION", details=semantic_check.details)
[0109] # All passed
[0110] return ValidationResult(passed=True, safe_sql=candidate_sql)
[0111] def _semantic_safety_check(self, sql: str, task: dict) ->CheckResult:
[0112] Semantic security validation example: Checking for attempts to access tables outside of task declarations.
[0113] # Use an SQL parser to extract all relevant table names
[0114] referenced_tables = self._extract_table_names(sql)
[0115] authorized_tables = task["tables"]
[0116] unauthorized = referenced_tables - set(authorized_tables)
[0117] if unauthorized:
[0118] return CheckResult(passed=False, details=f"Unauthorized access to the table:{unauthorized}")
[0119] # ... Other semantic rule checks (such as sensitive field filtering, JOIN complexity, etc.)
[0120] return CheckResult(passed=True)
[0121] Finish
[0122] In a specific implementation, the SafeSQLValidator presented in the pseudocode above is the core component for implementing multi-layered defense verification. This class encapsulates complete verification logic and provides a unified validate method as the verification entry point. The validate method receives two parameters: first, the candidate SQL statement to be verified, candidate_sql (string type), which is generated by the large language model based on the schema context; second, the metadata of the atomic task, task_metadata (dictionary type), which contains information such as task description, list of authorization tables, and security level, used to guide semantic-level verification. After the method is executed, it returns a ValidationResult object, which contains several fields to reflect the validation result: passed is a boolean value indicating whether the validation passed; error_type is a string, which is only valid when passed is False, and its value is such as "SCHEMA_MISMATCH" (schema inconsistency), "OPERATION_VIOLATION" (operation violation), or "SEMANTIC_VIOLATION" (semantic violation), used to indicate the error category; details is a string that provides a detailed description of the error; safe_sql is a string, which is only valid when the validation passes and stores the final SQL statement that can be safely executed.
[0123] The validator internally performs three layers of validation sequentially using three private methods: First, it calls the `_schema_validation` method to perform schema consistency validation. This method checks whether the table names and field names referenced in the SQL actually exist in the database schema, and whether the data types match (e.g., numeric fields should not be compared with strings), and returns a `CheckResult` object. Next, it calls the `_operation_whitelist_check` method to perform operation whitelist validation. This method parses the SQL operation type (e.g., SELECT, INSERT), determines whether it is within the system's preset operation whitelist (usually only SELECT operations are allowed), and checks for dangerous syntax (e.g., DROP, TRUNCATE, or comment character bypass), also returning a `CheckResult` object. Finally, it calls the `_semantic_safety_check` method to perform semantic safety and compliance validation. This method, combined with task metadata, checks whether the SQL complies with the security constraints declared in the atomic task declaration, such as whether it accesses tables other than the authorized tables, whether it requests sensitive fields (requiring field-level filtering), and whether the JOIN complexity exceeds the limit. In the implementation of _semantic_safety_check, the _extract_table_names method uses the SQL parser to extract all table names involved in the SQL statement, returns a collection type, and performs a set difference operation with the list of tables in the task metadata to detect possible unauthorized access.
[0124] The CheckResult used in the above validation process is a single-item validation result class, which contains two properties: passed (boolean value) indicates whether the validation passed, and details (string) provides detailed information when the validation failed. All symbols appearing in the pseudocode follow Python syntax rules: `self` represents a reference to the current instance; `:` after parameters and `->` after return values are type annotations used to specify the expected type of variables or return values; `#` is a comment symbol, followed by code explanation; `=` is the assignment operator; `self._schema_validation(candidate_sql)` calls the `_schema_validation` method of the current instance and passes in the parameter; `if not schema_check.passed:` is a conditional statement that executes the subsequent code block when `schema_check.passed` is `False`; `return ValidationResult(...)` constructs and returns a validation result object; `f"unauthorized access to tables: {unauthorized}"` is a format string that can embed the value of the variable `unauthorized` into the string; `task["tables"]` retrieves the value of the key "tables" from the dictionary `task`; `set(authorized_tables)` converts the list of authorized tables into a set; `referenced_tables - set(authorized_tables)` is a set difference operation, the result of which is the set of table names accessed unauthorizedly; `if` `unauthorized` checks if the collection is not empty; if not, it executes the subsequent return logic. Through the above rigorous verification process and clear code structure, the system can effectively identify and block various unsafe operations, ensuring that candidate SQL statements are only marked as executable statements when they conform to the security policy at the schema, operation, and semantic levels, thereby guaranteeing the security and controllability of database interactions.
[0125] Only SQL statements that pass all the above checks will be marked as safe to execute and submitted to the database engine. Failure at any step will immediately terminate the process and record detailed violation information in the audit log.
[0126] S4: Execute SQL marked as safe and executable to obtain the actual query results, capture the intermediate inference results of the large language model during the generation process, compare the actual query results as real data with the intermediate inference results, and optimize the large language model based on the comparison results.
[0127] Specifically, S4 is cognitive enhancement feedback. This step alleviates the data illusion problem at its root by comparing the actual results with the model's expectations, which is different from the single-execution error correction mechanism of existing technologies.
[0128] S4 includes:
[0129] S4.1: Compare the real data with the intermediate inference results to identify cognitive biases in the model's field mapping and logical understanding, and obtain bias data;
[0130] S4.2: Form an aligned dataset based on the deviation data;
[0131] S4.3: Using the aforementioned aligned dataset, regularly calibrate and optimize the large language model using supervised fine-tuning or reinforcement learning methods.
[0132] Specifically, S4 constitutes a closed loop of trusted data acquisition and cognitive enhancement.
[0133] like Figure 4 As shown, after executing the secure SQL, not only is the result returned to the user, but a feedback loop of cognitive enhancement is also initiated. When generating the SQL, by requiring the LLM to output a chain of thoughts, the system can capture its implicit assumptions.
[0134] An example of feedback signal generation is as follows:
[0135] Suppose the LLM states in the thought process: "The year, city_name, and gdp_value fields need to be selected from the gdp_data table...", while the actual field corresponding to the city name in the database table is region. When the validator discovers through schema validation that the city_name field does not exist, it will not only correct the SQL but also generate a feedback record:
[0136] start
[0137] Cognitive bias: In table 'gdp_data', the model expects the field 'city_name', but the actual field is 'region'.
[0138] Finish
[0139] Such bias data is continuously collected to form an "alignment dataset". The system periodically uses this dataset to perform supervised fine-tuning (SFT) on a dedicated SQL generation model, or to optimize security prompt templates, thereby achieving continuous calibration and improvement of the understanding of specific databases, forming a virtuous cycle of becoming more accurate with use.
[0140] In one embodiment, the method further includes:
[0141] The system records structured audit logs from user queries to result returns. These audit logs are represented in JSON format and include the user's natural language query, atomic task list, candidate SQL, validation results at each level, execution results, and feedback data.
[0142] Specifically, this invention provides a structured audit log that records the entire process from user query to result return. The audit log is stored in JSON format and includes user query, atomic task list, candidate SQL, verification results at each level, execution results, and feedback data, supporting process traceability and compliance analysis. The structured audit log provides an immutable operation chain record for high-security fields, meeting the rigid requirements of audit traceability.
[0143] like Figure 5 As shown, all interactions during the query process are captured by the structured audit module, generating immutable logs. Log entries are not simple text, but JSON objects with a fixed pattern, facilitating querying and analysis. This design allows auditors to directly answer complex compliance questions such as, "What were all the queries involving the policy_report table last week with a security level of 'medium'?" through database queries.
[0144] Existing technologies disclose methods for generating SQL, such as document 1CN120470020A, which discloses a Text2SQL implementation method and system. This method uses a planner to perform intent analysis and processing path planning on natural language queries, obtaining query intent, domain classification results, and key entity information. The planner's work includes: identifying the intent of the natural language query to determine the query type (fact query, statistical analysis, or trend prediction); classifying the domain based on a preset industry classification system; extracting key entity information (company name, time range, indicator name); and deciding on the subsequent calling strategy for the retrieval and generator based on the query type and complexity. The security intent parsing step of this application receives natural language queries, drives a large language model to perform intent parsing and task decomposition of the query through a prompt template with built-in security constraint rules, and parses and verifies the structured results output by the model, outputting atomic tasks with security annotations. Each atomic task is represented in structured JSON format, containing a unique identifier (id), a task description (desc), a list of expected accessed tables (tables), and a security attribute object (security) (including security level and note).
[0145] The two applications differ significantly in their security intent parsing. Firstly, in terms of output format, Reference 1 outputs query intent, domain classification results, and key entity information (discrete information), while this application outputs a structured list of atomic tasks (JSON format) with security annotations. This application structures and machine-readables the intent parsing results, providing precise input for subsequent verification. Secondly, regarding security embedding, Reference 1 does not embed security constraints at the intent parsing stage, while this application embeds security constraint rules in the prompt template, with the output results containing security annotations, thus achieving proactive security and bringing security control forward to the intent understanding stage. Thirdly, regarding authorization boundaries, Reference 1 does not explicitly limit the range of tables accessible to the query, while this application explicitly limits the expected list of tables accessed for each task through the `tables` field, thus providing machine-readable authorization boundaries for semantic security verification. Finally, regarding security levels, Reference 1 does not classify tasks by security level, while this application uses the `security.level` field to indicate the task's security level (low / medium / high), supporting differentiated security policies. High-level tasks can trigger additional approvals or de-identification. Compared to the literature, the technical effects achievable by this application include: Proactive security, preventing risks at the source: The planner in literature 1 only understands "what the user wants to query," while the security intent parsing in this application further clarifies "what the user is allowed to query." By integrating table-level access control rules into the intent parsing process, atomic tasks with authorized boundaries are output, eliminating the possibility of unauthorized access at the source and achieving proactive security control. Structured output, providing precise input for subsequent verification: The discrete information output by literature 1 requires subsequent modules to parse and correlate it, which can easily lead to information loss or misunderstanding. The structured atomic task list output by this application encapsulates task descriptions, authorization table lists, security levels, and other metadata in a unified JSON format, providing a precise and machine-readable input basis for semantic security verification during the secure SQL compilation stage, ensuring that information transmission between modules is not distorted. Differentiated security strategies, adapting to multi-level security requirements: Through security level labeling, the system can dynamically adjust processing strategies according to the task risk level. For example, "low" level tasks can have simplified validation processes to improve efficiency, "medium" level tasks can be subject to mandatory field-level filtering, and "high" level tasks can trigger additional approval processes or record more detailed audit logs, achieving a balance between security and efficiency. Fine-grained field-level control prevents the leakage of sensitive information: By passing specific security constraints such as "requires field-level filtering" through the `security.note` field, the system can enforce a restriction to returning only a subset of allowed sensitive fields during the subsequent SQL compilation stage, or perform de-identification processing before returning the results. This mechanism solves the problem of fine-grained control of sensitive fields that was difficult to address in Reference 1.
[0146] Regarding the verification mechanism, Reference 1 discloses a verifier that performs multi-dimensional verification on the initial SQL statement, including syntax, table fields, permissions, and logic. If verification fails, the generation logic is iteratively adjusted. Specifically, this includes: syntax verification (checking if it conforms to SQL syntax specifications), table field verification (checking if the referenced tables and fields exist), permission verification (confirming the user has query permissions), logic verification (evaluating the consistency between the SQL logic and the query intent), and complexity verification (evaluating the execution resource consumption). When verification fails, the verifier outputs the specific error type, which serves as the basis for the generator to adjust the SQL generation logic. The secure SQL compilation steps of this application perform multi-layered verification on the candidate SQL, including schema consistency verification, operation whitelist verification, and semantic security verification. Specifically: schema consistency verification: checks whether the table names and field names referenced in the SQL exist in the database schema and whether the data types match. Operation whitelist verification: checks whether the operation type of the SQL is within the preset operation whitelist (usually only SELECT operations are allowed), and also detects whether it contains dangerous syntax (such as DDL operations, DELETE / UPDATE without WHERE conditions, comment character bypass, etc.). Semantic security verification: This checks whether the SQL statement complies with the security annotations of the atomic task, including whether it accesses tables outside the expected access table list, whether it meets field-level filtering requirements, and whether it complies with resource protection rules (such as the maximum number of related tables). The core differences between the two applications include: First, the verification dimension: Reference 1 verifies syntax, table fields, permissions, logic, and complexity, while this application verifies schema consistency, operation whitelists, and semantic security. This application adds a semantic security verification dimension, incorporating task metadata into the verification basis. Second, the verification basis: Reference 1 is based on the SQL statement itself and user permissions, while this application is based on the SQL statement +... The atomic task security annotation (tables, security) enables task-aware verification, extending security constraints from the intent parsing phase to the SQL verification stage. Regarding privilege escalation detection, while Reference 1's permission verification only checks user read / write permissions to tables and fields, this application's semantic security verification checks whether SQL accesses tables outside the authorized tables declared in the task, achieving more refined privilege escalation detection and preventing "legitimate table access with unauthorized permissions." Regarding dangerous operations, Reference 1 does not explicitly mention the detection of dangerous syntax, while this application explicitly detects DDL, DELETE / UPDATE without WHERE conditions, and comment character bypass through operation whitelist verification, constructing a dangerous syntax blacklist rule base to proactively defend against malicious operations. In terms of resource protection, Reference 1 discloses complexity verification to evaluate resource consumption but does not bind it to the task. This application's semantic security verification incorporates resource protection rules (such as the maximum number of associated tables) into the verification, achieving task-level resource control and preventing complex queries from affecting system stability.Compared to Reference 1, the technical effects achievable by this application include: semantic-level security control to prevent unauthorized access: Reference 1's permission verification only checks the user's read and write permissions to tables and fields, but it cannot prevent scenarios where "a user has the right to access a table, but the current query should not access that table" (such as mistakenly associating a sensitive population data table when analyzing GDP). This application's semantic security verification accurately identifies and blocks unauthorized access by comparing the table names actually referenced by the SQL with the expected list of accessed tables declared by the atomic task, achieving refined control from "user-level permissions" to "task-level authorization"; proactive defense against malicious operations to ensure system security: Reference 1 does not explicitly mention the detection of dangerous syntax and mainly relies on user permission control. This application's operation whitelist verification not only restricts operation types but also constructs a blacklist rule base for dangerous syntax, proactively detecting and intercepting malicious or dangerous operations such as DDL operations, DELETE / UPDATE without WHERE conditions, and comment character bypass, effectively preventing SQL injection attacks and data corruption caused by misoperation; task-level resource protection to avoid system overload: Reference 1's complexity verification independently evaluates the resource consumption of SQL but is not bound to the task context. This application incorporates resource protection rules (such as the maximum number of related tables) into semantic security verification, and combines task metadata to determine whether the resource consumption of the current query is reasonable, preventing system performance degradation or service interruption due to complex queries, and ensuring system stability in multi-user concurrent scenarios; a full-process security closed loop achieves end-to-end control: the security intent parsing output of this application produces atomic tasks with security annotations, and the secure SQL compilation performs semantic security verification based on these annotations, forming a security closed loop from "intent understanding" to "SQL verification". The verification and intent parsing in Reference 1 are relatively independent, making it difficult to achieve this kind of integrated end-to-end security control.
[0147] Document 2-CN121070970A discloses a method, apparatus, and electronic device for generating SQL statements, specifically disclosing a method for verifying SQL statements based on a preset keyword blacklist, table name whitelist, and field-database field mapping relationships. This includes: keyword blacklist verification of the SQL statement (detecting potentially dangerous keywords); table name whitelist verification of the SQL statement after passing the blacklist verification (ensuring the existence and validity of referenced table names); and field-database field mapping relationship verification of the SQL statement after passing the whitelist verification (verifying the correctness of field mapping). The secure SQL compilation steps of this application perform multi-layered verification on candidate SQL statements, sequentially performing schema consistency verification, operation whitelist verification, and semantic security verification. Semantic security verification is the core innovation of this application; it combines the security annotations of atomic tasks to check whether the SQL complies with the security constraints declared in the task declaration. The differences between the two applications include: First, the dimensions of comparison: Reference 2 verifies based on keyword blacklists, table name whitelists, and field mappings, while this application verifies based on schema consistency, operation whitelists, and semantic security, adding semantic security verification and incorporating task metadata and security annotations into the verification; Second, the depth of verification: Reference 2 uses static rule matching at the syntactic level, while this application uses dynamic security verification at the semantic level, enabling a leap from "syntactic correctness" to "semantic compliance"; Third, regarding unauthorized access detection: Reference 2's table name whitelist only checks if the table exists, while this application's semantic security verification checks whether tables outside the authorized tables declared in the task are accessed, enabling task-level unauthorized access detection rather than just checking table existence; Fourth, regarding field control: Reference 2's field mapping verifies field existence, while this application's semantic security verification, combined with security.note, checks whether field-level filtering requirements are met, thus supporting fine-grained control of sensitive fields; Finally, regarding context awareness: Reference 2's verification is independent of the task context, while this application's verification combines the security annotations (tables, security) of atomic tasks, enabling context-aware security verification. Therefore, compared to Reference 2, the technical advantages of this application include: a leap from syntactic correctness to semantic compliance: the verification in Reference 2 remains at the syntactic level—checking whether keywords are on the blacklist, whether table names exist, and whether fields exist. This verification can prevent explicit attacks such as SQL injection, but it cannot determine whether SQL statements are compliant in terms of business semantics (for example, when querying "individual income tax", although dangerous keywords are not used, sensitive fields are requested).This application's semantic security verification, combined with task metadata, determines whether SQL statements comply with security constraints at the business semantic level, achieving a qualitative leap from "syntactic correctness" to "semantic compliance." Task-level privilege escalation detection prevents unauthorized access to legitimate tables: Reference 2's table name whitelist only ensures that the tables referenced by the SQL exist in the database, but cannot determine whether the current task has the right to access these tables (for example, user A has the right to access the policy_report table, but the current task should only query GDP data, yet it mistakenly associates the policy report table). This application's semantic security verification accurately identifies and blocks task-level privilege escalation by comparing the actual table names referenced by the SQL with the tables list of the atomic task. Fine-grained control of sensitive fields ensures data privacy: Reference 2's field mapping verification only checks whether the field exists, and cannot determine whether the field should be accessed. This application uses the `security.note` field of atomic tasks to pass constraints such as "field-level filtering required," and enforces checks during the semantic security verification phase to ensure that the SQL only requests a subset of allowed sensitive fields (e.g., only returning the `public_summary` field, rather than the complete policy report), achieving fine-grained control over sensitive fields. Context-aware security verification improves accuracy: The verification in Reference 2 is static and context-independent. The semantic security verification in this application is closely integrated with the security annotations of atomic tasks, enabling dynamic adjustments to the verification strategy based on the task context. For example, for tasks with a "high" security level, stricter field-level filtering checks can be triggered; for tasks involving multi-table joins, compliance with resource protection rules can be checked. This context-aware characteristic significantly improves the accuracy and applicability of verification. Multi-layered progressive verification builds a defense-in-depth system: The three layers of verification in this application are not simply parallel but a progressive defense system. Schema consistency verification ensures the correct SQL structure, operation whitelist verification ensures the legality of operations, and semantic security verification ensures business semantic compliance. The three-layer verification system is progressive and complementary, constructing a defense-in-depth system from syntax to semantics, and from operation to business, which can effectively deal with various security threats.
[0148] In summary, compared with the combination of documents 1 and 2, this application has the following technical advantages:
[0149] The unification of proactive security and defense-in-depth: This application brings security constraints forward to the intent understanding stage through security intent parsing, and constructs a defense-in-depth system through three-layer progressive verification, realizing full-process security control from the source to execution, forming a security paradigm of "proactive defense" rather than "passive remediation".
[0150] Integration of task awareness and context verification: This application integrates the authorization boundary of the intent parsing stage into the SQL verification stage through the security annotation of atomic tasks, realizing semantic security verification of task awareness, and solving deep-seated security problems such as "legitimate operation but unauthorized access" and "legitimate field but sensitive leakage" that are difficult to deal with by existing technologies.
[0151] Balancing Syntactic Correctness and Semantic Compliance: The verification mechanism in this application not only ensures the syntactic correctness of SQL statements, but also ensures that they comply with the security constraints of business semantics through semantic security verification, achieving a leap from "usable" to "trustworthy", and providing a practical technical solution for high-security scenarios such as government affairs and finance.
[0152] Balancing security control and system efficiency: By using security level labeling and differentiated verification strategies, this application can optimize the processing flow while ensuring security. High-level tasks are subject to strict verification, while low-level tasks have simplified processes to improve efficiency, thus achieving a balance between security and efficiency.
[0153] Example 2
[0154] Based on the same inventive concept, this embodiment discloses a Text-to-SQL intent query system for LLM, comprising:
[0155] The security intent parsing module receives natural language queries and drives the large language model to perform intent parsing and task decomposition on the natural language queries through prompt templates with built-in security constraint rules. It outputs atomic tasks with security annotations. The atomic tasks with security annotations are represented in a structured form. Each task contains a unique identifier, task description, expected access table list, and security attribute object.
[0156] The secure SQL compilation module is used to obtain the database schema information of the tables to be accessed from a pre-built knowledge base for each atomic task, combine the task description with the database schema information to form a hint, drive the large language model to generate candidate SQL; and perform multi-level verification on the generated candidate SQL based on the security annotation and security constraint rules of the atomic task, including schema consistency verification, operation whitelist verification and semantic security verification. When all multi-level verifications pass, the candidate SQL is marked as secure and executable.
[0157] The cognitive enhancement feedback module is used to execute SQL marked as safe and executable to obtain the actual query results and capture the intermediate inference results of the large language model during the generation process. The actual query results are compared with the intermediate inference results as real data, and the large language model is optimized based on the comparison results.
[0158] In one implementation, the system further includes a full-link audit module for recording structured audit logs from user query to result return. The audit logs are represented in JSON format and include the user's natural language query, atomic task list, candidate SQL, verification results at each level, execution results, and feedback data.
[0159] Before implementation, basic configuration is required: connect to the target database (such as a macroeconomic indicator database), automatically extract complete schema information through the INFORMATION_SCHEMA system table, and construct a vectorized knowledge base; load the security policy defined in YAML format, which explicitly specifies the operation whitelist (allow_operations: ["SELECT"]), table-level access control rules (sensitive_tables: {personal_income: ["role_a"]}), and resource protection rules (max_join_tables: 3). Since the system in Embodiment 2 of this invention is the same system used in the Text-to-SQL intent query method for LLM in Embodiment 1, those skilled in the art can understand the specific structure and variations of this system based on the method described in Embodiment 1 of this invention, and therefore will not be elaborated upon here. All systems used in the method of Embodiment 1 of this invention fall within the scope of protection of this invention.
[0160] Example 3
[0161] Based on the same inventive concept, the present invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements an intent query method for LLM-oriented Text-to-SQL according to one embodiment.
[0162] Since the computer-readable storage medium described in Embodiment 3 of this invention is the same computer-readable storage medium used in implementing the Text-to-SQL intent query method for LLM in Embodiment 1 of this invention, those skilled in the art can understand the specific structure and variations of this computer-readable storage medium based on the method described in Embodiment 1 of this invention, and therefore will not be repeated here. All computer-readable storage media used in the method of Embodiment 1 of this invention fall within the scope of protection of this invention.
[0163] Example 4
[0164] Based on the same inventive concept, the present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described in Embodiment 1.
[0165] Since the computer device described in Embodiment 4 of this invention is the same computer device used to implement the Text-to-SQL intent query method for LLM in Embodiment 1 of this invention, those skilled in the art can understand the specific structure and variations of this computer device based on the method described in Embodiment 1 of this invention, and therefore will not be repeated here. All computer devices used in the method of Embodiment 1 of this invention fall within the scope of protection of this invention.
[0166] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0167] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0168] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention. Clearly, those skilled in the art can make various modifications and variations to the embodiments of the invention without departing from the spirit and scope of the invention. Thus, if these modifications and variations of the embodiments of the invention fall within the scope of the claims of the invention and their equivalents, the invention also intends to include these modifications and variations.
Claims
1. An intent query method for LLM-oriented Text-to-SQL, characterized in that, include: The system receives natural language queries and uses a built-in security constraint rule prompt template to drive the large language model to perform intent parsing and task decomposition on the natural language queries. It outputs atomic tasks with security annotations, including: parsing the intent of the natural language query; decomposing the natural language query into independently executable atomic tasks based on the intent parsing results; and attaching corresponding security annotations to the atomic tasks based on the matching of the current user's role information and table-level access control rules, outputting security-annotated atomic tasks. The atomic tasks are represented in a structured form. Each security-annotated atomic task contains a unique identifier, task description, a list of expected accessed tables, and a security attribute object. Security constraint rules include operation whitelists, table-level access control rules, and resource-level protection rules. The operation whitelist is used to control access to specific SQL operations executed in the database; table-level access control rules define access permissions for different roles or users to database tables; and resource-level protection rules limit the resource consumption of query operations. For each atomic task, the database schema information of the tables it is expected to access is obtained from the pre-built knowledge base. The task description and the database schema information are combined to form a hint, which drives the large language model to generate candidate SQL. Based on the security annotations and security constraint rules of atomic tasks, the generated candidate SQL is subjected to multi-layered verification, including schema consistency verification, operation whitelist verification, and semantic security verification. This includes: checking whether the table names and field names referenced in the SQL exist in the database schema and whether their data types match; if both are satisfied, the consistency verification passes. It also checks whether the operation type of the SQL is within a preset operation whitelist and whether it contains dangerous syntax; if it is within the whitelist and does not contain dangerous syntax, the operation whitelist verification passes. Dangerous syntax includes data definition language operations, delete and update operations without WHERE conditions, and bypassing checks through comment symbols. Finally, it checks whether the SQL complies with the security annotations of atomic tasks, specifically whether it accesses tables outside the expected access table list, meets field-level filtering requirements, and complies with resource protection rules; if it does, the semantic security verification passes. When all multi-layered verifications pass, the candidate SQL is marked as safe and executable. Execute SQL marked as safe and executable to obtain the actual query results, capture the intermediate inference results of the large language model during the generation process, compare the actual query results as real data with the intermediate inference results, and optimize the large language model based on the comparison results.
2. The Text-to-SQL intent query method for LLM as described in claim 1, characterized in that, Database schema information includes table names, field names, data types, field comments, primary and foreign key constraints, index information, and relationships between tables.
3. The intent query method for LLM-oriented Text-to-SQL as described in claim 1, characterized in that, The actual query results are compared with the intermediate inference results as real data, and the large language model is optimized based on the comparison results, including: By comparing real data with intermediate inference results, we can identify cognitive biases in the model's field mapping and logical understanding, and obtain bias data. An aligned dataset is formed based on the deviation data; Using the aforementioned aligned dataset, supervised fine-tuning or reinforcement learning methods are employed to periodically calibrate and optimize the large language model.
4. The Text-to-SQL intent query method for LLM as described in claim 1, characterized in that, The method further includes: The system records structured audit logs from user queries to result returns. These audit logs are represented in JSON format and include the user's natural language query, atomic task list, candidate SQL, validation results at each level, execution results, and feedback data.
5. A Text-to-SQL intent query system for LLM, characterized in that, Based on the method described in claim 1, it includes: The security intent parsing module receives natural language queries and drives the large language model to perform intent parsing and task decomposition on the natural language queries through prompt templates with built-in security constraint rules. It outputs atomic tasks with security annotations. The atomic tasks with security annotations are represented in a structured form. Each task contains a unique identifier, task description, expected access table list, and security attribute object. The secure SQL compilation module is used to obtain the database schema information of the tables to be accessed from a pre-built knowledge base for each atomic task, combine the task description with the database schema information to form a hint, drive the large language model to generate candidate SQL; and perform multi-level verification on the generated candidate SQL based on the security annotation and security constraint rules of the atomic task, including schema consistency verification, operation whitelist verification and semantic security verification. When all multi-level verifications pass, the candidate SQL is marked as secure and executable. The cognitive enhancement feedback module is used to execute SQL marked as safe and executable to obtain the actual query results and capture the intermediate inference results of the large language model during the generation process. The actual query results are compared with the intermediate inference results as real data, and the large language model is optimized based on the comparison results.
6. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements an intent query method for LLM-oriented Text-to-SQL as described in any one of claims 1 to 4.
7. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements an intent query method for LLM-oriented Text-to-SQL as described in any one of claims 1 to 4.