Structured query statement generation method and device and related products
By constructing a multi-dialect corpus and injecting dialect-specific perturbation strategies, self-correcting code generation samples are generated and pre-trained on a large model. This solves the problem of dialect confusion errors in multi-database environments and achieves SQL generation with high accuracy and self-correction capabilities.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TELECOM CORP LTD
- Filing Date
- 2026-02-04
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies suffer from dialect confusion errors in environments with multiple databases, resulting in insufficient model accuracy and robustness, a lack of self-correction capabilities, and unstable training in cross-dialect environments.
By constructing a multi-dialect corpus, injecting dialect-specific perturbation strategies to generate perturbation SQL query statements, and generating self-correcting code generation samples, the large model is pre-trained with full parameters and fine-tuned by combining ambiguity resolution samples, thereby enhancing the model's cross-database dialect understanding and self-correction capabilities.
It achieves high accuracy and reliability in SQL generation in multi-dialect environments, possesses cross-dialect error immunity and self-correction capabilities, and improves the practicality of the model in real-world applications.
Smart Images

Figure CN121658508B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to a method, apparatus, electronic device, computer-readable storage medium, and computer program product for generating structured query statements. Background Technology
[0002] This section is intended to provide background or context for the embodiments of this disclosure as set forth in the claims. The description herein is not intended to be a prior art simply because it is included in this section.
[0003] In the field of AI-driven database query language generation, the relevant technologies mainly adopt a supervised fine-tuning paradigm based on a single database dialect to train models to achieve the mapping from natural language to standard SQL (Structured Query Language) statements.
[0004] However, this approach makes the model highly susceptible to dialect confusion errors in real-world environments with multiple databases. Furthermore, due to the lack of dedicated error modeling and correction mechanisms, these errors can propagate continuously, severely impacting the accuracy and robustness of the technology in practical applications. Summary of the Invention
[0005] The purpose of this disclosure is to provide a method, apparatus, electronic device, computer-readable storage medium, and computer program product for generating structured query statements, which can endow large models with accurate SQL generation capabilities and inherent error correction capabilities across database dialects, significantly improving their accuracy and practicality in multi-dialect mixed environments.
[0006] Other features and advantages of this disclosure will become apparent from the following detailed description, or may be learned in part from practice of this disclosure.
[0007] This disclosure provides a method for generating structured query statements, comprising: acquiring a multi-dialect corpus covering multiple database dialects, the multi-dialect corpus including multiple code generation samples; one of the code generation samples including a natural language query statement, a database dialect identifier, and a standard SQL query statement conforming to the database dialect specification generated based on the natural language query statement; acquiring multiple dialect-specific perturbation strategies, wherein the dialect-specific perturbation strategy refers to an error injection scheme designed to address the systematic differences in syntax, functions, and data types of different database dialects; for each code generation sample in the multi-dialect corpus, applying the dialect-specific perturbation strategy to inject errors into the standard SQL query statement to generate a corresponding perturbed SQL query statement; generating a self-correcting code generation sample based on the perturbed SQL query statement and the corresponding standard SQL query statement; and performing full-parameter pre-training on a large model based on the self-correcting code generation samples covering multiple database dialects, so that the large model acquires the ability to convert natural language query statements into standard SQL query statements under a specified database dialect.
[0008] In some embodiments, the method further includes: obtaining a disambiguation code generation sample, the disambiguation code generation sample including an ambiguous natural language query statement, an unambiguous natural language query statement corresponding to the ambiguous natural language query statement, and a disambiguation correction prompt word; and fine-tuning the large model based on the disambiguation code generation sample so that the large model can acquire the ability to identify and clarify query ambiguities during the code generation process.
[0009] In some embodiments, the method further includes: identifying the database dialect with the largest data volume in the multi-dialect corpus as the core dialect, and classifying database dialects other than the core dialect as non-core dialects; generating a first mixed training set based on code generation samples and ambiguity resolution code generation samples in the multi-dialect corpus; in the early stage of fine-tuning the large model using the first mixed training set, mixing the core dialect code generation samples, the ambiguity resolution code generation samples, and non-core dialect code generation samples in an N1:1:1 ratio; and in the later stage of fine-tuning the large model using the first mixed training set. In the intermediate stage of fine-tuning the large model using the mixed training set, the core dialect code generation samples, the ambiguity resolution code generation samples, and the non-core dialect code generation samples are mixed in a ratio of N2:1:1. In the later stage of fine-tuning the large model using the first mixed training set, the core dialect code generation samples, the ambiguity resolution code generation samples, and the non-core dialect code generation samples are mixed in a ratio of N3:1:1, where N1>N2>N3; N1 is an integer greater than 1, N2 is an integer greater than 1, and N3 is an integer greater than 1.
[0010] In some embodiments, the method further includes: identifying the database dialect with the largest data volume in the multi-dialect corpus as the core dialect, and classifying database dialects other than the core dialect as non-core dialects; in the early stage of fine-tuning the large model using the multiple code-generated samples, mixing the core dialect code-generated samples and non-core dialect code-generated samples in a ratio of N4:1; in the middle stage of fine-tuning the large model using the multiple code-generated samples, mixing the core dialect code-generated samples and the non-core dialect code-generated samples in a ratio of N5:1; and in the later stage of fine-tuning the large model using the multiple code-generated samples, mixing the core dialect code-generated samples and the non-core dialect code-generated samples in a ratio of N6:1, where N4>N5>N6; N4 is an integer greater than 1, N5 is an integer greater than 1, and N6 is an integer greater than 1.
[0011] In some embodiments, the plurality of code generation samples include a first sample, the first sample including a first query statement, an identifier of the target database dialect, and a standard SQL query statement generated based on the first query statement that conforms to the target database dialect specification; wherein, the method further includes: performing code generation processing on the first query statement through the large model to generate a first predicted SQL statement conforming to the target database dialect specification; determining at least one generation reward function value among execution feedback reward, transformation consistency reward, syntax verification reward, and recovery capability reward based on the first predicted SQL statement; wherein the reward function value is used to execute a reinforcement learning strategy to drive the parameter update of the large model; wherein the execution feedback reward is used to measure whether the first predicted SQL statement can be executed successfully in a specified database and obtain the correct result; wherein the transformation consistency reward is used to measure the degree of logical and semantic consistency of SQL statements conforming to the target dialect specification generated by the large model for the same query statement through different generation paths; the syntax verification reward is used to measure whether the first predicted SQL statement conforms to syntax specifications and structural constraints; the recovery capability reward is used to measure the ability of the large model to self-correct from generation errors and ultimately achieve success in a single code generation process.
[0012] In some embodiments, the method further includes: performing code generation processing on the first query statement using the large model to generate a second predicted SQL statement that conforms to the core database dialect specification; determining the structural similarity and semantic similarity between the first predicted SQL statement and the second predicted SQL statement; and determining the transformation consistency reward based on the structural similarity and semantic similarity.
[0013] In some embodiments, the method further includes: generating a correct intermediate inference chain for each code generation sample in the multi-dialect corpus, the intermediate inference chain describing the inference process of generating a standard SQL query statement from a natural language query statement; wherein the self-correcting code generation sample further includes an intermediate inference chain for generating the standard SQL query statement from the natural language query statement.
[0014] In some embodiments, error correction information is marked in the self-correcting code generation samples using structured tags to provide explicit error correction signals for the large model. The structured tags include error fragment boundary identifiers, error attribution metadata identifiers, and correction version identifiers.
[0015] This disclosure provides a structured query statement generation apparatus, including: a corpus acquisition module, a perturbation strategy acquisition module, an error injection module, a self-correcting sample generation module, and a pre-training module.
[0016] The corpus acquisition module is used to acquire a multi-dialect corpus covering multiple database dialects. The multi-dialect corpus includes multiple code generation samples. One of the code generation samples includes a natural language query statement, a database dialect identifier, and a standard SQL query statement generated based on the natural language query statement that conforms to the database dialect specification. The perturbation strategy acquisition module can be used to acquire multiple dialect-specific perturbation strategies, where the dialect-specific perturbation strategy refers to an error injection scheme designed to address the systematic differences in syntax, functions, and data types among different database dialects. The error injection module can be used to apply the dialect-specific perturbation strategy to each code generation sample in the multi-dialect corpus, injecting errors into the standard SQL query statement to generate a corresponding perturbation SQL query statement. The self-correcting sample generation module can be used to generate self-correcting code generation samples based on the perturbation SQL query statement and the corresponding standard SQL query statement. The pre-training module can be used to perform full-parameter pre-training on a large model based on the self-correcting code generation samples covering multiple database dialects, enabling the large model to acquire the ability to convert natural language queries into standard SQL query statements under a specified database dialect.
[0017] This disclosure provides an electronic device comprising: a memory and a processor; the memory for storing computer program instructions; and the processor for calling the computer program instructions stored in the memory to implement the method for generating structured query statements as described above.
[0018] This disclosure provides a computer-readable storage medium storing computer program instructions to implement the method for generating structured query statements as described in any of the preceding embodiments.
[0019] This disclosure provides a computer program product or computer program that includes computer program instructions stored in a computer-readable storage medium. The computer program instructions are read from the computer-readable storage medium, and a processor executes the computer program instructions to implement the aforementioned method for generating structured query statements.
[0020] The structured query statement generation method, apparatus, electronic device, computer-readable storage medium, and computer program product provided in this disclosure, by systematically injecting dialect-based perturbation errors into the model and constructing "error-correction" samples for training, enables the model to fundamentally understand the essential differences between different database dialects, thereby acquiring at least two core capabilities: first, the ability to accurately generate SQL statements conforming to specified dialect norms; and second, inherent cross-dialect error immunity and self-correction capabilities. This solution effectively solves the error propagation problem caused by dialect confusion in existing technologies, ultimately achieving highly reliable and accurate SQL generation of a general model in mixed database environments, significantly enhancing the practical value of the technology.
[0021] It should be understood that the above general description and the following detailed description are merely exemplary and do not limit this disclosure. Attached Figure Description
[0022] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure. It is obvious that the drawings described below are merely some embodiments of this disclosure, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.
[0023] Figure 1 A schematic diagram of a scenario is shown that can be applied to the method or apparatus for generating structured query statements in the embodiments of this disclosure.
[0024] Figure 2 This is a flowchart illustrating a method for generating structured query statements according to an exemplary embodiment.
[0025] Figure 3 This is a flowchart illustrating an ambiguity resolution method according to an exemplary embodiment.
[0026] Figure 4 This is a flowchart illustrating a large model fine-tuning method according to an exemplary embodiment.
[0027] Figure 5This is a method for fine-tuning a large model according to an exemplary embodiment.
[0028] Figure 6 This is a flowchart illustrating a reinforcement learning method according to an exemplary embodiment.
[0029] Figure 7 This is a flowchart illustrating a method for determining conversion consistency rewards according to an exemplary embodiment.
[0030] Figure 8 This is a flowchart illustrating a code generation method according to an exemplary embodiment.
[0031] Figure 9 This is a block diagram illustrating a structured query statement generation apparatus according to an exemplary embodiment.
[0032] Figure 10 A schematic diagram of the structure of an electronic device suitable for implementing embodiments of the present disclosure is shown. Detailed Implementation
[0033] Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as limited to the embodiments set forth herein; rather, they are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and therefore repeated descriptions of them will be omitted.
[0034] Those skilled in the art will recognize that embodiments of this disclosure can be a system, apparatus, device, method, or computer program product. Therefore, this disclosure can be implemented in the following forms: entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.
[0035] The features, structures, or characteristics described in this disclosure can be combined in any suitable manner in one or more embodiments. Numerous specific details are provided in the following description to give a thorough understanding of embodiments of this disclosure. However, those skilled in the art will recognize that the technical solutions of this disclosure can be practiced with one or more specific details omitted, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring various aspects of this disclosure.
[0036] In this disclosure, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.
[0037] The accompanying drawings are merely illustrative of this disclosure, and the same reference numerals in the drawings denote the same or similar parts, thus omitting repeated descriptions of them. Some block diagrams shown in the drawings do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.
[0038] The flowchart shown in the accompanying drawings is merely illustrative and does not necessarily include all content and steps, nor does it require execution in the described order. For example, some steps may be broken down, while others may be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.
[0039] In the description of this disclosure, unless otherwise stated, " / " means "or," for example, A / B can mean A or B. "And / or" in this document is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone. Furthermore, "at least one" means one or more, and "multiple" means two or more. The terms "first," "second," etc., do not limit the quantity or order of execution, and "first," "second," etc., do not necessarily imply differences; the terms "contains," "includes," and "has" are used to indicate an open-ended meaning of inclusion and refer to the existence of additional elements / components / etc. besides those listed.
[0040] This disclosure embodiment can be implemented by a terminal and / or a server. The terminal can obtain data from a computer device and display that data. The computer device can interact with the terminal, and can be a server hosting an application, or it can belong to the terminal (i.e., the terminal's backend), etc., without limitation.
[0041] The terminal can be a mobile phone, a laptop computer, or a playback device in a vehicle, etc., without limitation. The terminal can be considered a playback device in a vehicle, and it can display the target application. The terminal is only one example of the devices listed; the terminal in this disclosure is not limited to the listed devices. The target application in this disclosure can be any application capable of displaying multimedia information.
[0042] It is understood that the terminal mentioned in the embodiments of this disclosure can be a computer device, including but not limited to a terminal or a server. In other words, the computer device can be a server or a terminal, or a system composed of a server and a terminal. The terminal mentioned above can be an electronic device, including but not limited to mobile phones, tablets, desktop computers, laptops, handheld computers, in-vehicle devices, augmented reality / virtual reality (AR / VR) devices, head-mounted displays, smart TVs, wearable devices, smart speakers, digital cameras, webcams, and other mobile internet devices (MIDs) with network access capabilities, or terminals in scenarios such as trains, ships, and flights.
[0043] The servers mentioned above can be independent physical servers, server clusters or distributed systems composed of multiple physical servers, or cloud servers that provide basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, vehicle-road cooperation, content delivery networks (CDN), and big data and artificial intelligence platforms.
[0044] Optionally, the data involved in the embodiments of this disclosure may be stored in a computer device or may be stored based on cloud storage technology, without limitation.
[0045] To better understand the above-mentioned objectives, features and advantages of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments and features in the embodiments of the present disclosure can be combined with each other.
[0046] The following section will first explain some of the terms used in the embodiments of this disclosure so that those skilled in the art can understand them.
[0047] Dialect-Specific Perturbation: A systematic error injection strategy designed to address the differences in syntax, functions, and data types among different database dialects.
[0048] Self-Correction Exemplars: Training samples containing error-correction correspondences, used to train the model's self-correction capabilities.
[0049] Hybrid Reward Mechanism: A multi-dimensional reward calculation method that integrates execution feedback, logical consistency, and syntax validation.
[0050] Core Dialect: The primary database dialect with the richest data resources, serving as the basis for training. It is usually SQLite (Lightweight Structured Query Language Database) or MySQL (My Structured Query Language).
[0051] The preceding text introduced some terms and concepts involved in the embodiments of this disclosure. The following text introduces the technical features involved in the embodiments of this disclosure.
[0052] In code generation technologies, specialized code generation models are typically trained for specific database dialects. These models use pre-trained code models as a foundation and are fine-tuned on dialect-specific datasets through supervised learning. During training, natural language queries are taken as input, and target SQL statements are taken as output.
[0053] In code generation technologies, dialect conversion based on grammar rules and mapping tables is also employed to achieve conversion between different database dialects. These systems first construct a grammar mapping rule library between various database dialects, including function mappings and data type mappings. Then, cross-dialect conversion is achieved through syntax parsing and rule matching.
[0054] Traditional reinforcement learning code generation: In terms of model optimization, existing technologies mainly employ reinforcement learning strategies based on execution feedback. Specifically, the generated SQL is executed in the target database environment, and rewards or penalties are given based on the correctness of the execution results. Then, reinforcement learning algorithms such as PPO (Proximal Policy Optimization) are used to optimize the model strategy.
[0055] The above method has the following main technical problems.
[0056] 1. Dialect Knowledge Separation and Forgetting Issues: Existing single-dialect training models result in complete separation of knowledge about different database dialects, failing to achieve unified cross-dialect understanding. When a model is fine-tuned on a specific dialect, it often suffers catastrophic forgetting of knowledge from other dialects, leading to a severe decline in generalization ability. For example, a model trained on MySQL cannot understand the array operation syntax of PostgreSQL (Post-Structured Query Language database), and a model optimized on Oracle will forget the usage of date functions in SQLite.
[0057] 2. Lack of Error Correction and Error Propagation: Traditional code generation models lack self-correction mechanisms, and once an incorrect SQL statement is generated, it cannot correct itself. More seriously, existing training methods do not specifically learn common error types across dialects, leading to systemic errors when the model encounters dialect confusion. For example, it might incorrectly use MySQL's LIMIT (limit / limit the number of query results) syntax in a query targeting PostgreSQL, or misuse Oracle's (+) outer join syntax in an SQLite environment.
[0058] 3. Sparse Reward Signals and Unstable Training: Existing reinforcement learning methods overly rely on execution feedback as the sole reward signal. However, in multi-dialect scenarios, many low-resource dialects lack a complete execution environment, resulting in extremely sparse reward signals. Furthermore, the large reward gap between successful and failed executions makes the training process unstable and convergence difficult. Especially when dealing with complex queries, subtle grammatical errors can lead to complete execution failure, but the model cannot learn effective correction strategies from such coarse-grained feedback.
[0059] 4. Ambiguity of User Intent and Confusion in Dialect Selection: Existing systems lack intelligent mechanisms for recognizing and processing the dialect intent in user queries. When a user's natural language query does not explicitly specify the target database type, the system often randomly selects or defaults to using a certain dialect, resulting in generated results that do not match the user's actual needs. This lack of ambiguity handling severely impacts the system's usability and user experience.
[0060] To address the above problems, this application proposes a solution, which can be found in the following embodiments.
[0061] The exemplary embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.
[0062] Figure 1 A schematic diagram of a scenario is shown that can be applied to the method or apparatus for generating structured query statements in the embodiments of this disclosure.
[0063] Please refer to Figure 1The diagram illustrates an implementation environment provided by an exemplary embodiment of this disclosure.
[0064] like Figure 1 As shown, system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105. Network 104 serves as the medium for providing communication links between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc.
[0065] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Terminal devices 101, 102, and 103 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, desktop computers, wearable devices, virtual reality devices, smart home devices, etc.
[0066] Server 105 can be a server that provides various services, such as a backend management server that supports the devices operated by users using terminal devices 101, 102, and 103. The backend management server can analyze and process received requests and other data, and feed the processing results back to the terminal devices.
[0067] A server can be a standalone physical server, a server cluster or a distributed system consisting of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. This disclosure does not impose any restrictions on this.
[0068] Server 105 may, for example, acquire a multi-dialect corpus covering multiple database dialects, the multi-dialect corpus including multiple code generation samples; one of the code generation samples includes a natural language query statement, a database dialect identifier, and a standard SQL query statement generated based on the natural language query statement that conforms to the database dialect specification; Server 105 may, for example, acquire multiple dialect-specific perturbation strategies, where the dialect-specific perturbation strategy refers to an error injection scheme designed to address the systematic differences in syntax, functions, and data types of different database dialects; Server 105 may, for example, apply the dialect-specific perturbation strategy to each code generation sample in the multi-dialect corpus, injecting errors into the standard SQL query statement to generate a corresponding perturbation SQL query statement; Server 105 may, for example, generate a self-correcting code generation sample based on the perturbation SQL query statement and the corresponding standard SQL query statement; Server 105 may, for example, perform full-parameter pre-training on a large model based on the self-correcting code generation samples covering multiple database dialects, so that the large model acquires the ability to convert natural language query statements into standard SQL query statements under a specified database dialect.
[0069] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Server 105 can be a single physical server or a combination of multiple servers. Depending on actual needs, it can have any number of terminal devices, networks, and servers.
[0070] Under the above system architecture, this disclosure provides a method for generating structured query statements, which can be executed by any electronic device with computing capabilities.
[0071] Figure 2 This is a flowchart illustrating a method for generating structured query statements according to an exemplary embodiment. The method provided in this disclosure can be executed by any electronic device with computing power; for example, the method can be implemented by the aforementioned... Figure 1 The execution can be performed by a server or terminal device in the embodiments, or it can be performed by both a server and a terminal device. In the following embodiments, the server is used as the execution subject for illustration, but this disclosure is not limited to this.
[0072] Reference Figure 2 The method for generating structured query statements provided in this disclosure may include the following steps.
[0073] Step S202: Obtain a multi-dialect corpus covering multiple database dialects. The multi-dialect corpus includes multiple code generation samples. One of the code generation samples includes natural language query statements, database dialect identifiers, and standard SQL query statements that conform to database dialect specifications generated based on the natural language query statements.
[0074] A multi-dialect corpus is a specially constructed dataset whose core feature is that it contains raw training data for various database dialects (such as MySQL, PostgreSQL, Oracle, SQLite, etc.).
[0075] In some embodiments, a code generation sample may correspond to a database dialect.
[0076] In some embodiments, a code generation sample may include a natural language query statement, an identifier of a database dialect, and a standard SQL query statement that conforms to the database dialect specification and is generated based on the natural language query statement.
[0077] Natural language query statements refer to a sentence or paragraph in which a user uses everyday, unstructured language (such as Chinese or English) to express their data query intent.
[0078] In some embodiments, a standard SQL query statement generated based on the natural language query statement that conforms to the database dialect specification can refer to "a 100% correct, executable SQL code that fully complies with the syntax and functional specifications of a specific database dialect (such as MySQL, PostgreSQL, etc.) and accurately realizes the intent expressed by the user's natural language query."
[0079] In some embodiments, a multi-speech corpus can be constructed. ,in For natural language queries, For database schema. For target dialect identifiers, This is a standard SQL statement.
[0080] The database schema can be a "blueprint" or "structural specification" for the database. It defines what tables are in the database, what fields are in each table, what data type each field is, and the relationships between tables, but it does not contain any actual data itself.
[0081] Step S204: Obtain multiple dialect-specific perturbation strategies, where a dialect-specific perturbation strategy refers to an error injection scheme designed to address the systematic differences in syntax, functions, and data types among different database dialects.
[0082] Dialect-specific perturbation strategies are a type of data augmentation method that, during the model training phase, intentionally and systematically applies the syntax, functions, or features of one database dialect to SQL statements in another database dialect in order to teach the model to recognize and correct cross-database errors, thereby creating typical obfuscated errors.
[0083] In some embodiments, the following four types of dialect-specific perturbation strategies can be designed.
[0084] 1. Dialectal grammatical ambiguity perturbation P_syntax(SQL,D_target,D_interfere)→SQL_perturbed.
[0085] Here, SQL refers to the original standard SQL query statement input into the perturbation function.
[0086] P_syntax represents the dialect semantic ambiguity perturbation function, which transforms a standard SQL statement that conforms to the specification into an SQL statement containing specific syntax errors by deliberately obfuscating the specific syntactic structures of different database dialects.
[0087] D_target represents the target database dialect, which is the specific database type that the user ultimately wants the generated SQL statements to run on.
[0088] D_interfere can refer to "interference dialect," which is the database dialect from which incorrect grammar or features are deliberately introduced during dialect grammatical ambiguity perturbation.
[0089] SQL_perturbed refers to the SQL query statement generated by injecting errors into the original standard SQL statement by applying dialect-specific perturbation strategies (such as P_syntax), which contains specific and typical cross-dialect errors.
[0090] 2. Function semantic conflict perturbation P_function(SQL,F_source,F_target)→SQL_perturbed.
[0091] P_function is the specific implementation function of "function semantic conflict perturbation". Its function is to deliberately obfuscate functions with similar functions but different names in different database dialects, and replace a correct function in a standard SQL statement with a function that is invalid or malfunctions in the target dialect, thereby creating a specific function mismatch error.
[0092] SQL: The input, raw, standard SQL query statement. It is the baseline object for perturbation operations. It is a correct SQL statement conforming to a certain dialect specification.
[0093] F_source: The source function, which is a function name that exists in the SQL statement but is incompatible with the target dialect. It is an erroneous function that is intentionally selected and retained in the perturbed SQL, and is the "root cause" of conflicts.
[0094] F_target: The target function, which is the correct function name that exists in the target dialect and is functionally equivalent to F_source. It represents the final correction target. In self-correcting samples, the model needs to replace F_source with F_target.
[0095] SQL_perturbed: The SQL statement generated after perturbation. Its characteristic is that it contains a function semantic conflict error caused by the F_source function. Generation method: It is obtained by replacing the originally correct function in the original SQL statement with the incorrect F_source function.
[0096] 3. Data type incompatibility perturbation: P_datatype(SQL,T_source,T_target) → SQL_perturbed.
[0097] P_datatype: The specific implementation function for "data type incompatibility perturbation". This function deliberately obfuscates data types that represent similar data but have different names or definitions in different database dialects, replacing a correct data type in a standard SQL statement with a data type that is incompatible, unsupported, or semantically different in the target dialect, thereby creating a specific data type error.
[0098] T_source: Source data type, which is a data type name that is intentionally used in an SQL statement and is incompatible with the target dialect. It is an incorrect data type that is selected and retained in the perturbed SQL and is the root cause of type conflicts.
[0099] T_target: The target data type, which is the correct data type name that exists in the target dialect and is semantically closest to the intended representation of T_source. It represents the final correction target. In self-correcting samples, the model needs to replace T_source with T_target.
[0100] SQL_perturbed: The SQL statement generated after perturbation. Its characteristic is that it contains a data type incompatibility error caused by the T_source data type. Generation method: It is obtained by replacing the originally correct data types in the original SQL statement (usually in the table structure definition or conversion functions) with the incorrect T_source data type.
[0101] 4. Logical structure error disturbance P_logic(SQL,Order_correct,Order_wrong)→SQL_perturbed.
[0102] P_logic: The specific implementation function for "logical structure error perturbation". This function generates an SQL statement that may still be syntactically correct but has logical errors in its semantics by deliberately scrambling the correct logical order of key clauses in the SQL statement or modifying their logical operators.
[0103] Order_correct: Correct clause order, describes the standard and correct logical execution order that each clause in an SQL statement should follow under the target dialect. Purpose: Serves as a benchmark for judging correctness and is the target structure that the model ultimately learns and outputs.
[0104] Order_wrong: Incorrect clause order describes an intentionally designed arrangement of clauses that does not conform to SQL syntax rules or logical execution order. Purpose: It is used to tamper with the original SQL, creating a "faulty recipe" to produce a chaotic logical structure.
[0105] SQL_perturbed: The SQL statement generated after perturbation. Its characteristic is that although it may not have syntax errors, the execution logic may not match the user's intent due to the disordered order of clauses or incorrect logical relationships, or it may fail to execute directly in some dialects. Generation method: It is obtained by rearranging the clauses in the original SQL statement or modifying the logical operators according to the error order specified by Order_wrong.
[0106] Step S206: For each code in the multi-dialect corpus, generate a sample, apply a dialect-specific perturbation strategy, inject errors into the standard SQL query statements, and generate the corresponding perturbation SQL query statements.
[0107] In some embodiments, a sample of raw code containing standard SQL (e.g., SELECT AVG(salary) FROM employees) and a target dialect (e.g., PostgreSQL) may be required first. Subsequently, the system will randomly select a strategy from a predefined perturbation strategy library (e.g., function obfuscation, syntax substitution, etc.), such as "function semantic conflict perturbation", and intentionally replace the function supported by the target dialect in the standard SQL (e.g., AVG) with a synonym function that does not exist in the target dialect but exists in other dialects (e.g., AVERAGE), thereby generating a perturbation SQL statement (SELECT AVERAGE(salary) FROM employees) containing typical cross-dialect errors. This "error-correction" pair will serve as a complete teaching sample to train the model's recognition and error correction capabilities.
[0108] Step S208: Generate a self-correcting code generation sample based on the perturbation SQL query statement and the corresponding standard SQL query statement.
[0109] In some embodiments, self-correcting samples can be constructed using the following methods.
[0110] In some embodiments, an error version can be generated by randomly selecting a perturbation function (SQL_error = random_select([P_syntax, P_function, P_datatype, P_logic])(SQL_i): randomly selecting one of the four known dialect-specific perturbation strategies to inject perturbation into standard SQL_i) from the standard SQL statement of the code generation sample.
[0111] A specific example can be shown below.
[0112] / / Examples of dialectal grammatical ambiguity perturbations
[0113] Original SQL: SELECT FROM table1 t1 LEFT JOIN table2 t2 ON t1.id = t2.id
[0114] Perturbation injection: SELECT FROM table1 t1, table2 t2 WHERE t1.id = t2.id(+) / / Intentionally injecting Oracle syntax
[0115] Target dialect: PostgreSQL
[0116] Expected correction: SELECT FROM table1 t1 LEFT JOIN table2 t2 ON t1.id = t2.id / / Restore standard SQL
[0117] Training logic: Make the model recognize that "Oracle's (+) syntax is incorrect in PostgreSQL and LEFTJOIN should be used".
[0118] In some embodiments, a correct intermediate inference chain can also be generated for each code generation sample in the multi-dialect corpus. The intermediate inference chain describes the reasoning process of generating a standard SQL query statement from a natural language query statement. The self-correcting code generation sample also includes an intermediate inference chain for generating a standard SQL query statement from a natural language query statement.
[0119] The intermediate inference chain can be generated by a known trained network model or can be manually labeled based on experience; this application does not impose any restrictions on this.
[0120] In some embodiments, the self-correcting sample enhanced by the intermediate inference chain can be obtained by the following method.
[0121] In some embodiments, the inference chain format may refer to the following: problem understanding --> schema analysis --> SQL construction logic --> final SQL generation.
[0122] Example of inference chain generation.
[0123] Reasoning chain: First, analyze the query requirements: calculate the average salary of each department (problem understanding); then identify the relevant tables in the database: such as the employees table containing department and salary fields (schema analysis); construct grouping and aggregation logic: group by department and calculate the average salary (SQL construction logic); generate SQL: SELECT department, AVG(salary) FROM employees GROUP BY department" (this SQL statement's function is: group the "employees" table by "department" and calculate the average "salary" for each department).
[0124] In some embodiments, error correction information can also be marked in the self-correcting code generation samples through structured tags to provide clear error correction signals for large models. The structured tags include error fragment boundary identifiers, error attribution metadata identifiers, and correction version identifiers.
[0125] In some embodiments, a designed structured markup format can be adopted to provide explicit error correction signals for the model.
[0126] In some embodiments, the boundaries of error segments can be marked by ERR_START (a kind of marker) and ERR_END (a kind of marker).
[0127] In some embodiments, error attribution metadata (DIALECT_SYNTAX (dialect syntax error), FUNC_MISMATCH (function mismatch error), TYPE_INCOMPATIBLE (data type incompatibility error), LOGIC_ERROR (logical structure error)) can be provided by TYPE (a kind of marker).
[0128] In some embodiments, CORRECTION can be used to mark the provision of the correct revised version
[0129] In some embodiments, by combining the inference path (inference chain), dialect-specific perturbation injection, and structured markup format, an example of a self-correcting code generation sample can be as follows.
[0130] [CONTEXT]
[0131] <!-- Start: Context information block, used to define all background information required for generating SQL -->
[0132] Database Schema: CREATE TABLE employees (id INT, name VARCHAR(50),department VARCHAR(30), salary DECIMAL(10,2));
[0133] <!-- Database schema: Defines the structure of the employees table, including four fields: ID, name, department, and salary -->
[0134] Natural Language Query: Query the average salary of each department;
[0135] <!-- Natural language query: The original question raised by the user -->
[0136] Target Dialect: PostgreSQL;
[0137] <!-- Target dialect: Specifies the database specification that the generated SQL should follow, here it is PostgreSQL -->
[0138] [ / CONTEXT]
[0139] <!-- End: Context information block -->
[0140] [QUESTION]
[0141] <!-- Start: Question instruction block, clearly instruct the model to perform a task -->
[0142] Generate SQL query for the above requirement.
[0143] <!-- Instruction content: Require the model to generate the corresponding SQL query based on the above context information -->
[0144] [ / QUESTION]
[0145] <!-- End: Question instruction block -->
[0146] [REASONING]
[0147] <!-- Start: Reasoning process block, showing the complete thinking chain when the model generates SQL -->
[0148] Need to calculate the average salary by department. First, try to construct the SQL:
[0149] <!-- Reasoning step one: Understand the user's intention, that is, a grouped aggregation query is required -->
[0150] [ERR_START|TYPE:FUNC_MISMATCH];
[0151] <!-- Start: Error marker, indicating that the error type is "function mismatch" -->
[0152] SELECT department, AVERAGE(salary) FROM employees GROUP BY department;
[0153] <!-- The initial SQL constructed (with error): Incorrectly used the non - existent AVERAGE function in PostgreSQL -->
[0154] [ERR_END]
[0155] <!-- End: Error marker -->
[0156] Found an error: The aggregation function in PostgreSQL should be AVG instead of AVERAGE;
[0157] <!-- Step two of reasoning: Identify and diagnose the specific error cause -->
[0158] [CORRECTION]
[0159] <!-- Start: Correction marker, indicating that the following is the correct SQL fragment -->
[0160] SELECT department, AVG(salary) FROM employees GROUP BY department;
[0161] <!-- Revised SQL fragment: Replace AVERAGE with the correct aggregation function AVG in the target dialect PostgreSQL -->
[0162] [ / CORRECTION]
[0163] <!-- End: Correction marker -->
[0164] [ / REASONING]
[0165] <!-- End: Reasoning process block -->
[0166] [SQL]
[0167] <!-- Start: Final answer block, providing the complete and correct SQL statement -->
[0168] SELECT department, AVG(salary) FROM employees GROUP BY department;
[0169] <!-- The finally generated standard SQL: An executable query statement conforming to the PostgreSQL dialect specification -->
[0170] [ / SQL]
[0171] <!-- End: Final answer block -->
[0172] Step S210, based on the samples generated by the self-correcting code covering multiple database dialects, perform full-parameter pre-training on the large model so that the large model can obtain the ability to convert natural language query statements into standard SQL query statements in the specified database dialect.
[0173] In some embodiments, the large model can be continuously pre-trained using the self-correcting code generation samples described above. The continuous pre-training strategy is to use a full-parameter training method to feed these self-correcting samples into a code base model (such as CodeLlama) to ensure that the model fully absorbs and internalizes this complex cross-dialect error correction logic.
[0174] The technical solution provided in this embodiment offers a precise summary of the core invention by systematically constructing "error-correction" sample pairs: By designing a systematic perturbation strategy targeting database dialect differences, typical cross-dialect errors are actively injected into standard SQL statements. Furthermore, by integrating inference chains and structured tags, self-correcting samples are constructed that can clearly teach the model "what is an error" and "how to correct it." Subsequently, these samples covering multiple dialects are used to perform full-parameter pre-training on the model, enabling it to fundamentally internalize cross-dialect knowledge. This endows the single model with the core technical advantage of accurately generating SQL in multiple database environments and possessing inherent self-correction capabilities.
[0175] In some embodiments, users often express themselves vaguely in real-world applications (such as not explicitly specifying the target database type), which causes the system to be unable to generate SQL accurately.
[0176] To address the above problems, this application proposes the following solutions.
[0177] Figure 3 This is a flowchart illustrating an ambiguity resolution method according to an exemplary embodiment.
[0178] refer to Figure 3 The above-mentioned method for resolving ambiguity may include the following steps.
[0179] Step S302: Obtain the ambiguity resolution code generation sample. The ambiguity resolution code generation sample includes ambiguous natural language query statements, unambiguous natural language query statements corresponding to the ambiguous natural language query statements, and ambiguity correction prompts.
[0180] Ambiguous natural language queries refer to natural language expressions that contain multiple possible interpretations, making it impossible to directly and uniquely map them to a standard SQL query. This ambiguity can stem from multiple dimensions. For example, the target database may be unclear, the business logic or filtering conditions may be vague, or the aggregation dimensions may be unclear.
[0181] Example of a user query: "Statistics on total sales".
[0182] Ambiguity: Is the statistics based on product, region, or salesperson?
[0183] Unambiguous natural language query statements: Clear statements that can uniquely correspond to a standard SQL query after eliminating all ambiguities that may lead to multiple interpretations.
[0184] Ambiguity correction prompts: These are structured requests proactively generated to retrieve key missing information from users when one or more of the aforementioned ambiguities are detected in a user's query. Their core purpose is to guide the conversation and clarify intent.
[0185] In real-world scenarios, user queries often do not contain explicit dialectal instructions (such as "query sales data" rather than "query sales data using MySQL"), and large models need to be able to handle such ambiguity.
[0186] In some embodiments, disambiguation data dis_am can be constructed, and an example of data construction is shown below:
[0187] / / Construct training samples for ambiguous queries
[0188] Original sample: ("Query average salary for each department", schema, "MySQL", target_sql)
[0189] Ambiguous sample: ("Query average salary for each department", schema, "UNKNOWN", multiple_dialect_options)
[0190] / / Training objective: To enable the model to learn to recognize ambiguity and generate reasonable clarification requests.
[0191] Expected output:
[0192] {
[0193] "ambiguity_detected": True,
[0194] / Ambiguity detection flag: Boolean value.
[0195] - When the value is True, it means that the system has detected ambiguity in the current user's query and cannot directly generate SQL.
[0196] - This is the switch signal that triggers the subsequent clarification process. /
[0197] "clarification_needed": "Please specify the target database type",
[0198] / Necessary clarification information: string.
[0199] - These are easy-to-understand natural language prompts that the system presents directly to the user.
[0200] - It clearly points out the core issue of the current ambiguity and guides users to provide key information. /
[0201] "suggested_options": ["MySQL", "PostgreSQL", "Oracle"]
[0202] / Suggested options: string array.
[0203] - The system provides users with predefined solutions based on available resources, context, or history.
[0204] - This list transforms ambiguous questions into clear multiple-choice questions, greatly simplifying the user experience.
[0205] - The options in the examples are specific database dialects, which directly correspond to the target specification required when generating SQL. /
[0206] }
[0207] Step S304: Fine-tune the large model based on the ambiguity resolution code generation samples so that the large model can identify and clarify query ambiguities during the code generation process.
[0208] This technical solution fine-tunes a large model using specially constructed ambiguity resolution samples, upgrading it from a simple code generation tool into an intelligent interactive system with context awareness. When faced with ambiguous queries such as unclear target databases, vague query conditions, or missing aggregation dimensions, the model can proactively identify unclear intents, generate structured clarification requests and suggested options, and guide users to supplement key information. This transforms open-ended questions into explicit instructions, ultimately significantly improving the accuracy and practicality of natural language to SQL conversion in real-world complex scenarios.
[0209] In related technologies, single-dialect training methods have a fatal flaw: when a model is fine-tuned specifically for a particular dialect (such as MySQL), "catastrophic forgetting" occurs, completely losing the ability to understand other dialects (such as PostgreSQL and Oracle). Furthermore, in real-world applications, users often do not explicitly specify the target database type, causing the system to fail to accurately generate SQL. The following examples will employ a course-based hybrid training strategy to allow the model to master the core dialect while maintaining multi-dialect sensitivity; and during training, specifically cultivate the model's ability to identify and handle ambiguous queries, laying the foundation for subsequent deployment.
[0210] Figure 4 This is a flowchart illustrating a large model fine-tuning method according to an exemplary embodiment.
[0211] refer to Figure 4 The above-mentioned large model fine-tuning method may include the following steps.
[0212] Step S402: In the multi-dialect corpus, the database dialect with the largest amount of data is identified as the core dialect, and the database dialects other than the core dialect are identified as non-core dialects.
[0213] Step S404: Generate a first mixed training set based on code generation samples and ambiguity resolution code generation samples from the multi-speech corpus.
[0214] Step S406: In the early stage of fine-tuning the large model using the first mixed training set, the core dialect code generation samples, the ambiguity resolution code generation samples, and the non-core dialect code generation samples are mixed in a ratio of N1:1:1.
[0215] Step S408: In the intermediate stage of fine-tuning the large model using the first mixed training set, the core dialect code generation samples, the ambiguity resolution code generation samples, and the non-core dialect code generation samples are mixed in a ratio of N2:1:1.
[0216] Step S410: In the later stage of fine-tuning the large model using the first mixed training set, the core dialect code generation samples, ambiguity resolution code generation samples, and non-core dialect code generation samples are mixed in a ratio of N3:1:1, where N1>N2>N3; N1 is an integer greater than 1, N2 is an integer greater than 1, and N3 is an integer greater than 1.
[0217] Specifically, the curriculum-based hybrid dialect fine-tuning strategy involves identifying a core dialect and gradually increasing the proportion of other dialects during training, allowing the model to "review" other dialect knowledge while consolidating its core capabilities. The specific process is shown below.
[0218] 1. Determine the core dialect c_base (usually choose the dialect with the largest amount of data, such as SQLite).
[0219] 2. Construct a mixed training set: Train_mixed = 0.8 × Data_c_base (core database dialect) + 0.1 × Data_dis_am (ambiguity resolution code generation samples) + 0.1 × Data_others (non-core database languages).
[0220] 3. The difficulty of implementing a gradual dialect change increases.
[0221] Dialect distribution in Epoch_i (the i-th training round):
[0222] if i <= Epoch_n / 3: Pure c_base training; / / For the first third of the epochs, train using pure core dialect;
[0223] if Epoch_n / 3 < i <= 2 Epoch_n / 3: Mixing ratio 0.8:0.1:0; / / For the middle third of the rounds, the mixing ratio is 0.8:0.1:0;
[0224] if i > 2 Epoch_n / 3: Mixing ratio 0.6:0.2:0.2 / / For the last third of the rounds, the mixing ratio is 0.8:0.1:0.
[0225] Where i is an integer greater than 0.
[0226] The above methods allow large models to maintain sensitivity to multiple dialects while mastering the core dialect.
[0227] This technical solution employs a course-based, hybrid dialect fine-tuning strategy, designing a progressive training curriculum: in the early stages of training, the model focuses on mastering the core dialect with the richest data, establishing a solid foundation; in the middle stages, non-core dialects and ambiguous samples are gradually introduced to activate the model's cross-dialect perception capabilities; and in the later stages, the proportion of non-core samples is further increased to enhance the model's generalization ability. This training approach, progressing from easy to difficult and from specialized to broad, ensures high performance of the model on core tasks while endowing it with strong generalization capabilities for handling multi-dialect scenarios and fuzzy queries, ultimately achieving accurate and robust SQL generation from a single model in multi-database environments.
[0228] Figure 5 This is a method for fine-tuning a large model according to an exemplary embodiment.
[0229] refer to Figure 5The above-mentioned large model fine-tuning method may include the following steps.
[0230] Step S502: In the multi-dialect corpus, the database dialect with the largest amount of data is identified as the core dialect, and the database dialects other than the core dialect are identified as non-core dialects.
[0231] Step S504: In the early stage of fine-tuning the large model through multiple code generation samples, the core dialect code generation samples and non-core dialect code generation samples are mixed in a ratio of N4:1.
[0232] Step S506: In the intermediate stage of fine-tuning the large model through multiple code generation samples, the core dialect code generation samples and non-core dialect code generation samples are mixed in a ratio of N5:1.
[0233] Step S508: In the later stage of fine-tuning the large model using multiple code-generated samples, the core dialect code-generated samples and non-core dialect code-generated samples are mixed in a ratio of N6:1, where N4>N5>N6. N4 is an integer greater than 1, N5 is an integer greater than 1, and N6 is an integer greater than 1.
[0234] This technical solution employs a progressive and balanced fine-tuning method. In the early stages of training, a higher ratio (N4:1) is used to focus on core dialects to establish a solid foundation. In the middle stages (N5:1), the exposure of non-core dialects is moderately increased to activate cross-dialect perception. In the later stages (N6:1), the ratio of the two is further balanced to enhance generalization ability. This dynamic curriculum, from specialized to broad, ensures that the model deeply masters the advantages of core dialects while effectively preventing the forgetting of non-core dialect knowledge. Ultimately, this endows a single model with robust and accurate SQL generation capabilities in multi-database environments.
[0235] Figure 6 This is a flowchart illustrating a reinforcement learning method according to an exemplary embodiment.
[0236] In some embodiments, multiple code generation samples in the corpus may include a first sample, which includes a first query statement, an identifier of the target database dialect, and a standard SQL query statement that conforms to the target database dialect specification and is generated based on the first query statement.
[0237] refer to Figure 6 The reinforcement learning method described above may include the following steps.
[0238] Step S602: The first query statement is processed by the large model to generate a first predicted SQL statement that conforms to the dialect standard of the target database.
[0239] Step S604: Determine at least one of the following reward function values based on the first prediction SQL statement: execution feedback reward, transformation consistency reward, syntax verification reward, and recovery capability reward. The reward function value is used to execute a reinforcement learning strategy to drive large model parameter updates.
[0240] In some embodiments, reinforcement learning optimization based on a hybrid reward mechanism can design a hybrid reward function: .
[0241] in, It is about implementing feedback rewards, It is a conversion of consistency rewards, It is a grammar check reward and It's a recovery ability reward. These are preset weights.
[0242] The execution feedback reward is used to measure whether the first predicted SQL statement can be executed successfully in the specified database and obtain the correct result.
[0243] In some embodiments, the feedback reward R_exec = {
[0244] +1.0, if the SQL executes successfully in the target database and the result is completely consistent with the gold standard.
[0245] +0.7, if the SQL execution succeeds but the result is partially correct (e.g., the field order is different).
[0246] +0.3, if the SQL syntax is correct but the logic is incorrect (e.g., a WHERE condition is missing).
[0247] -1.0, if SQL execution fails (syntax error, table does not exist, etc.)
[0248] 0.0, if execution fails (target database environment missing)}
[0249] The conversion consistency reward measures the degree of logical and semantic consistency among SQL statements generated by a large model for the same query statement through different generation paths, which conform to the target dialect specification.
[0250] The syntax validation reward is used to measure whether the first predicted SQL statement conforms to syntax rules and structural constraints.
[0251] In some embodiments, a lightweight SQL parser specific to a particular dialect can be used to parse the SQL statements corresponding to that dialect.
[0252] For example: Parser_result = DialectParser[D_target](SQL_direct).
[0253] # Syntax parsing and validation execution steps.
[0254] # 1. `DialectParser`: A dictionary or collection of parsers that pre-loads lightweight SQL parsers for different database dialects (such as MySQL, PostgreSQL, etc.).
[0255] # 2. `[D_target]`: Indexes and retrieves the SQL parser instance specific to the target dialect from `DialectParser` using the target dialect identifier (e.g., 'MySQL', 'PostgreSQL') as the key.
[0256] # 3. `(SQL_direct)`: Passes the predicted SQL statement (`SQL_direct`) directly generated by the model as an input parameter to the dialect parser obtained in the previous step.
[0257] # 4. `Parser_result`: Stores the parser's output. This result is used for subsequent calculation of the syntax reward (R_syntax):
[0258] #- If the SQL statement fully conforms to the target dialect specification, the parsing is successful, and the result can be used for reward calculation.
[0259] #- If the SQL statement contains a syntax error, parsing will fail and an exception will be thrown. This result will also be used as a basis for negative rewards.
[0260] # This line of code is the core of the syntax validation reward (R_syntax) signal generation, ensuring that the generated SQL conforms to the syntax rules of the target database in terms of structure.
[0261] Syntax validation reward R R_syntax = {
[0262] +0.3, if the syntax is completely correct and conforms to dialect norms
[0263] +0.1, if syntax is basically correct but has minor irregularities (such as extra spaces).
[0264] -0.1, the if statement has a minor syntax error but is still parseable (e.g., a missing semicolon).
[0265] -0.2, if a serious syntax error prevents parsing.
[0266] }
[0267] In some embodiments, reinforcement learning training can be performed using the following methods.
[0268] 1. Algorithm selection: PPO (Proximal Policy Optimization).
[0269] 2. Reward weighting configuration:
[0270] / / Standard configuration (balancing all dimensions)
[0271] =0.4 (Execution Feedback) =0.3 (consistency), =0.2 (syntax), =0.1 (recovery);
[0272] # Standard configuration (balancing all aspects)
[0273] =0.5, =0.2, =0.2, =0.1;
[0274] # Rapid Iteration Scenario Configuration (Development and Testing Environment)
[0275] =0.2, =0.4, =0.2, =0.2.
[0276] The resilience reward measures the ability of a large model to self-correct from generation errors and ultimately achieve success in a single code generation process.
[0277] R_recovery: Recovery capability reward. This reward aims to quantify and incentivize the model's ability to "correct mistakes," that is, its ability to successfully recover from a failed generation through a self-correction mechanism.
[0278] # Reward calculation in the iterative error correction process.
[0279] R_recovery = {
[0280] +0.5, if the initial generation fails but is successfully repaired through the self-correction mechanism.
[0281] +0.8, if generation succeeds on the first attempt (optimal case).
[0282] -0.3, if the initial failure also fails and the self-correction fails.
[0283] }
[0284] # Restoring the educational value of rewards: 1. Encourage the model to generate correct SQL on the first try (maximum reward +0.8); 2. Recognize the value of "knowing and correcting mistakes" (second-best reward +0.5); 3. Punish errors that cannot be self-corrected (negative reward -0.3).
[0285] This technical solution provides comprehensive and dense optimization signals for reinforcement learning training by designing a multi-dimensional hybrid reward mechanism that integrates execution feedback, logical consistency, syntax verification, and recovery capabilities. This mechanism not only ensures the functional correctness of SQL through execution rewards, but also enhances logical reliability through consistency rewards, guarantees structural standardization through syntax rewards, and incentivizes the model to develop self-correction capabilities through recovery rewards. This effectively solves the problems of sparse reward signals and unstable training in traditional methods, and ultimately drives the model to achieve synergistic optimization in accuracy, robustness, and inference capabilities. It can adapt to diverse needs from high-reliability production environments to rapid iterative development scenarios.
[0286] Figure 7 This is a flowchart illustrating a method for determining conversion consistency rewards according to an exemplary embodiment.
[0287] refer to Figure 7 The method for determining the consistency reward for the above-mentioned transformation may include the following steps.
[0288] Step S702: The first query statement is processed by the large model to generate a second predicted SQL statement that conforms to the dialect specification of the core database.
[0289] Step S704: Determine the structural similarity and semantic similarity between the first predicted SQL statement and the second predicted SQL statement.
[0290] Step S706: Determine the conversion consistency reward based on structural similarity and semantic similarity.
[0291] R_consistency (transformation consistency reward): In situations where a real database environment is unavailable (or execution is very costly), such as with some niche database dialects, it is difficult to build a perfect test environment. If only R_exec (execution feedback reward) is relied upon, the model will not receive any effective feedback when facing these niche dialects, and learning will stagnate. To solve the problem of "reward sparsity," R_consistency is designed, which uses two different paths to complete the same task and then compares the consistency of the two results.
[0292] 1. Path 1: Direct Generation.
[0293] The model receives the user's natural language query, database schema, and target dialect (such as Oracle); it directly generates an SQL query for Oracle, called SQL_direct = PolicyModel(Q, S, D_target).
[0294] 2. Path Two: First, develop the core dialect, then translate it.
[0295] The model first ignores the target dialect and converts the user's query into a version in the most familiar core dialect (such as SQLite), called SQL_base = PolicyModel(Q, S, c_base); then, it calls a "translator" (auxiliary model) to translate this core dialect version of SQL_base into the target dialect (Oracle) version, called SQL_translated = Translator(SQL_base, c_base -> D_target).
[0296] Comparison and rating.
[0297] Now we have two SQL queries that are also for Oracle: SQL_direct and SQL_translated.
[0298] In theory, if the logic of the model is perfect and consistent, the two queries should be functionally equivalent.
[0299] The reward R_consistency is calculated by comparing the structural similarity (AST-similarity) and semantic equivalence of the two queries: R_consistency = AST_similarity(SQL_direct, SQL_translated) × Semantic_equivalence(SQL_direct, SQL_translated).
[0300] / / Generate two SQL versions in parallel
[0301] SQL_direct = PolicyModel(Q, S, D_target); / / Directly generate the target dialect
[0302] SQL_base = PolicyModel(Q, S, c_base); / / Generate the core dialect first.
[0303] SQL_translated = Translator(SQL_base, c_base → D_target); / / Translate to the target dialect
[0304] / / Calculate the consistency score
[0305] R_consistency = AST_similarity(SQL_direct, SQL_translated) ×Semantic_equivalence(SQL_direct, SQL_translated).
[0306] in:
[0307] AST_similarity: The similarity of the abstract syntax tree structure ∈ [0,1];
[0308] Semantic_equivalence: Semantic equivalence score ∈ [0,1];
[0309] R_syntax: Syntax validation bonus.
[0310] This technical solution innovatively solves the problem of sparse reward signals in reinforcement learning for low-resource dialects by designing a dual-path generation and consistency comparison mechanism. By executing the direct generation path and the core dialect translation path in parallel, and comprehensively evaluating the structural similarity and semantic equivalence of the outputs of the two paths, the solution provides the model with an internally consistent reward signal that does not depend on the execution environment of the real database. This not only effectively ensures the training stability and model performance under low-resource dialects, but also significantly improves the accuracy and robustness of cross-dialect semantic understanding by forcing the model to maintain logical consistency between different generation paths.
[0311] Figure 8 This is a flowchart illustrating a code generation method according to an exemplary embodiment.
[0312] Application scenario: A large enterprise uses multiple database systems, including MySQL (customer data), PostgreSQL (product data), Oracle (financial data), and SQLite (configuration data), and needs a unified natural language query interface.
[0313] refer to Figure 8 The above code generation method may include the following specific implementation steps.
[0314] Phase 1: Self-correcting sample generation (pre-training).
[0315] Collect multi-language SQL query data within the enterprise: 3000 MySQL queries, 2500 PostgreSQL queries, 2000 Oracle queries, and 1500 SQLite queries.
[0316] Perturbation strategies are applied to common enterprise query patterns: in financial queries, the Oracle date function TO_DATE is mistakenly used as MySQL's STR_TO_DATE; in customer statistics, MySQL's GROUP_CONCAT is mistakenly used as PostgreSQL's STRING_AGG; in product analysis, PostgreSQL array operations are mistakenly used as Oracle's VARRAY syntax, etc.
[0317] Example of generating a self-correcting sample.
[0318] Error SQL:
[0319] SELECT customer_id, STRING_AGG(product_name, ',') FROM orders GROUPBY customer_id;
[0320] Target dialect: MySQL;
[0321] Correcting SQL:
[0322] SELECT customer_id, GROUP_CONCAT(product_name) FROM orders GROUP BY customer_id.
[0323] The inference chain is generated, and combined with self-correcting samples and markup language, ultimately producing 12,000 samples (each statement can generate 1-3 types of perturbation samples, approximately 9,000 at a ratio of 1.3). 1.3 The training set consists of 12,000 samples.
[0324] Self-correcting sample example. (See detailed annotations for reference.) Figure 2 The description of the illustrated embodiment will not be repeated here.
[0325] [CONTEXT]
[0326] Database Schema: CREATE TABLE employees (id INT, name VARCHAR(50),department VARCHAR(30), salary DECIMAL(10,2));
[0327] Natural Language Query: Query the average salary for each department;
[0328] Target Dialect: PostgreSQL;
[0329] [CONTEXT]
[0330] [QUESTION]
[0331] Generate SQL query for the above requirement.
[0332] [QUESTION]
[0333] [REASONING]
[0334] We need to calculate the average salary by department. First, let's try constructing the SQL:
[0335] [ERR_START|TYPE:FUNC_MISMATCH]
[0336] SELECT department, AVERAGE(salary) FROM employees GROUP BYdepartment.
[0337] [ERR_END]
[0338] Error found: The aggregate function in PostgreSQL should be AVG instead of AVERAGE;
[0339] [CORRECTION]
[0340] SELECT department, AVG(salary) FROM employees GROUP BY department;
[0341] CORRECTION
[0342] [REASONING]
[0343] [SQL]
[0344] SELECT department, AVG(salary) FROM employees GROUP BY department;
[0345] SQL
[0346] Phase Two: Model Training (Fine-tuning).
[0347] Constructing disambiguation data, as shown in the example below:
[0348] User query: "Query the top 10 customers by sales last month";
[0349] System detection: No database type specified, triggering clarification;
[0350] The system replied: "Please select the target database: [1] Customer Data (MySQL) [2] Product Data (PostgreSQL) [3] Financial Data (Oracle)";
[0351] User selection: [1];
[0352] System generated:
[0353] SELECT customer_id, SUM(amount) FROM sales WHERE date >= DATE_SUB(NOW(), INTERVAL 1 MONTH) GROUP BY customer_id ORDER BY SUM(amount) DESC LIMIT10;
[0354] MySQL was chosen as the core dialect c_base (due to the largest data volume);
[0355] Construct a mixed training set, assuming an epoch of 30:
[0356] The first 10 epochs: 100% MySQL data;
[0357] 11-20 epochs: 80% MySQL + 10% semantically resolved data + 10% other dialect mix;
[0358] 21-30 epochs: 60% MySQL + 20% disambiguation data + 20% other dialect mix.
[0359] Phase 3: Intensive Learning.
[0360] Deploy sandbox instances of each database in an enterprise testing environment.
[0361] Reward weight settings: =0.4 (Execution Feedback) =0.3 (consistency), =0.2 (syntax) =0.1 (Recovery).
[0362] Training process example:
[0363] Query: "Statistics on average salaries across departments";
[0364] Target: PostgreSQL;
[0365] Initial generation: SELECT department, AVG(salary) FROM employees GROUP BY department;
[0366] Results Analysis and Reward Calculation:
[0367] Execution result: The SQL was generated successfully on the first attempt and the result was correct, therefore the execution feedback reward R_exec is +1.0;
[0368] Consistency Bonus: Assuming that the directly generated SQL is highly consistent with the translated SQL, the consistency score R_consistency is 0.9 (example value).
[0369] Syntax Bonus: The SQL syntax is completely correct and conforms to dialect norms, so the syntax bonus R_syntax is +0.3;
[0370] Recovery capability bonus: Since the query was successfully generated in one go, it belongs to the "optimal efficiency" case with the highest efficiency. Therefore, the recovery capability and efficiency bonus R_recovery is +0.8.
[0371] Total reward calculation: R_total = (0.4 × 1.0) + (0.3 × 0.9) + (0.2 × 0.3) + (0.1 × 0.8) = 0.81.
[0372] This technical solution is applicable to intelligent SQL generation systems across database platforms and can adapt to diverse database environments in different industries and application scenarios. It can be applied to multiple professional scenarios such as data analysis platforms, business intelligence systems, database management tools, enterprise data platforms, and automated report generation.
[0373] The above method has the following core technical features.
[0374] Key technology point 1: Self-correcting learning method based on dialect-specific perturbations.
[0375] Technical means: Four systematic dialect-specific perturbation strategies (grammatical ambiguity, function conflict, data type incompatibility, and logical structure error) are adopted to generate enhanced self-correcting samples containing error-correction correspondences. Through continuous pre-training, the model internalizes cross-dialect self-correction capabilities.
[0376] Advantages compared to existing technologies: Existing technologies only focus on general logical errors, while this application designs perturbation strategies for the specific differences of database dialects; Existing technologies use random error injection, while this application uses systematic perturbation design based on dialect linguistic features; Existing technologies lack error correction ability training, while this application trains self-correction ability through error-correction sample pairs.
[0377] Key technology point two: Curriculum-based mixed dialect fine-tuning and interactive ambiguity resolution mechanism.
[0378] Technical approach: A progressive mixed dialect training strategy is designed, using the core dialect as the main component and gradually increasing the proportion of other dialects to prevent knowledge loss. Simultaneously, an interactive ambiguity resolution module is integrated, dynamically injecting explicit dialect contextual information through dialect intent detection and user interaction clarification.
[0379] Advantages compared to existing technologies: Existing technologies use separate training, which leads to knowledge forgetting, while this application maintains multi-dialect capabilities through hybrid training; Existing technologies cannot handle intent ambiguity, while this application provides an active ambiguity identification and resolution mechanism; Existing technologies rely on fixed dialect selection, while this application supports dynamic context injection and adaptation.
[0380] Core Technology Point 3: Hybrid Reward Machine Based on Execution Feedback and Logical Consistency.
[0381] Technical approach: A multi-dimensional reward function is designed that integrates execution feedback (R_exec), transformation consistency (R_consistency), syntax verification (R_syntax), and recovery capability (R_recovery). Among them, transformation consistency is calculated by comparing the results of direct generation and translation transformation, providing dense logical correctness signals for low-resource dialects.
[0382] Advantages compared to existing technologies: Existing technologies rely solely on sparse rewards from execution results, while this invention provides multi-dimensional dense reward signals; Existing technologies cannot handle dialects without an execution environment, while this invention solves the reward sparsity problem through logical consistency; Existing technologies ignore the value of error recovery, while this invention quantifies the behavioral value of "knowing and correcting mistakes" through recovery rewards.
[0383] Key auxiliary technology point four: Multi-dialect linguistic feature modeling and weight adaptive mechanism.
[0384] Technical approach: Based on the linguistic differences in dialects within the database, a structured knowledge base is established, including grammatical rules, function mappings, and data type correspondences. An adaptive weight adjustment mechanism is also designed. It supports personalized optimization for different application scenarios.
[0385] Advantages compared to existing technologies: Existing technologies use uniform model parameters, while this invention supports scenario-based adaptive weight adjustment; Existing technologies ignore the structured relationships between dialects, while this invention establishes a systematic model based on linguistic features; Existing technologies lack interpretability, while this invention provides transparent conversion logic through a structured knowledge base.
[0386] It should be particularly noted that the steps in each embodiment of the above-described method for generating structured query statements can be overlapped, substituted, added to, or deleted from each other. Therefore, these reasonable permutations and combinations of the method for generating structured query statements should also fall within the protection scope of this disclosure, and the protection scope of this disclosure should not be limited to the embodiments.
[0387] It should be noted that the scope of protection of this application should include, but is not limited to, the specific implementation methods described in the embodiments. Any alternative solution that uses a different name but substantially performs the same function and achieves the same technical effect falls within the scope of protection defined by the claims of this application.
[0388] Based on the same inventive concept, this disclosure also provides a structured query statement generation apparatus, as shown in the following embodiment. Since the principle by which this apparatus solves the problem is similar to that of the method embodiment described above, the implementation of this apparatus embodiment can refer to the implementation of the method embodiment described above, and repeated details will not be elaborated further.
[0389] Figure 9 This is a block diagram illustrating an apparatus for generating structured query statements according to an exemplary embodiment. (Refer to...) Figure 9 The structured query statement generation device 900 provided in this embodiment may include: a corpus acquisition module 901, a perturbation strategy acquisition module 902, an error injection module 903, a self-correcting sample generation module 904, and a pre-training module 905.
[0390] The system includes several modules: a corpus acquisition module 901, which acquires a multi-dialect corpus covering various database dialects, including multiple code generation samples; one of these code generation samples includes a natural language query statement, database dialect identifiers, and a standard SQL query statement generated from the natural language query statement that conforms to the database dialect; a perturbation strategy acquisition module 902, which acquires multiple dialect-specific perturbation strategies, where dialect-specific perturbation strategies refer to error injection schemes designed to address the systematic differences in syntax, functions, and data types among different database dialects; an error injection module 903, which applies dialect-specific perturbation strategies to inject errors into the standard SQL query statements of each code generation sample in the multi-dialect corpus, generating corresponding perturbed SQL query statements; a self-correcting sample generation module 904, which generates self-correcting code generation samples based on the perturbed SQL query statements and their corresponding standard SQL query statements; and a pre-training module 905, which performs full-parameter pre-training on the large model based on the self-correcting code generation samples covering various database dialects, enabling the large model to acquire the ability to convert natural language query statements into standard SQL query statements in a specified database dialect.
[0391] It should be noted that the corpus acquisition module 901, perturbation strategy acquisition module 902, error injection module 903, self-correcting sample generation module 904, and pre-training module 905, corresponding to steps S202 to S210 in the method embodiments, implement the same examples and application scenarios as the corresponding steps, but are not limited to the content disclosed in the above method embodiments. It should also be noted that the above modules, as part of the apparatus, can be executed in a computer system such as a set of computer-executable instructions.
[0392] In some embodiments, the structured query statement generation apparatus 900 may further include: an ambiguity resolution sample acquisition module and an ambiguity fine-tuning module.
[0393] The ambiguity resolution sample acquisition module can be used to acquire ambiguity resolution code generation samples, which include ambiguous natural language query statements, unambiguous natural language query statements corresponding to the ambiguous natural language query statements, and ambiguity correction prompts. The ambiguity fine-tuning module can be used to fine-tune the large model based on the ambiguity resolution code generation samples, so that the large model can acquire the ability to identify and clarify query ambiguities during the code generation process.
[0394] In some embodiments, the structured query statement generation device 900 may further include: a first core dialect determination module, a first mixed training set determination module, a first fine-tuning module, a second fine-tuning module, and a third fine-tuning module.
[0395] The first core dialect determination module can be used to determine the database dialect with the largest amount of data in a multi-dialect corpus as the core dialect, and to designate other database dialects as non-core dialects; the first mixed training set determination module can be used to generate a first mixed training set based on code generation samples and ambiguity resolution code generation samples in the multi-dialect corpus; the first fine-tuning module can be used in the early stage of fine-tuning the large model using the first mixed training set to mix core dialect code generation samples, ambiguity resolution code generation samples, and non-core dialect code generation samples in an N1:1:1 ratio; the second... The fine-tuning module can be used in the middle stage of fine-tuning the large model using the first mixed training set, mixing core dialect code generation samples, ambiguity resolution code generation samples, and non-core dialect code generation samples in a ratio of N2:1:1; the third fine-tuning module can be used in the later stage of fine-tuning the large model using the first mixed training set, mixing core dialect code generation samples, ambiguity resolution code generation samples, and non-core dialect code generation samples in a ratio of N3:1:1, where N1>N2>N3; N1 is an integer greater than 1, N2 is an integer greater than 1, and N3 is an integer greater than 1.
[0396] In some embodiments, the structured query statement generation apparatus 900 may further include: a second core dialect determination module, a fourth fine-tuning module, a fifth fine-tuning module, and a sixth fine-tuning module.
[0397] The second core dialect determination module can be used to identify the database dialect with the largest amount of data in a multi-dialect corpus as the core dialect, and to designate other database dialects as non-core dialects. The fourth fine-tuning module can be used in the early stage of fine-tuning the large model through multiple code-generated samples, mixing core dialect code-generated samples and non-core dialect code-generated samples at a ratio of N4:1. The fifth fine-tuning module can be used in the middle stage of fine-tuning the large model through multiple code-generated samples, mixing core dialect code-generated samples and non-core dialect code-generated samples at a ratio of N5:1. The sixth fine-tuning module can be used in the later stage of fine-tuning the large model through multiple code-generated samples, mixing core dialect code-generated samples and non-core dialect code-generated samples at a ratio of N6:1, where N4>N5>N6; N4 is an integer greater than 1, N5 is an integer greater than 1, and N6 is an integer greater than 1.
[0398] In some embodiments, the multiple code generation samples include a first sample, which includes a first query statement, an identifier of the target database dialect, and a standard SQL query statement that conforms to the target database dialect specification and is generated based on the first query statement; wherein, the structured query statement generation device 900 may further include: a first prediction statement generation module and a reward determination module.
[0399] The first predicted statement generation module can be used to generate code for the first query statement through a large model, generating a first predicted SQL statement that conforms to the dialect specification of the target database. The reward determination module can be used to determine at least one generation reward function value among execution feedback reward, transformation consistency reward, syntax verification reward, and recovery capability reward based on the first predicted SQL statement. The reward function value is used to execute reinforcement learning strategies to drive the parameter update of the large model. The execution feedback reward is used to measure whether the first predicted SQL statement can be executed successfully in the specified database and obtain the correct result. The transformation consistency reward is used to measure the degree of logical and semantic consistency of the SQL statements that conform to the target dialect specification generated by the large model for the same query statement through different generation paths. The syntax verification reward is used to measure whether the first predicted SQL statement conforms to the syntax specification and structural constraints. The recovery capability reward is used to measure the ability of the large model to self-correct from generation errors and eventually achieve success in a single code generation process.
[0400] In some embodiments, the structured query statement generation apparatus 900 may further include: a second prediction statement generation module, a similarity determination module, and a conversion consistency reward determination module.
[0401] The second predicted statement generation module can be used to process the code of the first query statement through a large model to generate a second predicted SQL statement that conforms to the dialect specification of the core database; the similarity determination module can be used to determine the structural similarity and semantic similarity between the first predicted SQL statement and the second predicted SQL statement; and the transformation consistency reward determination module can be used to determine the transformation consistency reward based on the structural similarity and semantic similarity.
[0402] In some embodiments, the structured query statement generation apparatus 900 may further include an inference chain determination module.
[0403] The inference chain determination module can be used to generate a correct intermediate inference chain for each code generation sample in the multi-dialect corpus. The intermediate inference chain describes the reasoning process of generating a standard SQL query statement from a natural language query statement. The self-correcting code generation sample also includes an intermediate inference chain for generating a standard SQL query statement from a natural language query statement.
[0404] In some embodiments, error correction information is marked in the self-correcting code generation samples using structured tags to provide clear error correction signals for the large model. The structured tags include error fragment boundary identifiers, error attribution metadata identifiers, and correction version identifiers.
[0405] Since the functions of the device 900 have been described in detail in their respective method embodiments, they will not be repeated here.
[0406] The modules and / or submodules described in the embodiments of this disclosure can be implemented in software or hardware. The described modules and / or submodules can also be located in a processor. The names of these modules and / or submodules do not, in some cases, constitute a limitation on the module and / or submodule itself.
[0407] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a portion of a module or program segment containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer program instructions.
[0408] Furthermore, the above figures are merely illustrative of the processes included in the method according to exemplary embodiments of this disclosure and are not intended to be limiting. It is readily understood that the processes shown in the above figures do not indicate or limit the temporal order of these processes. Additionally, it is readily understood that these processes may be executed synchronously or asynchronously, for example, in multiple modules.
[0409] Figure 10 A schematic diagram of an electronic device suitable for implementing embodiments of the present disclosure is shown. It should be noted that... Figure 10 The illustrated electronic device 1000 is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments disclosed herein.
[0410] like Figure 10As shown, the electronic device 1000 includes a central processing unit (CPU) 1001, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1002 or a program loaded from a storage section 1008 into a random access memory (RAM) 1003. The RAM 1003 also stores various programs and data required for the operation of the electronic device 1000. The CPU 1001, ROM 1002, and RAM 1003 are interconnected via a bus 1004. An input / output (I / O) interface 1005 is also connected to the bus 1004.
[0411] The following components are connected to I / O interface 1005: an input section 1006 including a keyboard, mouse, etc.; an output section 1007 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 1008 including a hard disk, etc.; and a communication section 1009 including a network interface card such as a LAN card, modem, etc. The communication section 1009 performs communication processing via a network such as the Internet. A drive 1010 is also connected to I / O interface 1005 as needed. A removable medium 1011, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 1010 as needed so that computer programs read from it can be installed into storage section 1008 as needed.
[0412] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing computer program instructions for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 1009, and / or installed from removable medium 1011. When the computer program is executed by central processing unit (CPU) 1001, it performs the functions defined above in the system of this disclosure.
[0413] It should be noted that the computer-readable storage medium disclosed herein may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable computer program instructions. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable storage medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. Computer program instructions contained on a computer-readable storage medium can be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0414] In another aspect, this disclosure also provides a computer-readable storage medium, which may be included in the device described in the above embodiments; or it may exist independently and not assembled into the device. The aforementioned computer-readable storage medium carries one or more programs, which, when executed by a device, enable the device to perform the following functions: acquiring a multi-dialect corpus covering multiple database dialects, the multi-dialect corpus including multiple code generation samples; one of the code generation samples includes a natural language query statement, a database dialect identifier, and a standard SQL query statement conforming to the database dialect specification generated based on the natural language query statement; acquiring multiple dialect-specific perturbation strategies, wherein the dialect-specific perturbation strategy refers to an error injection scheme designed to address the systematic differences in syntax, functions, and data types among different database dialects; for each code generation sample in the multi-dialect corpus, applying the dialect-specific perturbation strategy to inject errors into the standard SQL query statement, generating a corresponding perturbation SQL query statement; generating a self-correcting code generation sample based on the perturbation SQL query statement and the corresponding standard SQL query statement; and performing full-parameter pre-training on a large model based on the self-correcting code generation samples covering multiple database dialects, so that the large model acquires the ability to convert natural language query statements into standard SQL query statements under a specified database dialect.
[0415] According to one aspect of this disclosure, a computer program product or computer program is provided, comprising computer program instructions stored in a computer-readable storage medium. The computer program instructions are read from the computer-readable storage medium, and a processor executes the computer program instructions to implement the methods provided in various optional implementations of the above embodiments.
[0416] From the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions of the embodiments of this disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, or portable hard drive) and includes several computer program instructions to cause an electronic device (such as a server or terminal device) to execute the method according to the embodiments of this disclosure.
[0417] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the claims.
[0418] It should be understood that this disclosure is not limited to the detailed structures, drawing arrangements or implementations shown herein; rather, this disclosure is intended to cover various modifications and equivalent arrangements contained within the spirit and scope of the appended claims.
Claims
1. A method for generating structured query statements, characterized in that, include: Acquire a multi-dialect corpus covering multiple database dialects, the multi-dialect corpus including multiple code generation samples; One of the code generation samples includes a natural language query statement, a database dialect identifier, and a standard SQL query statement that conforms to the database dialect specification and is generated based on the natural language query statement. Multiple dialect-specific perturbation strategies are obtained, wherein the dialect-specific perturbation strategies refer to error injection schemes designed to address the systematic differences in syntax, functions and data types of different database dialects; Several dialect-specific perturbation strategies include: transforming standard SQL statements that conform to the specifications into statements containing specific syntax errors; replacing correct functions in standard SQL statements with functions that are invalid or functionally abnormal in the target dialect; replacing correct data types in standard SQL statements with data types that are incompatible, unsupported, or semantically different in the target dialect; and shuffling the correct logical order of key clauses in SQL statements or modifying their logical operators to generate SQL statements that are syntactically correct but have logical errors in semantics. For each code sample generated in the multi-dialect corpus, the dialect-specific perturbation strategy is applied to inject errors into the standard SQL query statements to generate corresponding perturbation SQL query statements; Based on the perturbation SQL query statement and the corresponding standard SQL query statement, a self-correcting code generation sample is generated. Based on self-correcting code generation samples covering multiple database dialects, a large model is pre-trained with full parameters to enable the large model to convert natural language query statements into standard SQL query statements in a specified database dialect.
2. The method according to claim 1, characterized in that, The method further includes: Obtain a sample of disambiguation code generation, which includes an ambiguous natural language query statement, an unambiguous natural language query statement corresponding to the ambiguous natural language query statement, and an ambiguity correction prompt word; The large model is fine-tuned based on the ambiguity resolution code generation samples to enable it to identify and clarify query ambiguities during code generation.
3. The method according to claim 2, characterized in that, The method further includes: In the multi-dialect corpus, the database dialect with the largest amount of data is identified as the core dialect, and the database dialects other than the core dialect are identified as non-core dialects. A first mixed training set is generated based on the code generation samples in the multi-speech corpus and the ambiguity resolution code generation samples. In the early stage of fine-tuning the large model using the first mixed training set, the core dialect code generation samples, the ambiguity resolution code generation samples, and the non-core dialect code generation samples are mixed in an N1:1:1 ratio. In the intermediate stage of fine-tuning the large model using the first mixed training set, the core dialect code generation samples, the ambiguity resolution code generation samples, and the non-core dialect code generation samples are mixed in a ratio of N2:1:
1. In the later stage of fine-tuning the large model using the first mixed training set, the core dialect code generation samples, the ambiguity resolution code generation samples, and the non-core dialect code generation samples are mixed in a ratio of N3:1:1, where N1>N2>N3; N1 is an integer greater than 1, N2 is an integer greater than 1, and N3 is an integer greater than 1.
4. The method according to claim 1, characterized in that, The method further includes: In the multi-dialect corpus, the database dialect with the largest amount of data is identified as the core dialect, and the database dialects other than the core dialect are identified as non-core dialects. In the early stage of fine-tuning the large model using the multiple code-generated samples, core dialect code-generated samples and non-core dialect code-generated samples are mixed in a ratio of N4:
1. In the intermediate stage of fine-tuning the large model using the multiple code generation samples, the core dialect code generation samples and the non-core dialect code generation samples are mixed in a ratio of N5:
1. In the later stage of fine-tuning the large model using the multiple code-generated samples, the core dialect code-generated samples and the non-core dialect code-generated samples are mixed in a ratio of N6:1, where N4>N5>N6; N4 is an integer greater than 1, N5 is an integer greater than 1, and N6 is an integer greater than 1.
5. The method according to claim 1, characterized in that, The plurality of code generation samples include a first sample, which includes a first query statement, an identifier of the target database dialect, and a standard SQL query statement generated based on the first query statement that conforms to the target database dialect specification; wherein, the method further includes: The first query statement is processed by the large model to generate a first predicted SQL statement that conforms to the dialect standard of the target database. Based on the first predicted SQL statement, at least one of the following reward function values is determined: execution feedback reward, transformation consistency reward, syntax verification reward, and recovery capability reward; wherein the reward function value is used to execute a reinforcement learning strategy to drive the update of the large model parameters. The execution feedback reward is used to measure whether the first predicted SQL statement can be executed successfully in the specified database and obtain the correct result; The conversion consistency reward is used to measure the degree of logical and semantic consistency between the SQL statements generated by the large model for the same query statement through different generation paths and conforming to the target dialect specification. The syntax verification reward is used to measure whether the first predicted SQL statement conforms to syntax rules and structural constraints. The resilience reward measures the ability of the large model to self-correct from generation errors and ultimately achieve success in a single code generation process.
6. The method according to claim 5, characterized in that, The method further includes: The first query statement is processed by the large model to generate a second predictive SQL statement that conforms to the core database dialect standard. Determine the structural and semantic similarity between the first predicted SQL statement and the second predicted SQL statement; The conversion consistency reward is determined based on the structural similarity and semantic similarity.
7. The method according to claim 1, characterized in that, The method further includes: For each code sample in the multi-dialect corpus, a correct intermediate inference chain is generated, which describes the reasoning process of generating a standard SQL query from a natural language query. The self-correcting code generation sample also includes an intermediate inference chain that generates the standard SQL query statement from the natural language query statement.
8. The method according to claim 1, characterized in that, Error correction information is marked in the self-correcting code generation samples using structured tags to provide clear error correction signals for the large model. The structured tags include error fragment boundary identifiers, error attribution metadata identifiers, and correction version identifiers.
9. A device for generating structured query statements, characterized in that, include: The corpus acquisition module is used to acquire a multi-dialect corpus covering multiple database dialects, and the multi-dialect corpus includes multiple code generation samples; One of the code generation samples includes a natural language query statement, a database dialect identifier, and a standard SQL query statement that conforms to the database dialect specification and is generated based on the natural language query statement. The perturbation strategy acquisition module is used to acquire multiple dialect-specific perturbation strategies, wherein the dialect-specific perturbation strategy refers to the error injection scheme designed to address the systematic differences in syntax, functions and data types of different database dialects; Several dialect-specific perturbation strategies include: transforming standard SQL statements that conform to the specifications into statements containing specific syntax errors; replacing correct functions in standard SQL statements with functions that are invalid or functionally abnormal in the target dialect; replacing correct data types in standard SQL statements with data types that are incompatible, unsupported, or semantically different in the target dialect; and shuffling the correct logical order of key clauses in SQL statements or modifying their logical operators to generate SQL statements that are syntactically correct but have logical errors in semantics. The error injection module is used to generate samples for each code in the multi-dialect corpus, apply the dialect-specific perturbation strategy, inject errors into the standard SQL query statements, and generate corresponding perturbation SQL query statements. The self-correcting sample generation module is used to generate self-correcting code generation samples based on the perturbation SQL query statement and the corresponding standard SQL query statement; The pre-training module is used to perform full-parameter pre-training on a large model based on self-correcting code generation samples covering multiple database dialects, so that the large model can acquire the ability to convert natural language query statements into standard SQL query statements in a specified database dialect.
10. An electronic device, characterized in that, include: Memory and processor; The memory is used to store computer program instructions; the processor calls the computer program instructions stored in the memory to implement the method for generating structured query statements as described in any one of claims 1-8.
11. A computer-readable storage medium storing computer program instructions, characterized in that, When the computer program instructions are executed by the processor, they implement the method for generating structured query statements as described in any one of claims 1-8.
12. A computer program product comprising computer program instructions stored in a computer-readable storage medium, characterized in that, When the computer program instructions are executed by the processor, they implement the method according to any one of claims 1-8.