A text2sql generation method based on self-correction
By using a structured query generation task model based on meta-learning and self-supervised learning, and dynamically optimizing and revising SQL query statements multiple times, the high error rate and poor adaptability of existing technologies in generating SQL query statements are solved, achieving more efficient and flexible SQL query generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JOINT WARFARE COLLEGE NAT DEFENSE UNIV OF THE CHINESE PEOPLES LIBERATION ARMY
- Filing Date
- 2025-01-14
- Publication Date
- 2026-06-26
Smart Images

Figure CN120030035B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence technology, and in particular relates to a Text2SQL generation method based on self-correction. Background Technology
[0002] With the rapid development of natural language processing technology, the technology of converting natural language queries into structured query language (SQL) (i.e., Text2SQL) allows users to interact with databases in natural language and is widely used in scenarios such as intelligent question answering systems, business intelligence analysis, and automated data querying. However, existing Text2SQL technology still faces the following key problems in practical applications: traditional models rely on data from specific domains or with fixed structures for training, making it difficult to quickly adapt to new datasets and limiting their widespread application; in complex query scenarios (such as multi-table joins and nested queries), generated SQL query statements are prone to errors, lack automatic error correction capabilities, and rely solely on manual correction, resulting in low error correction efficiency; static templates or manually designed patterns are difficult to meet dynamically changing and complex query needs, leading to low efficiency and insufficient flexibility in generating SQL query statements.
[0003] Therefore, based on the above problems, there is an urgent need to propose a structured query statement generation method to improve the accuracy and reliability of generated SQL query statements. Summary of the Invention
[0004] In order to at least solve one or more of the technical problems mentioned above, the present invention proposes a self-correcting Text2SQL generation scheme in several aspects.
[0005] In a first aspect, embodiments of the present invention provide a Text2SQL generation method based on self-correction. The method includes: acquiring a first natural language query text; testing the first natural language query text using a pre-trained structured query generation task model to obtain a first structured query statement corresponding to the first natural language query text, wherein the structured query generation task model is trained using a meta-learning mechanism and a self-supervised learning mechanism; parsing the first natural language query text using a pre-trained large language model to obtain a query operation mode corresponding to the first natural language query text; dynamically optimizing the first structured query statement to obtain a second structured query statement when it is determined that the first structured query statement does not meet the query requirements corresponding to the query operation mode; executing the second structured query statement to obtain a first query result; and correcting and optimizing the second structured query statement based on the first query result to obtain a third structured query statement, wherein the third structured query statement is output as a target structured query statement corresponding to the first natural language query text.
[0006] In some embodiments, the method further includes: when it is determined that the first structured query statement meets the query requirements corresponding to the query operation mode, executing the first structured query statement to obtain a second query result; when it is determined that the second query result does not meet the expected result, modifying and optimizing the first structured query statement according to the second query result to obtain a fourth structured query statement, and outputting the fourth structured query statement as a target structured query statement corresponding to the first natural language query text.
[0007] In some embodiments, dynamically optimizing a first structured query statement includes: selecting a query template related to a first natural language query text from a pre-configured template library; and optimizing the first structured query statement based on the query template to obtain a second structured query statement.
[0008] In some embodiments, the first query result includes an error type indicating the existence of the second structured query statement. The second structured query statement is then corrected and optimized based on the first query result, including: determining an adjustment strategy corresponding to the error type; and correcting the second structured query statement according to the adjustment strategy corresponding to the error type.
[0009] In some embodiments, the structured query statement generation task model obtained by training according to the meta-learning mechanism and the self-supervised learning mechanism includes: constructing a labeled first training dataset, the first training dataset including multiple training data pairs, each training data pair including training natural language query text corresponding to a real application scenario, and training structured query statements corresponding to the training natural language query text; obtaining second training datasets corresponding one-to-one with each of multiple different open-source datasets; pre-training the structured query statement generation task model to be trained using multiple second training datasets and the first training dataset according to the meta-learning algorithm to obtain a preliminary structured query statement generation task model; fine-tuning the preliminary structured query statement generation task model to obtain the current structured query statement generation task model; and supervising the output of the current structured query statement generation task model according to the self-supervised learning algorithm to obtain a trained structured query statement generation task model.
[0010] In some embodiments, before pre-training the structured query generation task model to be trained using multiple second training datasets and a first training dataset according to the meta-learning algorithm, the method further includes: performing semantic expansion on each second natural language query text using a pre-trained large language model to obtain multiple semantically identical but differently expressed third natural language query texts corresponding to each second natural language query text; constructing a third training dataset based on the correspondence between the multiple third natural language query texts, the second natural language query texts, and the trained structured query statements; and updating the third training dataset to the first training dataset.
[0011] In some embodiments, after outputting the target structured query statement corresponding to the first natural language query text, the method further includes: using the first natural language query text and the target structured query statement as a new training data pair; and adding the new training data pair to the first training dataset to obtain an updated first training dataset.
[0012] This invention provides a self-correcting Text2SQL generation method. The method first obtains a first natural language query text; then, it tests the first natural language query text using a pre-trained structured query generation task model to obtain a first structured query statement corresponding to the first natural language query text. The structured query generation task model is trained using a meta-learning mechanism and a self-supervised learning mechanism. Next, it parses the first natural language query text using a pre-trained large language model to obtain a query operation pattern corresponding to the first natural language query text. Then, when it is determined that the first structured query statement does not meet the query requirements corresponding to the query operation pattern, the first structured query statement is dynamically optimized to obtain a second structured query statement. The second structured query statement is executed to obtain a first query result. Finally, when it is determined that the first query result does not meet the expected result, the second structured query statement is corrected and optimized based on the first query result to obtain a third structured query statement. The third structured query statement is output as the target structured query statement corresponding to the first natural language query text. Compared with related technologies, the technical solution provided by the embodiments of the present invention performs multiple verifications on the SQL query statements generated by model prediction, automatically discovers and corrects potential errors, and effectively improves the accuracy and reliability of the generated SQL query statements. Attached Figure Description
[0013] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:
[0014] Figure 1 This is an exemplary schematic diagram of a self-correcting Text2SQL generation method 100 according to an embodiment of the present invention;
[0015] Figure 2 An exemplary flowchart of a self-correcting Text2SQL generation method 200 according to other embodiments of the present invention is shown;
[0016] Figure 3 An exemplary flowchart of a structured query statement generation task model training method 300 according to other embodiments of the present invention is shown;
[0017] Figure 4 This is a schematic diagram of the structure of a self-correcting Text2SQL generation device 400 according to an embodiment of the present invention;
[0018] Figure 5 A structural schematic diagram of a structured query statement generation task model training apparatus 500 according to other embodiments of the present invention is shown.
[0019] Figure 6 A schematic block diagram of an electronic device 600 according to an embodiment of the present invention is shown.
[0020] The same or similar reference numerals in the accompanying drawings represent the same or similar parts. Detailed Implementation
[0021] To better understand and explain this invention, a further detailed description will be provided below with reference to the accompanying drawings. This invention is not limited to these specific embodiments. Rather, any modifications or equivalent substitutions made to this invention should be covered within the scope of the claims.
[0022] It should be noted that numerous specific details are provided in the following detailed embodiments. Those skilled in the art should understand that the present invention can be practiced without these specific details. In the various detailed embodiments given below, principles, structures, and components well known in the art are not described in detail in order to highlight the spirit of the invention.
[0023] The self-correcting Text2SQL generation method provided by this invention can be executed by a computer device, which can be a terminal or a server. The terminal can be a smartphone, tablet, laptop, touchscreen, personal computer (PC), personal digital assistant (PDA), or other similar device, and may also include a client. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
[0024] Please refer to Figure 1 , Figure 1 This is an exemplary schematic diagram of a self-correcting Text2SQL generation method 100 according to an embodiment of the present invention. Figure 1 As shown, the method includes:
[0025] In step S101, the first natural language query text is obtained.
[0026] In the steps described above, Natural Language Query Text (NLPT) refers to a user's query request for a database, expressed in natural language. NLPT can be represented as NLP query text. The first NLPT is the NLPT received by the electronic device after model training is complete, and is intended for processing.
[0027] In step S102, the trained structured query statement generation task model is used to test the first natural language query text to obtain the first structured query statement corresponding to the first natural language query text. The structured query statement generation task model is trained and learned based on the meta-learning mechanism and the self-supervised learning mechanism.
[0028] In the steps described above, the structured query generation task model is trained using meta-learning and self-supervised learning mechanisms. It can be a pre-trained large language model (such as Generic Pretrained Transformer, GPT) or a deep learning model optimized for Text2SQL tasks. Meta-learning algorithms, such as MAML (Modern-Agnostic Meta-Learning) and Reptile, iteratively update the model parameters, enabling it to quickly adapt to new tasks on limited sample data.
[0029] Structured query statements are commands used to retrieve specific information from a database. SQL (Structured Query Language) is a programming language specifically designed for managing relational database systems. Through SQL queries, users can specify the data they want to retrieve from the database, including which columns to select, which rows to filter, and how to sort and group the results. The basic structure of an SQL query typically includes the following main parts: The SELECT keyword is followed by the column names (or expressions) to be retrieved, specifying which columns should be included in the query results. The FROM keyword is followed by the name of the table to query. This part specifies which table the data comes from. The WHERE clause specifies the filtering conditions; only rows that meet these conditions will be included in the query results. The GROUP BY clause groups the query results by one or more columns. This clause is often used with aggregate functions (such as COUNT, SUM, AVG, etc.) to perform aggregation operations on each group. The ORDER BY clause specifies how the query results should be sorted. This clause can be followed by one or more column names, and the sort order (ascending or descending) for each column.
[0030] The trained structured query generation task model was used to test the first natural language query text, generating the first structured query text corresponding to it. Assume the user inputs the following natural language query text: "Query the order amount, payment method, and order status of customers in East China in 2019". The trained structured query generation task model can generate the corresponding SQL query as follows:
[0031] SELECT order_amount,payment_method FROM orders WHEREregion='East'ANDyear=2019.
[0032] In step S103, the first natural language query text is parsed using a pre-trained large language model to obtain the query operation mode corresponding to the first natural language query text.
[0033] In step S104, when it is determined that the first structured query statement does not meet the query requirements corresponding to the query operation mode, the first structured query statement is dynamically optimized to obtain the second structured query statement.
[0034] In step S105, the second structured query statement is executed to obtain the first query result.
[0035] In step S106, when it is determined that the first query result does not meet the expected result, the second structured query statement is modified and optimized according to the first query result to obtain a third structured query statement. The third structured query statement is output as the target structured query statement corresponding to the first natural language query text.
[0036] In the steps described above, the query operation pattern refers to the query operation that needs to be performed in the database, determined based on the query intent of the natural language query text. For example, query operation patterns include, but are not limited to, multi-table joins, nested queries, and complex aggregations. Assuming that in a relational database management system (RDBMS), multi-table joins, nested queries, and complex aggregations are three common query operations used to retrieve and summarize data from one or more tables.
[0037] A multi-table join is an SQL query operation that allows two or more tables to be combined based on certain common columns (usually primary and foreign keys), thereby retrieving data from multiple tables in the same query. Types of multi-table joins include inner join, left outer join, right outer join, full outer join, and cross join. For example, if you have two tables, one is an orders table and the other is a customers table, you can use a multi-table join to join these two tables based on the customer ID (assuming it's a common column between the two tables) so that you can see order information and its corresponding customer information in the same query result.
[0038] A nested query, also known as a subquery, is an SQL query that embeds another query within a single query. The outer query (also called the main query) depends on the result of the inner query (subquery). Nested queries are often used when you need to determine conditions based on the result of another query. For example, you might want to find all orders whose total amount is greater than the average order amount. In this case, you would write a nested query where the inner query calculates the average amount of all orders, and the outer query finds all orders that are greater than that average.
[0039] Complex aggregation refers to the use of aggregate functions (such as SUM, AVG, COUNT, MAX, MIN, etc.) in SQL queries to perform more complex analysis and calculations. Complex aggregation may involve aggregation calculations on multiple columns, grouped aggregation (GROUP BY), filtering with HAVING clauses, and aggregation operations after joining multiple tables. For example, to calculate the total order amount for each customer and only display those customers whose total order amount exceeds a certain threshold, one might first group by customer, then calculate the total order amount for each customer, and finally filter out records whose total amount exceeds the threshold. These three concepts represent different types of operations in SQL queries, demonstrating the powerful capabilities of SQL to handle and analyze complex data relationships stored in relational databases.
[0040] When it is determined that the first structured query statement does not meet the query requirements corresponding to the query operation mode, the first structured query statement is dynamically optimized to obtain the second structured query statement. For example, the first natural language query text is: "Query the order amount, payment method, and order status of customers in East China in 2019". The first SQL query statement does not include the "order status" field, nor does it process the associated customer information; therefore, the first SQL query statement does not fully meet the query requirements. To solve this problem, a query template related to the first natural language query text is selected from a pre-configured template library; the first SQL query statement is then optimized based on the query template to obtain the second structured query statement. Query templates include, but are not limited to: join query templates, nested query templates, and aggregate query templates. The second SQL query statement can be as follows:
[0041] SELECT o.order_amount,o.payment_method,o.order_status,c.customer_nameFROM orders o
[0042] JOIN customers c ON o.customer_id=c.customer_id
[0043] WHERE o.region='East'AND o.year=2019;
[0044] By dynamically adjusting the syntax of the second structured query, we can ensure that it includes the "order status" field (order_status) and the customer name (customer_name) retrieved from the "customers" table using a join query. This ensures that the query results accurately display customer information. This approach allows for flexible fulfillment of more complex query needs and the generation of more precise SQL queries.
[0045] A second structured query statement is executed in the database, and the result is compared with the expected query result. This verification process identifies potential errors in the generated SQL query in real time, such as missing fields, incorrect table joins, and query logic errors. When a deviation is found between the SQL query result and the expected result, the generated SQL query statement is corrected and optimized based on the errors reported in the query results, thereby improving the accuracy of the generated result and the reliability of its execution.
[0046] The self-correcting Text2SQL generation method proposed in this invention first generates SQL query statements using a structured query statement generation task model. Then, it dynamically optimizes the SQL query statements based on the query operation patterns corresponding to the natural language query text. Finally, it further refines the SQL query statements based on the query results. Optimizing the SQL query statements based on query operation patterns better meets complex query needs, significantly improving the flexibility and efficiency of SQL query statement generation. Refining the SQL query statements based on query results improves their accuracy. The structured query statement generation task model, trained using meta-learning and self-supervised learning mechanisms, can quickly adjust its learning strategy to complete testing tasks and effectively enhances the model's generalization ability across different domains.
[0047] Figure 2 An exemplary flowchart of a self-correcting Text2SQL generation method 200 according to other embodiments of the present invention is shown. Figure 2 As shown, the method includes:
[0048] In step S201, the first natural language query text is obtained;
[0049] In step S202, the pre-trained structured query statement generation task model is used to test the first natural language query text to obtain the first structured query statement corresponding to the first natural language query text. The structured query statement generation task model is trained and learned based on the meta-learning mechanism and the self-supervised learning mechanism.
[0050] In step S203, the first natural language query text is parsed using a pre-trained large language model to obtain the query operation mode corresponding to the first natural language query text.
[0051] In step S204, when it is determined that the first structured query statement does not meet the query requirements corresponding to the query operation mode, the first structured query statement is dynamically optimized to obtain the second structured query statement.
[0052] In step S205, the second structured query statement is executed to obtain the first query result.
[0053] In step S206, when it is determined that the first query result does not meet the expected result, the second structured query statement is modified and optimized according to the first query result to obtain a third structured query statement. The third structured query statement is output as the target structured query statement corresponding to the first natural language query text.
[0054] In step S207, when it is determined that the first query result meets the expected result, a second structured query statement is output as the target structured query statement corresponding to the first natural language query text.
[0055] In step S208, when it is determined that the first structured query statement satisfies the query requirements corresponding to the query operation mode, the first structured query statement is executed to obtain the second query result.
[0056] In step S209, when it is determined that the second query result meets the expected result, the first structured query statement is output as the target structured query statement.
[0057] In step S210, when it is determined that the second query result does not meet the expected result, the first structured query statement is modified and optimized according to the second query result to obtain a fourth structured query statement. The fourth structured query statement is output as the target structured query statement corresponding to the first natural language query text.
[0058] In the above steps, a pre-trained structured query generation task model is used to generate a first structured query statement corresponding to the first natural language query text. Then, based on the query operation mode corresponding to the first natural language query text, the first structured query statement undergoes a first self-correction check. If the first structured query statement meets the query requirements corresponding to the query operation mode, it indicates that the first SQL query statement conforms to the requirements of the query operation mode. Then, by executing the first SQL query statement, it is determined whether a second self-correction check is needed. Executing the first SQL query statement yields a second query result. If the second query result meets the expected result, the first SQL query statement is output. The first SQL query statement is the target SQL query statement corresponding to the first natural language query text.
[0059] When it is determined that the first structured query statement does not meet the query requirements corresponding to the query operation mode, it indicates that the first structured query statement still has errors that need to be corrected in complex query scenarios. The first SQL query statement is dynamically optimized to correct it, resulting in the second SQL query statement. Then, the second SQL query statement is executed, and the feedback from the query results performs a second self-correction check. The second SQL query statement is executed again, yielding the first query result. If the first query result meets the expected result, it means that the second SQL query statement has overcome the relevant problems after the first self-correction check and can be directly output. If the first query result does not meet the expected result, the second SQL query statement is corrected and optimized based on the first query result, resulting in the third SQL query statement, which is the SQL query statement after two corrections.
[0060] In some embodiments, the second structured query statement is modified and optimized based on the first query result, including determining an adjustment strategy corresponding to the error type; and modifying the second structured query statement according to the adjustment strategy corresponding to the error type.
[0061] After executing an SQL query in a specialized database, comparing the query results with the expected results identifies errors in the SQL query statement. These errors are then used to correct the SQL query. Through multiple feedback loops, the SQL generation strategy is continuously improved, gradually eliminating common errors in SQL queries, such as field mismatches, missing fields, and logical errors. After each correction, the corrected SQL query and its corresponding natural language query text are used as new training data pairs, fed back into the training dataset used to train the structured query generation task model. The updated training dataset is then used to adjust the generation strategy of the structured query generation task model.
[0062] When the query result of an SQL query indicates that some fields are missing, the missing fields are automatically identified and supplemented into the generated SQL query through a feedback mechanism.
[0063] When the results of an SQL query indicate incomplete or inaccurate information, the SQL query should be adjusted based on the difference between the actual data and the expected results. For example, the query conditions can be adjusted, or a join operation can be added to the SQL query.
[0064] When an SQL query result indicates a table join error, the data related to the table join error in the SQL query should be adjusted based on the query result. Table join errors can manifest as missing join conditions, incorrect join condition order, or join conditions that do not correctly relate multiple tables. Data related to table join errors can include supplementing missing fields or conditions in the join conditions, or ensuring the accuracy of joins between different tables.
[0065] When the results of an SQL query indicate a logical error, add aggregation operations, grouping conditions, or filtering conditions. These additions ensure the accuracy of the query results. Logical errors include omissions such as missing aggregation operations or incorrect conditions.
[0066] When the query result of an SQL query differs significantly from the expected result, the query fields or conditions in the SQL query are adjusted to optimize the data range and query logic, so that the query result of the SQL query is closer to the expected result.
[0067] When the result of an SQL query indicates a syntax error, the SQL query is adjusted according to syntax rules to obtain a SQL query that conforms to SQL syntax standards. Syntax errors include, but are not limited to, spelling errors, mismatched parentheses, and incorrect keywords.
[0068] When the query result indicates a data reference error, the error is corrected based on the context of the first natural language query text and feedback information. Data reference errors include, but are not limited to, invalid column names, table names, or constant values.
[0069] Suppose the user inputs the natural language query text: "Display the order amounts and 'paid' status of all customers in East China in 2019". The initially generated SQL query statement is as follows:
[0070] SELECT order_amount,order_status FROM orders WHEREregion='East'ANDyear=2019;
[0071] After executing the SQL query, the query results are compared with the expected results. When the results do not match the expected results, analysis reveals that the data range for the "Order Status" field in the SQL query is not precisely defined. The query only returns the status of all orders, while the user requirement is to display the order amount for all customers and orders with a "Paid" status. Therefore, the initially generated SQL query did not include a "Paid" status filter. Based on the execution feedback, the data range was found to be inconsistent with the expected results, and a "data range adjustment" strategy was adopted to correct the SQL query. A filter condition was added to the SQL query to limit the order status to "Paid". The final generated SQL query is:
[0072] SELECT order_amount,order_status
[0073] FROM orders WHEREregion='East'AND year=2019AND order_status='Paid';
[0074] Through the above adjustments, executing the revised SQL query statement returns only the amount and status of "paid" orders, thus meeting the user's query requirements. In this process, by automatically identifying data ranges that do not meet expectations and adjusting the filtering conditions in the SQL query statement, the query logic of the revised SQL query statement is optimized, enabling it to accurately return the required data results.
[0075] The self-correcting Text2SQL generation method provided by this invention continuously provides feedback and correction when relevant conditions are not met through the above-mentioned two self-correcting SQL query statement processing methods, and gradually optimizes the SQL query statement generation strategy, which significantly improves the accuracy of query results and the efficiency of query statement generation.
[0076] Figure 3 An exemplary flowchart of a structured query statement generation task model training method 300 according to other embodiments of the present invention is shown. Figure 3 As shown, the method includes:
[0077] In step S301, a labeled first training dataset is constructed. The first training dataset includes multiple training data pairs. Each training data pair includes a second natural language query text corresponding to the actual application scenario and a third structured query statement corresponding to the second natural language query text.
[0078] In some embodiments, before pre-training the structured query generation task model to be trained using multiple second training datasets and a first training dataset according to the meta-learning algorithm, the method further includes: performing semantic expansion on each second natural language query text using a pre-trained large language model to obtain multiple semantically identical but differently expressed third natural language query texts corresponding to each second natural language query text; constructing a third training dataset based on the correspondence between the multiple third natural language query texts, the second natural language query texts, and the trained structured query statements; and updating the third training dataset to the first training dataset.
[0079] Suppose that in an order query application, the expert team annotates the natural language query text "Query the total order amount of customers in East China in 2019" from actual sales data, and obtains the corresponding SQL query as follows:
[0080] SELECT SUM(order_amount)FROM orders WHEREregion='East'AND year=2019;
[0081] Based on the labeled natural language query text "Query the total order amount of customers in East China in 2019", a pre-trained large language model is used to generate several natural language query texts with different expressions, such as "Statistics on the order amount of customers in East China in 2019", "Calculate the total order amount of all customers in East China in 2019", and "View the total order amount of customers in East China in 2019".
[0082] This approach allows for the handling of different types of query inputs, effectively improving the robustness of the model.
[0083] In step S302, a second training dataset corresponding to each of the multiple different open-source datasets is obtained.
[0084] After constructing the first training dataset, the structured query generation task model is pre-trained using multiple second training datasets and the first training dataset to obtain a preliminary structured query generation task model. Multiple second training datasets can be, for example, open-source Text2SQL datasets such as WikiSQL, SPIDER, and BI RD.
[0085] The first training dataset is a training dataset for a specific professional field, such as a training dataset for an order query application.
[0086] In some embodiments, after outputting a target structured query statement corresponding to a first natural language query text, the method includes: constructing a new training data pair from the first natural language query text and the target structured query statement; and adding the new training data pair to the first training dataset to obtain an updated first training dataset.
[0087] In step S303, based on the meta-learning algorithm, the structured query statement generation task model to be trained is pre-trained using multiple second training datasets and the first training dataset to obtain a preliminary structured query statement generation task model.
[0088] In the steps described above, a large language model is pre-trained using a meta-learning algorithm (such as Mode-Agnostic Meta-Learning, MAML) on both the first and second training datasets. Meta-learning allows for the acquisition of processing strategies for general SQL query structures and query operation patterns, such as simple queries, join queries, and aggregation queries. Introducing the meta-learning mechanism enables the structured query generation task model to be quickly adjusted based on a small amount of new domain-specific data, optimizing the SQL query generation strategy.
[0089] In step S304, the initial structured query statement generation task model is fine-tuned to obtain the current structured query statement generation task model.
[0090] In step S305, the output of the current structured query generation task model is supervised and detected according to the self-supervised learning algorithm to obtain the trained structured query generation task model.
[0091] Assuming that, based on meta-learning algorithms, after pre-training on the WikiSQL, SPIDER, and BI RD datasets, the large language model has learned the basic structure of general-domain SQL queries and common complex query patterns. When the pre-trained structured query generation task model is applied to handle a specific domain task (such as customer order data analysis), the meta-learning mechanism utilizes a small amount of training data from the customer order data analysis domain to quickly fine-tune the initial parameters of the pre-trained structured query generation task model, obtaining the optimal query generation strategy to generate SQL queries that meet the query requirements of the customer order data analysis domain.
[0092] For example, the natural language query text for customer order data is "Query the total order amount of customers in East China in 2019". Based on existing pre-trained knowledge (such as aggregation queries, join queries, etc.), a small amount of data from the customer order data analysis field is used to quickly fine-tune and generate the correct SQL query statement.
[0093] The meta-learning mechanism eliminates the need to retrain the entire model, significantly improving transfer efficiency.
[0094] During the training phase, a self-supervised learning mechanism can be used to compare the SQL queries generated by the pre-trained model with the labeled SQL queries in the labeled training data pairs. This automatically checks and corrects errors in the generated SQL queries, thereby optimizing the SQL query generation process. By comparing the generated SQL queries with the labeled SQL queries, deviations between them can be identified. When deviations exist (such as missing fields or incorrect conditions), these errors can be corrected promptly through the self-supervised feedback mechanism. After multiple rounds of self-supervised training, the model continuously learns from errors, gradually improving the quality of the generated SQL queries during the training phase.
[0095] Suppose that in order data analysis, a user inputs the natural language query text "Query the total order amount for each customer in East China in 2019". The SQL query statement generated by the pre-trained model is as follows:
[0096] SELECT SUM(order_amount)FROM orders WHEREregion='East'AND year=2019GROUP BY customer_id;
[0097] The marked SQL query statement is:
[0098] SELECT customer_id,SUM(order_amount)FROM orders WHEREregion='East'AND year=2019GROUP BY customer_id;
[0099] Comparing the generated SQL query with the annotated SQL query reveals that the generated SQL query lacks the "customer_id" field, which is crucial information for achieving the user's query objective. While the "SUM(order_amount)" part correctly calculates the total order amount for each customer, the absence of the "customer_id" field in the generated SQL query means the query result cannot identify the total order amount for each customer. Therefore, after a self-monitoring mechanism checks the generated SQL query, the generation strategy can be adjusted based on the feedback from the check results to ensure the generated SQL query includes the "customer_id" field, thus meeting the user's query requirements. Ultimately, the generated SQL query matches the annotated SQL query.
[0100] The proposed training method for a structured query generation task model addresses the poor cross-dataset adaptability of existing Text2SQL technologies by introducing a meta-learning approach. Adaptive meta-learning enables the model to quickly adjust its learning strategy based on new datasets, optimizing its generalization ability across different domains and tasks, and flexibly adapting to various situations while achieving rapid transfer. Furthermore, by introducing a self-supervised optimization mechanism, the model automatically learns and discovers potential problems in the training dataset, reducing the need for manual annotation. Additionally, by comparing the predicted SQL queries with standard SQL queries and performing self-correction based on the results, the accuracy of the generated SQL queries is improved.
[0101] Please refer to Figure 4 , Figure 4 This is a schematic diagram of the structure of a self-correcting Text2SQL generation device 400 according to an embodiment of the present invention. Figure 4 As shown, the structured query statement generation device 400 includes:
[0102] The natural language query text acquisition module 401 is configured to acquire the first natural language query text.
[0103] The structured query statement generation module 402 is configured to test the first natural language query text using a pre-trained structured query statement generation task model to obtain the first structured query statement corresponding to the first natural language query text. The structured query statement generation task model is trained and learned based on a meta-learning mechanism and a self-supervised learning mechanism.
[0104] The query operation mode parsing module 403 is configured to use a pre-trained large language model to parse the first natural language query text and obtain the query operation mode corresponding to the first natural language query text.
[0105] The structured query statement dynamic optimization module 404 is configured to dynamically optimize the first structured query statement to obtain a second structured query statement when it is determined that the first structured query statement does not meet the query requirements corresponding to the query operation mode.
[0106] The structured query execution module 405 is configured to execute a second structured query to obtain the first query result;
[0107] The structured query statement correction module 406 is configured to correct and optimize the second structured query statement based on the first query result when it is determined that the first query result does not meet the expected result, so as to obtain a third structured query statement. The third structured query statement is output as the target structured query statement corresponding to the first natural language query text.
[0108] In some embodiments, the structured query statement execution module 405 is configured to execute the first structured query statement and obtain the second query result when it is determined that the first structured query statement meets the query requirements corresponding to the query operation mode.
[0109] The structured query statement correction module 406 is configured to correct and optimize the first structured query statement based on the second query result when it is determined that the second query result does not meet the expected result, so as to obtain a fourth structured query statement. The fourth structured query statement is output as the target structured query statement corresponding to the first natural language query text.
[0110] In some embodiments, the structured query statement dynamic optimization module 404 is configured to select a query template related to the first natural language query text from a pre-configured template library; and optimize the first structured query statement according to the query template to obtain a second structured query statement.
[0111] In some embodiments, the first query result includes an error type indicating the existence of the second structured query statement. The structured query statement correction module 406 is configured to determine an adjustment strategy corresponding to the error type and correct the second structured query statement according to the adjustment strategy corresponding to the error type.
[0112] The self-correcting Text2SQL generation device provided by this invention continuously provides feedback and correction when relevant conditions are not met through the above-mentioned two self-correcting SQL query statement processing methods, and gradually optimizes the SQL query statement generation strategy, which significantly improves the accuracy of query results and the efficiency of query statement generation.
[0113] Figure 5A structural schematic diagram of a structured query statement generation task model training apparatus 500 according to other embodiments of the present invention is shown. For example... Figure 5 As shown, the device 500 includes:
[0114] The training data construction module 501 is configured to construct a labeled first training dataset. The first training dataset includes multiple training data pairs. Each training data pair includes training natural language query text corresponding to the actual application scenario and training structured query statements corresponding to the training natural language query text.
[0115] The open-source data acquisition module 502 is configured to acquire a second training dataset corresponding to each of the multiple different open-source datasets.
[0116] The meta-learning pre-training module 503 is configured to pre-train the structured query generation task model to be trained using multiple second training datasets and the first training dataset according to the meta-learning algorithm, so as to obtain a preliminary structured query generation task model.
[0117] The meta-learning fine-tuning module 504 is configured to fine-tune the initial structured query statement generation task model to obtain the current structured query statement generation task model.
[0118] The self-supervised learning module 505 is configured to perform supervised detection on the output of the current structured query generation task model according to the self-supervised learning algorithm, so as to obtain the trained structured query generation task model.
[0119] In some embodiments, the device 500 further includes:
[0120] The text expansion module is configured to use a pre-trained large language model to perform semantic expansion on each second natural language query text to obtain multiple third natural language query texts with the same semantics but different text expressions corresponding to each second natural language query text.
[0121] The training data construction module 501 is also configured to construct a third training dataset based on the correspondence between multiple third natural language query texts, second natural language query texts and training structured query statements; and update the third training dataset to the first training dataset.
[0122] In some embodiments, the device further includes:
[0123] The training data pair building module is configured to use the first natural language query text and the target structured query statement as new training data pairs;
[0124] The training data update module is configured to add new training data pairs to the first training dataset to obtain the updated first training dataset.
[0125] The structured query generation task model training device proposed in this invention solves the problem of poor cross-dataset adaptability in existing Text2SQL technology by introducing a meta-learning method. Adaptive meta-learning enables the model to quickly adjust its learning strategy based on the data in the new dataset, optimize the model's generalization ability in different domains, flexibly respond to different domains and tasks, and achieve rapid transfer. Furthermore, by introducing a self-supervised optimization mechanism, it automatically learns and discovers potential problems in the training dataset, reducing the need for manual annotation. It also improves the accuracy of generated SQL queries by comparing the predicted SQL queries with standard SQL queries and performing self-correction based on the results.
[0126] Figure 6 A schematic block diagram of an electronic device 600 according to an embodiment of the present invention is shown. Figure 6 As shown, the electronic device 600 may include a processor 601 and a memory 602. The memory 602 stores computer program instructions that execute a self-correcting Text2SQL generation method. When the computer program instructions are executed by the processor 601, the electronic device 600 executes the method according to the preceding description. Figures 1 to 3 The method described herein. For example, in some embodiments, the electronic device 600 is used for a self-correcting Text2SQL generation method, as can be seen in the foregoing embodiments, which will not be repeated here.
[0127] An embodiment of the present invention provides an electronic device, the device comprising: a processor; and a memory storing computer program instructions implemented by a computer for executing a self-correcting Text2SQL generation method, wherein when the computer program instructions are executed by the processor, the electronic device performs, as follows: Figures 1 to 3 The method described.
[0128] This invention provides a computer-readable storage medium comprising computer-implemented computer program instructions for executing a self-correcting Text2SQL generation method, wherein when the computer program instructions are executed by a processor, the method enables the implementation of... Figures 1 to 3 The method described.
[0129] Those skilled in the art will recognize that embodiments of the present invention can be implemented as a system, method, or computer program product. Therefore, the present invention can be specifically implemented as entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software, generally referred to herein as a "circuit," "module," "unit," or "system." Furthermore, in some embodiments, the present invention can also be implemented as a computer program product contained in one or more computer-readable media, which contains computer-readable program code.
[0130] Any combination of one or more computer-readable media may be used. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium can be, for example,, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (not exhaustive) of a computer-readable storage medium may include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in connection with an instruction execution system, apparatus, or device.
[0131] Program code contained on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0132] It should be understood that each block of a flowchart and / or block diagram, as well as combinations of blocks in a flowchart and / or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device to produce a machine that, when executed by a computer or other programmable data processing device, creates means for implementing the functions / operations specified in the blocks of the flowchart and / or block diagram.
[0133] While numerous embodiments of the invention have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Many modifications, alterations, and alternatives will occur to those skilled in the art without departing from the spirit and essence of the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in the practice of the invention. The appended claims are intended to define the scope of protection of the invention and therefore cover equivalents or alternatives within the scope of these claims.
Claims
1. A Text2SQL generation method based on self-correction, characterized in that, The method includes: Construct a labeled first training dataset, which includes multiple training data pairs. Each training data pair includes a training natural language query text corresponding to a real application scenario and a training structured query statement corresponding to the training natural language query text. A second training dataset corresponding to each of the multiple different open-source datasets was obtained. According to the meta-learning algorithm, the structured query statement generation task model to be trained is pre-trained using multiple second training datasets and the first training dataset to obtain a preliminary structured query statement generation task model. The initial structured query statement generation task model is fine-tuned to obtain the current structured query statement generation task model. Based on the self-supervised learning algorithm, the output of the current structured query statement generation task model is supervised and detected to obtain the trained structured query statement generation task model. Obtain the first natural language query text; The trained structured query generation task model is used to test the first natural language query text to obtain the first structured query text corresponding to the first natural language query text. The first natural language query text is parsed using a pre-trained large language model to obtain the query operation pattern corresponding to the first natural language query text. When it is determined that the first structured query statement does not meet the query requirements corresponding to the query operation mode, the first structured query statement is dynamically optimized to obtain the second structured query statement. Execute the second structured query statement to obtain the first query result; When it is determined that the first query result does not meet the expected result, the second structured query statement is modified and optimized according to the first query result to obtain a third structured query statement. The third structured query statement is output as the target structured query statement corresponding to the first natural language query text.
2. The method according to claim 1, characterized in that, The method also includes: When it is determined that the first structured query statement satisfies the query requirements corresponding to the query operation mode, the first structured query statement is executed to obtain the second query result; When it is determined that the second query result does not meet the expected result, the first structured query statement is modified and optimized according to the second query result to obtain a fourth structured query statement. The fourth structured query statement is output as the target structured query statement corresponding to the first natural language query text.
3. The method according to claim 1 or 2, characterized in that, Dynamically optimize the first structured query statement, including: Select a query template related to the first natural language query text from the pre-configured template library; The first structured query statement is optimized based on the query template to obtain the second structured query statement.
4. The method according to claim 1, characterized in that, The first query result includes error types indicating the existence of the second structured query statement. Based on the first query result, the second structured query statement is corrected and optimized, including: Determine the adjustment strategy corresponding to the error type; The second structured query statement is corrected according to the adjustment strategy corresponding to the error type.
5. The method according to claim 1, characterized in that, Before pre-training the structured query generation task model to be trained using multiple second training datasets and the first training dataset according to the meta-learning algorithm, the method further includes: By using a pre-trained large language model, each second natural language query text is semantically expanded to obtain multiple third natural language query texts with the same semantics but different text expressions, corresponding to each second natural language query text. A third training dataset is constructed based on the correspondence between multiple third natural language query texts, second natural language query texts, and the training structured query statements; Update the third training dataset to the first training dataset.
6. The method according to claim 1, characterized in that, After outputting the target structured query statement corresponding to the first natural language query text, the method further includes: Use the first natural language query text and the target structured query statement as a new training data pair; The new training data pairs are added to the first training dataset to obtain the updated first training dataset.