Dataset generation method and apparatus, electronic device, and medium
By generating and filtering NL2SQL datasets using large models, the problem of low efficiency in manual annotation is solved, enabling the generation of diverse and reliable datasets and supporting applications in complex SQL query scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING PACTERA JINXIN TECH LTD
- Filing Date
- 2025-06-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for generating NL2SQL datasets rely on manual annotation, which is inefficient and costly, making it difficult to generate sufficiently realistic and diverse samples, thus limiting model performance and practical applications.
The system leverages large models to generate table creation statements based on business scenarios, batches database table data, and ensures dataset diversity and reliability by filtering and verifying the matching of natural language questions with SQL query statements.
It generates diverse and reliable datasets that can handle complex multi-table joins and nested subqueries, improving the model's ability to be applied in real-world business scenarios.
Smart Images

Figure CN120596501B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a method, apparatus, electronic device, and medium for generating datasets. Background Technology
[0002] In the field of Natural Language Processing (NLP), the task of converting natural language queries into Structured Query Language (SQL), or NL2SQL, has become a highly anticipated research direction. In existing technologies, when performing NLP based on neural network models, the training, optimization, and testing of these models rely on high-quality, large-scale NL2SQL datasets. Traditional methods for constructing NL2SQL datasets primarily rely on manual annotation, where professionals familiar with SQL manually convert natural language questions into corresponding SQL query statements. While this method ensures data quality, it suffers from inefficiency and high costs, making it difficult to meet the needs of large-scale datasets. Furthermore, manually annotated datasets often lack diversity, failing to cover the various complex query scenarios that may be encountered in real-world applications. Related technologies utilize existing question-answer pairs to automatically generate SQL queries, but the SQL queries generated by this method are often too simplistic and fail to reflect the complex query needs of the real world. Existing methods for automatically generating question-answer pairs often struggle to produce sufficiently realistic and diverse samples. This not only limits the performance of NL2SQL models but also hinders the widespread application of this technology in real-world business scenarios. Summary of the Invention
[0003] This invention provides a dataset generation method, apparatus, electronic device, and medium to address the shortcomings of traditional dataset generation methods that automatically generate SQL queries using existing question-answer pairs, which makes it difficult to produce sufficiently realistic and diverse samples, thus limiting the performance of NL2SQL models.
[0004] This invention provides a dataset generation method, comprising:
[0005] Determine the database tables corresponding to the business scenario, and use the large model to generate table creation statements based on the natural language questions and answers in the business scenario;
[0006] Based on the table creation statement, the large model is used to generate multiple data entries for each of the database tables in batches.
[0007] The large model is used to generate natural language description questions and SQL query statements that match the natural language description questions in batches based on multiple data records in each of the database tables.
[0008] The natural language problem description and the SQL query statement matching the natural language problem description are filtered and verified at least once to generate a dataset in which the natural language problem description is converted into SQL query statement.
[0009] According to the dataset generation method provided by the present invention, the table creation statement includes a description of the table name, field descriptions and primary and foreign key relationships, as well as multiple sample data insertion statements for each database table, wherein the sample data of the primary and foreign keys in the multiple sample data insertion statements for each database table are related.
[0010] According to the dataset generation method provided by the present invention, the step of generating multiple data records in batches for each database table based on the table creation statement using the large model includes:
[0011] Based on the table creation statement and data requirement prompts, multiple data entries are generated in batches for each table through the large model. The data requirement prompts include at least one of the following: the data must meet the constraints of field explanations and database attributes; the data must be diverse; the data must be generated completely according to the data in actual production; the primary and foreign key fields must correspond and be consistent; and the data between multiple database tables must have an intersection.
[0012] If an error occurs when the table creation statement is executed in the database, the data is modified based on the error information and the table creation statement in the large model.
[0013] The system receives verification information indicating whether the data meets actual production requirements. Based on this verification information, it modifies or regenerates new data using the large model based on natural language questions and the table creation statements.
[0014] According to the dataset generation method provided by the present invention, the step of using the large model to generate natural language description questions and SQL query statements matching the natural language description questions in batches based on multiple data entries in each of the database tables includes:
[0015] Randomly select a database table and obtain the table information corresponding to the database table. The table information includes at least one of the following: table name, table name comment, field name, field comment, field database attribute, and sampled data.
[0016] The large model generates a natural language description of the question corresponding to the database table and an SQL query statement that matches the natural language description of the question based on the table information of the database table.
[0017] Replace the dataset tables and generate corresponding natural language description questions and SQL query statements that match the natural language description questions for each dataset table.
[0018] According to the dataset generation method provided by the present invention, the step of generating a natural language description question corresponding to the database table and an SQL query statement matching the natural language description question based on the table information of the database table using a large model includes:
[0019] By using different prompt words, different large models, sampling different table information and sampled data, natural language questions and their corresponding SQL query statements are generated synchronously.
[0020] And / or, by using different prompt words, different large models, sampling different table information and sampled data, first generate natural language questions and then generate corresponding SQL query statements.
[0021] According to the dataset generation method provided by the present invention, the natural language description of the problem and the SQL query statements matching the natural language description of the problem are filtered and verified at least once to generate a dataset in which the natural language problem description is converted into SQL query statements, including:
[0022] The SQL query statement is executed in the database, and the SQL query statements that generate errors and those that return null values are filtered out.
[0023] According to the dataset generation method provided by the present invention, the natural language description of the problem and the SQL query statements matching the natural language description of the problem are filtered and verified at least once to generate a dataset in which the natural language problem description is converted into SQL query statements, including:
[0024] Based on the data format and page design requirements of the large model and dataset files, a dataset verification system is constructed. Based on the dataset verification system, the natural language description of the problem and the SQL query statement matching the natural language description of the problem are optimized to generate a dataset in which the natural language problem description is converted into an SQL query statement.
[0025] The present invention also provides a dataset generation apparatus, comprising:
[0026] The table creation statement generation module is used to determine the database table corresponding to the business scenario and generate table creation statements based on the natural language questions and answers in the business scenario using a large model;
[0027] The data generation module is used to generate multiple data entries for each database table in batches based on the table creation statement and the large model.
[0028] The matching generation module is used to generate natural language description questions and SQL query statements that match the natural language description questions in batches based on multiple data in each of the database tables using the large model.
[0029] The filtering and verification module is used to perform at least one filtering and verification on the natural language description of the question and the SQL query statement that matches the natural language description of the question, and generate a dataset in which the natural language question description is converted into SQL query statement.
[0030] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the dataset generation method as described in any of the preceding claims.
[0031] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the dataset generation method described in any of the preceding claims.
[0032] The dataset generation method, apparatus, electronic device, and medium provided by this invention utilize a large model to generate table creation statements based on natural language questions and answers in business scenarios. Based on the table creation statements, the large model generates multiple data entries for each database table in batches, which can ensure the diversity and reliability of the dataset. By performing at least one filtering and verification on the questions described in natural language and the SQL query statements that match the questions described in natural language, the accuracy of the data in the dataset can be improved, which is conducive to its widespread application in actual business scenarios. Attached Figure Description
[0033] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0034] Figure 1 This is one of the flowcharts illustrating the dataset generation method provided in this embodiment of the invention;
[0035] Figure 2 This is a schematic diagram illustrating the query and response of the Claude-3.5-Sonnet large model provided in an embodiment of the present invention;
[0036] Figure 3 This is a schematic diagram of questions and answers in a human resources department business scenario provided by an embodiment of the present invention;
[0037] Figure 4 This is a schematic diagram of the prompt words for the Claude-3.5-Sonnet large model provided in the embodiments of the present invention;
[0038] Figure 5This is a schematic diagram of batch data generation prompt words provided in an embodiment of the present invention;
[0039] Figure 6 This is a schematic diagram of the prompt words used for verification through a large model in the program provided in this embodiment of the invention;
[0040] Figure 7 This is a schematic diagram of the prompt words used for the first time in the initial version generation system provided by this embodiment of the invention;
[0041] Figure 8 This is a schematic diagram of the manual verification system provided in an embodiment of the present invention;
[0042] Figure 9 This is the second flowchart illustrating the dataset generation method provided in this embodiment of the invention;
[0043] Figure 10 This is a schematic diagram of the functional structure of the dataset generation device provided in an embodiment of the present invention;
[0044] Figure 11 This is a functional structure diagram of the electronic device provided in an embodiment of the present invention. Detailed Implementation
[0045] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0046] Figure 1 A flowchart of the dataset generation method provided in the embodiments of the present invention is shown below. Figure 1 As shown, the dataset generation method provided in this embodiment of the invention includes:
[0047] Step 101: Determine the database table corresponding to the business scenario, and use the large model to generate table creation statements based on the natural language questions and answers in the business scenario;
[0048] In this embodiment of the invention, the large-scale model selected is the Claude-3.5-Sonnet model. After the large-scale model provides the database tables, it undergoes manual review and evaluation to check the reasonableness of the model's response and whether the provided database tables are the main database tables required for the business scenario. If the manual review reveals issues, such as too many redundant tables, too few tables, or tables that do not meet the needs of the scenario, the large-scale model can be asked to revise its response based on the previous step, or the previous question can be corrected before the large-scale model responds again.
[0049] If only the business scenario is defined, the questions and answers are relatively open-ended within that scenario, and there are no special subjective requirements. You can directly ask the larger model which main database tables are needed to construct a reporting system for a specific business scenario. For example... Figure 2 The example shown is the query and response to the Claude-3.5-Sonnet large model. If there are specific business scenarios requiring answers, the questions and answers can be manually compiled, and then the large model can directly generate table creation statements based on the questions and answers. For example... Figure 3 , Figure 4 The questions and answers, along with hints, are shown in the example business scenarios.
[0050] Step 102: Based on the table creation statement, use the large model to generate multiple data records for each of the database tables in batches;
[0051] Step 103: Using the large model, generate natural language description questions and SQL query statements that match the natural language description questions in batches based on multiple data records in each of the database tables;
[0052] Step 104: Perform at least one filtering and verification on the natural language description of the problem and the SQL query statements that match the natural language description of the problem to generate a dataset of natural language problem descriptions converted into SQL query statements.
[0053] It should be noted that the large model in the embodiments of the present invention is a pre-trained large model based on natural language processing (NLP), including but not limited to Claude-3.5-Sonnet, ChatGPT, DeepSeek-Coder, etc.
[0054] Traditional methods for generating SQL queries from natural language data often utilize existing question-and-answer pairs to automatically generate SQL queries. However, these queries are frequently overly simplistic and fail to reflect the complexities of real-world queries. Furthermore, existing question-and-answer pair generation methods struggle to produce sufficiently realistic and diverse samples when handling advanced SQL features such as complex multi-table joins and nested subqueries. This not only limits model performance but also hinders the widespread application of this technology in real-world business scenarios.
[0055] The dataset generation method provided in this invention determines the database tables corresponding to a business scenario, and uses a large model to generate table creation statements based on natural language questions and answers in the business scenario. Based on the table creation statements, the large model generates multiple data entries for each database table in batches. The large model then uses these data entries to generate natural language-described questions and matching SQL query statements in batches. The natural language-described questions and matching SQL query statements are then filtered and verified at least once to generate a dataset of natural language question descriptions converted into SQL query statements. By using a large model to generate table creation statements and database table data, and then generating natural language questions and corresponding SQL query statements based on the database table data, the diversity and reliability of the dataset can be guaranteed. It can handle complex multi-table joins, nested subqueries, and other advanced SQL features. The at least one filtering and verification process improves the accuracy of the data in the dataset, which is beneficial for its widespread application in real-world business scenarios.
[0056] Based on any of the above embodiments, the table creation statement includes, but is not limited to, a description of the table name, field descriptions and primary and foreign key relationships, as well as multiple sample data insertion statements for each database table, wherein the sample data of the primary and foreign keys in the multiple sample data insertion statements for each database table are related.
[0057] This invention sets up associations for primary and foreign keys in multiple sample data insertion statements, making it applicable to scenarios involving complex multi-table joins, nested subqueries, and other advanced SQL features.
[0058] In this embodiment of the invention, after determining the table structure required for the business scenario, the large language model generates table creation statements that conform to PostgreSQL syntax. The requirements for the table creation statements are also given: the table name and its description, field definitions and descriptions, primary and foreign key relationships, and sample data insertion statements. It is necessary to ensure that the primary and foreign key relationships of the sample data are correct so that related queries can return valid results. For example, the prompts are: When building a banking financial institution supervision system, the following are some essential core table names: Financial Institution Basic Information Table (Financial_Institution_Info), Balance Sheet (Balance_Sheet), Income Statement (Income_Statement), Capital Adequacy Ratio Table (Capital_Adequacy_Ratio), Liquidity Ratio Table (Liquidity_Ratio), Non-Performing Loans Table (Non_Performing_Loans), Risk Exposure Table (Risk_Exposure), Related Party Transactions Table (Related_Party_Transactions), Regulatory Reports Table (Regulatory_Reports), and Compliance Records Table (Compliance_Records);
[0059] Please provide the PostgreSQL table creation statements for the above tables. The table creation statements should include a description of the table names, field descriptions, primary and foreign key relationships, and several sample data insertion statements for each table. Note that the sample primary and foreign key data should have overlap to ensure that there is data after the join. The generated data should be as realistic as possible to meet the requirements of actual production use. The final result should only return the table creation statements that can be executed in the PostgreSQL database, without any other explanations or descriptions.
[0060] In this embodiment of the invention, after generating the table creation statements in the large model, manual checks and modifications are performed. These checks include verifying whether the database tables and their fields meet the needs of the business scenario, whether fields need to be added or removed, whether the primary and foreign key design of the tables is reasonable, whether the generated sample data matches the actual situation in production, whether the table creation statements are generated as required, and whether there are any errors during execution. If any problems are found, the large model can be further modified and improved based on the specific issues.
[0061] Based on any of the above embodiments, the step of using the large model to generate multiple data records for each database table in batches based on the table creation statement includes:
[0062] Step 201: Based on the table creation statement and data requirement prompts, generate multiple data entries for each table in batches using the large model. The data requirement prompts include at least one of the following: data must comply with field explanations and database attribute constraints; data must be diverse; data must be generated entirely according to actual production data; primary and foreign key fields must correspond consistently; and data between multiple database tables must have overlap.
[0063] Step 202: Execute the table creation statement in the database. If an error occurs, modify the data based on the error information and the table creation statement using the large model.
[0064] Step 203: Receive verification information on whether the data meets the actual production requirements, and based on the verification information, modify or regenerate new data through the large model based on natural language questions and the table creation statement.
[0065] In this embodiment of the invention, after the tables are created, the large model generates 10,000 rows of data for each table based on the table creation statements. Requirements for the generated data are given, such as compliance with field interpretation and database attribute constraints, data diversity, generation entirely based on actual production data, and consistency between primary and foreign key fields to ensure data overlap after joins. Note that this step requires a high-performance large model such as Claude-3.5-Sonnet or ChatGPT. For example... Figure 5 The prompt words shown.
[0066] After generating the database table insert statements, execute them in the database. If errors occur, the model should modify the SQL insert statements based on the error messages. After successful modification and execution, manually check whether the table data in the database meets the requirements of actual production. For example, check whether names and addresses are real names and addresses, whether dates and times conform to the corresponding formats and current business dates and times, whether financial institution names are real financial institution names, and whether financial institution profits and liabilities match real-world situations. If problems are found, modify the model accordingly or regenerate new answers based on the problems and table creation statements. For example, generate 10,000 sample data insert statements for each table. Note that the sample data for primary and foreign keys should have overlap to ensure data is present after a join. The generated data should be as realistic as possible, meeting the actual production requirements of bank risk control scenarios. The final result should only return the table creation statements that can be executed in the PostgreSQL database, without any other explanations or descriptions. The data should be in a Chinese environment, and should be diverse, reasonable, and realistic. For example, names should be real names used in daily life, not names like "Zhang San" or "Li Si".
[0067] The above table creation statements are for reports in a banking regulatory business scenario system. Please generate 10,000 rows of data based on these statements and return them using SQL insert statements. The generated data must strictly adhere to table name comments, field name comments, and database constraints. The data must be diverse, scientific, and reasonable, strictly conforming to real-world and actual production data. For example, names and related transaction parties should be real bank or financial institution names, and the timeframes should be reasonable. For fields such as total assets, deposits, loans, and non-performing loan ratio, the unit of measurement should be (yuan), and the data should be within a reasonable range applicable to actual banking operations. For example, the unit should be uniformly "yuan," and the data range should be as follows:
[0068] Net interest income: RMB 500 million to RMB 5 billion;
[0069] Non-interest income: RMB 100 million to RMB 1 billion;
[0070] Operating expenses: 300 million to 3 billion yuan;
[0071] Loan loss provisions: RMB 50 million to RMB 500 million;
[0072] Net profit: 250 million to 2.5 billion yuan;
[0073] These ranges are reasonable for large commercial banks, but it's important to adjust them based on the institution's total assets or other indicators. Fields with primary and foreign key relationships should maintain consistent data to ensure overlap after table joins. Data should be calculated according to established rules, and the generated data should strictly adhere to these rules. Finally, the SQL query statement for the PostgreSQL database should only return 10,000 rows of inserted data in batches.
[0074] Based on any of the above embodiments, the step of using the large model to generate natural language description questions and SQL query statements matching the natural language description questions in batches based on multiple data entries in each of the database tables includes:
[0075] Step 301: Randomly select a database table and obtain the table information corresponding to the database table. The table information includes at least one of the following: table name, table name comment, field name, field comment, field database attribute, and sampled data.
[0076] Step 302: Generate a natural language description of the question corresponding to the database table and an SQL query statement matching the natural language description of the question based on the table information of the database table using the large model;
[0077] Step 303: Replace the dataset table and generate a corresponding natural language description question and an SQL query statement that matches the natural language description question for each dataset table.
[0078] In this embodiment of the invention, the step of generating a natural language description of the question corresponding to the database table and an SQL query statement matching the natural language description of the question based on the table information of the database table using a large model includes:
[0079] By using different prompt words, different large models, sampling different table information and sampled data, natural language questions and their corresponding SQL query statements are generated synchronously.
[0080] And / or, by using different prompt words, different large models, sampling different table information and sampled data, first generate natural language questions and then generate corresponding SQL query statements.
[0081] In an embodiment of the present invention, after the data generation in actual production is completed, database table information is obtained, that is, table name, table name comment, field name, field comment, field database attribute, sampling data, etc. Based on the information in the database, the large model generates questions described in natural language and SQL query statements. To ensure the diversity and complexity of the generated SQL query statements and questions, through different prompt words, different models, sampling different table information and sampling data, generate questions and SQL synchronously, generate questions first and then generate SQL, etc., in diverse ways, and generate according to different complexity requirements. For example, ask the Claude-3.5-Sonnet and Chatgpt models with the following prompts: [{'table_name':'stock_basic', 'table_comment': None, 'columns': [{'column_name': 'ts_code', 'data_type': 'character varying', 'column_default': None, 'constraints': ['PRIMARY KEY'], 'comment': 'Stock code'}, {'column_name': 'name', 'data_type': 'character varying', 'column_default': None, 'constraints': [], 'comment': 'Stock name'}],'sample_data': {'ts_code': '603012.SH', 'name': 'Chuangli Group'}}, {'table_name':'stock_prices', 'table_comment': None, 'columns': [{'column_name': 'amount', 'data_type': 'double precision', 'column_default': None, 'constraints': [], 'comment': 'Transaction amount'}…].
[0082] Based on the database table information provided above, generate 100 natural language questions and their corresponding SQL queries. The database table information includes table names and descriptions, business descriptions of field data, technical attributes and constraints of the fields, and several sample data entries. Each generated question should involve at least two tables, with each table using at least one field, and should ideally include conditional filtering or aggregation operations. Please try asking questions from different perspectives, including questions based on the sampled data in `sample_data`, to increase the diversity of questions and SQL queries. Please combine multiple tables to generate complex SQL queries with corresponding multi-table joins. Finally, ensure that the generated SQL queries completely correspond to the natural language questions, that field types match during JOIN operations, and that they execute correctly on the PostgreSQL database. The natural language questions should also be clear and easy to understand. The result should be returned in the format: `[{"question":"natural language question description", "sql":"SQL query statement"},{...},{....}]`, without any further explanation.
[0083] In this embodiment of the invention, diverse prompts are designed to guide the large model in generating complex SQL queries. For example: generating an SQL query involving multi-table joins to retrieve all orders of a user and their corresponding product information; generating an SQL query containing nested subqueries to retrieve users whose total order amount is greater than their average order amount; and generating an SQL query containing aggregate functions to retrieve the total number of orders for each user. When generating samples, different prompts are randomly selected to ensure that the generated SQL queries cover various complex scenarios.
[0084] Based on any of the above embodiments, the natural language description of the problem and the SQL query statements matching the natural language description of the problem are filtered and verified at least once to generate a dataset of natural language problem descriptions converted into SQL query statements, including:
[0085] Step 401: Execute the SQL query operation in the database, and filter out SQL query statements that generate errors and SQL query statements that return null values;
[0086] In this embodiment of the invention, after generating natural language questions and corresponding SQL query statements, the SQL query is executed in the database. SQL queries that generate errors or return null values are filtered out. This initial verification filters out the correct SQL queries and questions. Then, a larger model is used for further checking and verification, and problematic data samples are labeled and filtered out. The prompts used in the program for verification through the larger model are as follows: Figure 6 As shown.
[0087] Step 402: Based on the data format and page design requirements of the large model and dataset files, construct a dataset verification system. Based on the dataset verification system, optimize the natural language description of the problem and the SQL query statement that matches the natural language description of the problem, and generate a dataset in which the natural language problem description is converted into an SQL query statement.
[0088] In this embodiment of the invention, a dataset verification system is constructed based on the data format and page design requirements of a large model combined with dataset files, including:
[0089] Step 601: Design the initial version of the prompts based on the page design requirements and dataset format;
[0090] Step 602: Iteratively modify the prompts, add descriptions of the labeled options, clarify the requirements for page layout and interaction design, and generate the final prompts;
[0091] Verification prompts may include phrases such as "Please check if the following SQL statement matches the problem description and verify if its logic is correct," or "Please check if the data returned by the following SQL statement meets the requirements of the business scenario."
[0092] Step 603: Input the final prompt word into the model to generate the initial version of the system code;
[0093] Step 604: Run the initial version of the system code to test whether each function works properly;
[0094] Step 605: Collect user feedback on the system's page layout, interaction design, and functions;
[0095] Step 606: Adjust the prompt words based on the feedback, input the optimized prompt words into the model, and generate a dataset verification system.
[0096] In this embodiment of the invention, based on the Claude-3.5-Sonnet model, combined with the data format of the generated and saved dataset file, and according to the page design requirements, prompt words are designed, iteratively modified and improved, and finally the model generates a dataset verification system that meets the requirements. For example, the prompt words used for the first version of the system are as follows: Figure 7 As shown. After multiple modifications and improvements, the final generated dataset verification system page is as follows. Figure 8 As shown.
[0097] After the initial screening and verification of the dataset samples, the dataset verification system verifies and annotates each sample based on the generated dataset. The main verification contents include whether the natural language question can be converted into an SQL query statement, whether the SQL query accurately reflects the query intent of the question and obtains the correct answer through execution, and whether there are any problems with the SQL association logic, WHERE conditions, aggregation calculations, etc., and whether they correspond to the question described in the natural language.
[0098] like Figure 9 As shown, the dataset generation method provided in this embodiment of the invention specifically includes:
[0099] Step 1: Determine the business scenario:
[0100] Define clear business needs, such as identifying target business scenarios (e.g., banking supervision, e-commerce, logistics). Identify the problems that need to be solved in this scenario (e.g., querying user information, calculating order amounts). Compile a list of problems and answers, listing the key issues in the business scenario, and providing detailed answers to each problem, ensuring that the answers cover all necessary information.
[0101] Step 2: Generate the database table structure:
[0102] Based on the questions and answers in the business scenario, a database table structure containing multiple tables is generated using a large model. Primary and foreign key relationships are ensured between the tables.
[0103] Check whether the generated table structure meets the business requirements and ensure that the fields and constraints in the table structure are reasonable.
[0104] Step 3: Generate table data:
[0105] Generate table data that meets actual production requirements based on the table structure, ensuring that the data conforms to the field constraints and data range in the business scenario.
[0106] Check whether the generated data meets the actual production requirements (such as the reasonableness of fields such as name, address, and date).
[0107] Step 4: Obtain table information and extract data:
[0108] Extract metadata for each table from the database, including table name, field names, field types, constraints, etc. Randomly select a small sample of data from each table for subsequent question and SQL statement generation.
[0109] Step 5: Generate natural language questions and SQL query statements:
[0110] Design diverse prompts based on table information and sample data, ensuring that the prompts cover information such as table name, field name, field comments, and sample data.
[0111] Generate natural language questions and corresponding SQL query statements using a large model, ensuring that each question involves at least two tables and each table uses at least one field.
[0112] Check if the generated SQL statement matches the problem description and ensure that the SQL statement can be executed correctly on PostgreSQL.
[0113] Step 6: Execute the SQL query statement:
[0114] Execute the generated SQL query in the database, capture the execution results and possible error messages, filter the data samples that meet the requirements, and select the SQL statements and data samples that are successfully executed and return non-empty result sets.
[0115] Step 7: Model Validation
[0116] The valid SQL statements, their corresponding questions, and execution results are input into the large model for validation. The validation results are then obtained, including whether the SQL statements are correct and whether the returned data matches the question description.
[0117] Based on the verification results, problematic SQL statements and their descriptions are marked, and problematic data samples are filtered out, retaining high-quality data.
[0118] Step 8: Generate a dataset verification system:
[0119] Design the system's page layout, including a problem display area, SQL display area, execution result area, annotation area, and editing area. Implement system functions, including displaying problems and SQL queries, executing SQL queries, and supporting manual annotation and correction. Deploy the system to a server or local environment for use by human verifiers.
[0120] Step 9: Manual verification of annotations:
[0121] Verification personnel use the system to verify each data sample, marking issues and correcting SQL statements. The verified, high-quality NL2SQL dataset is then saved to a file for later use.
[0122] This invention, based on a large-scale model, constructs information such as the names of the main database tables and descriptions and uses of related tables required for specific business scenarios. Based on this table information, high-performance large-scale models such as Claude-3.5-Sonnet and ChatGPT are used to generate the database table creation statements. After successful table creation, large-scale models such as Claude-3.5-Sonnet and ChatGPT, along with optimized prompts, are used to batch generate real-world table data consistent with actual production environments. The large-scale model uses database table information, including table names, table name comments, field names, field comments, field database attributes, and sampled data, to pose natural language questions and generate corresponding SQL queries. By using different prompts, different models, sampling different table information and sampled data, and employing diverse methods such as generating questions first and then generating SQL, and varying SQL complexity requirements, the diversity and complexity of generated SQL and questions are ensured. A dataset verification system generated using the Claude-3.5-Sonnet large-scale model is used to facilitate manual verification.
[0123] The dataset generation method provided in this embodiment of the invention constructs a high-quality dataset with more than 6,000 single tables, multi-table joins, nested subqueries, aggregate queries, and other types of data across six business scenarios.
[0124] The dataset generation apparatus provided by the present invention is described below. The dataset generation apparatus described below and the dataset generation method described above can be referred to in correspondence.
[0125] Figure 10 This is a schematic diagram of the structure of the dataset generation device provided in an embodiment of the present invention, as shown below. Figure 10 As shown, the dataset generation apparatus provided in this embodiment of the invention includes:
[0126] The table creation statement generation module 1001 is used to determine the database table corresponding to the business scenario and generate a table creation statement based on the natural language questions and answers in the business scenario using a large model.
[0127] Data generation module 1002 is used to generate multiple data entries for each database table in batches based on the table creation statement and the large model.
[0128] The matching generation module 1003 is used to generate natural language description questions and SQL query statements that match the natural language description questions in batches based on multiple data in each of the database tables using the large model;
[0129] The filtering and verification module 1004 is used to perform at least one filtering and verification on the natural language description of the problem and the SQL query statement that matches the natural language description of the problem, and generate a dataset of natural language problem descriptions converted into SQL query statements.
[0130] The dataset generation apparatus provided in this embodiment of the invention determines the database tables corresponding to a business scenario, and uses a large model to generate table creation statements based on natural language questions and answers in the business scenario; based on the table creation statements, the large model generates multiple data entries for each database table in batches; the large model uses the multiple data entries in each database table to generate natural language description questions and SQL query statements matching the natural language description questions in batches; the natural language description questions and the SQL query statements matching the natural language description questions are filtered and verified at least once to generate a dataset in which natural language question descriptions are converted into SQL query statements. By using a large model to generate table creation statements and database table data, and then generating natural language questions and corresponding SQL query statements based on the database table data, the diversity and reliability of the dataset can be guaranteed. It can handle complex multi-table joins, nested subqueries, and other advanced SQL features. The at least one filtering and verification can improve the accuracy of the data in the dataset, which is beneficial for its widespread application in actual business scenarios.
[0131] Figure 11 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 11 As shown, the electronic device may include a processor 1110, a communications interface 1120, a memory 1130, and a communication bus 1140. The processor 1110, communications interface 1120, and memory 1130 communicate with each other via the communication bus 1140. The memory 1130 includes computer programs, an operating system, and acquired data. The processor 1110 can call logical instructions in the memory 1130 to execute a dataset generation method. This method includes: determining a database table corresponding to a business scenario; generating table creation statements based on natural language questions and answers in the business scenario using a large model; generating multiple data entries for each database table in batches based on the table creation statements using the large model; generating natural language description questions and matching SQL query statements based on the multiple data entries in each database table in batches using the large model; and performing at least one filtering and verification on the natural language description questions and matching SQL query statements to generate a dataset of natural language question descriptions converted into SQL query statements.
[0132] Furthermore, the logical instructions in the aforementioned memory 1130 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to related technologies, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0133] On the other hand, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements a dataset generation method provided by the above methods. This method includes: determining a database table corresponding to a business scenario; generating table creation statements based on natural language questions and answers in the business scenario using a large model; generating multiple data entries for each database table in batches based on the table creation statements using the large model; generating natural language-described questions and SQL query statements matching the natural language-described questions in batches based on the multiple data entries in each database table using the large model; performing at least one filtering and verification on the natural language-described questions and the SQL query statements matching the natural language-described questions to generate a dataset of natural language question descriptions converted into SQL query statements.
[0134] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0135] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the parts that contribute to the related technology, can be embodied in the form of software products. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0136] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A data set generation method characterized by, include: Determine the database tables corresponding to the business scenario, and use the large model to generate table creation statements based on the natural language questions and answers in the business scenario; Based on the table creation statement, the large model is used to generate multiple data entries for each of the database tables in batches. The large model is used to generate natural language description questions and SQL query statements that match the natural language description questions in batches based on multiple data records in each of the database tables. The natural language problem description and the SQL query statement matching the natural language problem description are filtered and verified at least once to generate a dataset in which the natural language problem description is converted into SQL query statement; The table creation statement includes a description of the table name, field descriptions, and primary and foreign key relationships, as well as multiple sample data insertion statements for each database table. The sample data of the primary and foreign keys in the multiple sample data insertion statements for each database table are related. The process of generating multiple rows of data for each database table in batches based on the table creation statement and the large model includes: Based on the table creation statement and data requirement prompts, multiple data entries are generated in batches for each table through the large model. The data requirement prompts include at least one of the following: the data must meet the constraints of field explanations and database attributes; the data must be diverse; the data must be generated completely according to the data in actual production; the primary and foreign key fields must correspond and be consistent; and the data between multiple database tables must have an intersection. If an error occurs when the table creation statement is executed in the database, the data is modified based on the error information and the table creation statement in the large model. The system receives verification information indicating whether the data meets actual production requirements. Based on this verification information, it modifies or regenerates new data using the large model based on natural language questions and the table creation statements.
2. The dataset generation method according to claim 1, characterized in that, The process of using the large model to generate natural language descriptions of questions and matching SQL query statements based on multiple data entries in each of the database tables includes: Randomly select a database table and obtain the table information corresponding to the database table. The table information includes at least one of the following: table name, table name comment, field name, field comment, field database attribute, and sampled data. The large model generates a natural language description of the question corresponding to the database table and an SQL query statement that matches the natural language description of the question based on the table information of the database table. Replace the dataset tables and generate corresponding natural language description questions and SQL query statements that match the natural language description questions for each dataset table.
3. The data set generation method of claim 2, wherein, The process of generating a natural language description of the question corresponding to the database table and an SQL query statement matching the natural language description of the question based on the table information of the database table using a large model includes: By using different prompt words, different large models, sampling different table information and sampled data, natural language questions and their corresponding SQL query statements are generated synchronously. And / or, by using different prompt words, different large models, sampling different table information and sampled data, first generate natural language questions and then generate corresponding SQL query statements.
4. The data set generation method of claim 1, wherein, The natural language problem description and the SQL query statements matching the natural language problem description are filtered and validated at least once to generate a dataset of natural language problem descriptions converted into SQL query statements, including: The SQL query statement is executed in the database, and the SQL query statements that generate errors and those that return null values are filtered out.
5. The data set generation method of claim 1, wherein, The natural language problem description and the SQL query statements matching the natural language problem description are filtered and validated at least once to generate a dataset of natural language problem descriptions converted into SQL query statements, including: Based on the data format and page design requirements of the large model and dataset files, a dataset verification system is constructed. Based on the dataset verification system, the natural language description of the problem and the SQL query statement matching the natural language description of the problem are optimized to generate a dataset in which the natural language problem description is converted into an SQL query statement.
6. A data set generating apparatus characterized by comprising: include: The table creation statement generation module is used to determine the database table corresponding to the business scenario and generate table creation statements based on the natural language questions and answers in the business scenario using a large model; The data generation module is used to generate multiple data records for each database table in batches based on the table creation statement and the large model. The table creation statement includes a description of the table name, field descriptions and primary and foreign key relationships, as well as multiple sample data insertion statements for each database table, wherein the sample data of the primary and foreign keys in the multiple sample data insertion statements for each database table are related. The process of generating multiple data entries for each database table using the large model based on the table creation statement includes: generating multiple data entries for each table using the large model based on the table creation statement and data requirement prompts, wherein the data requirement prompts include at least one of the following: data must meet the constraints of field interpretation and database attributes; data must be diverse; data must be generated entirely according to actual production data; primary and foreign key fields must correspond consistently; and data between multiple database tables must have overlap. The process involves executing the table creation statement in the database; if an error occurs, modifying the data based on the error information and the table creation statement using the large model; receiving verification information indicating whether the data meets actual production requirements; and modifying or regenerating new data based on the verification information, using the large model based on natural language processing and the table creation statement. The matching generation module is used to generate natural language description questions and SQL query statements that match the natural language description questions in batches based on multiple data in each of the database tables using the large model. The filtering and verification module is used to perform at least one filtering and verification on the natural language description of the question and the SQL query statement that matches the natural language description of the question, and generate a dataset in which the natural language question description is converted into SQL query statement.
7. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the dataset generation method as described in any one of claims 1 to 5.
8. A non-transitory readable storage medium having stored thereon a computer program, characterized in that, When the computer program is executed by a processor, it implements the dataset generation method as described in any one of claims 1 to 5.
Citation Information
Patent Citations
Data processing method, system and device, storage medium and program product
CN118132684A
System and method for exchanging data and commands between an object oriented system and relational system
US6163776A