A method and system for generating intelligent SQL question answering based on a large-scale database model.
By constructing a large model prompt word and vector database, and utilizing the embedding vectors describing SQL historical statements and business query requests, accurate SQL queries are generated, solving the problem of low accuracy in NL2SQL and improving user experience and data query accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- YUNNAN BAIYAO GRP MEDICINE E-COMMERCE CO LTD
- Filing Date
- 2024-12-30
- Publication Date
- 2026-06-30
Smart Images

Figure CN122309534A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer software technology, and in particular to a method and system for generating intelligent SQL question answering based on a large model database. Background Technology
[0002] Large models refer to machine learning models with a massive number of parameters and complex computational structures. These models are typically built from deep neural networks and have billions or even hundreds of billions of parameters. The purpose of large models is to improve their expressive power and predictive performance, enabling them to handle more complex tasks and data. Large models have wide applications in various fields, including natural language processing, computer vision, speech recognition, and recommender systems. By training on massive amounts of data to learn complex patterns and features, large models possess stronger generalization capabilities and can make accurate predictions on unseen data.
[0003] Database SQL execution history is a mechanism used to record the SQL statements executed in a database. It helps database administrators and developers with: 1. Auditing: Tracking modifications made to the database by users or the system. 2. Debugging: Viewing the execution of SQL statements to help locate problems. 3. Performance analysis: Understanding the execution frequency and duration of SQL statements for performance optimization. 4. Auditing: Ensuring compliance with requirements such as GDPR. Different databases support their own methods for querying database SQL execution history; for example, MySQL uses `general_log`, PostgreSQL uses `pg_stat_statements`, etc.
[0004] NL2SQL (Natural Language to SQL) is a technology that combines natural language processing and database querying. It aims to convert user queries in natural language into SQL query statements, thereby enabling automatic interaction between natural language question answering and databases.
[0005] ChatBI, also known as conversational BI, is a business intelligence tool based on Natural Language Processing (NLP) technology. It interacts with users through dialogue, helping them quickly obtain the data and analysis results they need. However, in practical applications, ChatBI still has some shortcomings, including limitations in conditional queries, insufficient semantic matching, and limitations in user experience. Its NL2SQL technology path, which directly generates SQL from a large language model, is prone to problems such as low data query accuracy (60%-70%, lower for cross-table queries) and inconsistent data definitions. Summary of the Invention
[0006] In view of this, this application proposes a method for generating intelligent SQL question answering based on a large model database to solve the problems reflected in the background technology.
[0007] This application provides a method for generating intelligent SQL question answering based on a large model database, characterized by comprising: Based on historical SQL statements, construct large model suggestion words to obtain a description of business query requests; Based on the description of the business query request, establish a correspondence with the historical SQL statements and add them to the corresponding vector database; Upon receiving a business query request, it is matched using a vector database to obtain the most similar historical SQL statement and the corresponding business query request description. Based on the most similar historical SQL statements and the corresponding business query request descriptions, construct large model suggestion words and generate the SQL statement for the business query request.
[0008] Optionally, the method for generating intelligent SQL question answering based on a large model database is characterized by: The vector database includes embedding vectors that describe data source names, database names, SQL history statements, and business query requests. The embedding vector described in the business query request serves as an index to the vector database; The data source name, database name, SQL history statements, and business query request description are used as fields in the vector database.
[0009] Optionally, upon receiving a business query request, a vector database is used for matching to obtain the most similar historical SQL statement and the corresponding business query request description, including: Received business query requests; The received natural language business query requests are transformed into query vectors; Based on the query vector, the N closest approximate vectors are obtained by querying the vector database according to vector similarity. The fields returned by the obtained similar vector include the data source name, database name, SQL history statements, and business query request description.
[0010] Optionally, the method for generating intelligent SQL question answering based on a large model database is characterized by further comprising: The SQL history statements are exported according to different databases using different SQL history statement query tools or methods; The vector database selected is a database like FAISS and Milvus, which handle large-scale vector data.
[0011] Optionally, the step of constructing large model suggestion words based on SQL history statements includes: A large model of suggestion words is constructed for the SQL history statements. The suggestion words for the SQL history statements include relevant content related to the query request wording. The specific prompts include the name of the database used, the specific content of the SQL history statements used in the query, and the requirement for the large model to provide the two most likely query request formats.
[0012] Optionally, the method for generating intelligent SQL question answering based on a large model database is characterized by further comprising: The results returned by the large model are split to obtain multiple business query request descriptions. Therefore, the SQL history statements and business query request descriptions have a one-to-many relationship.
[0013] Optionally, the step of constructing large model suggestion words based on the most similar historical SQL statements and the corresponding business query request descriptions includes: A large model prompt word is constructed based on the SQL history statements and the corresponding business query request descriptions. For each business query request, a corresponding SQL statement is generated, including relevant content on the corresponding relationship. The specific prompts include the database name used, a description of the query request as an example, the corresponding SQL statement, and a requirement for the large model to provide the corresponding SQL statement based on the specific description of the business query request.
[0014] This application also provides software for constructing a data indicator system based on a large model, characterized in that it obtains corresponding large model prompts based on different business systems and generates data indicator information.
[0015] Optionally, the method for generating a data indicator system based on a large model is characterized in that the generated data indicator information can also be imported into a data indicator platform for CRUD operations.
[0016] This application also provides an electronic device, characterized in that it includes: Memory, processor, and computer programs stored in said memory and executable on said processor The processor is configured to implement the method of any one of claims 1 to 7 when executing the computer program.
[0017] The beneficial effects of this application are: by utilizing the SQL execution history of the database, the accuracy of NL2SQL is significantly improved, overcoming the previous problem of low accuracy of NL2SQL in multi-table question answering, and providing a new method for the application of NL2SQL and ChatBI in more complex databases. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the accompanying drawings required in the description of the embodiments or the prior art are briefly introduced below. Obviously, the accompanying drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0019] Figure 1 The flowchart illustrates a method for generating intelligent SQL question answering based on a large model, as disclosed in this application. Figure 2 This document presents a flowchart illustrating a business query request disclosed in this application. Detailed Implementation
[0020] Various exemplary embodiments, features, and aspects of this application will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.
[0021] The terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0022] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.
[0023] Furthermore, to better illustrate this application, numerous specific details are provided in the following detailed embodiments. Those skilled in the art should understand that this application can be implemented without certain specific details. In some instances, methods, means, components, and circuits well-known to those skilled in the art have not been described in detail in order to highlight the main points of this application.
[0024] This application applies to the creation of a knowledge base mapping SQL to natural language questions by annotating SQL statements in the SQL execution history of a database using a large model, and then storing this knowledge in a vector database. During NL2SQL question answering, the natural language question is first matched against the vector database to obtain the closest SQL statement. The matched question and SQL statement are then used as context to ask the large model, resulting in a more accurate SQL statement. This provides a new method for applying NL2SQL and ChatBI to complex databases.
[0025] NL2SQL (Natural Language to SQL) is a technology that combines natural language processing and database querying. It aims to convert user queries, input in natural language, into SQL queries, thereby enabling automatic interaction between natural language question answering and the database. Following a typical enterprise reporting workflow and the development approach, we first obtain customer text information through the UI or front-end, transmit it to the back-end for database storage, and then either display the work order or directly query the database. Data developers write the SQL code, and finally, the query results are visualized.
[0026] ChatBI, also known as conversational BI, is a business intelligence tool based on Natural Language Processing (NLP) technology. It interacts with users through dialogue, helping them quickly obtain the data and analysis results they need. The core of ChatBI lies in its ability to understand and process natural language queries, transforming complex data analysis needs into simple conversations. Compared to traditional BI, it is simpler and easier to use, further simplifying the data analysis process and improving enterprise decision-making efficiency.
[0027] like Figure 1 As shown in the flowchart, the method for generating intelligent SQL question answering based on a large model includes the following: S100: Based on historical SQL statements, construct large model suggestion words to obtain a description of the business query request.
[0028] Specifically, a large model of query suggestions is constructed using historical SQL statements obtained from the database. These suggestions must include the name of the queried database. Based on these suggestions, the large model is used to obtain a description of the business query request. The large model returns multiple descriptions of business query requests, thus establishing a one-to-many correspondence between historical SQL statements and their corresponding descriptions. The historical SQL statements refer to the recorded and viewed history of SQL statements executed in the database, which is extremely useful for database auditing, performance analysis, and troubleshooting.
[0029] The SQL history statement construction model includes prompts related to the query request format. Specifically, the prompt is: "You are an AAA database expert, meaning you understand both business logic and databases. A business expert described a database query requirement in Chinese, and you wrote the corresponding SQL statement. Please directly provide the two most likely ways the business expert might have phrased the query request. SQL statement: XXX." Here, XXX represents the specific SQL statement, and AAA is the name of the database being queried.
[0030] S200 establishes a correspondence between the business query request description and the historical SQL statements, and adds them to the corresponding vector database.
[0031] Specifically, the business query request description and its embedding vector obtained in S100 are linked with the SQL history statements, database names, and data source names in the vector database using a function to facilitate subsequent query operations. Furthermore, the business query request description, its embedding vector, the SQL history statements, the database names, and the data source names are added to the vector database.
[0032] The embedding vector refers to the process of mapping high-dimensional data (such as text, images, and videos) to a low-dimensional space. An embedding vector is an N-dimensional real-valued vector that represents the input data as points in a continuous numerical space. In large models, because embedding vectors contain semantic information, words with similar meanings will have their embedding vectors located close together in space. S300, upon receiving a business query request, uses a vector database to match and obtain the most similar historical SQL statement and the corresponding business query request description.
[0033] Specifically, when a user queries a business request description using natural language, the system first converts the natural language question into a vector. This vector is then matched against the vector database described in S200 to obtain the N results that are closest to the vector. The returned results include the business query request description, the SQL history statements, the database name, and the data source name.
[0034] S400: Based on the most similar historical SQL statement and the corresponding business query request description, construct a large model prompt word and generate the SQL statement for the business query request.
[0035] Specifically, the large model prompt words are constructed by using the closest SQL historical statement obtained in S300 and its corresponding business query request description. The prompt words specify the correspondence between the business query request description and the SQL historical statement, and the large model is required to provide the SQL statement corresponding to the business query request based on the correspondence.
[0036] Specifically, a large model prompt is constructed for the SQL historical statements and their corresponding business query request descriptions. For each business query request, a corresponding SQL statement is generated, including relevant content about the corresponding relationships. The specific prompt is: "You are an AAA database expert. Given that the SQL statement corresponding to business query request description YYY is XXX, please provide the SQL statement corresponding to the following business query request description ZZZ." XXX is the closest historical SQL statement matched from the vector database in S300, YYY is the business query request description corresponding to XXX in S300, ZZZ refers to the newly proposed business query request, and AAA is the name of the query database.
[0037] The above method involves matching natural language questions into a vector database to obtain the closest historical SQL statements. Then, using newly proposed questions and SQL statements as the corresponding context, a larger model is asked to obtain more accurate SQL statements, thus improving the accuracy of NL2SQL and overcoming the previous problem of low accuracy in multi-table question answering.
[0038] like Figure 1 As described in S200, the business query request description and SQL history statements are established in a corresponding relationship and added to the corresponding vector database.
[0039] Specifically, establish a mapping relationship between SQL history statements and business query request descriptions, and add data such as data source name, database name, SQL history statements, business query request descriptions, and embedding vectors of business query request descriptions to the corresponding vector database.
[0040] The embedding vector describing the business query request serves as the index of the vector database; the data source name, database name, SQL history statements, and business query request description serve as fields in the vector database.
[0041] Specifically, SQL history statements are exported from different databases, and different databases support different SQL history statement query tools or methods. For example, PostgreSQL uses pg_stat_statements, MySQL uses performance_schema.events_statements_history, SQL Server uses sys.dm_exec_query_stats, Oracle uses v$sql and DBA_HIST_SQLTEXT, etc. Third-party tools can also be used to support obtaining database SQL history statements.
[0042] Specifically, vector databases such as FAISS and Milvus, which handle large-scale vector data, can be selected.
[0043] like Figure 1 As described in S300, the received business query requests are matched using a vector database, and the specific process is as follows: Figure 2 The flowchart of the business query request is shown, including the following: Specifically, when a new business query request is received, the natural language description of the business query request is first converted into a vector; then, the vector database is searched based on the converted vector to find vectors that are similar to this vector; then, the vector database will return N closest results (N is a natural number such as 1, 2, 3); the fields contained in these returned results are the data source name, database name, SQL history statements, and business query request description.
[0044] like Figure 1 As described in S100, a large model of prompt words is constructed based on historical SQL statements.
[0045] Specifically, we construct large-scale model prompts and generate corresponding business query request descriptions for SQL historical statements through large-scale model question answering.
[0046] Taking the Oracle EBS system as an example, the prompt could be: "You are an Oracle EBS database expert, meaning you understand both business logic and databases. A business expert described a database query requirement in Chinese, and you are asked to write the corresponding SQL statement. You then wrote the following SQL statement. Please directly provide the two most likely ways the business expert might have phrased the query request. SQL statement: XXX", where XXX is the specific SQL statement. The results returned by the large model can be broken down into multiple business query request descriptions, with a one-to-many relationship between the SQL statements and the business query request descriptions.
[0047] The Oracle EBS system is an enterprise-level comprehensive application software suite. This software mainly uses Oracle's database technology and is based on a modular architecture design. It provides comprehensive management tools and data analysis methods for enterprise management, and integrates multiple management modules such as enterprise resource planning, customer relationship management, and supply chain management.
[0048] like Figure 1 As described in S400, large model suggestion words are constructed based on the most similar historical SQL statements and the corresponding business query request descriptions.
[0049] Specifically, a large model of prompts is constructed, using the most similar historical SQL statements and their corresponding business query request descriptions as context to generate corresponding SQL statements for new business query requests.
[0050] Taking the Oracle EBS system as an example, the prompt is "You are an Oracle EBS database expert. The SQL statement corresponding to the business query request description YYY is known to be XXX. Please provide the SQL statement corresponding to the business query request description ZZZ below." Here, XXX is the closest historical SQL statement matched from the vector database in the fourth step, YYY is the business query request description corresponding to XXX in the fourth step, and ZZZ refers to the newly proposed business query request.
[0051] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A method for generating intelligent SQL question-answering based on a large-scale database model, characterized in that, include: Based on historical SQL statements, construct large model suggestion words to obtain a description of business query requests; Based on the description of the business query request, establish a correspondence with the historical SQL statements and add them to the corresponding vector database; Upon receiving a business query request, it is matched using a vector database to obtain the most similar historical SQL statement and the corresponding business query request description. Based on the most similar historical SQL statements and the corresponding business query request descriptions, construct large model suggestion words and generate the SQL statement for the business query request.
2. The method for generating intelligent SQL question answering based on a large model database as described in claim 1, characterized in that: The vector database includes embedding vectors that describe data source names, database names, SQL history statements, and business query requests. The embedding vector described in the business query request serves as an index to the vector database; The data source name, database name, SQL history statements, and business query request description are used as fields in the vector database.
3. The method for generating intelligent SQL question answering based on a large model database as described in claim 1, characterized in that, Upon receiving a business query request, a vector database is used for matching to obtain the most similar historical SQL statement and the corresponding business query request description, including: Received business query requests; The received natural language business query requests are transformed into query vectors; Based on the query vector, the N closest approximate vectors are obtained by querying the vector database according to vector similarity. The fields returned by the obtained similar vector include the data source name, database name, SQL history statements, and business query request description.
4. The method for generating intelligent SQL question answering based on a large model database as described in claim 1, characterized in that, Also includes: The SQL history statements are exported according to different databases using different SQL history statement query tools or methods; The vector database selected is a database like FAISS and Milvus, which handle large-scale vector data.
5. The method for generating intelligent SQL question answering based on a large model database as described in claim 1, characterized in that, The process of constructing large-scale model suggestion words based on SQL history statements includes: A large model of suggestion words is constructed for the SQL history statements. The suggestion words for the SQL history statements include relevant content related to the query request wording. The specific prompts include the name of the database used, the specific content of the SQL history statements used in the query, and the requirement for the large model to provide the two most likely query request formats.
6. The method for generating intelligent SQL question answering based on a large model database as described in claim 5, characterized in that, Also includes: The results returned by the large model are split to obtain multiple business query request descriptions. Therefore, there is a one-to-many relationship between the SQL history statements and the business query request descriptions.
7. The method for generating intelligent SQL question answering based on a large model database as described in claim 1, characterized in that, The process of constructing large model suggestion words based on the most similar historical SQL statements and corresponding business query request descriptions includes: A large model prompt word is constructed based on the SQL history statements and the corresponding business query request descriptions. For each business query request, a corresponding SQL statement is generated, including relevant content on the corresponding relationship. The specific prompts include the database name used, a description of the query request as an example, the corresponding SQL statement, and a requirement for the large model to provide the corresponding SQL statement based on the specific description of the business query request.
8. A software for constructing a data indicator system based on a large model, characterized in that, Based on different business systems, obtain corresponding large model prompts and generate data indicator information.
9. The method for generating a data indicator system based on a large model according to claim 8, characterized in that, The generated data metrics information can also be imported into the data metrics platform for CRUD operations.
10. An electronic device, characterized in that, include: Memory, processor, and computer programs stored in said memory and executable on said processor The processor is configured to implement the method of any one of claims 1 to 7 when executing the computer program.