Method, device and medium for converting natural language into structured query language

CN122240655APending Publication Date: 2026-06-19HANGZHOU WOQU NETWORK TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU WOQU NETWORK TECH
Filing Date
2026-05-19
Publication Date
2026-06-19

Smart Images

  • Figure CN122240655A_ABST
    Figure CN122240655A_ABST
Patent Text Reader

Abstract

This application relates to the field of natural language processing technology, specifically to methods, devices, and media for converting natural language into structured query language. It includes: acquiring a user-configured business table to be learned and a learning description for the business table; querying the business data corresponding to the business table to be learned and cleaning the business data; inputting the learning description of the business table to be learned, the corresponding business data, and a third prompt word template into a large model to obtain the data relationships output by the large model; storing the data relationships output by the large model into a graph database and a vector database; obtaining supplementary knowledge of the user's question from the graph database and the vector database, and inputting the user's question, the first prompt word template, a preset table structure, and the supplementary knowledge into the large model for reasoning to obtain the structured query language output by the large model. This invention can solve the problem of the large workload associated with manually entering relationships into a knowledge base.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to methods, devices and media for converting natural language into structured query language. Background Technology

[0002] Natural Language to SQL (NL2SQL) is a technology that transforms user-generated natural language queries into structured SQL queries. Its purpose is to lower the barrier to database queries, allowing non-technical users without specialized database knowledge to easily obtain the data they need by inputting natural language descriptions. With the development of deep learning technology, NL2SQL is no longer limited to rule-based matching but is based on generative large-scale models to achieve natural language to SQL conversion.

[0003] However, existing methods still fall short in establishing accurate connections between Natural Language Translated (NLTR) SQL and database tables for vertical domains, resulting in low accuracy of generated SQL. Most current NL2SQL implementations are based on general large-scale models, but these models lack industry information for specific vertical sectors like finance, power, and banking, as well as internal enterprise data. This can cause the system to be unable to determine which tables and fields to search. For example, if a bank wants to "statistically identify high-net-worth users," the rules and definition of "high-net-worth" are enterprise-specific. Without supplementary business knowledge, directly querying this information will result in incorrect SQL generation. Furthermore, polysemy is common, and existing methods struggle to match the correct database fields.

[0004] Furthermore, while Retrieval Enhancement Generate Knowledge Bases (RAG knowledge bases) can solve some domain-specific data issues, in practice, enterprise users need to clean and organize their enterprise knowledge bases. Moreover, most industries already have data dictionaries stored in their databases, making the conversion of this database-existing data into knowledge base documents a very time-consuming and labor-intensive task. For example, enterprises often have many internal rules. For instance, a table might store regional codes, but the actual query might be for the region's name. Relationships between regions are defined by the number of digits in the code, not the name. For example, the code for a power supply station in a certain region might be 33401, while the code for a power supply station in a sub-region under that region might be 3340102. To calculate the electricity consumption of all districts and counties under that region, the region's name cannot be used; the code 33401 must be used. Many similar reverse-engineering scenarios exist, and manually entering these relationships into the knowledge base would be a considerable workload. Summary of the Invention

[0005] The purpose of this invention is to provide a method, device, and medium for converting natural language into structured query language, so as to solve the problem of the large workload involved in manually entering relationships into the knowledge base.

[0006] According to a first aspect of the present invention, a method for converting natural language into structured query language is provided, the method comprising the following steps: Obtain the user-configured business table to be learned and the user-configured learning description of the business table to be learned; the learning description of the business table to be learned includes the relationship between the data in the business table to be learned.

[0007] Query the business data corresponding to the business table to be learned, and clean the business data; the data cleaning includes removing duplicate data and outdated data.

[0008] The learning instructions for the business table to be learned, the corresponding business data, and the third prompt word template are input into the large model to obtain the data relationship output by the large model; the third prompt word template includes the learning data relationship.

[0009] Store the data relationships output by the large model into a graph database and a vector database.

[0010] Supplementary knowledge of the user's question is obtained from graph databases and vector databases. The user's question, the first prompt word template, the preset table structure, and the supplementary knowledge are input into the large model for reasoning to obtain the structured query language output by the large model. The first prompt word template includes a task description that converts the user's question into a structured query language.

[0011] Furthermore, the process of acquiring supplementary knowledge about user issues includes: Identify keywords in user questions and determine the type of keywords in user questions; the types of keywords include first-type keywords and second-type keywords; the first-type keywords are keywords configured in the knowledge base generated by retrieval enhancement, and the second-type keywords are other keywords besides the first-type keywords.

[0012] Obtain the ratio of the number of first-category keywords in user questions to the total number of keywords in user questions, and adjust the initial recall weights of first-category keywords, second-category keywords, and vectorized initial recall weights based on the ratio to obtain the target recall weights of first-category keywords, second-category keywords, and vectorized target recall weights; the initial recall weight of first-category keywords is greater than the initial recall weight of second-category keywords, and the target recall weight of first-category keywords is greater than the target recall weight of second-category keywords.

[0013] The combined weight of each set of search results recalled from the graph database and the vector database is determined based on the target recall weight of the first type of keywords, the target recall weight of the second type of keywords, and the vectorized target recall weight. Supplementary knowledge is then determined based on the combined weight of each set of search results recalled.

[0014] Furthermore, the comprehensive weight of each set of search results recalled from the graph database and vector database is determined based on the target recall weight of the first type of keywords, the target recall weight of the second type of keywords, and the vectorized target recall weight, including: For any set of retrieved results, if the retrieved result is retrieved only by keyword recall, the overall weight of the retrieved result is the sum of the target recall weights of the corresponding keywords; if the retrieved result is retrieved only by vectorized recall, the overall weight of the retrieved result is the target recall weight of the vectorized keywords; if the retrieved result is retrieved by both keyword recall and vectorized recall, the overall weight of the retrieved result is the sum of the target recall weights of the corresponding keywords and the target recall weights of the vectorized keywords.

[0015] Furthermore, adjusting the initial recall weight of the first type of keyword, the initial recall weight of the second type of keyword, and the vectorized initial recall weight based on the ratio includes: when the ratio of the number of first type of keywords to the total number of keywords in the user's question is less than or equal to a preset ratio threshold, the initial recall weight of the first type of keyword is increased and determined as the target recall weight of the first type of keyword; the initial recall weight of the first type of keyword is decreased and determined as the target recall weight of the first type of keyword; and the initial recall weight of the vectorized keyword is decreased and determined as the target recall weight of the vectorized keyword.

[0016] Furthermore, the process of obtaining the table structure includes: The user question, the second prompt word template, the table names included in the data source, the field names included in any table included in the data source, and the supplementary knowledge are input into the large model for reasoning, and a list of table names output by the large model is obtained; the second prompt word template includes filtering table names related to the user question from the table names included in the data source.

[0017] The table structure is constructed based on the list of table names output by the large model; the table structure includes the field names corresponding to each table name in the list of table names and the explanation of the fields.

[0018] Furthermore, the method also includes: The structured query language output by the large model is subjected to syntax validation. If the validation fails, the structured query language output by the large model is determined to be an erroneous result. Otherwise, the structured query language output by the large model is executed. If an error occurs during execution, the structured query language output by the large model is determined to be an erroneous result. Otherwise, the user question, the first prompt word template, the preset table structure, and the supplementary knowledge set of the prompt words are input into the large model multiple times for reasoning, and the structured query language corresponding to the target question is obtained through voting.

[0019] Furthermore, determining supplementary knowledge based on the comprehensive weight of each group of retrieved results includes: sorting the retrieved results according to the corresponding comprehensive weight, and determining the top preset number of retrieved results with the largest comprehensive weight as supplementary knowledge.

[0020] Furthermore, the retrieval enhancement generates a knowledge base that includes glossary entries, business logic explanations, and case studies.

[0021] According to a second aspect of the present invention, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the above-described natural language to structured query language processing method.

[0022] According to a third aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program, which, when executed by a processor, implements the above-described natural language to structured query language processing method.

[0023] Compared with the prior art, the present invention has at least the following beneficial effects: This invention inputs the learning instructions of the business table to be learned (including the relationships between data in the business table), the corresponding business data, and a third prompt word template including the relationships between the learning data into a large model. The large model then obtains the data relationships existing in the business table and stores these data relationships in a graph database and a vector database. Thus, this invention can automatically learn the knowledge in the business table and obtain the data relationships. This avoids the need for manual input of these data relationships from the business table into the graph database and vector database, reducing the workload of manual labor.

[0024] Furthermore, this invention determines the type of keywords in the user question and extracts core information (first-type keywords) and auxiliary information (second-type keywords) directly related to the retrieval enhancement knowledge base from the user question. The retrieval enhancement knowledge base is configured by the user, is highly specialized, and stores domain knowledge lacking in large models. Compared to second-type keywords, if first-type keywords exist in the user question, the knowledge corresponding to those first-type keywords is more important and can more effectively compensate for the lack of domain knowledge in large models. The recall weight of keywords and the recall weight of vectorization are adjusted according to the ratio of the number of first-type keywords to the total number of keywords in the user question. The ratio of the number of first-type keywords to the total number of keywords in the user question can characterize the influence of first-type keywords on the vectorized results corresponding to the user question, indirectly reflecting the degree of correlation between the vectorized results and first-type keywords. Adjusting the recall weight of keywords and the recall weight of vectorization according to this ratio also changes the degree of reference of the keyword recall method and the vectorized recall method in the recall stage, so that the knowledge recalled according to the comprehensive weight is more in line with the user's actual needs. Based on this, the recalled knowledge can be input into the large model to make up for the lack of domain knowledge in the large model and improve the accuracy of SQL generation. Attached Figure Description

[0025] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0026] Figure 1 A flowchart of the natural language to structured query language processing method provided in Embodiment 1 of the present invention; Figure 2 A flowchart illustrating the process of acquiring supplementary knowledge about user problems as provided in Embodiment 1 of the present invention; Figure 3 This is a flowchart illustrating the process of obtaining the table structure provided in Embodiment 1 of the present invention. Detailed Implementation

[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0028] Example 1: According to this embodiment, as Figure 1 As shown, a method for converting natural language to structured query language is provided, the method comprising the following steps: S100, obtain the user-configured business table to be learned and the user-configured learning description of the business table to be learned; the learning description of the business table to be learned includes the relationship between the data in the business table to be learned.

[0029] For example, the business table stores the codes of regions. The learning instructions configured by the user are: the relationship between regions is associated by the number of digits in the code; for example, the code of a power supply station in a certain region is 33401, and the code of a power supply station in a certain sub-region under this region is 3340102.

[0030] S200, query the business data corresponding to the business table to be learned, and clean the business data; the data cleaning includes removing duplicate data and outdated data.

[0031] S300, input the learning instructions of the business table to be learned, the corresponding business data, and the third prompt word template into the large model, and obtain the data relationship output by the large model; the third prompt word template includes the learning data relationship.

[0032] S400 stores the data relationships output from the large model into a graph database and a vector database.

[0033] In this embodiment, the data relationships output by the large model are stored in a graph database in the form of nodes and edges, and the text corresponding to the data relationships output by the large model is vectorized and stored in a vector database.

[0034] This embodiment inputs the learning instructions for the business table to be learned (including the relationships between data in the business table), the corresponding business data, and a third prompt word template including the relationships between the data to be learned into the large model. The large model then obtains the data relationships existing in the business table and stores these data relationships in the graph database and vector database. Thus, this embodiment can automatically learn the knowledge in the business table and obtain the data relationships. This avoids the need for manual input of these data relationships from the business table into the graph database and vector database, reducing the workload of manual labor.

[0035] S500: Obtain supplementary knowledge of the user's question from the graph database and vector database, and input the user's question, the first prompt word template, the preset table structure and the supplementary knowledge into the large model for reasoning, and obtain the structured query language output by the large model; the first prompt word template includes a task description that converts the user's question into a structured query language.

[0036] As a specific implementation method, the process of acquiring supplementary knowledge about user problems includes, for example... Figure 2 As shown: S501, Identify keywords in the user's question and determine the type of keywords in the user's question; the types of keywords include first-type keywords and second-type keywords; the first-type keywords are keywords configured in the knowledge base for retrieval enhancement, and the second-type keywords are other keywords besides the first-type keywords.

[0037] In this embodiment, the purpose of determining the type of keywords in the user's question is to make a preliminary diagnosis of the user's domain knowledge requirement. The appearance of the first type of keywords directly indicates that the user's question touches on the knowledge blind spot of the large model; the purpose of keyword classification is not for simple word grouping, but for the subsequent quantification of the urgency and direction of knowledge supplementation.

[0038] In this embodiment, the construction of the enhanced knowledge base for retrieval is targeted, including enterprise-specific concepts that large models cannot learn or fully learn during pre-training (such as specific calculation formulas for potential customer conversion rates, and specific associations between activity codes and marketing activities). The aforementioned keyword classification process maps questions to these specific concepts. As a specific implementation, the enhanced knowledge base for retrieval includes glossary, business logic explanations, and case studies. The purpose of providing glossary is to resolve semantic ambiguity. When some terms are ambiguous, the glossary page can be configured with the terms and their specific semantic explanations. The purpose of providing business logic explanations is to supplement business logic rules with keywords. For example, if a power company wants to query the number of abnormal meters running backwards, it can configure the keyword "meters running backwards" in the business logic and configure the rule explanation to freeze the daily forward / reverse active power total or peak / flat / valley electricity readings at each rate - the previous day's forward / reverse active power total reading < 0. If any of these conditions are met, it is determined that the meter is running backwards. The purpose of providing case studies is to allow the system to learn this approach by providing examples. When a corresponding question is queried, the system can generate accurate SQL based on the case studies.

[0039] As a specific implementation method, Natural Language Processing (NLP) tools are used to extract keywords from user queries. For any extracted keyword, if the extracted keyword matches a pre-configured keyword in the retrieval enhancement knowledge base, then the keyword is determined to be a first-category keyword; otherwise, the keyword is determined to be a second-category keyword.

[0040] S502, obtain the ratio of the number of first-category keywords in the user question to the total number of keywords in the user question, and adjust the initial recall weight of the first-category keywords, the initial recall weight of the second-category keywords, and the vectorized initial recall weight according to the ratio to obtain the target recall weight of the first-category keywords, the target recall weight of the second-category keywords, and the vectorized target recall weight; the initial recall weight of the first-category keywords is greater than the initial recall weight of the second-category keywords, and the target recall weight of the first-category keywords is greater than the target recall weight of the second-category keywords.

[0041] In this embodiment, the ratio of the number of first-category keywords in the user question to the total number of keywords in the user question quantifies the specialization level of the current query. This ratio characterizes the influence of the first-category keywords on the vectorized results corresponding to the user question, indirectly reflecting the correlation between the vectorized results and the first-category keywords. As a specific implementation, adjusting the initial recall weight of the first-category keywords, the initial recall weight of the second-category keywords, and the initial recall weight of the vectorized results based on this ratio includes: when the ratio of the number of first-category keywords to the total number of keywords in the user question is less than or equal to a preset threshold, increasing the initial recall weight of the first-category keywords to determine the target recall weight of the first-category keywords, decreasing the initial recall weight of the second-category keywords to determine the target recall weight of the second-category keywords, and decreasing the initial recall weight of the vectorized results to determine the target recall weight of the vectorized results. Therefore, when the ratio is small, meaning the first type of keywords in the user question have a small impact on the vectorized results corresponding to the user question, and the correlation between the vectorized results and the first type of keywords is also small, this embodiment increases the target recall weight of the first type of keywords and decreases the target recall weight of the second type of keywords and the vectorized results. This is beneficial for increasing the overall weight of the search results corresponding to the first type of keywords, which is beneficial for prioritizing the recall of the search results corresponding to the first type of keywords, and also beneficial for prioritizing the recall of knowledge missing in the large model. Conversely, if the ratio is large, then the first type of keywords in the user question have a large impact on the vectorized results corresponding to the user question, and the correlation between the vectorized results and the first type of keywords is also large. In this case, even without adjusting the weights, it is possible to ensure the priority recall of knowledge missing in the large model.

[0042] As a specific implementation, the preset ratio threshold is an empirical value. For example, the preset ratio threshold is 0.3, 0.4, or 0.5. When the ratio is less than the preset ratio threshold T, the target recall weight of the first type of keyword is w1’, w1’ = w1×(1 + k1×(1 - r)), where w1 is the initial recall weight of the first type of keyword, k1 is the preset weight influence coefficient of the ratio on the first type of keyword, k1>0, and r is the ratio. The target recall weight of the second type of keyword is w2’, w2’ = w2×(k2×r + c1), where w2 is the initial recall weight of the second type of keyword, k2 is the preset weight influence coefficient of the ratio on the second type of keyword, 0<k2<1, c1 is the minimum value of the target recall weight of the second type of keyword, c1>0, and k2×T + c1<1. The target recall weight of vectorization is w3’, w3’ = w3×(k3×r + c2), where w3 is the initial recall weight of vectorization, k3 is the preset weight influence coefficient of the ratio on vectorization, 0<k3<1, c2 is the minimum value of the target recall weight of vectorization, c2>0, and k3×T + c2<1. Among them, w1, w2, w3, k1, k2, k3, c1, and c2 can be empirical values or obtained by fitting based on historical data (the loss of fitting can be measured by the accuracy of the structured query language finally output by the large model). When the ratio is greater than or equal to the preset ratio threshold T, the weights are not adjusted, that is, the initial recall weight of the first type of keyword is equal to the target recall weight of the first type of keyword, the initial recall weight of the second type of keyword is equal to the target recall weight of the second type of keyword, and the initial recall weight of vectorization is equal to the target recall weight of vectorization.

[0043] S503, determine the comprehensive weight of each group of retrieval results recalled from the graph database and the vector database according to the target recall weights of the first type of keyword, the second type of keyword, and the vectorization, and determine supplementary knowledge according to the comprehensive weight of each group of retrieved results.

[0044] In this embodiment, two recall methods are employed: keyword recall and vectorized recall. It should be understood that search results obtained using different keyword recall methods may be the same, and search results obtained using both keyword and vectorized recall methods may also be the same. As a specific implementation, determining the comprehensive weight of each set of search results recalled from the graph database and vector database based on the target recall weights of the first type of keywords, the second type of keywords, and the target recall weight of the vectorized recall method includes: for any set of recalled search results, if the search result is recalled only by the keyword recall method, then the comprehensive weight of the search result is the sum of the target recall weights of the keywords corresponding to the search result; if the search result is recalled only by the vectorized recall method, then the comprehensive weight of the search result is the target recall weight of the vectorized recall method; if the search result is recalled by both the keyword recall method and the vectorized recall method, then the comprehensive weight of the search result is the sum of the target recall weights of the keywords corresponding to the search result and the target recall weights of the vectorized recall method.

[0045] As a specific implementation method, determining supplementary knowledge based on the comprehensive weight of each retrieved set includes: sorting the retrieved results according to their corresponding comprehensive weights, and determining the top preset number of retrieved results with the highest comprehensive weights as supplementary knowledge. The preset number is positively correlated with the number of first-category keywords in the user question. Therefore, when the user question contains a large number of first-category keywords, more retrieved results are allowed to be input into the large model, which helps improve the accuracy of the structured query language output by the large model. Thus, the more times a retrieved result is recalled, and the greater the keyword or vectorized target recall weight at the time of recall, the greater the comprehensive weight of that retrieved result, and the greater its probability of being used as supplementary knowledge.

[0046] As a specific implementation method, the process of obtaining the table structure includes, for example: Figure 3 As shown: S510, input the user question, the second prompt word template, the table names included in the data source, the field names included in any table included in the data source, and the supplementary knowledge into the large model for reasoning, and obtain the list of table names output by the large model; the second prompt word template includes table names related to the user question selected from the table names included in the data source.

[0047] In this embodiment, some table names in the data source are unrelated to the user's question, while others are related to the user's question. The purpose of reasoning through the large model is to filter out the table names related to the user's question from the table names included in the data source.

[0048] In this embodiment, the supplementary knowledge mentioned above is also essential knowledge for determining which table names in the large model inference data source are related to the user's question. The process of acquiring supplementary knowledge has been explained above and will not be repeated here.

[0049] S520, construct a table structure based on the list of table names output by the large model; the table structure includes the field names corresponding to each table name in the list of table names and the explanation of the fields.

[0050] In this embodiment, the table structure is an explanation of the table, including the name of each field and its corresponding explanation. For example, the table is named TABLE employees, and the field names include employee_id and first_name, where employee_id means employee ID, which is auto-incrementing; and first_name means the employee's name, which can be up to 50 characters and cannot be empty.

[0051] In one specific implementation, the method further includes: The structured query language output by the large model undergoes syntax validation. If validation fails, the output is deemed an incorrect result. Otherwise, the output is executed. If an error occurs during execution, the output is again deemed an incorrect result. Otherwise, the user question, the first prompt word template, the preset table structure, and the supplementary knowledge set of the prompt words are repeatedly input into the large model for reasoning, and a voting process is used to obtain the structured query language corresponding to the target question. Those skilled in the art will recognize that the process of syntax validation for the structured query language is existing technology and will not be elaborated upon here. Based on this specific implementation, the accuracy of the final result returned to the user is improved.

[0052] In one specific implementation, the user question, the first prompt word template, the preset table structure, and the supplementary knowledge set of the prompt words are input into the large model three times for reasoning. The number of times each structured query language output by the large model that successfully passed syntax validation and did not produce errors during execution is recorded is then determined. The structured query language with the most outputs is taken as the structured query language corresponding to the target question. It should be understood that the structured query language corresponding to the target question is the final result returned to the user.

[0053] This embodiment determines the type of keywords in the user question and extracts core information (first-type keywords) and auxiliary information (second-type keywords) directly related to the retrieval enhancement knowledge base from the user question. The retrieval enhancement knowledge base is configured by the user, is highly professional, and stores domain knowledge lacking in large models. Compared to second-type keywords, if first-type keywords exist in the user question, the knowledge corresponding to those first-type keywords is more important and can more effectively compensate for the lack of domain knowledge in large models. The recall weight of keywords and the recall weight of vectorization are adjusted according to the ratio of the number of first-type keywords to the total number of keywords in the user question. The ratio of the number of first-type keywords to the total number of keywords in the user question can characterize the influence of first-type keywords on the vectorized results corresponding to the user question, indirectly reflecting the degree of correlation between the vectorized results and first-type keywords. Adjusting the recall weight of keywords and the recall weight of vectorization according to this ratio changes the degree of reference of the keyword recall method and the vectorized recall method in the recall stage, so that the knowledge recalled based on the comprehensive weight is more in line with the user's actual needs. Based on this, the recalled knowledge can be input into the large model to make up for the lack of domain knowledge in the large model and improve the accuracy of SQL generation.

[0054] Example 2: This embodiment provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it performs the following steps: Obtain the user-configured business table to be learned and the user-configured learning description of the business table to be learned; the learning description of the business table to be learned includes the relationship between the data in the business table to be learned.

[0055] Query the business data corresponding to the business table to be learned, and clean the business data; the data cleaning includes removing duplicate data and outdated data.

[0056] The learning instructions for the business table to be learned, the corresponding business data, and the third prompt word template are input into the large model to obtain the data relationship output by the large model; the third prompt word template includes the learning data relationship.

[0057] Store the data relationships output by the large model into a graph database and a vector database.

[0058] Supplementary knowledge of the user's question is obtained from graph databases and vector databases. The user's question, the first prompt word template, the preset table structure, and the supplementary knowledge are input into the large model for reasoning to obtain the structured query language output by the large model. The first prompt word template includes a task description that converts the user's question into a structured query language.

[0059] Example 3: This embodiment provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, it performs the following steps: Obtain the user-configured business table to be learned and the user-configured learning description of the business table to be learned; the learning description of the business table to be learned includes the relationship between the data in the business table to be learned.

[0060] Query the business data corresponding to the business table to be learned, and clean the business data; the data cleaning includes removing duplicate data and outdated data.

[0061] The learning instructions for the business table to be learned, the corresponding business data, and the third prompt word template are input into the large model to obtain the data relationship output by the large model; the third prompt word template includes the learning data relationship.

[0062] Store the data relationships output by the large model into a graph database and a vector database.

[0063] Supplementary knowledge of the user's question is obtained from graph databases and vector databases. The user's question, the first prompt word template, the preset table structure, and the supplementary knowledge are input into the large model for reasoning to obtain the structured query language output by the large model. The first prompt word template includes a task description that converts the user's question into a structured query language.

[0064] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0065] While specific embodiments of the invention have been described in detail by way of examples, those skilled in the art should understand that the examples are for illustrative purposes only and are not intended to limit the scope of the invention. Those skilled in the art should also understand that various modifications can be made to the embodiments without departing from the scope and spirit of the invention.

Claims

1. A method for converting natural language to structured query language, characterized in that, The method includes the following steps: Obtain the user-configured business table to be learned and the user-configured learning description of the business table to be learned; the learning description of the business table to be learned includes the relationship between the data in the business table to be learned; Query the business data corresponding to the business table to be learned, and perform data cleanup on the business data; the data cleanup includes removing duplicate data and outdated data; The learning instructions for the business table to be learned, the corresponding business data, and the third prompt word template are input into the large model to obtain the data relationship output by the large model; the third prompt word template includes the learning data relationship. Store the data relationships output from the large model into a graph database and a vector database; Supplementary knowledge of the user's question is obtained from graph databases and vector databases. The user's question, the first prompt word template, the preset table structure, and the supplementary knowledge are input into the large model for reasoning to obtain the structured query language output by the large model. The first prompt word template includes a task description that converts the user's question into a structured query language.

2. The method for converting natural language to structured query language according to claim 1, characterized in that, The process of acquiring supplementary knowledge about user issues includes: Identify keywords in user questions and determine the type of keywords in user questions; the types of keywords include first-type keywords and second-type keywords; the first-type keywords are keywords configured in the knowledge base generated by retrieval enhancement, and the second-type keywords are other keywords besides the first-type keywords; The ratio of the number of Category 1 keywords in user questions to the total number of keywords in user questions is obtained. Based on this ratio, the initial recall weights of Category 1 keywords, Category 2 keywords, and vectorized initial recall weights are adjusted to obtain the target recall weights of Category 1 keywords, Category 2 keywords, and vectorized target recall weights. The initial recall weight of Category 1 keywords is greater than the initial recall weight of Category 2 keywords, and the target recall weight of Category 1 keywords is greater than the target recall weight of Category 2 keywords. The combined weight of each set of search results recalled from the graph database and the vector database is determined based on the target recall weight of the first type of keywords, the target recall weight of the second type of keywords, and the vectorized target recall weight. Supplementary knowledge is then determined based on the combined weight of each set of search results recalled.

3. The method for converting natural language to structured query language according to claim 2, characterized in that, The overall weight of each set of search results recalled from the graph database and the vector database is determined based on the target recall weight of the first type of keywords, the target recall weight of the second type of keywords, and the vectorized target recall weight. For any set of retrieved results, if the retrieved result is retrieved only by keyword recall, the overall weight of the retrieved result is the sum of the target recall weights of the corresponding keywords; if the retrieved result is retrieved only by vectorized recall, the overall weight of the retrieved result is the target recall weight of the vectorized keywords; if the retrieved result is retrieved by both keyword recall and vectorized recall, the overall weight of the retrieved result is the sum of the target recall weights of the corresponding keywords and the target recall weights of the vectorized keywords.

4. The method for converting natural language to structured query language according to claim 2, characterized in that, Adjusting the initial recall weight of the first type of keyword, the initial recall weight of the second type of keyword, and the vectorized initial recall weight based on the ratio includes: when the ratio of the number of first type of keywords to the total number of keywords in the user's question is less than or equal to a preset ratio threshold, the initial recall weight of the first type of keyword is increased and determined as the target recall weight of the first type of keyword; the initial recall weight of the first type of keyword is decreased and determined as the target recall weight of the first type of keyword; and the initial recall weight of the vectorized keyword is decreased and determined as the target recall weight of the vectorized keyword.

5. The method for converting natural language to structured query language according to claim 1, characterized in that, The process of obtaining the table structure includes: The user question, the second prompt word template, the table names included in the data source, the field names included in any table included in the data source, and the supplementary knowledge are input into the large model for reasoning, and a list of table names output by the large model is obtained; the second prompt word template includes filtering table names related to the user question from the table names included in the data source; The table structure is constructed based on the list of table names output by the large model; the table structure includes the field names corresponding to each table name in the list of table names and the explanation of the fields.

6. The method for converting natural language to structured query language according to claim 1, characterized in that, The method further includes: The structured query language output by the large model is subjected to syntax validation. If the validation fails, the structured query language output by the large model is determined to be an erroneous result. Otherwise, the structured query language output by the large model is executed. If an error occurs during execution, the structured query language output by the large model is determined to be an erroneous result. Otherwise, the user question, the first prompt word template, the preset table structure, and the supplementary knowledge set of the prompt words are input into the large model multiple times for reasoning, and the structured query language corresponding to the target question is obtained through voting.

7. The method for converting natural language to structured query language according to claim 6, characterized in that, The supplementary knowledge is determined based on the overall weight of each group of retrieved results. This includes sorting the retrieved results according to their overall weight and identifying the top 100 retrieved results with the highest overall weight as supplementary knowledge.

8. The method for converting natural language to structured query language according to claim 2, characterized in that, The enhanced knowledge base for retrieval includes definitions, business logic explanations, and case studies.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the natural language to structured query language processing method as described in any one of claims 1 to 8.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the natural language to structured query language processing method as described in any one of claims 1 to 8.