Database query-based processing method and apparatus, device, and medium

By acquiring user queries, filtering candidate database schemas, and generating database query statements, the problem of inaccurate database query statements is solved, achieving an efficient and accurate query process, lowering the user threshold, and improving the efficiency and accuracy of the database system.

WO2026123824A1PCT designated stage Publication Date: 2026-06-18CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD
Filing Date
2025-09-05
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

In existing technologies, the generation of database query statements is inaccurate, leading to inaccurate and incomplete query results, which affects the efficiency of the database system and the user experience, and also poses a risk of data loss.

Method used

By acquiring user query questions, filtering candidate database schemas, extracting key content information and converting it into content identifiers recognizable by the database, determining the query statement skeleton, and filling in the content identifiers and target database schemas to generate database query statements, the query process is optimized using large models and context learning.

🎯Benefits of technology

It enables the automated generation of accurate database query statements, lowers the barrier for users to query the database, improves query efficiency and accuracy, and reduces labor costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025119429_18062026_PF_FP_ABST
    Figure CN2025119429_18062026_PF_FP_ABST
Patent Text Reader

Abstract

The present application provides a database query-based processing method and apparatus, a device, and a medium. The method comprises: acquiring a user query question; on the basis of the user query question, selecting candidate database schemas from among a plurality of database schemas in a preset database, and extracting key content information from the user query question; converting the key content information into a content identifier that is recognizable by the database, and selecting a target database schema from among the candidate database schemas; determining function structure information of the user query question, and determining a query statement skeleton of the user query question on the basis of the function structure information; and filling the query statement skeleton with the content identifier and the target database schema to obtain a database query statement, and performing a query operation in the database by means of the database query statement.
Need to check novelty before this filing date? Find Prior Art

Description

Database query-based processing methods, apparatus, equipment, and media

[0001] Related applications

[0002] This application claims priority to Chinese patent application filed on December 12, 2024, with application number 202411835078.5, entitled "A Processing Method, Apparatus, Device and Medium Based on Database Query", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of computer technology, and in particular to a processing method, apparatus, device and medium based on database query. Background Technology

[0004] Databases, as core tools for data storage and management, play a vital role in various industries. However, with the continuous expansion of database size and the increasing complexity of data, the problem of inaccurate database query generation has gradually become prominent, especially in databases involving numerous fields and complex relationships between them.

[0005] In existing technologies, inaccurate database query generation not only affects the accuracy and completeness of query results but also reduces the efficiency of the database system and the user experience. On the one hand, inaccurate queries may prevent users from obtaining the necessary data information, thus hindering their ability to make correct decisions and judgments. On the other hand, inaccurate queries may also lead to risks such as data errors and data loss, causing unnecessary losses to the company's business operations. Summary of the Invention

[0006] In view of the above problems, in order to overcome the above problems or at least partially solve the above problems, this application proposes a processing method, apparatus, device and medium based on database query.

[0007] This application provides a database query-based processing method in a first aspect. The method includes: obtaining a user query question; selecting candidate database patterns from multiple database patterns in a preset database based on the user query question, and extracting key content information from the user query question; converting the key content information into a content identifier recognizable by the database, and selecting a target database pattern from the candidate database patterns; determining the function structure information of the user query question, and determining the query statement skeleton of the user query question based on the function structure information; and filling the content identifier and the target database pattern into the query statement skeleton to obtain a database query statement, and performing a query operation on the database using the database query statement.

[0008] In some embodiments, the step of selecting candidate database patterns from multiple database patterns in a preset database based on the user query question includes: determining the question vector of the user query question and determining the database pattern vectors of multiple database patterns in the preset database; calculating the similarity between the question vector and the database pattern vectors, and selecting candidate database patterns from the multiple database patterns based on the similarity.

[0009] In some embodiments, the step of selecting a target database pattern from the candidate database patterns includes: performing text matching on the key content information and the candidate database patterns, and selecting a target database pattern from the candidate database patterns based on the text matching results; and / or performing semantic matching on the key content information and the candidate database patterns, and selecting a target database pattern from the candidate database patterns based on the semantic matching results.

[0010] In some embodiments, before executing the query operation in the database using the database query statement, the method further includes: executing the database query statement in the database and obtaining the execution result; if the execution result is an error message, adjusting the database query statement until the execution result of the adjusted database query statement is a non-error message.

[0011] In some embodiments, the method is applied to a large model, and the method further includes: determining target sample data matching the user query question from a preset sample dataset; using the target sample data as prompt words for the large model; wherein each sample data in the sample dataset is a tuple of a sample query question and a sample query statement.

[0012] In some embodiments, determining target sample data matching the user query question from the preset sample dataset includes: determining the function set, table, and field set of each sample data in the preset sample dataset; and determining the target sample data matching the user query question from the sample dataset based on the function set, table, and field set.

[0013] In some embodiments, the user query question is a question in natural language form, and the database query statement is a statement that the database can recognize.

[0014] This application provides a database query-based processing apparatus in a second aspect. The apparatus includes: a user query question acquisition module for acquiring user query questions; a question deconstruction module for filtering candidate database patterns from multiple database patterns in a preset database based on the user query question, and extracting key content information from the user query question; a pattern linking module for converting the key content information into content identifiers recognizable by the database, and filtering target database patterns from the candidate database patterns; a skeleton determination module for determining the function structure information of the user query question, and determining the query statement skeleton of the user query question based on the function structure information; and a database query statement generation module for filling the content identifiers and target database patterns into the query statement skeleton to obtain a database query statement, and performing a query operation on the database using the database query statement.

[0015] In some embodiments, the question deconstruction module is further configured to: determine the question vector of the user query question, and determine the database pattern vector of multiple database patterns in a preset database; calculate the similarity between the question vector and the database pattern vector, and filter candidate database patterns from the multiple database patterns based on the similarity.

[0016] In some embodiments, the pattern linking module is further configured to: perform text matching on the key content information and candidate database patterns, and filter out a target database pattern from the candidate database patterns based on the text matching results; and / or perform semantic matching on the key content information and candidate database patterns, and filter out a target database pattern from the candidate database patterns based on the semantic matching results.

[0017] In some embodiments, the apparatus further includes: an execution result acquisition module, configured to execute the database query statement in the database and obtain the execution result; and an adjustment module, configured to adjust the database query statement when the execution result is an error message, until the execution result of the adjusted database query statement is a non-error message.

[0018] In some embodiments, the apparatus is applied to a large model, and the apparatus further includes: a target sample data determination module, configured to determine target sample data matching the user query question from a preset sample dataset; and to use the target sample data as prompt words for the large model; wherein each sample data in the sample dataset is a tuple of a sample query question and a sample query statement.

[0019] In some embodiments, the target sample data determination module includes: a set information determination module, used to determine the function set, table, and field set of each sample data in a preset sample dataset; and a sample data matching submodule, used to determine target sample data matching the user query question from the sample dataset based on the function set, table, and field set.

[0020] In some embodiments, the user query question is a question in natural language form, and the database query statement is a statement that the database can recognize.

[0021] In a third aspect, this application provides an electronic device including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the method described above.

[0022] In a fourth aspect, this application provides a non-volatile computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method described above.

[0023] The technical solution of this application has the following advantages:

[0024] According to embodiments of this application, the following steps are achieved: obtaining user query questions; selecting candidate database patterns from multiple database patterns in a pre-set database based on the user query questions, and extracting key content information from the user query questions; converting the key content information into database-recognizable content identifiers, and selecting target database patterns from the candidate database patterns; determining the function structure information of the user query questions, and determining the query statement skeleton based on the function structure information; filling the content identifiers and target database patterns into the query statement skeleton to obtain the database query statement, and executing the query operation in the database using the database query statement. This realizes the automatic generation and execution of database query statements based on user query questions, accurately generating database query statements while lowering the threshold for users to query the database, and significantly improving query efficiency and accuracy. Attached Figure Description

[0025] To more clearly illustrate the technical solution of this application, the drawings used in the description of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0026] Figure 1 is a flowchart illustrating a database query optimization method based on the combination of content and structural knowledge, provided in some embodiments of this application.

[0027] Figure 2 is a flowchart of a database query-based processing method provided in some embodiments of this application.

[0028] Figure 3 is a schematic diagram of the implementation process of a problem deconstruction module provided in some embodiments of this application.

[0029] Figure 4 is a schematic diagram of the implementation process of a bidirectional mode link module provided in some embodiments of this application.

[0030] Figure 5 is a schematic diagram of the implementation process of an SQL generation module provided in some embodiments of this application.

[0031] Figure 6 is a structural block diagram of a database query-based processing device provided in some embodiments of this application. Detailed Implementation

[0032] To make the above-mentioned objectives, features, and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0033] Among related technologies, generative large language models are a powerful and popular technique in the field of natural language processing today. Based on massive amounts of data, deep attention networks based on Transformers (sequence models with attention mechanisms), and reinforcement learning based on human feedback, generative large language models can learn rich latent semantics, thereby understanding natural language and generating corresponding responses.

[0034] Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that aims to enable computers to understand, process, and generate human language. Large Language Models (LLMs), also known as large-scale models, are AI models designed to understand and generate human language. Trained on massive amounts of text data, they can perform a wide range of tasks, including text summarization, translation, and sentiment analysis, and contain billions of parameters. A pipeline system is a technique used to organize multiple tasks in a specific order, forming a processing flow. Each task receives input data sequentially, generates an output, and then passes the output to the next task as its input.

[0035] Generative large language models primarily employ a two-stage training strategy: first, pre-training on large-scale unsupervised data; then, fine-tuning and reinforcement learning using labeled data for specific tasks. This strategy enables the model to perform well with limited labeled data and generate responses that best meet human needs.

[0036] In practical applications, large models (generative large language models) have been widely used to convert natural language text into database queries. For example, schema information from the database (including table creation statements and primary and foreign keys) is input into the large model as contextual information to establish a connection between the large model and database knowledge. Then, few-shot learning is used to guide the large model to generate Structured Query Language (SQL) statements. However, databases typically contain numerous fields, and the relationships between fields and between fields and values ​​are complex, which poses a challenge to the accurate generation of database queries by large models.

[0037] Therefore, in this application embodiment, an innovative semantic bridging enhancement framework for text to SQL is proposed by combining multi-channel joint technology. This framework improves the conversion efficiency from natural language questions to SQL queries by redefining the SQL generation paradigm.

[0038] This application also employs a Text-to-SQL In-context learning strategy based on a large model. This strategy maps key content and structural information from the problem to database schemas. Through bidirectional schema linking and a refined example selection strategy, the context learning process of the large model is optimized. It is applicable to various cross-domain databases and exhibits strong scalability and adaptability. By leveraging the collaborative work of content and structure channels, necessary information can be accurately extracted from the original problem, generating accurate SQL queries and effectively reducing the complexity of model processing and resource consumption.

[0039] Text-to-SQL (TTO) is a technique in natural language processing used to convert natural language questions or instructions into structured query language (SQL). The goal of this technology is to enable computers to understand database query requirements described by humans in natural language and translate them into SQL query statements that can be directly executed on the database. This is achieved by combining knowledge of language models, syntax analysis, semantic parsing, and database schemas. Applications of Text-to-SQL technology can include automated question-answering systems, intelligent improvements to database query interfaces, and semantic search.

[0040] In-context learning is a concept in machine learning and artificial intelligence, particularly in the research and application of large language models (LLMs). In in-context learning, the model does not simply rely on a pre-trained static knowledge base, but rather dynamically adjusts its behavior based on the current input context. This allows the model to adapt to new inputs on the fly without requiring retraining or fine-tuning. New tasks can be learned quickly by observing a small number of examples.

[0041] In the database domain, a schema is a logical concept used to organize objects within a database. Objects in a schema typically include tables, indexes, data types, sequences, stored procedures, primary keys, and foreign keys. Schema linking refers to the process of establishing links between unstructured expressions of user queries and tables and fields in the database schema during database interactions. This schema linking process allows for effective modeling of the relationships between user queries and fields.

[0042] Figure 1 shows a flowchart of a database query optimization method based on the combination of content and structural knowledge provided in this embodiment of the application. It includes a problem structure module, a bidirectional mode connection module, an SQL generation module, and constructs a collaborative working mechanism between the content channel and the structure channel.

[0043] The question deconstruction module is primarily used for initial screening of the database schema, extracting entity mentions from user questions. These entity mentions are typically entity phrases or adverbs of degree, such as names of people, places, or organizations. They are closely related to the content and structural hierarchy of the final generated SQL.

[0044] The bidirectional schema linking module serves as a transition from user query to SQL generation, acting as a bridge between the query structure stage and the SQL generation stage. This stage combines content information with the database schema to obtain candidate identifiers in the content pipeline and further derives the SQL function from the structure information. All processes in this stage are implemented using in-context learning.

[0045] SQL Generation Module: This module is a crucial step in transforming natural language questions into executable SQL queries. Through a well-designed query skeleton and meticulous information filling, it ensures the accuracy and effectiveness of the queries. First, it generates an SQL skeleton based on the SQL function and the original question. Then, it fills the SQL skeleton with candidate fields and tables to complete the final SQL query.

[0046] The embodiments of this application can solve the following problems existing in the related technology:

[0047] 1. Information Redundancy and Low Context Learning Efficiency: In practical applications, information redundancy is a common problem, preventing the model from effectively focusing on key content. In this embodiment, by linking and optimizing the input context using bidirectional patterns, irrelevant information can be filtered out, significantly improving context learning efficiency.

[0048] 2. Pattern matching difficulties: Existing models often perform poorly in matching natural language problems with database schemas. In this embodiment, structural information is accurately extracted through structural channels to ensure effective matching with the database schema, thereby improving the accuracy of SQL generation.

[0049] 3. Poor cross-domain adaptability: Many models perform well in specific domains but fail when applied across different domains. In the embodiments of this application, the model can adapt to the database query requirements of different domains, improving its applicability.

[0050] The inventive point of this application's embodiments is:

[0051] 1. A problem understanding scheme based on joint learning of content and structure.

[0052] In database query systems, there is a complex semantic mapping relationship between the textual description of a question and the database schema. In this embodiment, by constructing a content channel and a structure channel, key content information and structural information are extracted from the question, respectively, and then mapped to the database schema. This dual-channel mechanism not only understands the keywords in the question but also captures the semantic structure of the question statement, thus providing a foundation for generating accurate SQL queries.

[0053] 2. Context optimization strategy based on sample selection.

[0054] Based on the similarity assessment between the problem and historical data, a two-stage screening strategy is designed from both structural and content perspectives to select small sample data as context, thereby improving the learning ability of large models and their adaptability to the SQL domain.

[0055] 3. A pluggable SQL generation framework based on content and structure.

[0056] SQL query generation is a multi-step process involving multiple stages such as problem understanding, pattern matching, and query construction. In this embodiment, the designed framework is based on a general text-to-SQL process, which can be flexibly adjusted according to different application scenarios and database requirements, allowing for the selection of whether to add additional content and structure analysis modules, thus exhibiting high scalability and adaptability.

[0057] In practical applications, the embodiments of this application have the following advantages:

[0058] 1. Efficient Content and Structure Information Extraction. Through innovative content and structure channels, key content and structure information can be accurately extracted from user questions. This efficient information extraction method can effectively identify fields and table content relevant to user questions, thereby improving the accuracy of SQL generation.

[0059] 2. Fully leverage the generation capabilities of large-scale models to reduce manual labor costs. By utilizing the generation capabilities of large-scale language models, direct conversion from natural text to query statements is achieved. This reduces reliance on manually defined rules and lowers labor costs.

[0060] 3. Flexible adaptation to different database schemas. Through bidirectional schema linking and a refined example selection strategy, the context learning process of large language models is optimized. This not only improves the efficiency of natural language to SQL conversion but also enhances the model's adaptability to databases in different domains and with different database schemas.

[0061] The present application will be further described below with reference to the accompanying drawings:

[0062] Referring to FIG2, a flowchart of a database query-based processing method provided by some embodiments of this application is shown, which may specifically include the following steps 201 to 205.

[0063] Step 201: Obtain the user's query question.

[0064] In some embodiments of this application, the user query question is a question in natural language form.

[0065] Among these, questions in natural language form refer to user queries that are expressed in everyday language (such as Chinese or English), rather than in programming code, mathematical formulas, or other non-natural language formats. For example, a user query like "What will the weather be like in Beijing tomorrow?" is an example of a question expressed in natural language.

[0066] In practical applications, user queries can be obtained and processed to generate database query statements.

[0067] Step 202: Based on the user's query question, select candidate database patterns from multiple database patterns in the preset database, and extract key content information from the user's query question.

[0068] In this context, a pre-built database refers to one or more database systems that have been established and have stored a large amount of data. These databases can contain different types of data.

[0069] In database systems, a schema refers to the structure or organization of data. Different database schemas can represent different data views or access methods. Multiple database schemas mean that a pre-built database contains various different data structures or access paths. Based on the content of the user's query, schemas containing the information required by the user can be selected from multiple database schemas as candidate schemas.

[0070] After identifying candidate database schemas, keywords or phrases crucial to the query results can be extracted from the user's query as key content information. This key content information can then be used as query conditions or filter criteria to retrieve the specific information the user needs from the database.

[0071] As some examples, key information can be data in the form of text, numbers, dates, etc.

[0072] In some embodiments of this application, the step of selecting candidate database patterns from multiple database patterns in a preset database based on the user query question includes: determining the question vector of the user query question and determining the database pattern vectors of multiple database patterns in the preset database; calculating the similarity between the question vector and the database pattern vectors, and selecting candidate database patterns from the multiple database patterns based on the similarity.

[0073] The user query question vector refers to the process of converting the user query question (natural language question) into a mathematical representation. The database schema vector refers to the process of converting database schema information into a mathematical vector representation.

[0074] After determining the question vector and the database pattern vector, the similarity between the two can be calculated to determine the similarity result. Based on the similarity calculation result, candidate database patterns that match the user's query question can be selected from multiple database patterns.

[0075] As examples, the question structure module can be used to initially filter the database schema and extract entity mentions from user questions. These entity mentions can be entity phrases or adverbs of degree, and are closely related to the content and structural hierarchy of the final generated SQL.

[0076] In some examples, a database filter can also be designed in the question structure module to filter out database schemas that are closely related to the question from the user's query questions.

[0077] For example, 20,000 SQL statements and question tuples could be collected from user history interaction data.<question,SQL> Then, rule functions are used to extract tables and fields from the SQL, forming...<question,table.column> The binary pairs are used as training data. The database filter selects a pre-trained language model (BERT) as the underlying semantic encoder to represent `question` (field) and `table.column` (field). Then, a post-interaction method based on maximum similarity (e.g., MaxSim) is used to calculate the similarity between `question` and `table.column`, and the degree of association between the table and nodes is determined by the similarity score. The calculation formula is as follows: O q =Normalize(CNN(BERT) Q (q)) O c =Normalize(CNN(BERT) C (c))

[0078] Among them, O q and O c It is a vector representation encoded using the dual-tower model; i and j represent the dimension values ​​of a vector; score represents the similarity between the user's query and the current database schema vector table.column(; by setting a similarity threshold, preliminary filtering of database schemas can be performed.)

[0079] In some examples, after initial screening of the database schema, key content and structural information in user queries can be extracted, such as nouns, entity words, degree adverbs, etc., from the original user query.

[0080] Figure 3 illustrates the implementation process of the problem deconstruction module. As shown in Figure 3, the predefined database schema consists of data tables and fields that have been initially filtered from the database schema by the retrieval module. Foreign key information defines which field links different data tables. Knowledge clues describe the relationship between the user's question and the schema. This information, along with the user's query, forms the prompts for the problem deconstruction module, which then become the module's output (key content information). For example, the prompts might include content information extracted from the user's query: gender, client, and average salary; and structural information: the youngest, the lowest, and average.

[0081] Step 203: Convert the key content information into a content identifier that the database can recognize, and filter out the target database pattern from the candidate database patterns.

[0082] As examples, converting key content information into database-recognizable content identifiers refers to transforming the key content information into a format that the database can understand and store. For example, this could involve converting the key content information into a specific data type (such as integers, floating-point numbers, strings, dates, etc.); converting the key content information into OR data encoding (such as converting text to UTF-8 encoding); or assigning database-recognizable content identifiers to the key content information based on the specific identifier system used by the database (such as primary keys, foreign keys, etc.).

[0083] In some embodiments of this application, the step of selecting a target database pattern from the candidate database patterns includes: performing text matching on the key content information and the candidate database patterns, and selecting a target database pattern from the candidate database patterns based on the text matching results; and / or performing semantic matching on the key content information and the candidate database patterns, and selecting a target database pattern from the candidate database patterns based on the semantic matching results.

[0084] Since a database schema defines the structure of data in a database, including tables, columns, data types, and relationships, it's possible to match key content information with candidate database schemas (textual matching and / or semantic matching), and then select the target database schema from the candidate schemas based on the matching results.

[0085] Text matching refers to comparing key content information with various candidate database patterns at the text level. For example, string or word-based matching checks whether words or phrases in the key content information directly appear in the descriptions of candidate database patterns. Based on the text matching results, candidate database patterns that highly match or contain key information at the text level are selected. These selected patterns are considered to be closest to the target requirements and are used as the target database pattern.

[0086] Semantic matching refers to understanding the meaning of key content information and candidate database patterns. It utilizes Natural Language Processing (NLP) techniques, such as semantic analysis and entity recognition, to identify potential connections or similarities between key content information and candidate database patterns, even if they use different words or expressions. Through semantic matching, candidate database patterns that, although differing in textual expression, are highly relevant or consistent with the key content information in meaning are selected and used as target database patterns.

[0087] As examples, a bidirectional schema linking module can be used to convert key content information into database-recognizable content identifiers and filter target database schemas from candidate database schemas. This bidirectional schema linking module acts as a bridge between the user query and SQL generation, bridging the question structure stage and the SQL generation stage. At this stage, content information is combined with the database schema (the data obtained from preprocessing and extracting key information mentioned above) to obtain candidate identifiers in the content pipeline, and further, SQL functions are derived from the structure information through the structure channel.

[0088] In some examples, the content channel is primarily used for information transformation and enhancement. It combines keywords extracted from user queries with the database schema, leveraging external evidence (such as domain knowledge and knowledge bases) to convert natural language expressions into database-recognizable identifiers. Then, foreign key relationships in the database (where other related information in some tables is recorded through foreign key relationships) are used as contextual knowledge to guide the expansion of the candidate set in the larger model.

[0089] Since several preliminary database schemas (candidate database schemas) have already been obtained in the problem deconstruction module, these candidate database schemas can be further filtered through bidirectional schema linking. For example, literal matching can be used to match the set of candidate word fragments extracted from the content channel with the table names and column names in the database schemas, selecting the most relevant table fields. Then, for table fields not directly mentioned in the question, potential matching items are filtered out by identifying the semantic similarity between the field's description information or field value examples and the question content. Finally, based on the results of the two matching processes, as well as knowledge clues, foreign keys, primary keys, and other dependencies, database schemas with low task relevance are eliminated from the candidate database schemas to generate a target database schema with less redundant information based on the user's query question.

[0090] The structural pipeline is used to extract function structure information from the query. User language expressions typically correspond to standard function operations, such as sorting (ORDER BY) and averaging (AVG). By using a large model, it's determined whether there are any implicit function operations in the user's query. Then, the commonly used function names and definitions in the database are provided to the large model as context. The large model needs to combine this information to map fragments of the user's query to specific function structures in the SQL and determine the function's constraints, such as ascending or descending order.

[0091] Figure 4 illustrates the implementation process of the bidirectional model linking module. In the content channel, the word "average salary" extracted from the question structure module is directly replaced with the field [A11] in the database using the referential description in the knowledge clues. Furthermore, based on foreign key knowledge hints, the fields connecting the two related tables are [district] and [district_id], thus this field is added to the candidate set. Combining the knowledge in the schema, the initially extracted "later birthdate" is replaced with the corresponding field [birth_data] in the database. Finally, the extracted candidate tables and fields are bidirectionally linked with the predefined database schema to obtain the bidirectionally filtered database schema. In the structure channel, the word fragments output in the previous stage are mapped to functions in the database, filtering out candidate functions [ORDER BY], [ASC], [DESC], and [AVG]. The model output in this stage includes: candidate tables / fields, candidate functions, and the bidirectionally filtered database schema.

[0092] Step 204: Determine the function structure information of the user query question, and determine the query statement skeleton of the user query question based on the function structure information.

[0093] In this context, function structure information refers to the structured representation of the various components of the query and their interrelationships, identified after parsing the user's query. The query skeleton refers to the basic query statement framework constructed based on the function structure information to execute the user's query. For example, the framework could be a template for a structured query language (such as SQL), containing the basic elements required to execute the query.

[0094] As examples, user queries can be parsed to identify the components of the query and their interrelationships, forming a structured representation that serves as function structure information. After parsing the function structure information, the basic query statement framework (i.e., the query statement skeleton of the user query) can be constructed based on this information.

[0095] Step 205: Fill the content identifier and target database schema into the query statement skeleton to obtain the database query statement, and execute the query operation in the database through the database query statement.

[0096] After determining the query skeleton for the user's query, information such as content identifiers (e.g., specific field names, table names) and the target database schema (e.g., the actual table structure and field names in the database) can be populated into the query skeleton. The populated query skeleton then becomes a complete, executable database query.

[0097] As examples, after obtaining a database query statement, it can be sent to a Database Management System (DBMS). The DBMS parses and executes the query statement. During execution, the DBMS calculates the query results based on the content of the query statement and the actual data in the database, and returns them to the query initiator.

[0098] In some embodiments of this application, the database query statement is a statement that the database can recognize.

[0099] In some embodiments of this application, before the database query operation is performed on the database using the database query statement, the method further includes: executing the database query statement on the database and obtaining the execution result; if the execution result is an error message, adjusting the database query statement until the execution result of the adjusted database query statement is a non-error message.

[0100] As examples, after executing a database query, the DBMS calculates the query result based on the content of the query and the actual data in the database, and returns this result to the query initiator. This result can be a data record set, a single data value, the number of rows affected (such as insert, update, or delete operations), or an error message (if there are errors in the query).

[0101] If the DBMS returns an error message, it indicates that the database query statement encountered a problem during execution. These problems can be caused by syntax errors, logical errors, insufficient permissions, or non-existent data. When the execution result is an error, the database query statement needs to be corrected. This may include fixing syntax errors, adjusting logical conditions, changing field or table names, and adding necessary permissions.

[0102] After adjusting the database query statement, you need to execute the adjusted query statement again in the database and check the execution result. If the execution result is no longer an error message, but a normal query result (such as a data record set, a single data value, etc.), then the adjustment is effective and the database query statement can be executed correctly. If the execution result is still an error message, you need to continue adjusting the database query statement until you find the correct database query statement.

[0103] As examples, a natural language query (user query question) can be transformed into an executable SQL query using an SQL generation module. Accuracy and effectiveness of the query are ensured through a carefully designed query skeleton and meticulous information filling. The SQL skeleton is generated based on the SQL function and the original question, and then filled with candidate fields and tables to complete the final SQL query. Specifically, this includes:

[0104] 1. Skeleton generation: The candidate functions recommended by the structure channel and the user query question are provided as context to the large model, enabling it to build an SQL query skeleton. This skeleton defines the basic structure of the query, including how each function combines fields. The specific table names and field names will be defaulted using the [MASK] character.

[0105] 2. Skeleton filling: Based on the candidate tables and field information provided by the content channel, use the large model to complete the mapping and filling of placeholders in the skeleton. If necessary, add WHERE clauses to the SQL skeleton to filter records and ensure that the returned results are consistent with the user's question intent.

[0106] 3. LLM-based error correction module: Design a large model-based error correction agent. After each skeleton filling execution, the SQL for the current round will be obtained. Then, the SQL library function execution module will be called to obtain the SQL execution result. If the SQL execution result is incorrect, or a null value is retrieved, the system will roll back to the previous module to re-extract the structural and skeleton information until the SQL execution no longer produces errors.

[0107] Figure 5 illustrates the implementation process of the SQL generation module. As shown in Figure 5, the candidate functions output from the structure channel in the bidirectional mode linking module are used to generate the SQL skeleton using the large model. The table names and field names are filled with placeholders [table_name] and [column_name], respectively. The resulting SQL skeleton is as follows: "SELECT T1.[column_name] FROM [table_name] AS T1 INNER JOIN [table_name] AS T2 ON T1.[column_name]=T2.[column_name] ORDER BY T2.[column_name]ASC,T1.[column_name]DESC LIMIT 1"

[0108] The SQL statement means: Find records in the "Customer" table that match the "Region" table, sort them in ascending order by the A11 field in the "Region" table and descending order by the "Date of Birth" field in the "Customer" table, and then return the "Gender" field value of the first record in the sorted results. The corresponding user question: What is the gender of the youngest customer who opened an account at a branch in a region with the lowest average wage?

[0109] Then, the candidate tables / fields and schema provided by the content channel are input into the large model as contextual knowledge along with the SQL skeleton to obtain the final SQL query:

[0110] "SELECT T1.gender(Select the "gender" field from the "Customers" table).

[0111] FROM client AS T1 (Data comes from the "client" table, which is aliased as T1).

[0112] INNER JOIN district AS T2 (This performs an inner join with the "district" table and aliases the "district" table to T2).

[0113] ON T1.district_id=T2.district_id(The join condition is that the “District ID” field in the “Customer” table is equal to the “District ID” field in the “District” table).

[0114] ORDER BY T2.A11 ASC, T1.birth_date DESC (Sort by the A11 field in the "Region" table in ascending order (A11 represents average salary), and by the "Date of Birth" field in the "Customer" table in descending order).

[0115] LIMIT 1 (limits the query result to return only one row of data). Execute the process shown in Figure 5.

[0116] In some embodiments of this application, the method is applied to a large model, and the method further includes: determining target sample data matching the user query question from a preset sample dataset; using the target sample data as prompt words for the large model; wherein each sample data in the sample dataset is a tuple of a sample query question and a sample query statement.

[0117] The pre-set sample dataset contains multiple sample data sets. These sample data sets are used to match user queries, identifying those that match the user's query as target sample data, which is then used as prompts for the larger model.

[0118] Hint words instruct the large model how to generate database queries corresponding to user queries. Each sample in the sample dataset is a tuple consisting of a sample query question and a sample query statement. Each sample contains a possible user question (sample query question) and a query statement written for that question (sample query statement). This structure allows the model to learn the mapping between query questions and query statements, enabling it to generate appropriate query statements based on new query questions.

[0119] As examples, 20,000 manually verified sample data entries can be collected from a large database interaction history.<question,SQL> (Two tuples) are used to form a sample dataset Q. Then, each question q in Q is... i These are respectively used as context inputs to the large model, enabling it to execute the problem structure module and the bidirectional schema linking module, extracting candidate table fields and functions from the user's problem. Assume F... i It is the final set of functions, I i It is the final table and field set, for each q i Define the complexity C of a task decoding. i The calculation process is expressed by the following formula: C i =|I i |+|F i |

[0120] Based on F i and C i Determine q in the candidate set i Does it share a similar decoding approach with the current user's question q? Selecting k-shot examples with similar SQL structures and question expressions serves as the basis for minimum sample learning, thereby guiding the large model's learning.

[0121] The specific implementation process is shown in pseudocode below:

[0122] Algorithm name: Small sample selection strategy based on content and structural similarity.

[0123] Input: Example candidate set Q, target user question q, target number of samples k.

[0124] Output: The first K selected candidate sample sets.

[0125] 1: S1 <- Empty Set (Construct an empty set S1 to store the initial small sample set).

[0126] 2: for each question qi in Q do(for each user question q in the sample set Q) i )

[0127] 3: ci,Fi <- InformationExtraction(qi) (This calls the InformationExtraction function to extract information from question q.) i (Extract content information and structural information).

[0128] 4: if SimilarityStructure(q,ci,Fi)then(determines the similarity between the target user question q and the current sample user question q) i(structural similarity).

[0129] 5: S1 <- Add qi to S1 (If similar, then add question q) i Add to set S1).

[0130] 6: end if

[0131] 7: end for

[0132] 8: S <- SimilarityExpression(q,k,S1) (Determines the content similarity between the target user's question q and the samples in S1, and selects the top k samples as the final sample set S).

[0133] 9: return S

[0134] In some embodiments of this application, determining target sample data matching the user query question from the preset sample dataset includes: determining the function set, table, and field set of each sample data in the preset sample dataset; and determining the target sample data matching the user query question from the sample dataset based on the function set, table, and field set.

[0135] The sample dataset consists of multiple tables, each containing multiple fields (columns). The set of tables and fields refers to these tables and the collection of their respective fields. These tables and fields constitute the structural framework of the data. The function set can be a collection of date conversion functions, string processing functions, mathematical calculation functions, etc.

[0136] By analyzing user queries, we determine the functions (such as filtering, sorting, aggregation, etc.) to be used, as well as the tables and fields to be queried. Then, we extract sample data that matches the query from the sample dataset as target sample data.

[0137] In this embodiment, the following steps are taken: First, a user query question is obtained. Then, candidate database patterns are selected from multiple database patterns in a pre-set database based on the user query question, and key content information is extracted from the user query question. This key content information is converted into a database-recognizable content identifier, and a target database pattern is selected from the candidate database patterns. Next, the function structure information of the user query question is determined, and the query statement skeleton is determined based on this information. Finally, the content identifier and the target database pattern are filled into the query statement skeleton to obtain the database query statement. The query operation is then executed in the database using this database query statement. This achieves automated generation and execution of database query statements based on user query questions. While accurately generating database query statements, it lowers the barrier for users to query the database, significantly improving query efficiency and accuracy.

[0138] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this application are not limited to the described order of actions, because according to the embodiments of this application, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this application.

[0139] Referring to FIG6, a schematic diagram of the structure of a database query-based processing device provided in some embodiments of this application is shown, which may specifically include: a user query question acquisition module 601, a question deconstruction module 602, a pattern linking module 603, a skeleton determination module 604, and a database query statement generation module 605.

[0140] The user query question acquisition module 601 is used to acquire user query questions.

[0141] The problem deconstruction module 602 is used to filter candidate database patterns from multiple database patterns in a preset database based on the user query question, and extract key content information from the user query question.

[0142] The pattern linking module 603 is used to convert the key content information into a content identifier that the database can recognize, and to filter out the target database pattern from the candidate database patterns.

[0143] The skeleton determination module 604 is used to determine the function structure information of the user query question, and determine the query statement skeleton of the user query question based on the function structure information.

[0144] The database query statement generation module 605 is used to fill the content identifier and the target database schema into the query statement skeleton to obtain the database query statement, and to perform a query operation in the database through the database query statement.

[0145] In some embodiments of this application, the problem deconstruction module 602 is further configured to:

[0146] Determine the question vector of the user's query question, and determine the database schema vector of multiple database schemas in the preset database;

[0147] Calculate the similarity between the question vector and the database pattern vector, and select candidate database patterns from the multiple database patterns based on the similarity.

[0148] In some embodiments of this application, the pattern linking module 603 is further configured to:

[0149] Perform text matching on the key content information and candidate database patterns, and based on the text matching results, filter out the target database pattern from the candidate database patterns; and / or

[0150] Semantic matching is performed on the key content information and candidate database patterns, and the target database pattern is selected from the candidate database patterns based on the results of the semantic matching.

[0151] In some embodiments of this application, the apparatus further includes:

[0152] The execution result acquisition module is used to execute the database query statement in the database and obtain the execution result;

[0153] The adjustment module is used to adjust the database query statement when the execution result is an error message, until the execution result of the adjusted database query statement is a non-error message.

[0154] In some embodiments of this application, the apparatus is applied to a large model, and the apparatus further includes:

[0155] The target sample data determination module is used to determine target sample data that matches the user query question from a preset sample dataset; and to use the target sample data as prompt words for the large model; wherein, each sample data in the sample dataset is a tuple of sample query question and sample query statement.

[0156] In some embodiments of this application, the target sample data determination module includes:

[0157] The set information determination module is used to determine the function set, table, and field set for each sample data in a preset sample dataset;

[0158] The sample data matching submodule is used to determine target sample data that matches the user query question from the sample dataset based on the function set, table and field set.

[0159] In some embodiments of this application, the user query question is a question in natural language form, and the database query statement is a statement that the database can recognize.

[0160] Some embodiments of this application also provide an electronic device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the method described above.

[0161] Some embodiments of this application also provide a computer-readable storage medium on which a computer program is stored, and which, when executed by a processor, implements the method described above.

[0162] Some embodiments of this application also provide a computer program product, including a computer program that, when executed by a processor, implements the method described above.

[0163] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.

[0164] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.

[0165] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0166] Those skilled in the art will understand that embodiments of this application can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this application can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of this application can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0167] This application describes embodiments with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in one or more blocks of the flowchart illustrations and / or one or more blocks of the block diagrams.

[0168] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means that implement the functions specified in one or more flowcharts and / or one or more block diagrams.

[0169] These computer program instructions may also be loaded onto a computer or other programmable data processing terminal equipment to cause a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, such that the instructions, which execute on the computer or other programmable terminal equipment, provide steps for implementing the functions specified in one or more flowcharts and / or one or more block diagrams.

[0170] It should also be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes the aforementioned element.

[0171] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0172] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

A database query-based processing method includes: Get the user's query question; Based on the user's query question, candidate database patterns are selected from multiple database patterns in the preset database, and key content information is extracted from the user's query question. The key content information is converted into content identifiers that the database can recognize, and the target database pattern is selected from the candidate database patterns; Determine the function structure information of the user query question, and determine the query statement skeleton of the user query question based on the function structure information; as well as The content identifier and the target database schema are filled into the query statement skeleton to obtain the database query statement, and the query operation is performed on the database through the database query statement. According to the method of claim 1, wherein selecting candidate database schemas from a plurality of database schemas in a preset database based on the user query question includes: Determine the question vector of the user's query question, and determine the database schema vector of multiple database schemas in the preset database; Calculate the similarity between the question vector and the database pattern vector, and select candidate database patterns from the multiple database patterns based on the similarity. According to the method of claim 1, wherein selecting the target database pattern from the candidate database patterns comprises: The key content information and candidate database patterns are matched with text, and the target database pattern is selected from the candidate database patterns based on the text matching results. and / or Semantic matching is performed on the key content information and candidate database patterns, and the target database pattern is selected from the candidate database patterns based on the results of the semantic matching. The method according to any one of claims 1 to 3, wherein before performing the query operation on the database through the database query statement, it further includes: Execute the database query statement in the database and obtain the execution result; If the execution result is an error, the database query statement is adjusted until the execution result of the adjusted database query statement is a non-error statement. The method according to any one of claims 1 to 3, wherein the method is applied to a large model, the method further comprising: Determine target sample data that matches the user's query question from a pre-set sample dataset; The target sample data is used as the prompt words for the large model; wherein, each sample data in the sample dataset is a tuple of a sample query question and a sample query statement. According to the method of claim 5, determining the target sample data matching the user query question from the preset sample dataset includes: Determine the set of functions, tables, and fields for each sample data in the pre-defined sample dataset; Based on the set of functions, tables, and fields, target sample data matching the user query question is determined from the sample dataset. According to the method of claim 1, the user query question is a question in natural language form, and the database query statement is a statement that the database can recognize. A database query-based processing device, comprising: The user query question acquisition module is used to acquire user query questions. The problem deconstruction module is used to filter candidate database patterns from multiple database patterns in a preset database based on the user's query question, and extract key content information from the user's query question. The pattern linking module is used to convert the key content information into content identifiers that the database can recognize, and to filter out the target database pattern from the candidate database patterns; The skeleton determination module is used to determine the function structure information of the user query question, and determine the query statement skeleton of the user query question based on the function structure information; as well as The database query statement generation module is used to fill the content identifier and the target database schema into the query statement skeleton to obtain the database query statement, and to execute the query operation in the database through the database query statement. The apparatus of claim 8, wherein the problem deconstruction module is further configured to: Determine the question vector of the user's query question, and determine the database schema vector of multiple database schemas in the preset database; Calculate the similarity between the question vector and the database pattern vector, and select candidate database patterns from the multiple database patterns based on the similarity. The apparatus of claim 8, wherein the mode linking module is further configured to: Perform text matching on the key content information and candidate database patterns, and based on the text matching results, filter out the target database pattern from the candidate database patterns; and / or Semantic matching is performed on the key content information and candidate database patterns, and the target database pattern is selected from the candidate database patterns based on the results of the semantic matching. The apparatus according to any one of claims 8 to 10, the apparatus further comprising: The execution result acquisition module is used to execute the database query statement in the database and obtain the execution result; The adjustment module is used to adjust the database query statement when the execution result is an error message, until the execution result of the adjusted database query statement is a non-error message. The apparatus according to any one of claims 8 to 10, applied to a large model, further includes: The target sample data determination module is used to determine target sample data that matches the user query question from a preset sample dataset. The target sample data is used as the prompt words for the large model; wherein, each sample data in the sample dataset is a tuple of a sample query question and a sample query statement. The apparatus according to claim 12, wherein the target sample data determination module comprises: The set information determination module is used to determine the function set, table, and field set for each sample data in a preset sample dataset; The sample data matching submodule is used to determine target sample data that matches the user query question from the sample dataset based on the function set, table and field set. The apparatus according to claim 8, wherein the user query question is a question in natural language form, and the database query statement is a statement that the database can recognize. An electronic device includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the method as described in any one of claims 1 to 7. A non-volatile computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method as described in any one of claims 1 to 7.