Data retrieval method and apparatus

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By performing semantic clustering on the hypothetical questions of the tools and filtering tool description information, and by combining user input information to optimize similarity calculation, the problem of low efficiency in existing tool retrieval methods is solved, and efficient and accurate tool retrieval is achieved.

CN122196241APending Publication Date: 2026-06-12LENOVO (BEIJING) LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: LENOVO (BEIJING) LTD
Filing Date: 2026-01-30
Publication Date: 2026-06-12

Application Information

Patent Timeline

30 Jan 2026

Application

12 Jun 2026

Publication

CN122196241A

IPC: G06F16/903; G06F16/355; G06F40/30; G06F18/22; G06F9/448; G06N5/04

AI Tagging

Application Domain

Semantic analysisOther databases querying

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing retrieval tools and methods have limited accuracy and coverage in complex or multi-intent tasks, resulting in low efficiency.

⚗Method used

By semantically clustering the hypothetical questions of the tools, question clustering information is generated. Combined with user input information, the first set of tools is determined. Target tools are then selected by combining the tool description information. Similarity calculation is optimized using preset weights to reduce interference from irrelevant context.

🎯Benefits of technology

It significantly improves the efficiency and accuracy of tool retrieval, reduces irrelevant contextual interference, shortens prompt length, and enhances the relevance and efficiency of model generation.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122196241A_ABST

Patent Text Reader

Abstract

The application provides a data retrieval method and device. The method comprises the following steps: determining that a model needs to perform tool calling according to user input, and obtaining user input information; determining a first tool set from a plurality of tools in a tool library according to at least one question clustering information of each tool and the user input information, wherein the question clustering information is obtained by performing semantic clustering on a corresponding hypothetical question of each tool; and determining a target tool according to the first tool set, wherein the target tool is used to determine a tool required by the model to perform a task corresponding to the user input.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing, and more particularly to a data retrieval method and apparatus. Background Technology

[0002] Tool retrieval is a crucial step in enhancing the model's external capabilities during the generation of inference results. Its goal is to select the set of tools that best fit the semantics of the current task from a large pool of available tools, thereby reducing irrelevant context, shortening prompt length, and improving the relevance and efficiency of model generation. High-quality tool retrieval not only improves task execution accuracy but also effectively mitigates contextual interference. Current mainstream methods are generally based on semantic matching or information retrieval mechanisms, calculating the similarity between the input query and the tool description or example question, and selecting the Top-K tools as candidates. While this approach is simple to implement, its retrieval accuracy and coverage remain limited in complex or multi-intent tasks. Summary of the Invention

[0003] This application provides a data retrieval method and apparatus.

[0004] One embodiment of this application provides a data retrieval method, the method comprising:

[0005] Based on user input, the model needs to invoke tools to obtain user input information; From multiple tools in the tool library, a first toolset is determined based on at least one question clustering information of each tool and the user input information, wherein the question clustering information is obtained by semantic clustering of the hypothetical questions corresponding to each tool; The target tool is determined based on the first toolset, and the target tool is used to determine the tool required for the model to perform the task corresponding to the user input.

[0006] The step of determining the first toolset based on at least one problem clustering information of each tool and the user input information includes: Based on at least one question clustering information of each tool and the user input information, a first candidate tool is determined; The similarity between the hypothesis question corresponding to each of the first candidate tools and the user input information is calculated. The first toolset is determined based on the similarity calculation results and the first candidate tools.

[0007] The step of determining the first toolset based on at least one problem clustering information of each tool and the user input information includes: A second candidate tool is determined based on at least one question clustering information of each tool and the user input information; The hypothetical question and tool description information of each second candidate tool are concatenated and then compared with the user input information to calculate the similarity. The first toolset is determined based on the similarity calculation results and the second candidate tools.

[0008] The semantic clustering of the hypothetical questions corresponding to each tool includes one of the following: Semantic clustering is performed on the hypothetical question corresponding to each tool to obtain the question clustering information; Semantic clustering is performed on the hypothetical questions corresponding to multiple tools in the tool library to obtain the question clustering information.

[0009] The step of determining the target tool based on the first toolset includes: The second toolset is determined based on the tool descriptions of each tool in the tool library and the user input information. The target tool is determined based on the first toolset and the second toolset.

[0010] The step of determining the target tool based on the first toolset and the second toolset includes: Obtain a first score for each tool in the first toolset, where the first score represents the degree of correlation between each tool in the first toolset and the user input information; Obtain a second score for each tool in the second toolset, the second score representing the degree of relevance of each tool in the second toolset to the user input information; The target score for each tool in the toolset is determined based on the first score and the second score, wherein the toolset is the union of the first toolset and the second toolset; The target tool is determined based on the target score and the tool set.

[0011] The step of determining the target score for each tool in the toolset based on the first score and the second score includes: If the target candidate tool belongs to both the first toolset and the second toolset, then the target score is determined based on the first score and the second score of the target candidate tool. If the target candidate tool belongs only to the first toolset, then the first score of the target candidate tool is determined as the target score; If the target candidate tool belongs only to the second toolset, then the second score of the target candidate tool is determined as the target score.

[0012] The step of determining the second toolset based on the tool description information of each tool in the tool library and the user input information includes: The first association score is determined based on the degree of text overlap between the tool description information and the user input information of each tool; A second association score is determined based on the semantic similarity between the tool description information and the user input information of each tool; The second toolset is determined based on the first association score and the second association score of each tool.

[0013] The method further includes: Upon detecting a new tool in the tool library, semantic clustering is performed on the hypothetical questions corresponding to the new tool to obtain question clustering information. If it is determined that the hypothesis problem of the newly added tool is related to other tools in the tool library, the hypothesis problem is added to the hypothesis problem of the tool. The newly added hypothetical questions in the tool are re-semantically clustered to obtain the question clustering information corresponding to the tool.

[0014] Another embodiment of this application provides a data retrieval device, the device comprising: The data acquisition module is used to determine the tools that the model needs to call based on user input and to obtain user input information; The processing module is used to determine a first toolset from multiple tools in the tool library based on at least one question clustering information of each tool and the user input information, wherein the question clustering information is obtained by semantic clustering of the hypothetical questions corresponding to each tool; The processing module is further configured to determine a target tool based on the first toolset, wherein the target tool is used to determine the tool required for the model to perform the task corresponding to the user input.

[0015] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this application, nor is it intended to limit the scope of this application. Other features of this application will become readily apparent from the following description. Attached Figure Description

[0016] The above and other objects, features, and advantages of exemplary embodiments of this application will become readily apparent from the following detailed description taken in conjunction with the accompanying drawings. Several embodiments of this application are illustrated in the drawings by way of example and not limitation, in which: In the accompanying drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

[0017] Figure 1 A flowchart of a data retrieval method according to an embodiment of this application is shown; Figure 2 A flowchart of a data retrieval method according to another embodiment of this application is shown; Figure 3 A flowchart of a data retrieval method according to another embodiment of this application is shown; Figure 4 A flowchart of a data retrieval method according to another embodiment of this application is shown; Figure 5 A flowchart of a data retrieval method according to another embodiment of this application is shown; Figure 6 A flowchart of a data retrieval method according to another embodiment of this application is shown; Figure 7 A flowchart of a data retrieval method according to another embodiment of this application is shown; Figure 8 A flowchart of a data retrieval method according to another embodiment of this application is shown; Figure 9 A schematic diagram of the structure of a data retrieval apparatus according to an embodiment of this application is shown; Figure 10 A schematic diagram of the composition structure of an electronic device according to an embodiment of this application is shown. Detailed Implementation

[0018] To make the objectives, features, and advantages of this application more apparent and understandable, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0019] To improve the efficiency, accuracy, and coverage of tool-based searches, one embodiment of this application provides a data retrieval method, such as... Figure 1 As shown, the method includes: Step 101: Determine the tool call required for the model based on user input and obtain user input information.

[0020] In this embodiment, when the user input includes keywords related to tasks requiring tool assistance, such as analyzing reports or generating charts, it is determined that the model needs to invoke a tool. Alternatively, it can be determined that the model needs to invoke a tool when the user input involves a task beyond the model's built-in capabilities. Furthermore, it can be determined that the model needs to invoke a tool when a preset tool invocation trigger rule is triggered based on the intent information determined by the user input.

[0021] Step 102: From multiple tools in the tool library, determine a first toolset based on at least one question clustering information of each tool and the user input information. The question clustering information is obtained by semantic clustering of the hypothetical questions corresponding to each tool.

[0022] Hypothetical questions are generated for each tool to predict questions users might ask related to its functionality. For example, hypothetical questions generated for financial analysis tools include "How do I calculate the profit margin?" and "How do I analyze the balance sheet?"

[0023] Problem clustering information is the representative semantic center extracted after semantic clustering of multiple hypothetical problems. Problem clustering information can cover the core semantics of the corresponding tool's hypothetical problems.

[0024] The similarity can be determined by comparing user input with the question clustering information of each tool in the tool library, selecting tools with high similarity, and then comparing user input with the hypothetical questions of the selected tools, selecting tools with high similarity, thus determining the first toolset.

[0025] Step 103: Determine the target tool based on the first toolset. The target tool is used to determine the tool required for the model to perform the task corresponding to the user input.

[0026] The target tools are used to assist the model in performing specific operations corresponding to the user's input. For example, when the user inputs "analyze financial statements and generate visualization charts", the target tools are ultimately determined to include financial analysis tools and chart generation tools. The financial analysis tools are used to assist the model in completing financial statement data analysis, and the chart generation tools are used to assist the model in generating data visualization charts.

[0027] The target tool can be determined by further sorting and filtering the first toolset, such as by the similarity between the user input information and the tool description information of each tool in the first toolset, and / or the textual overlap between the user input information and the tool description information of each tool in the first toolset.

[0028] Because each tool has a large number of hypothetical questions, directly using these questions for retrieval would result in excessive computational load, leading to low retrieval efficiency and slow speed. The solution described above, however, first selects tools with high similarity based on user input and question clustering information. Then, it determines a first toolset based on user input and the hypothetical questions of each tool, significantly improving retrieval efficiency and speed. Furthermore, the first toolset is further sorted and filtered using tool description information to determine target tools, further improving the accuracy and coverage of tool retrieval, reducing interference from irrelevant context, shortening prompt length, and ultimately enhancing the relevance and efficiency of model generation.

[0029] This application also provides a data retrieval method in one example, such as Figure 2 As shown, determining the first toolset based on at least one question clustering information from each tool and the user input information includes: Step 201: Determine the first candidate tool based on at least one question clustering information of each tool and the user input information.

[0030] For example, user input information is converted into a corresponding user input information vector. The tool library includes tool A, tool B, tool C, and tool D. Tool A includes question cluster information A, question cluster information B, question cluster information C, and question cluster information D; tool B includes question cluster information C, question cluster information E, question cluster information F, and question cluster information G; tool C includes question cluster information A, question cluster information H, question cluster information I, and question cluster information J; and tool D includes question cluster information D and question cluster information E. The similarity between each question cluster information of each tool and the user input information vector is determined, and the tool with the highest similarity is taken as the similarity between that tool and the user input information vector. The top three tools with the highest similarity are selected as the first candidate tools, namely tool B, tool D, and tool A.

[0031] Step 202: Calculate the similarity between the hypothesis question corresponding to each of the first candidate tools and the user input information.

[0032] Following the example above, the first candidate tools are tool B, tool D, and tool A. Tool B includes 30 hypothesis questions, tool D includes 20 hypothesis questions, and tool A includes 25 hypothesis questions. The similarity between the user input information vector and each hypothesis question of these three tools is determined to obtain the similarity between the user input information vector and each hypothesis question.

[0033] Step 203: Determine the first toolset based on the similarity calculation results and the first candidate tools.

[0034] Continuing with the previous example, the maximum similarity among multiple similarity scores for each tool is taken as the similarity between that tool and the user's input information vector. The first candidate tools are then ranked according to this similarity, resulting in tool B, tool A, and tool D. The two tools with the highest similarity are selected and defined as the first toolset, which includes tool B and tool A.

[0035] It should be noted that the mean, median, or maximum value among the multiple similarities of each tool can be determined as the similarity between that tool and the user's input information vector. The specific settings can be configured according to the requirements.

[0036] The first toolset is determined based on the similarity calculation results and the first candidate tools. Alternatively, the similarity between the clustering information of the question corresponding to the tool and the user input information and the similarity between the hypothesis question corresponding to the tool and the user input information can be weighted and summed using preset weights. The first candidate tools are then sorted and filtered based on the weighted summation results to determine the first toolset.

[0037] In the above scheme, the first candidate tools are determined by combining the question clustering information of each tool with the user input information. Then, the similarity between the hypothetical questions corresponding to the first candidate tools and the user input information is calculated. Finally, the first candidate tools are ranked and filtered by taking the mean, median, or maximum of multiple similarities, or by combining the similarities between the question clustering information and the individual hypothetical questions and summing them according to preset weights, to determine the first toolset. This effectively solves the problems of high computational load, low retrieval efficiency, and slow retrieval speed caused by direct retrieval when there are many hypothetical questions corresponding to each tool, reducing unnecessary data computation and significantly improving retrieval efficiency and speed.

[0038] This application also provides a data retrieval method in one example, such as Figure 3 As shown, determining the first toolset based on at least one question clustering information from each tool and the user input information includes: Step 301: Determine a second candidate tool based on at least one question clustering information of each tool and the user input information.

[0039] Similarly, we first determine the similarity between the question clustering information of each tool and the user input information, and then sort and filter the tools according to the similarity to determine the second candidate tool.

[0040] Step 302: After concatenating the hypothesis question and tool description information of each second candidate tool, perform similarity calculation with the user input information.

[0041] Each hypothesis question of the second candidate tool can be concatenated with the tool description information. Then, the similarity between each concatenated information and the user input information can be calculated, and the maximum value among the various similarities can be taken as the similarity between the second candidate tool and the user input information.

[0042] Alternatively, all the hypothesis questions of the second candidate tool can be concatenated with the tool description information, and then the similarity between the concatenated information and the user input information can be calculated to obtain the similarity between the second candidate tool and the user input information.

[0043] It should be noted that after obtaining the second candidate tools, the similarity between the hypothesis question corresponding to each second candidate tool and the user input information can be calculated. Then, based on the similarity calculation results and the second candidate tools, the third candidate tools can be determined. Then, the hypothesis question and tool description information of each third candidate tool are concatenated and similarity calculated with the user input information to obtain the similarity calculation results. These similarity calculation results are also used to determine the first toolset.

[0044] Step 303: Determine the first toolset based on the similarity calculation results and the second candidate tools.

[0045] The second or third candidate tools are ranked and filtered based on their similarity to the user's input information to determine the first toolset.

[0046] In the above scheme, second candidate tools are determined by combining the question clustering information of each tool with the user input information. Then, the hypothetical question and tool description information of the second candidate tool are concatenated and their similarity is calculated with the user input information. Alternatively, a third candidate tool can be selected from the second candidate tools based on the similarity between the hypothetical question and the user input information, and then the above concatenation and similarity calculation are performed on the third candidate tool. Finally, the second or third candidate tools are ranked and selected based on the similarity calculation results to determine the first tool set. This effectively solves the problems of high computational cost, low retrieval efficiency, and slow retrieval speed caused by direct retrieval when there are many hypothetical questions for each tool. It significantly reduces unnecessary data computation and significantly improves retrieval efficiency and speed. Simultaneously, by concatenating the hypothetical question and tool description information, and combining the semantic relationship between the tool and the user input, the accuracy and coverage of tool retrieval are improved, interference from irrelevant context is reduced, and the prompt length is shortened, thereby effectively improving the targeting and efficiency of model generation.

[0047] This application also provides a data retrieval method in one example, wherein the semantic clustering of the hypothetical questions corresponding to each tool includes one of the following: Semantic clustering is performed on the hypothetical questions corresponding to each tool to obtain the question clustering information.

[0048] For each tool in the tool library, obtain all the hypothetical questions corresponding to that tool, convert these hypothetical questions into corresponding vectors, and then use a preset clustering algorithm to perform semantic clustering on the vectors corresponding to all the hypothetical questions of that tool. After the clustering is completed, at least one representative semantic center that can cover the core semantics of all the hypothetical questions of that tool is obtained, that is, at least one question clustering information corresponding to that tool.

[0049] For example, the financial analysis tools in the tool library correspond to 30 hypothetical questions. First, each hypothetical question is transformed into a corresponding vector. Then, the K-means algorithm (K-means clustering algorithm) is used to cluster these 30 vectors. After clustering, the clustering information of the tool's 5 questions is obtained.

[0050] Semantic clustering is performed on the hypothetical questions corresponding to multiple tools in the tool library to obtain the question clustering information.

[0051] For all tools in the tool library, obtain all the hypothetical questions of these tools, convert all the hypothetical questions into corresponding vectors, and then use a preset clustering algorithm to perform semantic clustering on the vectors corresponding to all the hypothetical questions. After the clustering is completed, at least one question clustering information of the tool library is obtained.

[0052] For example, the toolkit includes financial analysis tools, report interpretation tools, and data statistics tools. The financial analysis tools have 30 corresponding hypothesis questions, the report interpretation tools have 25 corresponding hypothesis questions, and the data statistics tools have 22 corresponding hypothesis questions. Each hypothesis question is converted into a corresponding vector, and then the K-means algorithm is used to cluster these 77 vectors, resulting in 12 clustering information for the toolkit.

[0053] It should be noted that if semantic clustering is performed on the hypothetical questions corresponding to multiple tools in the tool library to obtain question clustering information, then the tool corresponding to each question clustering information needs to be labeled in the tool library. For example, if a certain question clustering information is obtained by clustering multiple hypothetical questions from tool A and tool B, then this question clustering information needs to be labeled as the question clustering information of tool A and tool B.

[0054] In the above scheme, performing semantic clustering on the hypothetical question corresponding to each tool individually can accurately obtain the core semantics of the tool's hypothetical question, providing targeted retrieval information for subsequent searches related to that tool. Conversely, performing semantic clustering on the hypothetical questions corresponding to all tools in the tool library and labeling the tools corresponding to each question cluster information can solve the problem of semantic fragmentation between different tools, establish cross-tool semantic connections, and ensure that question cluster information covers the common semantics of all tools. By adapting to different retrieval scenarios through these two clustering methods, the accuracy and coverage of retrieval are guaranteed while reducing data computation.

[0055] This application also provides a data retrieval method in one example, such as Figure 4 As shown, determining the target tool based on the first toolset includes: Step 401: Determine the second toolset based on the tool description information of each tool in the tool library and the user input information.

[0056] After determining the first toolset, the similarity between the tool description information of each tool in the tool library and the user input information is determined. Based on this similarity, the tools are sorted and filtered to determine the second toolset.

[0057] Step 402: Determine the target tool based on the first toolset and the second toolset.

[0058] The target tool can be determined from the first tool set and the second tool set based on the similarity between each tool in the first tool set and the user input information, and the similarity between each tool in the second tool set and the user input information.

[0059] In the above scheme, after determining the first toolset, a second toolset is obtained by sorting and filtering the tools in the tool library based on the similarity between the tool description information and the user input information. The target tools are then determined by combining the first and second toolsets. This approach utilizes the semantic similarity association information between the question clustering information and the user input information in the first toolset, and also incorporates the tool function information from the tool description information in the second toolset. This solves the problem of incomplete semantic coverage and insufficient matching accuracy caused by searching based on a single dimension. It significantly improves the accuracy and coverage of tool retrieval.

[0060] This application also provides a data retrieval method in one example, such as Figure 5 As shown, determining the target tool based on the first toolset and the second toolset includes: Step 501: Obtain a first score for each tool in the first toolset, wherein the first score represents the degree of correlation between each tool in the first toolset and the user input information.

[0061] Specifically, the first score of each tool in the first tool set can be determined using the following formula. :

[0062] in, This refers to the similarity between the tool and the user input information, determined based on the clustering information of the questions corresponding to the tool. The similarity between the tool and the user input information is determined based on the hypothesis question corresponding to the tool. for The corresponding preset weights, for The corresponding preset weights.

[0063] Step 502: Obtain a second score for each tool in the second toolset, wherein the second score characterizes the degree of association between each tool in the second toolset and the user input information.

[0064] Specifically, the second score for each tool in the second toolset can be determined using the following formula. :

[0065] in, This refers to the degree of text overlap between the tool description and the user input. This refers to the similarity between the tool description information and the user input information. Preset weights.

[0066] Step 503: Determine the target score for each tool in the toolset based on the first score and the second score, wherein the toolset is the union of the first toolset and the second toolset.

[0067] If a tool belongs to both the first and second toolsets, its target score is determined based on its first and second scores. If a tool belongs to the first toolset but not the second, its target score is determined based on its first score. If a tool belongs to the second toolset but not the first, its target score is determined based on its second score.

[0068] Step 504: Determine the target tool based on the target score and the tool set.

[0069] The tools in the toolset are sorted and filtered according to the target score to determine the target tool.

[0070] In the above scheme, a first score is determined for each tool in the first toolset and a second score for each tool in the second toolset. The first score combines the tool's question clustering information and the similarity between the hypothetical question and the user input information, with preset weights balancing their influence. The second score combines the textual overlap and similarity between the tool's description information and the user input information, again with preset weights balancing their influence. Then, for the union of the first and second toolsets, a target score is determined according to the different sets to which the tools belong. Finally, target tools are sorted and selected based on the target scores. This approach utilizes the relationship between tools and user input information from multiple dimensions, taking into account both the semantic information of the hypothetical question and the functional information of the tool, thus effectively combining information from the two toolsets. This avoids the limitations of single-dimensional retrieval, reduces data computation, and further improves the accuracy and coverage of tool retrieval.

[0071] This application also provides a data retrieval method in one example, such as Figure 6 As shown, determining the target score for each tool in the toolset based on the first score and the second score includes: Step 601: If the target candidate tool belongs to both the first toolset and the second toolset, then determine the target score based on the first score and the second score of the target candidate tool.

[0072] For example, the preset weight for the first score is 0.3, and the preset weight for the second score is 0.7. The first toolset includes tools A, B, C, and D, with corresponding first scores of 0.8, 0.75, 0.78, and 0.85, respectively. The second toolset includes tools E, F, D, and A, with corresponding second scores of 0.6, 0.9, 0.7, and 0.65, respectively. If tool A belongs to both the first and second toolsets, its target score is 0.695. If tool D belongs to both the first and second toolsets, its target score is 0.745.

[0073] Step 602: If the target candidate tool belongs only to the first toolset, then the first score of the target candidate tool is determined as the target score.

[0074] Continuing with the above example, if tool B belongs only to the first toolset, then the target score for tool B is 0.75; if tool C belongs only to the first toolset, then the target score for tool C is 0.78.

[0075] Step 603: If the target candidate tool belongs only to the second toolset, then the second score of the target candidate tool is determined as the target score.

[0076] Continuing with the previous example, if tool E belongs only to the second toolset, then the target score for tool E is 0.6. If tool F belongs only to the second toolset, then the target score for tool F is 0.9.

[0077] In the above scheme, the target score is determined based on whether the target candidate tool belongs to the first toolset and the second toolset. If the target candidate tool belongs to both sets, the target score is determined by combining its first and second scores, enabling retrieval based on multiple dimensions and avoiding the bias caused by retrieval based on a single dimension. If the target candidate tool belongs to only one set, the first or second score corresponding to that set is directly used as the target score, ensuring the stability of the retrieval.

[0078] This application also provides a data retrieval method in one example, such as Figure 7 As shown, determining the second toolset based on the tool description information of each tool in the tool library and the user input information includes: Step 701: Determine the first association score based on the text overlap between the tool description information of each tool and the user input information.

[0079] In this embodiment, the first association score can be obtained by using the proportion of words that co-occur in the tool description information and the user input information to the total number of words in the user input information. The first association score is also obtained by calculating the number of matches between the tool description information and the user input information for task-related keywords, with each match number assigned a corresponding score. Finally, the term relevance between the tool description information and the user input information is calculated using sparse indexing methods such as BM25 (a probabilistic information retrieval model) and SPLADE (a sparse retrieval model), and this relevance is directly used as the first association score.

[0080] Step 702: Determine the second association score based on the semantic similarity between the tool description information of each tool and the user input information.

[0081] In this embodiment, the tool description information and user input information can be converted into vectors respectively, and then the cosine similarity, Manhattan distance, Pearson correlation coefficient or Euclidean distance between the two vectors can be determined as the second association score.

[0082] Step 703: Determine the second toolset based on the first association score and the second association score of each tool.

[0083] A third correlation score can be obtained by setting preset weights and then summing the first and second correlation scores. This third correlation score is then used to sort and filter the tools, thus determining the second toolset.

[0084] In the above scheme, a first association score is determined based on the textual overlap between the tool description and user input, and a second association score is determined based on the semantic similarity between the two. These two association scores are then weighted and summed using preset weights to obtain a third association score. Finally, the tools are ranked and filtered based on the third association score to determine the second toolset. This approach combines both the direct textual association between the tool description and user input, as well as their semantic association, thus avoiding the limitations of a single-dimensional judgment. This further improves the retrieval accuracy of the second toolset.

[0085] This application also provides a data retrieval method in one example, such as Figure 8 As shown, the method further includes: Step 801: A new tool is detected in the tool library. Semantic clustering is performed on the hypothetical questions corresponding to the new tool to obtain question clustering information.

[0086] For example, if a new tax filing assistance tool is detected in the tool library, 30 hypothetical questions corresponding to this tool are generated using a language model, and these 30 hypothetical questions are converted into corresponding vectors. Then, the K-means clustering algorithm is used to perform semantic clustering on these vectors to obtain the clustering information for the tool's five questions.

[0087] Step 802: Determine that the hypothesis problem of the newly added tool is related to other tools in the tool library, and add the hypothesis problem to the hypothesis problem of the tool.

[0088] Following the example above, semantic analysis was performed on the 30 hypothetical questions in the tax filing support tool. Two hypothetical questions, "How to calculate quarterly tax payments?" and "How to verify the consistency between tax and financial data?", were identified as having high semantic similarity to the hypothetical questions in existing financial analysis tools in the tool library, indicating a correlation. Therefore, these two hypothetical questions were added to the hypothetical questions of the financial analysis tool. The number of hypothetical questions in the financial analysis tool increased from 30 to 32.

[0089] Step 803: Re-semantically cluster the hypothetical questions added to the tool to obtain the question clustering information corresponding to the tool.

[0090] Following the example above, after two new hypothetical questions are added to the financial analysis tool, the tool's question clustering information needs to be updated. The K-means clustering algorithm is used to perform semantic clustering on the vectors corresponding to the updated 32 hypothetical questions, obtaining the tool's new question clustering information, which replaces the tool's original question clustering information.

[0091] In the above scheme, when a new tool is added to the tool library, semantic clustering is first performed on the hypothetical questions of the new tool to obtain corresponding question clustering information, ensuring that the new tool can quickly acquire index information for retrieval. When a correlation is detected between the hypothetical questions of the new tool and other tools in the tool library, these hypothetical questions are added to the hypothetical questions of the corresponding tools, updating the semantic relationships between different tools and avoiding semantic fragmentation. Then, the tool with the added hypothetical questions is re-semantically clustered, ensuring that the question clustering information of that tool covers the core semantics of all updated hypothetical questions, guaranteeing the accuracy of its question clustering information. This achieves rapid information updates after a new tool is added to the tool library.

[0092] To implement the above data retrieval method, such as Figure 9 As shown, an example of this application provides a data retrieval apparatus, including: The data acquisition module 901 is used to determine the tool calls required by the model based on user input and to obtain user input information. The processing module 902 is used to determine a first toolset from multiple tools in the tool library based on at least one question clustering information of each tool and the user input information, wherein the question clustering information is obtained by semantic clustering of the hypothetical questions corresponding to each tool; The processing module 902 is further configured to determine a target tool based on the first toolset, wherein the target tool is used to determine the tool required for the model to perform the task corresponding to the user input.

[0093] The processing module 902 is further configured to determine a first candidate tool based on at least one question clustering information of each tool and the user input information; The processing module 902 is further configured to perform similarity calculation between the hypothesis question corresponding to each of the first candidate tools and the user input information; The processing module 902 is further configured to determine the first toolset based on the similarity calculation results and the first candidate tools.

[0094] The processing module 902 is further configured to determine a second candidate tool based on at least one question clustering information of each tool and the user input information; The processing module 902 is further configured to concatenate the hypothesis question and tool description information of each second candidate tool and perform similarity calculation with the user input information; The processing module 902 is further configured to determine the first toolset based on the similarity calculation results and the second candidate tools.

[0095] The processing module 902 is further configured to perform semantic clustering on the hypothetical questions corresponding to each tool to obtain the question clustering information; The processing module 902 is also used to perform semantic clustering on the hypothetical questions corresponding to multiple tools in the tool library to obtain the question clustering information.

[0096] The processing module 902 is further configured to determine a second toolset based on the tool description information of each tool in the tool library and the user input information; The processing module 902 is further configured to determine the target tool based on the first toolset and the second toolset.

[0097] The processing module 902 is further configured to obtain a first score for each tool in the first toolset, wherein the first score represents the degree of correlation between each tool in the first toolset and the user input information; The processing module 902 is further configured to obtain a second score for each tool in the second toolset, wherein the second score represents the degree of correlation between each tool in the second toolset and the user input information; The processing module 902 is further configured to determine the target score of each tool in the toolset based on the first score and the second score, wherein the toolset is the union of the first toolset and the second toolset; The processing module 902 is further configured to determine the target tool based on the target score and the tool set.

[0098] The processing module 902 is further configured to determine a target score based on a first score and a second score of the target candidate tool if the target candidate tool belongs to both the first toolset and the second toolset. The processing module 902 is further configured to determine the first score of the target candidate tool as the target score if the target candidate tool belongs only to the first toolset; The processing module 902 is further configured to determine the second score of the target candidate tool as the target score if the target candidate tool belongs only to the second toolset.

[0099] The processing module 902 is further configured to determine a first association score based on the text overlap between the tool description information of each tool and the user input information; The processing module 902 is further configured to determine a second association score based on the semantic similarity between the tool description information of each tool and the user input information; The processing module 902 is further configured to determine the second toolset based on the first association score and the second association score of each tool.

[0100] The processing module 902 is further configured to detect newly added tools in the tool library, perform semantic clustering on the hypothetical questions corresponding to the newly added tools, and obtain question clustering information; The processing module 902 is further configured to determine that the hypothesis problem of the newly added tool is related to other tools in the tool library, and add the hypothesis problem to the hypothesis problem of the tool; The processing module 902 is further configured to perform semantic clustering on the hypothetical questions added to the tool, and obtain question clustering information corresponding to the tool.

[0101] According to embodiments of this disclosure, this disclosure also provides an electronic device and a readable storage medium.

[0102] Figure 10A schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0103] like Figure 10 As shown, the electronic device 1000 includes a computing unit 1001, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a random access memory (RAM) 1003. The RAM 1003 may also store various programs and data required for the operation of the electronic device 1000. The computing unit 1001, ROM 1002, and RAM 1003 are interconnected via a bus 1004. An input / output (I / O) interface 1005 is also connected to the bus 1004.

[0104] Multiple components in electronic device 1000 are connected to I / O interface 1005, including: input unit 1006, such as keyboard, mouse, etc.; output unit 1007, such as various types of displays, speakers, etc.; storage unit 1008, such as disk, optical disk, etc.; and communication unit 1009, such as network card, modem, wireless transceiver, etc. Communication unit 1009 allows electronic device 1000 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0105] The computing unit 1001 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as data retrieval methods. For example, in some embodiments, the data retrieval method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and / or installed on the electronic device 1000 via ROM 1002 and / or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by the computing unit 1001, one or more steps of the data retrieval method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the data retrieval method by any other suitable means (e.g., by means of firmware).

[0106] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), system-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0107] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0108] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0109] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0110] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0111] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0112] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0113] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this disclosure, "a plurality of" means two or more, unless otherwise explicitly specified.

[0114] The above description is merely a specific embodiment of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this disclosure should be included within the scope of protection of this disclosure. Therefore, the scope of protection of this disclosure should be determined by the scope of the claims.

Claims

1. A data retrieval method, the method comprising: Based on user input, the model needs to invoke tools to obtain user input information; From multiple tools in the tool library, a first toolset is determined based on at least one question clustering information of each tool and the user input information, wherein the question clustering information is obtained by semantic clustering of the hypothetical questions corresponding to each tool; The target tool is determined based on the first toolset, and the target tool is used to determine the tool required for the model to perform the task corresponding to the user input.

2. The method according to claim 1, wherein determining the first toolset based on at least one problem clustering information of each tool and the user input information comprises: Based on at least one question clustering information of each tool and the user input information, a first candidate tool is determined; The similarity between the hypothesis question corresponding to each of the first candidate tools and the user input information is calculated. The first toolset is determined based on the similarity calculation results and the first candidate tools.

3. The method according to claim 1 or 2, wherein determining the first toolset based on at least one problem clustering information of each tool and the user input information comprises: A second candidate tool is determined based on at least one question clustering information of each tool and the user input information; The hypothetical question and tool description information of each second candidate tool are concatenated and then compared with the user input information to calculate the similarity. The first toolset is determined based on the similarity calculation results and the second candidate tools.

4. The method according to claim 1, wherein the semantic clustering of the hypothetical questions corresponding to each tool includes one of the following: Semantic clustering is performed on the hypothetical question corresponding to each tool to obtain the question clustering information; Semantic clustering is performed on the hypothetical questions corresponding to multiple tools in the tool library to obtain the question clustering information.

5. The method according to claim 1, wherein determining the target tool based on the first toolset comprises: The second toolset is determined based on the tool descriptions of each tool in the tool library and the user input information. The target tool is determined based on the first toolset and the second toolset.

6. The method according to claim 5, wherein determining the target tool based on the first toolset and the second toolset comprises: Obtain a first score for each tool in the first toolset, where the first score represents the degree of correlation between each tool in the first toolset and the user input information; Obtain a second score for each tool in the second toolset, the second score representing the degree of relevance of each tool in the second toolset to the user input information; The target score for each tool in the toolset is determined based on the first score and the second score, wherein the toolset is the union of the first toolset and the second toolset; The target tool is determined based on the target score and the tool set.

7. The method of claim 6, wherein determining the target score for each tool in the toolset based on the first score and the second score comprises: If the target candidate tool belongs to both the first toolset and the second toolset, then the target score is determined based on the first score and the second score of the target candidate tool. If the target candidate tool belongs only to the first toolset, then the first score of the target candidate tool is determined as the target score; If the target candidate tool belongs only to the second toolset, then the second score of the target candidate tool is determined as the target score.

8. The method according to claim 5, wherein determining the second toolset based on the tool description information of each tool in the tool library and the user input information comprises: The first association score is determined based on the degree of text overlap between the tool description information and the user input information of each tool; A second association score is determined based on the semantic similarity between the tool description information and the user input information of each tool; The second toolset is determined based on the first association score and the second association score of each tool.

9. The method according to claim 1, further comprising: Upon detecting a new tool in the tool library, semantic clustering is performed on the hypothetical questions corresponding to the new tool to obtain question clustering information. If it is determined that the hypothesis problem of the newly added tool is related to other tools in the tool library, the hypothesis problem is added to the hypothesis problem of the tool. The newly added hypothetical questions in the tool are re-semantically clustered to obtain the question clustering information corresponding to the tool.

10. A data retrieval device, the device comprising: The data acquisition module is used to determine the tools that the model needs to call based on user input and to obtain user input information; The processing module is used to determine a first toolset from multiple tools in the tool library based on at least one question clustering information of each tool and the user input information, wherein the question clustering information is obtained by semantic clustering of the hypothetical questions corresponding to each tool; The processing module is further configured to determine a target tool based on the first toolset, wherein the target tool is used to determine the tool required for the model to perform the task corresponding to the user input.