A method for accelerating AI agent tool calls based on a second-level caching mechanism

By constructing a two-level caching mechanism and vectorized representation processing, the tool invocation process of AI intelligent agents is optimized, solving the problem of low tool invocation efficiency and achieving more efficient tool invocation and response speed.

CN122309673APending Publication Date: 2026-06-30ZHEJIANG ZHUANZHUZHILIAN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG ZHUANZHUZHILIAN TECH CO LTD
Filing Date
2026-04-03
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

With existing technologies, when there are many tools and the context length is limited, the efficiency of AI agents calling tools is low and the response latency for repeated queries is high.

Method used

A two-level caching mechanism is adopted to construct a historical question-answer pair cache and a tool cache. Through a dual-path recall mechanism of semantic retrieval and BM25 retrieval, reusable answers are retrieved first from the historical question-answer pair cache. If no answer is found, candidate tools are recalled from the tool cache for the large language model to select and execute. The tool set is also processed by vectorization to filter candidate tools.

Benefits of technology

It reduces unnecessary tool calls, lowers response latency for repeated queries, improves tool call efficiency and scalability, and is suitable for multi-domain AI intelligent agent systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309673A_ABST
    Figure CN122309673A_ABST
Patent Text Reader

Abstract

This invention discloses a method for accelerating AI agent tool invocation based on a two-level caching mechanism, relating to the field of data processing technology. First, the tool set is vectorized to generate dense semantic vectors and BM25 sparse representations, constructing a tool cache database. Upon receiving a user query, a dual-path retrieval is performed in the historical question-answer pair cache database using semantic retrieval and BM25 text retrieval. If the matching degree exceeds a threshold, the historical answer is directly returned; otherwise, a dual-path retrieval is performed in the tool cache database to obtain candidate tools. After a large language model selects the tool for execution and generates the final answer, the new question-answer pair is written to the historical question-answer pair cache, achieving a closed-loop update. This method is suitable for efficient AI agent invocation in multi-tool scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and more specifically to a method for accelerating the invocation of AI intelligent agent tools based on a two-level caching mechanism. Background Technology

[0002] AI agents are software systems that use AI to achieve goals and complete tasks on behalf of users. Their core characteristics are the ability to autonomously design workflows and invoke tools. As agent systems that control tools to solve problems, they typically include modules for perception, planning, memory, and tool usage, and achieve complex task processing through large language models.

[0003] Common AI agent tool invocation schemes receive user input through a large language model, autonomously select suitable tools from a tool list, and the AI ​​large language model obtains the execution results of the tools, processes them, and outputs a response back to the user. The patent "A Dynamic Tool Selection and Optimization System and Method for External Tool Invocation by a Large Language Model" first constructs a tool feature space from three dimensions: functional characteristics, performance indicators, and resource requirements. It then employs a dynamic adaptive matching engine to optimize tool selection and combination, and continuously improves the accuracy of tool selection by optimizing and updating the invocation strategy based on tool invocation monitoring data through a reinforcement learning mechanism. Addressing issues such as incorrect tool selection, incomplete invocation, and disordered order during tool invocation by large language model agents, the literature "A Dynamic Tool Selection and Reflection Method Based on Retrieval Augmentation Generation (RAG) Technology" first uses retrieval augmentation generation technology to match candidate tools, then uses pre-constructed prompts to constrain the order in which the large model invokes tools, and finally triggers reflection at the task termination stage to forcibly verify the necessity of uninvoked tools. Experiments show that this method effectively improves the accuracy of tool invocation, task completion rate, and reduces the number of inversions.

[0004] Most of the aforementioned tool-calling solutions focus on improving the accuracy of tool calls from large models, while relatively neglecting the crucial dimension of tool call efficiency. Meanwhile, the tool-calling capabilities of large models themselves are continuously improving. According to the latest rankings of the Berkeley Function Calling Leaderboard (BFCL v4) (updated 2025-11-03), the accuracy of multi-round tool calls is approaching 70% (actual value 69.12%), and the accuracy of single-round tool calls is as high as 89.02%. Given this trend, it is foreseeable that as the capabilities of large models continue to improve, the marginal benefit of relying solely on engineering methods to further increase accuracy will gradually decrease, and its overall cost-effectiveness will continue to decline.

[0005] Therefore, how to accurately and effectively invoke the required tools when the context length of a large model is limited is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0006] In view of this, the present invention provides an AI agent tool invocation acceleration method based on a two-level caching mechanism to solve the problems of low efficiency in AI agent tool invocation and high latency in repeated query response under the conditions of a large number of tools and limited context length in the prior art.

[0007] To achieve the above objectives, the present invention adopts the following technical solution: A method for accelerating AI agent tool calls based on a second-level caching mechanism includes: Vectorization representation processing is performed on each tool in the intelligent agent tool set. The tool name and tool description of each tool are combined to form tool information text. Dense semantic vectors and BM25 sparse vectors are generated based on the tool information text. The tool information text, dense semantic vectors and BM25 sparse vectors are written into the tool cache database. Receive user queries, perform semantic retrieval and BM25 text retrieval on the user queries in the historical question-and-answer pair cache database, and obtain matching results of candidate historical question-and-answer pairs; When the maximum matching score of the candidate historical question-and-answer pair is greater than a preset threshold, the answer in the corresponding historical question-and-answer pair is directly output. When the maximum matching score of the matching results of the candidate historical question-answer pair is less than or equal to the preset threshold, semantic retrieval and BM25 sparse retrieval are performed on the user query in the tool cache database to obtain the matching results of the candidate tools, and a candidate tool set is formed based on the matching results of the candidate tools. The candidate tool set is input into a large language model, which selects and executes the target tool. The final answer is generated based on the tool's execution result and returned to the user.

[0008] Preferred options also include: The user query and the final answer are combined to form a new question-and-answer pair, and the new question-and-answer pair is written into the historical question-and-answer pair cache database to realize the closed-loop update of the historical question-and-answer pair cache database; Specifically, the new question-answer pair is vectorized and encoded to obtain a dense semantic vector for semantic retrieval, and a text index for BM25 text retrieval is established for the new question-answer pair.

[0009] Preferably, the vectorization representation processing of each tool in the agent toolkit includes: Obtain the tool name and tool description for each tool; The tool name and tool description are combined to form the tool information text; The tool information text is encoded using a word embedding model to obtain a dense semantic vector; The tool information text is segmented using a word segmentation tool, and the corresponding BM25 sparse vector is generated based on the BM25 algorithm. Write the tool information text, the dense semantic vector, and the BM25 sparse vector into the text field, dense vector field, and sparse vector field of the tool cache database, respectively.

[0010] Preferably, the tool cache database is a vector database, a dense vector index is established for the dense vector field, and the metric type of the dense vector index is cosine similarity; a sparse inverted index is established for the sparse vector field to support sparse retrieval based on BM25; The historical question-and-answer pair cache database establishes a vector index for the dense semantic vectors of the question-and-answer pairs to support semantic retrieval of historical question-and-answer pairs.

[0011] Preferably, the step of performing semantic retrieval and BM25 text retrieval on the user query in the historical question-answer pair cache database to obtain the matching results of candidate historical question-answer pairs includes: The user query is vectorized and encoded to obtain a query dense vector; Based on the query dense vector, a vector retrieval is performed in the historical question-and-answer pair cache database to obtain the first matching score of the candidate historical question-and-answer pair; Based on the user query, BM25 text retrieval is performed in the historical question-and-answer pair cache database to obtain the second matching score of the candidate historical question-and-answer pair; The first matching score and the second matching score are weighted and fused to obtain the final matching score of the candidate historical question and answer pair, and the candidate historical question and answer pair corresponding to the largest final matching score is taken as the matching result of the candidate historical question and answer pair.

[0012] Preferably, semantic retrieval and BM25 sparse retrieval are performed on the user query in the tool cache database to obtain matching results for candidate tools, and a candidate tool set is formed based on the matching results of the candidate tools, including: The user query is normalized. Based on the query dense vector, perform dense vector semantic retrieval in the tool cache database to obtain the third matching score of the candidate tool; Based on the normalized user query, BM25 sparse search is performed in the tool cache database to obtain the fourth matching score of the candidate tool; The third matching score and the fourth matching score are weighted and fused to obtain the final matching score of the candidate tool; The candidate tools are sorted from highest to lowest according to their final matching scores, and the top_k candidate tools with the highest scores are selected as the matching results of the candidate tools, forming the candidate tool set.

[0013] Compared with the prior art, the present invention has the following beneficial effects: This invention constructs a two-level caching mechanism consisting of a historical question-and-answer pair cache and a tool cache. This allows the AI ​​agent to prioritize retrieving reusable answers from the historical question-and-answer pair cache after receiving a user query. If the historical cache does not find the answer, the AI ​​agent then retrieves candidate tools from the tool cache for the large language model to select and execute. This reduces unnecessary tool calls and lowers the response latency of repeated queries.

[0014] Meanwhile, this invention pre-vectorizes the tool set and filters the tools based on a dual-path recall mechanism of semantic retrieval and BM25 retrieval, inputting only candidate tools with high matching degree into the large language model, avoiding loading all tool descriptions into the model context, which helps reduce context occupation and improve the calling efficiency and scalability in multi-tool scenarios.

[0015] Furthermore, this invention constructs historical question-and-answer pairs cache and tool cache using adapted retrieval methods. The historical question-and-answer pairs cache supports semantic retrieval and BM25 text retrieval, while the tool cache supports semantic retrieval and BM25 sparse retrieval, thereby improving the retrieval adaptability of different types of cached data and making it suitable for tool call acceleration scenarios in multi-domain AI intelligent agent systems. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0017] Figure 1 This is a schematic diagram of the process provided by the present invention. Detailed Implementation

[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0019] This invention discloses a method for accelerating AI agent tool calls based on a two-level caching mechanism, such as... Figure 1 As shown, it includes the following steps: First, each tool in the agent's toolkit is vectorized. The tool name and tool description of each tool are concatenated to form tool information text (tool name.tool description format, denoted as T). Dense semantic vectors and BM25 sparse vectors are generated based on the tool information text, and the tool information text, dense semantic vectors, and BM25 sparse vectors are written into the tool cache database.

[0020] After receiving a user query, the system first performs semantic retrieval and BM25 text retrieval on the user query in the historical question-and-answer pair cache database to obtain the matching results of candidate historical question-and-answer pairs and determine the final matching score of the candidate historical question-and-answer pairs.

[0021] When the maximum matching score of a candidate historical question-and-answer pair is greater than a preset threshold, the answer in the corresponding historical question-and-answer pair is directly output; when the maximum matching score of a candidate historical question-and-answer pair is less than or equal to the preset threshold, semantic retrieval and BM25 sparse retrieval are performed on the user query in the tool cache database to obtain the matching results of the candidate tools, and a candidate tool set is formed based on the matching results.

[0022] Subsequently, the candidate tool set is input into the large language model, which selects the target tool from the candidate tool set and executes it. Based on the tool execution result, the final answer is generated and returned to the user.

[0023] Finally, the user query and the final answer are combined to form a new question-answer pair, and the new question-answer pair is written into the historical question-answer pair cache database to achieve a closed-loop update of the historical question-answer pair cache database.

[0024] In one specific embodiment, it is assumed that there are n tools in the system; for ease of explanation, n=50 in this embodiment. The system first performs vectorization representation processing on the above 50 tools. Specifically, the tool name and tool description of each tool are concatenated to form tool information text (tool name.tool description format, denoted as T); then, T is encoded using the word embedding model Qwen3-embedding-0.6B to obtain a 1024-dimensional dense semantic vector, denoted as... Simultaneously, the Jieba word segmentation tool (version 0.42.1) was used to segment T, obtaining a word sequence, and the corresponding BM25 sparse vector was generated based on the BM25 algorithm, denoted as... Finally, T, and Write the data into the text field (text), dense vector field (text_dense), and sparse vector field (text_sparse) of the Milvus vector database (version 2.6.0) respectively to form a tool cache database.

[0025] The Milvus index configuration is as follows: A dense vector index is created for the dense vector field `text_dense`, with the metric type set to cosine similarity. In this embodiment, a FLAT CPU index is used; however, a GPU index, such as `GPU_IVF_FLAT`, can be used for scenarios requiring higher retrieval performance. A `SPARSE_INVERTED_INDEX` index is created for the sparse vector field `text_sparse` to support BM25-based sparse retrieval.

[0026] Subsequently, when the AI ​​agent system receives a user query q, it prioritizes retrieval from the historical question-and-answer pair cache database Redis (version 8.0). First, it uses the word embedding model Qwen3-embedding-0.6B to represent query q as a 1024-dimensional query dense vector, and then performs vector retrieval in Redis based on cosine similarity to obtain the first matching score of the candidate historical question-and-answer pairs. Simultaneously, based on the BM25 algorithm, full-text search is performed on the historical question-answer pairs in Redis to obtain the second matching score of the candidate historical question-answer pairs. Subsequently, the first matching score was... Second matching score Weighted fusion is performed to obtain the final matching score of candidate historical question-answer pairs. ,Right now:

[0027] In this embodiment, it is possible to take , Let the maximum matching score among the candidate historical question-answer pairs be... ,when At that time, the system directly returns the same as... The corresponding answer in the historical question and answer pair; when At that time, the system will switch to the tool recall process.

[0028] In the tool recall process, the system performs dense vector semantic retrieval and BM25 sparse retrieval on the user query in the Milvus tool cache database. Specifically, the user query is first normalized to generate a query dense vector; then, based on the query dense vector, dense vector semantic retrieval is performed in the tool cache database to obtain the third matching score of the candidate tool. Based on the normalized user query, a BM25 sparse search is performed in the tool cache database to obtain the fourth matching score of the candidate tools. Subsequently, the third matching score was... Matching score with the fourth Weighted fusion is performed to obtain the final matching score of the candidate tools. ,Right now:

[0029] In this embodiment, it is possible to take , The system sorts candidate tools from highest to lowest based on their final matching scores, and selects the top_k tools with the highest scores as the candidate tool set. In this embodiment, Top_k=5. To further improve recall accuracy, the system can also perform keyword filtering on the candidate tool results.

[0030] Next, the large language model receives the above set of candidate tools and autonomously selects the most suitable target tool to call and execute; the execution result of the tool is processed by the large language model to generate and output the final answer.

[0031] Finally, the system combines the user's query and the final answer into a new question-and-answer pair in the format of "query: answer", denoted as In the background daemon process, firstly... The new question-and-answer pairs are written to the Redis historical question-and-answer pair cache database in text format, and a text index for BM25 text retrieval is created for the new question-and-answer pairs; simultaneously, The word embedding model represents it as a 1024-dimensional dense semantic vector, denoted as . This information is then written to a Redis vector field to support subsequent semantic retrieval of historical question-answer pairs.

[0032] In this embodiment, Redis establishes an HNSW vector index for the dense semantic vectors, where the distance metric is cosine similarity, the maximum allowed number of neighbor connections M for each node in the graph can be set to 16, the search width EF_CONSTRUCTION during the construction phase can be set to 200, and the search width EF_RUNTIME during the retrieval phase can be set to 20. Through this method, the historical question-and-answer cache database simultaneously possesses semantic retrieval capabilities and BM25 text retrieval capabilities, thereby realizing a second-level cache invocation mechanism that works in conjunction with the tool cache.

[0033] The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other. Regarding the methods disclosed in this embodiment, since they correspond to the methods disclosed in other parts of this invention, the description is relatively brief, and relevant details can be found in the foregoing description.

[0034] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may also be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not limited to the embodiments shown herein, but should be accorded the widest scope of protection consistent with the principles and novel features disclosed herein.

Claims

1. A method for accelerating AI agent tool calls based on a two-level caching mechanism, characterized in that, include: Vectorization representation processing is performed on each tool in the intelligent agent tool set. The tool name and tool description of each tool are combined to form tool information text. Dense semantic vectors and BM25 sparse vectors are generated based on the tool information text. The tool information text, dense semantic vectors and BM25 sparse vectors are written into the tool cache database. Receive user queries, perform semantic retrieval and BM25 text retrieval on the user queries in the historical question-and-answer pair cache database, and obtain matching results of candidate historical question-and-answer pairs; When the maximum matching score of the candidate historical question-and-answer pair is greater than a preset threshold, the answer in the corresponding historical question-and-answer pair is directly output. When the maximum matching score of the matching results of the candidate historical question-answer pair is less than or equal to the preset threshold, semantic retrieval and BM25 sparse retrieval are performed on the user query in the tool cache database to obtain the matching results of the candidate tools, and a candidate tool set is formed based on the matching results of the candidate tools. The candidate tool set is input into a large language model, which selects and executes the target tool. The final answer is generated based on the tool's execution result and returned to the user.

2. The method for accelerating AI agent tool invocation based on a two-level caching mechanism according to claim 1, characterized in that, Also includes: The user query and the final answer are combined to form a new question-and-answer pair, and the new question-and-answer pair is written into the historical question-and-answer pair cache database to achieve a closed-loop update of the historical question-and-answer pair cache database; Specifically, the new question-answer pair is vectorized and encoded to obtain a dense semantic vector for semantic retrieval, and a text index for BM25 text retrieval is established for the new question-answer pair.

3. The method for accelerating AI agent tool calls based on a two-level caching mechanism according to claim 1, characterized in that, The vectorization representation processing of each tool in the intelligent agent toolkit includes: Obtain the tool name and tool description for each tool; The tool name and tool description are combined to form the tool information text; The tool information text is encoded using a word embedding model to obtain a dense semantic vector; The tool information text is segmented using a word segmentation tool, and the corresponding BM25 sparse vector is generated based on the BM25 algorithm. Write the tool information text, the dense semantic vector, and the BM25 sparse vector into the text field, dense vector field, and sparse vector field of the tool cache database, respectively.

4. The method for accelerating AI agent tool calls based on a two-level caching mechanism according to claim 3, characterized in that: The tool cache database is a vector database. A dense vector index is established for the dense vector field, and the metric type of the dense vector index is cosine similarity. A sparse inverted index is established for the sparse vector field to support sparse retrieval based on BM25. The historical question-and-answer pair cache database establishes a vector index for the dense semantic vectors of the question-and-answer pairs to support semantic retrieval of historical question-and-answer pairs.

5. The method for accelerating AI agent tool invocation based on a two-level caching mechanism according to claim 1, characterized in that, The step of performing semantic retrieval and BM25 text retrieval on the user query in the historical question-answer pair cache database to obtain the matching results of candidate historical question-answer pairs includes: The user query is vectorized and encoded to obtain a query dense vector; Based on the query dense vector, a vector retrieval is performed in the historical question-and-answer pair cache database to obtain the first matching score of the candidate historical question-and-answer pair; Based on the user query, BM25 text retrieval is performed in the historical question-and-answer pair cache database to obtain the second matching score of the candidate historical question-and-answer pair; The first matching score and the second matching score are weighted and fused to obtain the final matching score of the candidate historical question and answer pair, and the candidate historical question and answer pair corresponding to the largest final matching score is taken as the matching result of the candidate historical question and answer pair.

6. The method for accelerating AI agent tool calls based on a two-level caching mechanism according to claim 5, characterized in that, Semantic retrieval and BM25 sparse retrieval are performed on the user query in the tool cache database to obtain matching results for candidate tools. A candidate tool set is then formed based on these matching results, including: The user query is normalized. Based on the query dense vector, perform dense vector semantic retrieval in the tool cache database to obtain the third matching score of the candidate tool; Based on the normalized user query, BM25 sparse search is performed in the tool cache database to obtain the fourth matching score of the candidate tool; The third matching score and the fourth matching score are weighted and fused to obtain the final matching score of the candidate tool; The candidate tools are sorted from highest to lowest according to their final matching scores, and the top_k candidate tools with the highest scores are selected as the matching results of the candidate tools, forming the candidate tool set.