A tool intelligent agent system
By combining the analysis, planning, execution, and evaluation modules of the tool intelligence system with semantic parsing and dynamic capability boundary updates, the problem of difficulty in characterizing the differences in tool capabilities in multi-tool collaboration is solved, thereby improving the decision-making accuracy and execution robustness of visual content generation tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-02-25
- Publication Date
- 2026-06-30
AI Technical Summary
Existing intelligent agent systems lack accurate modeling of the differences in tool capabilities in multi-tool collaboration, leading to task execution errors. This is especially true in the field of visual content generation, where it is difficult to accurately characterize the actual performance boundaries of tools, affecting decision-making accuracy and execution robustness.
The tool intelligent agent system is adopted, which includes an analysis module for semantic parsing and task summarization, a planning module for generating sub-tasks and selecting the best-performing tool, an execution module for capability preference analysis and tool matching, and an evaluation module for multi-dimensional quantitative evaluation, forming a decision-making closed loop and dynamically updating the capability boundary matrix to improve the accuracy of tool selection.
It improves the accuracy of tool selection in complex tasks, reduces the error rate, enhances the robustness and adaptability of decision-making, optimizes the decision-making process through execution feedback, and improves task completion.
Smart Images

Figure CN121835932B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent agent technology, and more specifically to a tool intelligent agent system. Background Technology
[0002] With the rise of large-scale models and agent technologies, more and more scholars are applying agent technology to various fields. Agents are capable of autonomous reasoning, decision-making, and execution, making them naturally suited to handling diverse tasks and dynamic needs. Unlike traditional fixed-rule generation models, agents, through continuous interaction with users, can understand complex requirements and adaptively adjust generation strategies. The introduction of agents means that the generation process is no longer limited to a single dataset and rules, but can be flexibly adjusted according to different contexts and needs, thereby achieving output results that better meet user expectations.
[0003] Current intelligent agent technology typically uses large language models (LLM / MLLM) as its inference engine. It receives user input, analyzes control intentions, integrates task information through a pre-defined inference process, and transmits instructions to downstream tools to progressively advance task execution.
[0004] However, current research shows that within an agent framework, multi-tool collaboration among agents cannot successfully complete complex tasks solely through descriptions of each tool's capabilities. In agent systems, when a large number of tools participate in a task, simple textual descriptions often fail to accurately reflect their actual capabilities, especially when different tools have similar functions. For example, although text-to-image generation tools based on different generative models (such as SD3 and FLUX) may have highly similar textual descriptions, such as "text-to-image generation tools," in actual task execution, even models with identical functions may produce significantly different results depending on their generative preferences. A lack of precise capability descriptions can lead to incorrect tool selection during task execution, thus affecting the generation results. Therefore, accurately modeling the capability differences between tools and effectively distinguishing their performance in task execution becomes crucial for improving task execution efficiency and task completion rates.
[0005] In research based on large language models, models such as GPT and Claude excel at understanding and planning, but they still fall short in terms of specialized division of labor and precise execution. Therefore, researchers have developed multi-agent collaborative mechanisms, mainly including the following two paradigms:
[0006] One type is the engineering paradigm based on fixed processes, which ensures the reliability and output quality of task execution through preset roles and standardized operations. MetaGPT assigns fixed roles such as product manager and engineer to agents and forces the production of standardized intermediate artifacts such as PRDs and design documents, forming a structured pipeline. For example, the invention patent application with publication number CN120256065A, entitled "Agent-Based Cloud Task Scheduling Method, Device, Equipment and Medium", achieves full-process automation of cloud resource scheduling through a fixed call sequence of "description generation → topology generation → load prediction → scheme generation".
[0007] Another type is the interactive paradigm based on dynamic dialogue. It does not have a fixed process, but rather generates solutions dynamically through dialogue between intelligent agents, which is suitable for exploratory tasks. Represented by AutoGen, it achieves a closed loop of automatic code execution and debugging by building dialogue-enabled intelligent agent pairs (such as AssistantAgent and UserProxyAgent).
[0008] Furthermore, in response to the rapid increase in the number of tools, patent application CN119623478A, entitled "A Method and System for Semantic Parsing of Tool Calls in Low-Resource Data Environments," generates training data for tool logical expressions using context-free grammars and employs a self-training mechanism to iteratively optimize the semantic parsing model, enabling the model to accurately understand user semantics in low-resource environments. Patent application CN120256065A, entitled "A Method, Device, Equipment, and Medium for Cloud Task Scheduling Based on Intelligent Agents," proposes dynamically prioritizing candidate tools based on semantic analysis and relevance indicators, as well as historical data such as call frequency and success rate, to improve call accuracy and resource efficiency. Patent application CN118296337A, entitled "A Method for Evaluating the Accuracy of Tool Calls in Large Models," designs specific evaluation indicators for different application scenarios and calculates the similarity between predicted results and real tool labels to generate tool call evaluation information.
[0009] While the aforementioned methods enhance task execution capabilities through division of labor, processes, or dynamic sequencing, current research primarily focuses on agent task planning algorithms and tool scheduling strategies, emphasizing the rationality of planning logic. These works largely rely on the ideal assumption that "tool invocation will inevitably succeed," lacking a systematic evaluation of the actual performance of tools and neglecting the uncertainty of tool selection and its impact on the accuracy of the final decision. Furthermore, existing systems typically use generic textual descriptions to define tool capabilities, making it difficult to accurately characterize their true performance boundaries. This problem is particularly pronounced in the field of visual content generation. Although existing systems (such as MetaGPT and GenArtist) have improved task execution through task decomposition and multi-model scheduling, their tool descriptions remain relatively general, failing to clearly distinguish the specializations and applicable scenarios of different tools. For example, in image generation tasks, common text-to-image tools are described as "capable of generating images that conform to text semantics." This description fails to reflect the performance differences between different models and is insufficient to support agents in accurately matching tools in complex tasks, thus posing risks to overall planning and execution.
[0010] Therefore, there is an urgent need to build an intelligent agent tool invocation mechanism that can accurately characterize the actual capability boundaries of tools, integrate execution feedback, and support dynamic evaluation, so as to improve decision-making accuracy and execution robustness under complex tasks. Summary of the Invention
[0011] In view of the above problems, the present invention proposes a tool intelligent agent system to overcome or at least partially solve the above problems.
[0012] To achieve the above objectives, the present invention adopts the following technical solution:
[0013] This invention provides a tool intelligent agent system, comprising:
[0014] The analysis module is used to parse the task requirements input by the user and generate a structured task summary and evaluation objectives;
[0015] The planning module is used to analyze and generate the first subtask to be executed based on the task summary; and to determine the overall task completion rate based on the execution score of the current subtask, and generate subsequent subtasks if the overall task is not completed.
[0016] The execution module is used to analyze the capability preferences of the current subtask, match the best-performing tool from the tool library based on the preference weights to execute the current subtask, and obtain the execution results;
[0017] The evaluation module is used to perform multi-dimensional quantitative evaluation of the execution results based on the evaluation objectives, and generate an execution score for the current subtask and a textual evaluation of the results.
[0018] Furthermore, the analysis module includes:
[0019] The semantic parsing submodule is used to perform semantic parsing on the text commands input by the user; and when the user input includes an image, it extracts the semantic information of the image and performs joint semantic parsing on the image semantic information and the text commands.
[0020] The task summary generation submodule is used to generate a structured task summary based on the semantic parsing results;
[0021] The evaluation target generation submodule is used to generate corresponding evaluation targets based on the task summary using preset evaluation dimensions.
[0022] Furthermore, the planning module includes:
[0023] The initial subtask generation submodule is used to receive the task summary and the tool library basic capability description document, and selectively combine the image semantic information to decompose the task summary to generate the first subtask to be executed.
[0024] Furthermore, the planning module also includes:
[0025] The judgment submodule is used to compare the execution score of the current subtask with a preset score threshold. If the execution score of the current subtask is less than the preset score threshold and the current iteration round is less than the preset maximum number of iterations, then the total task is determined to be incomplete.
[0026] The subsequent subtask generation submodule is used to predict and generate the next subtask when the total task is not completed, based on the current historical trajectory information, the task summary and the tool library basic capability description document, and selectively combined with the image semantic information, and input into the trained large language model.
[0027] The current historical trajectory information includes the sub-tasks that have been executed in the current round and their corresponding execution scores.
[0028] Furthermore, the loss function of the large language model during training is expressed as:
[0029]
[0030] Among them, L( θ () represents the loss function during training; θ Indicates model parameters; E t~u Expressing expectations; This indicates a summary of the task. Indicates the first t The optimal sample for the wheel; Indicates the first tThe worst sample of the wheel; B represents the tool library's basic capability description document; σ Represents the sigmoid function; This represents the regularization hyperparameter; p ref Indicates predictions from a fixed initial model. p θ The reference probability; This indicates the planning module; h t-1 Indicates the first t Historical trajectory information for round -1; This indicates the total number of iterations of the task trajectory; s i It represents the semantic information of the image.
[0031] Furthermore, the execution module includes:
[0032] The capability preference analysis submodule is used to assign performance-related preference weights to the current subtask based on preset capability boundary dimensions and the tool library's basic capability description document.
[0033] The matching submodule is used to perform a dot product operation between the preference weights and the capability boundary matrix composed of the multi-dimensional capability boundary values of each tool in the tool library, so as to obtain the performance matching degree of each tool when executing the current subtask.
[0034] The tool selection submodule is used to sort the tool index according to the performance matching degree, obtain the theoretical execution effect ranking, and select the tool with the highest performance matching degree as the best performance tool to execute the current subtask.
[0035] The task execution submodule is used to call the best performance tool to execute the current subtask and obtain the execution result.
[0036] Furthermore, the execution module also includes:
[0037] The capability boundary matrix update submodule is used to select the top performers with the highest performance match from the tool library. m One tool, and randomly select from the remaining tools. n An additional tool; use the selected m + n Each tool performs the same subtask, and the execution results are compared to obtain a ranking of actual execution performance. The capability boundary matrix is updated based on the difference between the actual execution performance ranking and the theoretical execution performance ranking.
[0038] Furthermore, the task execution submodule includes:
[0039] The input condition confirmation unit is used to obtain the usage document information of the selected tool from the basic capability description document of the tool library; and determine the input conditions required for the selected tool to perform the current subtask based on the usage document information.
[0040] The execution unit is used to call the selected tool to execute the current subtask according to the input conditions and obtain a structured execution result.
[0041] Furthermore, in the evaluation module, the execution score corresponding to the current subtask is represented as follows:
[0042]
[0043] in, e t Indicates the first t The execution score of the wheel task is obtained by weighting various dimensions; Indicates the first i The evaluation weights of each dimension; L Indicates shared ownership L One dimension; o t Indicates the first t The result of the round's execution; g i Indicates the first i Evaluation objectives in several dimensions; This indicates the evaluation module.
[0044] Furthermore, the tool intelligence system is applied to downstream tasks; the downstream tasks include at least one of image editing tasks, image generation tasks, and visual understanding tasks.
[0045] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a tool intelligent agent system, which has the following beneficial effects:
[0046] This invention realizes a tool selection process based on task capability preferences and tool capability boundaries, which fundamentally reduces tool selection and task execution errors caused by misjudgment of tool capabilities, thereby improving decision-making accuracy and execution robustness under complex tasks.
[0047] This invention has dynamic update capabilities, and the system can self-optimize as experience in using the tool accumulates, exhibiting stronger adaptability to solve the problem of mismatch between predefined tool capability boundaries and actual task capability requirements and actual tool capabilities; it forms a synergistic enhancement between planning and execution, and by feeding back the execution effect to the planning stage, it is conducive to building a decision-making closed-loop system for continuous improvement. Attached Figure Description
[0048] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0049] Figure 1 This is a schematic diagram of the inference framework of the tool intelligent agent system provided in the embodiments of the present invention.
[0050] Figure 2 This is a schematic diagram of the inference framework in the image generation task provided in an embodiment of the present invention.
[0051] Figure 3 This is a schematic diagram illustrating the task breakdown and execution in the image generation task provided in an embodiment of the present invention.
[0052] Figure 4 This is a schematic diagram of the adaptive preference update mechanism in the image generation task provided in this embodiment of the invention.
[0053] Figure 5 This is a schematic diagram of the planning and optimization strategy for capability alignment in the image generation task provided in this embodiment of the invention.
[0054] Figure 6 This is a schematic diagram showing the visual comparison of the present invention with other existing methods in image generation and image editing tasks, as provided in the embodiments of the present invention.
[0055] Figure 7 This is a schematic diagram illustrating the visualization effects of multiple examples of image editing tasks provided in the embodiments of the present invention.
[0056] Figure 8 This is a schematic diagram illustrating the visualization effects of multiple examples of the image generation task provided in the embodiments of the present invention.
[0057] Figure 9 This is a schematic diagram illustrating the visualization effect of generating multiple samples of a custom image provided in an embodiment of the present invention.
[0058] Figure 10 This is a schematic diagram of the reasoning process for a visual understanding task provided in an embodiment of the present invention.
[0059] Figure 11 This is a diagram illustrating the time consumption comparison provided in an embodiment of the present invention.
[0060] Figure 12 This is a schematic diagram showing the comparison of word consumption provided in an embodiment of the present invention. Detailed Implementation
[0061] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0062] This invention discloses a tool intelligent agent system, such as... Figure 1 As shown, it includes a planning module and an execution module with a large language model as the inference engine, and an analysis module and an evaluation module with a multimodal large model as the inference engine; wherein:
[0063] The analysis module is used to parse the task requirements input by the user and generate a structured task summary and evaluation objectives;
[0064] The planning module is used to analyze and generate the first subtask to be executed based on the task summary; and to determine the overall task completion rate based on the execution score of the current subtask, and generate subsequent subtasks if the overall task is not completed.
[0065] The execution module is used to analyze the capability preferences of the current subtask, match the best-performing tool from the tool library based on the preference weights to execute the current subtask, and obtain the execution results;
[0066] The evaluation module is used to perform multi-dimensional quantitative evaluation of the execution results based on the evaluation objectives, and generate the execution score of the current subtask and textual result evaluation.
[0067] The planning module, execution module, and evaluation module are connected in sequence to form a closed loop of iterative execution based on execution scores.
[0068] Next, each of the above modules will be explained in detail.
[0069] 1. Analysis Module :
[0070] This analysis module includes a semantic parsing submodule, a task summary generation submodule, and an evaluation target generation submodule; wherein:
[0071] (1) Semantic parsing submodule, which is used to directly perform semantic parsing on plain text instructions when the user input is plain text instructions; and to analyze the image and extract the semantic information of the image using a visual language model when the user inputs multimodal information combining text and image. s i ; to extract semantic information from images s i Joint semantic parsing with text instructions;
[0072] (3) Task summary generation submodule, which is used to extract the core user intent through logical reasoning based on the semantic parsing results and generate a structured task summary. ;
[0073] (4) Evaluation target generation submodule, used to summarize the task based on preset evaluation dimensions. Generate corresponding evaluation targets; taking image generation or image editing tasks as an example, the preset evaluation dimensions may specifically include the number of objects, object types, object attributes, spatial relationships, and global image semantics, etc.
[0074] 2. Planning Module :
[0075] The planning module includes an initial subtask generation submodule, a judgment submodule, and a subsequent subtask generation submodule; wherein:
[0076] (1) Initial subtask generation submodule, used to receive task summary and tool library basic capability description document B And selectively combine image semantic information. s i Summary of the task Further analysis is performed to generate the first subtask to be executed. u 1; Specifically, the basic text prompt template set by the initial subtask generation submodule. It is used for planning modules The large language model in the dataset performs inference on subtask generation. It summarizes the received tasks. Tool Library Basic Capability Description Document B and image semantic information s i (If present) populate it into the base text prompt template The position of the corresponding variable in the middle is then used to obtain the task. Generate text prompts for the first subtask .Will Input into the large language model for the first subtask u 1. Perform reasoning and output.
[0077] The above tool library basic capability description document B This refers to a textual description of the basic capabilities of the tool library. For example, "The image generation function can generate images based on text input as well as generate images based on the combined control of reference images and text" or "The image editing function can add, delete, replace, etc. of objects."
[0078] Among them, selectively combining image semantic information s iThis means that when the user's input contains an image, the initial subtask generation submodule needs to combine the image's semantic information. s i The first subtask to be executed at this point. u 1 is represented as:
[0079]
[0080] (2) Judgment submodule, used to score the execution of the current subtask. e t If the score of the current subtask is less than the preset score threshold and the current iteration number is less than the preset maximum number of iterations, the task is determined to be incomplete.
[0081] (3) Subsequent subtask generation submodule, used to generate subtasks based on the current historical trajectory information when the total task is not completed. h t Task Summary and tool library basic capability description document B And selectively combine image semantic information. s i The input is fed into a pre-trained large language model to predict and generate the next subtask. u t+1 ; indicates as:
[0082]
[0083] Among them, selectively combining image semantic information s i This means that when the user's input contains an image, the subsequent subtask generation module needs to combine the image's semantic information. s i ;
[0084] The above historical trajectory information h t This includes the subtasks already executed in the current round and their corresponding execution scores, represented as follows: .
[0085] The execution and evaluation modules continue to work on the next subtask generated by the planning module. u t+1 The process involves correcting or further advancing historical subtasks until the execution result score reaches a preset threshold or the historical execution steps reach the maximum number of steps, and then outputting the task execution result with the best evaluation score. o best The optimal task execution result o best This refers to the final visual content generated based on user input and user requirements.
[0086] The training steps for the aforementioned large language model include:
[0087] 1) In each iteration, the planner generates... k There are 1 candidate subtasks, denoted as . ;
[0088] 2) Through the execution module and the evaluation module, respectively, this... k Each candidate subtask is processed to obtain the execution result for each candidate subtask. and execution score ;in, Indicates the first t The first in the wheel k One candidate subtask; Indicates the first t The first in the wheel k The execution results of the candidate subtasks; In the t The first in the wheel k Execution scores for each candidate subtask;
[0089] 3) Select the candidate subtask with the highest score as the optimal sample. The candidate subtask with the lowest score will be selected as the worst sample. The large language model is trained; the loss function during training is expressed as:
[0090]
[0091] Among them, L( θ () represents the loss function during training; θ Indicates the model parameters that need to be updated; E t~u Expressing expectations; This indicates a summary of the task. Indicates the first t The optimal sample for the wheel; Indicates the first t The worst sample of the wheel; B represents the tool library's basic capability description document; σ Represents the sigmoid function; This represents the regularization hyperparameter; p ref Indicates predictions from a fixed initial model. p θ The reference probability; This indicates the planning module; h t-1 Indicates the first t Historical trajectory information for round -1; This indicates the total number of iterations of the task trajectory;s i It represents the semantic information of the image.
[0092] 3. Execution Module :
[0093] The execution module includes a capability preference analysis submodule, a matching submodule, a tool selection submodule, a task execution submodule, and a capability boundary matrix update submodule; wherein:
[0094] (1) Ability preference analysis submodule, used to analyze the execution of the current subtask u t The key performance dimensions to focus on at the time; specifically, based on the preset capability boundary dimensions. D and tool library basic capability description document B Assign performance-related preference weights to the current subtask. W task Specifically, the basic text prompt template set for the ability preference analysis submodule. It is used to execute modules Large language models in China use preference weights W task Generate and then reason. For the received capability boundary dimension... D Tool Library Basic Capability Description Document B and the current subtask u t Fill it into the basic text prompt template The position of the corresponding variable is then used for the subtask. u t Generate preference weights W task Text prompts .Will Input into a large language model, and adjust the preference weights. W task Perform reasoning and output.
[0095] (2) Matching submodule, used to match preference weights W task The capability boundary matrix is formed by the capability boundaries of each tool in the preset tool library. M p Perform a dot product operation to obtain the corresponding subtasks for each tool to execute. u t Performance matching degree S tool , is represented as:
[0096]
[0097] in, This indicates a normalization operation.
[0098] The above preference weights W task and capability boundary matrix M p The dimensions corresponding to both are consistent. Among them, the preference weight... W task It is presented in vector form, where each dimension is a preset capability boundary dimension. D Each value represents a subtask. u t correspond D The allocation of attention, for example, the higher the weight assigned to a certain dimension, the higher the performance requirement for that dimension; capability boundary matrix. M p Each row of values represents the capability boundary information of each tool in the tool library, and each dimension of each row is a preset capability boundary dimension. D Its numerical value represents the tool's performance. D The performance of task execution under different dimensions is evaluated, and the capability boundary values of each tool are obtained through the capability assessment system.
[0099] (3) Tool selection submodule, used to select based on performance matching degree S tool Sort the tool index to obtain the theoretical performance ranking, represented as:
[0100]
[0101] Where R represents the tool index sorted by performance matching degree; This indicates operations that sort by index from largest to smallest and output the corresponding indices. Therefore, the tool ranked first in R, i.e., the tool with the highest performance match, is selected to execute the current subtask. u t Best performance tools Tool top1 ;
[0102] (4) Task execution submodule, used to call the selected best performance tool Tool top1 Execute the current subtask u t The task execution submodule includes: [Details of submodule would be inserted here]
[0103] 1) Input condition confirmation unit, used to obtain the selected best-performing tool from the tool library's basic capability description document. Tool top1 User documentation information; based on the user documentation information, determine the selected tool. Tooltop1 Execute the current subtask u t Required input conditions c Specifically, for the selected best performance tool Tool top1 The large language model in the execution module is based on the tool's user documentation and subtasks. u t Perform inference analysis to obtain the best performance tool that meets the format requirements. Tool top1 Conditional input c The documentation information includes: tool name, tool description, description and examples of tool input conditions, and supplementary information.
[0104] 2) Execution unit, used to determine the input conditions. c The selected tool is invoked to execute the current subtask, yielding structured execution results; specifically, the system enters the tool execution phase, where the system executes the tool. Tool top1 The relevant execution code, and based on the input conditions c As conditional input, to obtain the corresponding formatted output. o t The execution result output in each round is the visual content of the current task.
[0105] (5) Capability boundary matrix update submodule, used for:
[0106] 1) Utilize an "exploration-exploitation" strategy to select performance-matching tools from the tool library. S tool The highest front m One tool, and randomly select from the remaining tools. n An additional tool to increase the likelihood of selecting potentially high-performance tools; this m + n The ranking of theoretical performance of each tool (R) theory Represented as:
[0107]
[0108] in, Indicates selecting the best m The operation of each tool; Indicates random sampling n The operation of each tool; l This represents the total number of tools in the tool library.
[0109] 2) Use the selected m + n Each tool executes the same subtasku t The results of each execution are compared to obtain a ranking of the actual execution effectiveness.
[0110] 3) Update the capability boundary matrix based on the difference between the actual performance ranking and the theoretical performance ranking; expressed as:
[0111]
[0112] in, This represents the updated capability boundary matrix; This indicates a normalization operation; M p This represents the capability boundary matrix before the update. W task Indicates preference weights; η Indicates the update step size; Δ represents the direction coefficient, and R represents the theoretical execution performance ranking. theory Ranking of actual implementation results R actual The differences between them; among them, the actual performance ranking R actual The results are obtained by comparing and evaluating candidate outputs from multiple tools using a multimodal large model. Taking image generation or image editing tasks as an example, for multiple tool output images, they are simultaneously input into the multimodal large model, which then judges which image best matches the subtask's performance and ranks the actual performance from best to worst. This is done within the capability boundary matrix. M p During the update process, when the actual usage ranking of a tool exceeds its theoretical ranking, its performance boundary score will increase based on weighted preferences and the distribution of importance of specific tasks across dimensions; otherwise, the score will decrease accordingly.
[0113] It should be noted here that, in the existing technology, the tool capability boundary matrix in the tool library... M p These boundaries may originate from benchmark tests on large datasets (i.e., benchmark evaluation scores in existing authoritative papers) or expert-like evaluations based on previous tool usage. These boundaries may be inaccurate due to differences in task-related dimensions or subjective bias. To improve the accuracy of tool performance boundary scores, this invention employs an adaptive preference update mechanism through the aforementioned capability boundary matrix update submodule, iteratively adjusting the tool capability boundary matrix based on actual tool usage. M p Numerical value.
[0114] For those lacking sufficient user experience or benchmark results (i.e., lacking high-quality data) M pNewly added tools (scores) can be initialized with their scores using the average performance boundary scores of similar tools in the current tool library on the corresponding dimension, to ensure that their potential is not overlooked in subsequent tool use and iterative updates.
[0115] 4. Evaluation Module :
[0116] This evaluation module is used to perform multi-dimensional quantitative evaluation of the execution results based on the evaluation objectives, and generate an execution score for the current subtask; the execution score for the current subtask is represented as follows:
[0117]
[0118] in, e t Indicates the first t The execution score of the wheel task is obtained by weighting various dimensions; Indicates the first i The evaluation weights of each dimension, and ,in, L Indicates shared ownership L One dimension, Indicates the first L The evaluation weights for each dimension can be set as follows during implementation: ; o t Indicates the first t The result of the round's execution; g i Indicates the first i Evaluation objectives in several dimensions; This indicates the evaluation module.
[0119] Current subtask u The performance scores for each dimension of the text are input into the LLM for summarization, and the output is specific to the current subtask. u t Evaluation summary.
[0120] The aforementioned tool-based intelligent agent system is applied to downstream tasks; these downstream tasks include at least one of image editing, image generation, and visual understanding tasks.
[0121] Next, taking the image generation task as an example, the tool intelligent agent system provided by the present invention will be described in detail.
[0122] I. Taking the text-to-image generation task as an example, the reasoning process is as follows: Figure 2 As shown:
[0123] 1. Task Requirements Analysis:
[0124] After the user submits the task request, the analysis module... It receives user input, which can be plain text commands or multimodal information combining text and images, such as... Figure 2 In the example, the user entered the text requirement "A puppy wearing sunglasses and a Santa hat runs along a country road in winter, surrounded by a Christmas atmosphere" and a picture of a corgi. Therefore, in this task, the user expects the puppy in the generated image to be the corgi shown in the picture.
[0125] Subsequently, the analysis module analyzes the user input. Extract semantic information from the image and process the semantic information. s i Semantic parsing of text commands is performed, and core user intent is extracted through logical reasoning to generate a structured task summary. .like Figure 2 In the example, the analysis module integrates user input into a task summary and image semantic information through a pre-defined thought process. The task summary is "Generate a corgi wearing sunglasses and a Santa hat running on a country road decorated with Christmas decorations," the image semantic information is "A corgi stands on a rock with cherry blossom trees in the background," and the structured evaluation objective is "{number:{{sunglasses:1}, {corgi:1}}, background: country road decorated with Christmas decorations, posture: running}."
[0126] Analysis Module By pre-setting evaluation dimensions, evaluation targets corresponding to the final task output results are generated. Figure 2 In the example, since the final image semantics of this image generation task only involve "a corgi," "sunglasses," "Santa hat," and "a country road with a Christmas atmosphere," the analysis module... The evaluation objective for the task is integrated into "{Number:{{Sunglasses:1}, {Corgi:1}}, Background: Country road decorated with Christmas, Posture: Running}". It should be noted that the dimensions involved in this evaluation objective differ for different image generation tasks.
[0127] 2. Task Planning:
[0128] Planning module Receive from the analysis module Task Summary Image semantics s i (Image input is available if required by the user) and a description of the basic capabilities of the tool library. B By using their own experience and common sense, the task was broken down to obtain the sub-tasks that the first-round execution module could execute. u 1; is represented as:
[0129]
[0130] like Figure 2 As shown in the example, the analysis module generates the first round of sub-tasks based on existing task summaries, input image semantics, and tool library basic capability description documents. u 1, which means "generating a dog on a snowy country road".
[0131] 3. Performance-driven tool selection:
[0132] For subtasks from the analysis module u t Execution module According to subtasks u t This step involves analyzing the ability preferences described in the description, in order to analyze the performance of subtasks. u t The key performance dimensions to focus on at the time. Based on the preset capability boundary dimensions. D and tool library basic capability description document B For subtasks u t Assign performance-related preference weights:
[0133] In terms of specific implementation, such as Figure 2 As shown in the example, when the execution module receives a subtask u t After obtaining relevant information, it proceeds according to the established reasoning chain and the sub-tasks. u t The required capabilities are assigned preference weights to the boundaries of each dimension. For example, in the subtask "Generate a dog on a snowy country road" shown in the figure, the execution module... After reasoning, it was determined that the task has the highest weight in terms of the number of objects, while the weights of other dimensions are relatively small. Therefore, the normalized weight vectors for each dimension are directly output.
[0134] Subtasks u t Corresponding preference weights W task The capability boundary matrix formed by the capability boundaries of each tool in the pre-defined tool library M p Perform a dot product operation to calculate the subtasks executed by each tool in the tool library. Performance matching degree S tool , is represented as:
[0135]
[0136] in, This indicates a normalization operation.
[0137] like Figure 2 In the example, the tool performance boundary matrix M p This represents the performance capability of all tools in the corresponding tool library across different dimensions; the higher the capability, the larger the value. It can be initialized using authoritative evaluation benchmarks obtained from large datasets (which can be obtained from existing task evaluation benchmark papers or evaluated by experts).
[0138] Based on performance matching degree S tool The numerical values are sorted from largest to smallest according to their corresponding tool indices:
[0139]
[0140] Where R represents the tool index sorted by performance matching degree; This indicates operations that sort by index from largest to smallest and output the corresponding indices. Therefore, the tool ranked first in R, i.e., the tool with the highest performance match, is selected to execute the current subtask. u t The best tool Tool top1 ;
[0141] like Figure 2 In the example, based on the obtained performance matching degree Rank the values from largest to smallest to find the optimal tool, FLUX.
[0142] 4. Tool Invocation and Execution:
[0143] Execution module Based on the best tool chosen Tool top1 User documentation and subtasks u t Analyze and predict its best tools Tool top1 implement u t Required input conditions c .like Figure 2 In this example, FLUX is chosen as the optimal tool for subtask execution. Based on the FLUX user documentation, the execution module analyzes the subtasks. u 1. And give the input conditions for FLUX. c "A dog on a snowy country road."
[0144] The system then enters the tool execution phase, where it executes the relevant code for the FLUX tool, based on the input conditions. c As conditional input, to obtain the corresponding formatted output.o 1.
[0145] 5. Evaluation of sub-task execution effectiveness:
[0146] This evaluation module Based on the analysis module The resulting evaluation objectives, for the first t Execution result of the round o t A multi-dimensional evaluation is conducted, and an execution score is given for each dimension. The execution scores for each dimension are weighted and summed using the following formula to obtain a comprehensive score.
[0147]
[0148] in, e t Indicates the first t The wheel task obtains an execution score (i.e., a comprehensive score) by weighting various dimensions. Indicates the first i The evaluation weights of each dimension, and ,in, L Indicates shared ownership L One dimension, Indicates the first L The evaluation weights for each dimension can be set as follows during implementation: ; o t Indicates the first t The result of the round's execution; g i Indicates the first i Evaluation objectives in several dimensions; This indicates the evaluation module.
[0149] Current subtask u The performance scores for each dimension of the text are input into the LLM for summarization, and the output is specific to the current subtask. u t Evaluation summary.
[0150] like Figure 2 In the example, the execution result o After analysis, its weighted score was 20 points, and a comprehensive evaluation summary was given: "lacking running posture, sunglasses, and Christmas atmosphere".
[0151] 6. Decision optimization and implementation:
[0152] The system for each stage Execution result score e tThe score is compared with a preset threshold to determine whether the user task has been completed using a quantitative method. If the score... e t If the threshold requirement is not met, the historical trajectory information will be based on the historical subtasks and their corresponding execution result evaluation information. h t Task Summary Input image semantic information s i (Image input is available upon user request), and basic tool library documentation. B As a planning module The input is used to predict the next stage of task planning, to correct or further advance historical subtasks: where the next subtask... u t+1 Represented as:
[0153]
[0154] Repeat steps 3-5, iteratively executing all subtasks until the execution result score reaches the preset threshold or the historical execution step count reaches the maximum, and output the task execution result with the best evaluation score. o best .
[0155] like Figure 2 In the example, for the user's task, the agent breaks down and executes the task step by step. After multiple rounds of execution, it iteratively processes the base image generated in the first round, gradually approaching the output result intended by the user.
[0156] II. Taking the text-to-image generation task as an example, the task breakdown and execution process is as follows: Figure 3 As shown:
[0157] and Figure 2 Similarly, when a user inputs the creative requirement "Create a hand-drawn illustration," depicting a warm autumn scene: a field with eight green cabbages growing on soil covered with fallen leaves, and a red bird foraging in the distance, the illustration should convey seasonality and a sense of life. At that time, the analysis module... After task analysis and summarization, the task planning process then begins.
[0158] In the planning module During the planning process, the planning module Based on personal experience and common sense, the first task, "Generate 8 cabbages to grow in the soil," was summarized and output. In the performance-driven tool selection process, since the number of cabbages is more important for this task, a preference weight was assigned to the output. Wtask The proportion of numerical dimensions is higher, thus affecting the performance boundary matrix. M p During the weighted calculation, the Creati-Layout tool received the highest score.
[0159] The tool then proceeds to execution, using the optimal tool Creati-Layout to perform the subtask "generating 8 cabbages growing on soil," thus successfully generating the basic image, which contains eight cabbages arranged in two rows against the background of the land.
[0160] Since the output did not meet the end user's needs, the process of task planning, tool selection, tool execution, and result evaluation needed to be repeated cyclically. Through multiple rounds of execution, different tools were selected to execute different sub-tasks to ensure optimal execution results for each sub-task. Finally, after multiple rounds of execution, the expected result was output.
[0161] III. Taking the text-to-image generation task as an example, the execution process of its adaptive preference update mechanism is as follows: Figure 4 As shown:
[0162] For the tool capability boundary matrix in the preset tool library M p The tool performance boundary scores may originate from benchmark tests on large datasets (i.e., benchmark evaluation scores in existing authoritative papers) or expert-like evaluations based on previous tool usage. These boundaries may be inaccurate due to differences in task-related dimensions or subjective bias. To improve the accuracy of tool performance boundary scores, this invention proposes an adaptive preference update mechanism that iteratively adjusts the tool capability boundary matrix based on actual tool usage. M p Numerical value.
[0163] by Figure 2 As an extension of the embodiments described above, in the first round of sub-tasks, the planning module... The subtask is "Generate a dog on a snowy country road".
[0164] During the tool selection process, an "exploration-exploitation" strategy is used to select tools from the tool library that offer the best performance match. S tool The highest front m One tool, and randomly select from the remaining tools. n An additional tool to increase the likelihood of selecting potential high-performance tools:
[0165]
[0166] in, Indicates selecting the best m The operation of each tool; Indicates random sampling n The operation of each tool; l This represents the total number of tools in the tool library.
[0167] like Figure 2 In the embodiment, based on performance matching degree S tool Theoretically, the best, second-best, and random tools should be selected, with a ranking of "[1,2,3]". For the same subtask, the three selected tools are then compared. Compare execution and execution results to obtain the ranking of actual execution performance; calculate the difference between the theoretical ranking and the actual ranking, and update the capability boundary matrix accordingly. M p Numerical value: Represented as:
[0168]
[0169] in, This represents the updated capability boundary matrix; This indicates a normalization operation; M p This represents the capability boundary matrix before the update. W task Indicates preference weights; η Indicates the update step size; Δ represents the direction coefficient, and R represents the theoretical execution performance ranking. theory Ranking of actual implementation results R actual The differences between them; among them, the actual performance ranking R actual The results are obtained by comparing and evaluating candidate outputs from multiple tools using a multimodal large model. Taking image generation or image editing tasks as an example, for multiple tool output images, they are simultaneously input into the multimodal large model, which then judges which image best matches the subtask's performance and ranks the actual performance from best to worst. This is done within the capability boundary matrix. M p During the update process, when the actual usage ranking of a tool exceeds its theoretical ranking, its performance boundary score will increase based on weighted preferences and the distribution of importance of specific tasks across dimensions; otherwise, the score will decrease accordingly.
[0170] like Figure 2In the embodiments, it is evident that the suboptimal tool best fits the subtask in terms of visual semantics, while the random tool matches the reference image effect better than the optimal tool in terms of the appearance of the puppy. Therefore, the actual multimodal large model will rank the actual performance as "[3,1,2]". The difference between the obtained theoretical ranking "[1,2,3]" and the actual ranking "[3,1,2]" is calculated to obtain the capability boundary matrix. M p The directional coefficients will be assigned according to the weights of task preferences. W task Adjust and update M p .
[0171] IV. Taking the text-to-image generation task as an example, the execution process of the capability alignment planning and optimization strategy is as follows: Figure 5 As shown:
[0172] by Figure 2 As an extension of the embodiments described above, in the first round of sub-tasks, the planning module... The subtask is given: "Generate a dog on a snowy country road." After execution, the corresponding result is obtained. o t-1 .
[0173] To further enhance the planning module To enhance step-aware decision-making capabilities, this invention extends the Step-aware Preference Optimization (SPO) method proposed by Liang et al., and proposes a capability-aligned planning optimization strategy to align the decision-making process with the performance boundaries of tool execution.
[0174] For each execution phase t , by planning module Generate through random sampling k There are 1 candidate subtasks, denoted as . To ensure diversity of candidate subtasks, the selection... β The proportion of subtask samples does not use empirical information, while the remaining samples do. This empirical information comes from the task graph of the agent's past successful task executions.
[0175] Execute each subtask to obtain the corresponding execution output. And by the evaluation module The results are evaluated to obtain the corresponding evaluation scores. .
[0176] The subtask with the highest evaluation score is selected as the optimal sample. The subtask with the lowest score is the worst sample. This process is used for training the planning module. Provide preference samples to train the planning module. Output the optimal sample with a higher probability.
[0177] After collecting sufficient trajectory data, the large language model is trained using the following formula:
[0178] The loss function during training is expressed as:
[0179]
[0180] Among them, L( θ () represents the loss function during training; θ Indicates the model parameters that need to be updated; E t~u Expressing expectations; This indicates a summary of the task. Indicates the first t The optimal sample for the wheel; Indicates the first t The worst sample of the wheel; B represents the tool library's basic capability description document; σ Represents the sigmoid function; This represents the regularization hyperparameter; p ref Indicates predictions from a fixed initial model. p θ The reference probability; This indicates the planning module; h t-1 Indicates the first t Historical trajectory information for round -1; This indicates the total number of iterations of the task trajectory; s i It represents the semantic information of the image.
[0181] Final Planning Module The weights of the trained large language model are loaded and used in the system's task planning process.
[0182] Figure 6 This is a diagram illustrating the visual comparison of the effects of the present invention with other existing methods in image generation and image editing tasks, wherein:
[0183] For the image generation task with input "An astronaut cat wearing a spacesuit with a white gas canister is fishing on a rock-covered asteroid. A star-shaped bait hangs on the fishing line, and blue space fish swim against a cosmic background with spiral galaxies and twinkling stars. Next to the cat is a small bucket filled with freshly caught glowing fish. A small UFO is watching from the upper right corner." For the image generation task with input "A tranquil tennis court scene: A smooth wooden rocking chair sits beside a green mesh fence, and a corgi wearing green glasses sits quietly on it. The court has a bright blue surface and clear white lines, with a single tennis ball on the left." Compared to other existing methods, this invention generates a more reasonable structure with higher accuracy in entity categories, numbers, and attributes. Other methods suffer from issues such as missing entities, incorrect positions, and incorrect attributes.
[0184] For an image editing task with input "Place a blue spoon to the right of the cookie, garnish with a mint leaf next to the stemless strawberry, and then place the entire dessert on a marble countertop." For an image editing task with input "Please change the image to the following scene: A child walks across a meadow carrying a dragon-shaped kite. The child is wearing a dark red sweater. Mountains rise in the background, and a rainbow streaks across the sky." Compared to other existing methods, this invention achieves greater accuracy in reproducing the input instructions. Other methods suffer from non-compliance with editing, missing steps, and over-editing.
[0185] Figure 7 This is a visualization illustration of several examples of image editing tasks provided by the present invention. The image editing tasks include: 1) Modifying the original image, replacing the wooden plate with a polished marble plate, removing the fork in the middle, and repainting the chocolate cake as a lighter caramel color. 2) Editing the original image, changing the clothing to a denim jacket style, changing the background to a concert stage scene, and updating the guitar body to a bright green. 3) Modifying the original image, making the fire hydrant look rusty and old, changing the woman's shirt color to a soft blue, and setting the scene against a cityscape background. 4) Editing the original image, changing the chocolate cake to blue, removing the background flag, and adding a bouquet with falling confetti. 5) Editing the original image, redesigning the fire hydrant into a star pattern, changing the ground to lush green grass, placing a white rabbit on it, and setting the scene to dusk.
[0186] Figure 8This invention provides visualizations of several examples of image generation tasks. The image generation tasks include: 1) Drawing an image: A woman wearing a yellow hat and dress sits on a garden bench, holding a basket of red roses. The scene should create a classic and romantic atmosphere. 2) Drawing a detailed close-up image showing the texture and cracks of a rustic, weathered wooden wall. The image should convey a sense of realism and wear, with slight water stains on the top indicating its age and the effects of the natural environment. 3) Generating a tranquil landscape painting. The scene should include a winding stone path leading through flowers in the foreground to a cozy red-roofed house nestled among trees, with hazy mountains in the distance. An elderly person should be working in a nearby field. 4) Generating a tranquil and atmospheric modern bedroom scene. The composition should focus on a bedside table by a large window, on which sits a dark blue backpack, a soft pink toothbrush, and a glowing smartphone. The view outside the window shows a night scene.
[0187] Figure 9 This invention provides visualizations of multiple examples of custom image generation. The tasks involved include: 1) This toy is searching for treasure at the bottom of the sea; 2) Generating an image: a cat wearing a colorful hat on a riverbank with flowers and shimmering water; 3) Transforming this cat into a red-haired, beret-wearing, roaring tiger.
[0188] Figure 10 This is an example of a tool-based intelligent agent system applied to a visual understanding task, whose reasoning process is similar to... Figure 3 The image generation implementation is similar. This system is applicable to tasks and tools across any domain. However, for visual understanding tasks, conventional large models are often limited by their strong prior knowledge, leading to misleading results for simple tasks. For example... Figure 10 As shown in the example, large models typically answer 10 fingers for the question of how many fingers a person has. However, in real-world figures, especially anime characters, the number of fingers may not be 10. Therefore, the tool-based intelligent agent system proposed in this invention provides a task flow for counting the number of fingers in a person, as follows: Figure 10 As shown. For the user-input task "Please tell me how many fingers the person in the image has?", after understanding the input image and task information, the planning module infers and breaks down the task. For this task execution, the first sub-task given by the planning module is "Detect the position of the person's hand in the image". Subsequently, the execution module performs a performance-driven tool selection process, and... Figure 3Similarly, based on this subtask, preference weights for detecting hand positions are given, and then a dot product is performed with the tool performance boundary matrix to obtain a ranking of the matching degree of different detection tools. The best tool, "Groundingdino," was selected for detecting hand positions. Subsequent steps include segmenting the left or right hand based on fingers to obtain the actual position and number of fingers. After multiple rounds of execution, the number of detected fingers is finally summarized, and an answer that conforms to the understanding task is output. The above image generation, image editing, and image understanding tasks effectively demonstrate that the tool agent system proposed in this invention is applicable to various tasks in various fields and has strong robustness and generalization ability.
[0189] To fully verify the effectiveness of the tool intelligent agent system provided by the present invention, the embodiments of the present invention adopted the following comparative experiments.
[0190] I. Quantitative Results and Analysis:
[0191] In this embodiment of the invention, three different benchmarks were used: T2I-CompBench (Huang et al., 2023), OneIG-Bench (Chang et al., 2025), and Complex-Edit (Yang et al., 2025b) to objectively evaluate its visual reasoning performance in image generation and editing tasks from multiple perspectives.
[0192] 1. Basic Image Generation Comparison:
[0193] In this embodiment of the invention, the invention is compared with various image generation methods based on the basic task, as shown in Table 1.
[0194] Table 1: Comparison of the basic image generation methods of this invention with various image generation methods
[0195]
[0196] Among them, T2I-CompBench evaluates images in terms of attribute binding and object relationships. From Table 1, it can be seen that: (1) Traditional models, such as FLUX and SD3, are still competitive in terms of texture, non-spatial and complexity metrics, and their performance is similar to or surpasses that of CoT (Chain-of-Thought) based methods (T2I-R1, GoT). (2) CoT based methods rely on fine-tuning of LLM (Large Language Model), which limits their task scope; simple prompts may lead to overly complex interpretations and inaccurate images. (3) Agent-based methods (GenArtist, T2I-Copilot) use self-correction to regenerate low-quality outputs, thereby improving reliability. (4) This invention can match the most suitable model to adjust its capabilities according to different tasks, thereby achieving optimal performance in all dimensions.
[0197] 2. Advanced Image Generation and Contrast:
[0198] To further evaluate the effectiveness of this invention in visual reasoning, the performance of various methods under different scenarios and complex text prompts was evaluated on OneIG-Bench. As shown in Table 2:
[0199] Table 2: Comparison of Advanced Image Generation Methods with This Invention and Various Other Image Generation Methods
[0200]
[0201] Table 2 shows that: (1) For more complex generation tasks, FLUX and SD3 perform significantly worse than methods integrating LLM in terms of inference metrics, indicating that integrating LLM improves the ability to handle complex information. (2) Regarding alignment accuracy, GoT and GenArtist perform worse than other methods, indicating that a single large model has limited capacity when handling complex tasks. (3) Both T2I-Copilot and this invention utilize agent collaboration mechanisms to more accurately plan each step of visual inference when handling cross-domain information. (4) Due to the limitations of the toolset restricting its generation capabilities, this invention does not show a significant advantage over other methods in terms of alignment and text metrics. However, its performance-driven tool selection enables more intelligent planning, thus bringing a significant advantage in inference.
[0202] II. Efficiency Comparison:
[0203] 1. Comparison of time consumption:
[0204] To verify the inference efficiency of this invention, QWen3-VL-32B (QWen3-VL, 2025) was uniformly used as the LLM for this invention, GenArtist, and T2I-Copilot in this embodiment. Inference was performed on the same dataset as in Table 1, and the time consumption for task planning, tool selection, and image evaluation in each round was recorded in this embodiment. Figure 11 As shown, compared with the benchmark method, the method in this embodiment of the invention shows a significant reduction in time consumption in all three processes.
[0205] In particular, while T2I-Copilot's fixed toolset minimizes its tool selection time, GenArtist's detailed tool capability text descriptions require more inference time when the number of tools is high. In contrast, this invention achieves a significantly lower tool selection time than GenArtist by analyzing subtasks and outputting capability-matching preference weights.
[0206] 2. Comparison of word consumption:
[0207] To demonstrate the efficiency of this invention in tool selection, embodiments of this invention extend this problem by managing a large-scale tool library in a future agent community. Specifically, embodiments of this invention use GPT-4o (Fang et al., 2025b) to simulate a large tool library with the number of tools ranging from 10 to 200, and generate tool information with text descriptions and multidimensional ratings. Embodiments of this invention use the “complex_vel” subset of T2I-CompBench for the task and compare the performance-driven tool selection of this invention with traditional text-based methods, with a maximum output lexical number of 8192. Embodiments of this invention compare the total lexical consumption (input and output) of the two methods. Figure 12 As shown, traditional text-based methods consume more tokens because they struggle to define tool capabilities, leading to an exponential increase in token consumption with the number of tools, without addressing the issue of selection correctness. This invention focuses on task-specific dimensions and is therefore unaffected by the number of tools. With increasing dimensions (from d=4 to d=16), the token consumption of this invention primarily increases slowly with input prompts. This demonstrates the superior efficiency of this invention in tool management and selection for future-oriented intelligent agents.
[0208] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0209] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A tool intelligent agent system, characterized in that, include: The analysis module is used to parse the task requirements input by the user and generate a structured task summary and evaluation objectives; The planning module is used to analyze and generate the first subtask to be executed based on the task summary. And it is used to determine the overall task completion rate based on the execution score of the current subtask, and to generate subsequent subtasks if the overall task is not completed; The execution module is used to analyze the capability preferences of the current subtask, match the best-performing tool from the tool library based on the preference weights to execute the current subtask, and obtain the execution results; The evaluation module is used to perform multi-dimensional quantitative evaluation of the execution results based on the evaluation objectives, and generate the execution score of the current subtask and textual result evaluation. The execution module includes: The capability preference analysis submodule is used to assign performance-related preference weights to the current subtask based on preset capability boundary dimensions and the tool library's basic capability description document. The matching submodule is used to perform a dot product operation between the preference weights and the capability boundary matrix composed of the multi-dimensional capability boundary values of each tool in the tool library, so as to obtain the performance matching degree of each tool when executing the current subtask. The tool selection submodule is used to sort the tool index according to the performance matching degree, obtain the theoretical execution effect ranking, and select the tool with the highest performance matching degree as the best performance tool to execute the current subtask. The task execution submodule is used to call the best performance tool to execute the current subtask and obtain the execution result; The execution module further includes: The capability boundary matrix update submodule is used to select the top performers with the highest performance match from the tool library. m One tool, and randomly select from the remaining tools. n An additional tool; use the selected m + n Each tool performs the same subtask, and the execution results are compared to obtain a ranking of actual execution performance. The capability boundary matrix is updated based on the difference between the actual execution performance ranking and the theoretical execution performance ranking.
2. The tool intelligent agent system as described in claim 1, characterized in that, The analysis module includes: The semantic parsing submodule is used to perform semantic parsing on the text commands input by the user; and when the user input includes an image, it extracts the semantic information of the image and performs joint semantic parsing on the image semantic information and the text commands. The task summary generation submodule is used to generate a structured task summary based on the semantic parsing results; The evaluation target generation submodule is used to generate corresponding evaluation targets based on the task summary using preset evaluation dimensions.
3. The tool intelligent agent system as described in claim 2, characterized in that, The planning module includes: The initial subtask generation submodule is used to receive the task summary and the tool library basic capability description document, and selectively combine the image semantic information to decompose the task summary to generate the first subtask to be executed.
4. The tool intelligent agent system as described in claim 2, characterized in that, The planning module also includes: The judgment submodule is used to compare the execution score of the current subtask with a preset score threshold. If the execution score of the current subtask is less than the preset score threshold and the current iteration round is less than the preset maximum number of iterations, then the total task is determined to be incomplete. The subsequent subtask generation submodule is used to predict and generate the next subtask when the total task is not completed, based on the current historical trajectory information, the task summary and the tool library basic capability description document, and selectively combined with the image semantic information, and input into the trained large language model. The current historical trajectory information includes the sub-tasks that have been executed in the current round and their corresponding execution scores.
5. The tool intelligent agent system as described in claim 4, characterized in that, The loss function of the large language model during training is expressed as: Among them, L( θ () represents the loss function during training; θ Indicates model parameters; E t~u Expressing expectations; This indicates a summary of the task. Indicates the first t The optimal sample for the wheel; Indicates the first t The worst sample of the wheel; B This document describes the basic capabilities of the tool library. σ Represents the sigmoid function; This represents the regularization hyperparameter; p ref Indicates predictions from a fixed initial model. p θ The reference probability; This indicates the planning module; h t-1 Indicates the first t -1 round's historical trajectory information; This indicates the total number of iterations of the task trajectory; s i It represents the semantic information of the image.
6. The tool intelligent agent system as described in claim 1, characterized in that, The task execution submodule includes: The input condition confirmation unit is used to obtain the usage document information of the selected tool from the basic capability description document of the tool library; and determine the input conditions required for the selected tool to perform the current subtask based on the usage document information. The execution unit is used to call the selected tool to execute the current subtask according to the input conditions and obtain a structured execution result.
7. The tool intelligent agent system as described in claim 1, characterized in that, In the evaluation module, the execution score corresponding to the current subtask is represented as follows: in, e t Indicates the first t The execution score of the wheel task is obtained by weighting various dimensions; Indicates the first i The evaluation weights of each dimension; L Indicates shared ownership L One dimension; o t Indicates the first t The result of the round's execution; g i Indicates the first i Evaluation objectives in several dimensions; This indicates the evaluation module.
8. The tool intelligent agent system as described in claim 1, characterized in that, The tool intelligent agent system is applied to downstream tasks; the downstream tasks include at least one of image editing tasks, image generation tasks, and visual understanding tasks.