Agent large model optimization method, device, equipment and program product

By constructing a closed-loop optimization process and multi-agent collaboration, the problem of the lack of self-correction capability of the agent system is solved, and autonomous optimization under task execution feedback is realized, which improves the adaptability and robustness of the agent in complex environments.

CN122242549APending Publication Date: 2026-06-19CHINA MOBILE M2M +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA MOBILE M2M
Filing Date
2026-01-26
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing intelligent agent systems based on large language models lack dynamic self-correction capabilities and cannot continuously optimize based on task execution feedback, resulting in insufficient adaptability in real-world, dynamic task scenarios.

Method used

By constructing a closed-loop optimization process, task instructions are input into the agent, an initial response is generated and executed in the task execution environment, feedback is obtained and self-optimization is performed, parameters are adjusted using a large model, forming multi-agent collaboration and semantic alignment, and combined with policy gradient learning and backtracking mechanisms, the model can achieve continuous autonomous evolution.

🎯Benefits of technology

It significantly improves the output accuracy and task adaptability of the agent in complex and dynamic environments, and enhances the robustness and overall performance of the system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242549A_ABST
    Figure CN122242549A_ABST
Patent Text Reader

Abstract

This application discloses a method, apparatus, device, and program product for optimizing a large-scale intelligent agent model, aiming to solve the problems in existing technologies where intelligent agents lack dynamic self-correction capabilities and cannot continuously optimize based on task execution feedback. The solution includes: inputting task instructions to the intelligent agent to obtain a preliminary generated response output by the intelligent agent; executing the preliminary generated response in a task execution environment to obtain an execution result of the preliminary generated response; inputting the execution result of the preliminary generated response into the intelligent agent, so that the intelligent agent optimizes the preliminary generated response based on the execution result of the preliminary generated response to obtain an improved generated response; executing the improved generated response in the task execution environment to obtain an execution result of the improved generated response; and optimizing the parameters of the large-scale model carried by the intelligent agent based on the task instructions and the execution result of the improved generated response.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of large model application technology, and in particular to a method, apparatus, device and program product for optimizing large models of intelligent agents. Background Technology

[0002] In the field of artificial intelligence, an agent is a software entity capable of perceiving its environment, making autonomous decisions, and executing actions to complete a specific task. With the development of Large Language Models (LLMs), their powerful semantic understanding and content generation capabilities are often used as the "brain" of agents, enabling them to handle complex natural language tasks. However, existing agent systems based on LLMs often lack the ability to dynamically self-correct based on task execution results. After outputting a response, the agent struggles to continuously optimize its behavior based on environmental feedback, resulting in insufficient adaptability in real-world, dynamic task scenarios and limiting the overall performance and application effectiveness of the system. Summary of the Invention

[0003] This application proposes a method, apparatus, device, and program product for optimizing large-scale intelligent agent models, aiming to solve the problems in the prior art where intelligent agents lack dynamic self-correction capabilities and cannot continuously optimize based on task execution feedback.

[0004] Correspondingly, the technical solution of this application is as follows: Firstly, a method for optimizing large-scale intelligent agent models is provided, including: The task instructions are input to the agent to obtain the initial generated response output by the agent; The preliminary generated response is executed in the task execution environment to obtain the execution result of the preliminary generated response; The execution result of the preliminary generated response is input into the agent, so that the agent can optimize the preliminary generated response based on the execution result of the preliminary generated response to obtain an improved generated response; The improved generated response is executed in the task execution environment to obtain the execution result of the improved generated response; Based on the execution results of the task instructions and the improved generated response, the parameters of the large model carried by the agent are optimized.

[0005] Secondly, a device for optimizing large-scale intelligent agent models is provided, comprising: The task processing module is used to input task instructions to the agent in order to obtain the initial generation response output by the agent. The first verification module is used to execute the preliminary generated response in the task execution environment to obtain the execution result of the preliminary generated response; The response improvement module is used to input the execution result of the preliminary generated response into the agent, so that the agent can optimize the preliminary generated response based on the execution result of the preliminary generated response to obtain an improved generated response; The second verification module is used to execute the improved generated response in the task execution environment and obtain the execution result of the improved generated response; The model training module is used to optimize the parameters of the large model carried by the agent based on the execution results of the task instructions and the improved generated response.

[0006] Thirdly, embodiments of this application provide an electronic device, including: a processor; and a memory configured to store computer-executable instructions, which, when executed, cause the processor to perform the method described in the first aspect.

[0007] Fourthly, a computer program product is provided, the computer program product including a computer-readable storage medium storing a computer program operable to cause a computer to perform the method described in the first aspect.

[0008] This embodiment first inputs task instructions to the agent, which then generates an initial response using its large model. The initial response is then executed in the task execution environment, and the result is re-inputted to the agent, prompting it to optimize the initial response using the large model, resulting in an improved response. This improved response is then executed again in the task execution environment to verify the optimization effect. Finally, by combining the execution results of the task instructions and the improved response, the parameters of the agent's large model are optimized. This is equivalent to achieving closed-loop training of the agent's large model under external manual annotation, guided by optimization results. In application, through continuous task execution and large model parameter update cycles, the agent can autonomously accumulate experience and dynamically adjust its generation strategy, thereby significantly improving output accuracy, task adaptability, and overall system robustness in complex and dynamic environments. Attached Figure Description

[0009] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0010] Figure 1 This is a flowchart illustrating the intelligent agent large model optimization method according to an embodiment of this application.

[0011] Figure 2 This is a schematic diagram of the application architecture of the intelligent agent large model optimization method according to an embodiment of this application.

[0012] Figure 3 This is a schematic diagram of the structure of the intelligent agent large model optimization device according to an embodiment of this application.

[0013] Figure 4 This is a schematic diagram of the structure of an electronic device according to an embodiment of this application. Detailed Implementation

[0014] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this specification.

[0015] As mentioned earlier, existing agent systems based on large language models (hereinafter referred to as large models) often lack the ability to dynamically self-correct based on task execution results. After outputting a response, the agent struggles to continuously optimize its behavior based on environmental feedback, resulting in insufficient adaptability in real-world, dynamic task scenarios and limiting the overall performance and application effectiveness of the system. Therefore, this application proposes an agent large model optimization method, apparatus, device, and program product, aiming to solve the problems of existing agents lacking dynamic self-correction capabilities and being unable to continuously optimize based on task execution feedback.

[0016] The technical solutions provided by the various embodiments of this application will be described in detail below with reference to the accompanying drawings.

[0017] One embodiment of this application provides a method for optimizing a large model of an intelligent agent. Wherein, Figure 1 This is a flowchart illustrating the large-scale optimization method for the intelligent agent, including: S101, input the task instruction to the agent to obtain the initial generated response output by the agent.

[0018] The purpose of this step is to provide an initial, evaluable baseline output for subsequent self-correction and model optimization based on environmental feedback. The system first needs to obtain the agent's preliminary understanding and response to task instructions based on its current knowledge state (i.e., the parameters of the large model it carries). Only on this basis can its execution effectiveness be verified in real or simulated environments, thereby identifying problems, obtaining feedback, and initiating subsequent iterative optimization cycles. Therefore, this step is a crucial link connecting task intent with actual execution and triggering the entire self-learning mechanism.

[0019] Furthermore, in this embodiment, there can be multiple agents, collectively forming a multi-agent collaborative system. Each agent is equipped with its own corresponding large model, which can share the same pre-trained foundation but are adapted to its specific responsibilities or perspectives through independent lightweight fine-tuning (e.g., using low-rank adaptation techniques). Upon receiving the same task instruction, each agent independently generates an initial response based on its own optimized model parameters. This parallel design of multiple agents enables the system to simultaneously understand and initially solve the task from multiple dimensions or professional perspectives, thus providing a rich information source and diverse foundation for subsequent collaboration, comparison, and semantic alignment, significantly enhancing the system's robustness and creativity in handling complex tasks.

[0020] It should be noted that the aforementioned intelligent agent and the task instructions it processes have a wide range of application scenarios. For example, in the field of water conservancy management, the intelligent agent can be applied to tasks such as configuring automated irrigation systems, generating reservoir scheduling plans, and writing water affairs reports. In this scenario, the task instructions can be specific operation requests generated in actual business operations, such as a user inputting in natural language, "Configure an irrigation plan for the northern farmland that starts at 6:00 AM every day, lasts for 30 minutes, and has an irrigation volume of 500 liters." Simultaneously, the task instructions can also be data samples specifically designed for training and optimizing large models of the intelligent agent, aiming to improve the model's task completion capabilities in a specific domain through repeated execution and feedback. The content of the task instructions usually needs to be preprocessed and converted into structured data (such as JSON or XML format) so that the intelligent agent can accurately parse it. After receiving this instruction, the intelligent agent's built-in large language model outputs a structured preliminary generated response according to its built-in generation strategy, such as a configuration document containing fields such as irrigation area, time, duration, and water volume. This process can be formally represented as... ,in The structured task instructions that represent the input. Representing the The parameters carried by the intelligent agent are The generation strategy defined by the large model, and This is the initial generated response output by the intelligent agent.

[0021] S102, execute the initial response generation in the task execution environment to obtain the execution result of the initial response generation.

[0022] The significance of this step lies in placing the initial generated response, still in a theoretical or configuration state, within a verifiable context for practical testing, thereby obtaining objective and realistic feedback on the execution effect. This is a crucial data source for achieving closed-loop self-optimization. Without interactive verification with the real or simulated world, the system cannot identify potential defects, errors, or mismatches in its output within specific application scenarios. Therefore, executing the initial generated response and obtaining its results essentially opens a channel for "practice to verify truth" for the system, enabling subsequent self-correction and model optimization to be based on concrete and quantifiable effect evaluations, rather than subjective or theoretical speculation.

[0023] The aforementioned task execution environment refers to a system or platform capable of receiving and running responses (such as configuration instructions and operation plans) generated by intelligent agents, and simulating or actually producing corresponding results. It can be a highly simulated digital twin environment or sandbox system for securely and cost-effectively testing various configuration schemes; it can also be a real physical system or business platform, such as a real water conservancy project control system or irrigation network. After the environment executes the initial generated response, it produces an execution result, which particularly emphasizes the natural language description of execution failures or adverse consequences. For example, in a water management scenario, after executing an incomplete irrigation configuration, the environment might return natural language feedback such as "The generated irrigation configuration lacks key equipment information, causing the water pump to fail to start." This natural language-based failure description constitutes the direct input for the subsequent understanding and self-correction of the intelligent agent.

[0024] Furthermore, in a preferred embodiment, when there are multiple agents, before executing the improved generated responses of each agent in the task execution environment for final verification, the system can introduce a collaboration and alignment step between agents. Specifically, the system calculates the semantic similarity between the improved generated responses of any two agents. This semantic similarity can be obtained by calculating the cosine value between the semantic embedding vectors of their respective responses, and is formally represented as... ,in and Representing the first The and the first The agent improves the generation of semantic embedding vectors for responses. This refers to the semantic similarity between the two agents. If the calculation reveals two target improvement generation responses with a semantic similarity below a preset threshold, it indicates a significant semantic divergence in the two agents' understanding of the task or their solutions. In this case, the system can input one of these responses (referred to as the first target improvement generation response) into the agent generating the other response (referred to as the second target improvement generation response). The receiving agent will attempt to integrate the effective semantic content from the first target improvement generation response and adjust and optimize its second target improvement generation response. The significance of this is that it establishes a collaborative feedback mechanism based on semantic consistency. By requiring agents to examine and integrate solutions from other perspectives that differ semantically, it promotes knowledge sharing and strategy collaboration among agents. This not only reduces contradictions and improves the consistency of the overall solution at the output level, but also prompts each agent to adjust its generation strategy at the model learning level to better align with other agents, thereby spontaneously generating more collaborative and globally consistent responses in future tasks, ultimately improving the reliability and quality of the results output by the multi-agent system as a whole.

[0025] S103, the execution result of the preliminary generated response is input into the agent, so that the agent can optimize the preliminary generated response based on the execution result of the preliminary generated response to obtain an improved generated response.

[0026] The significance of this step lies in endowing the agent with the ability to self-reflect and make immediate corrections based on actual performance results, transforming the one-time static generation process into an iterative and learnable dynamic optimization process. By re-inputting the agent with the execution results returned by the environment, especially those representing failures (such as error messages described in natural language), the system is essentially guiding the agent to conduct a "post-mortem review." The agent needs to integrate its initial task understanding (i.e., the initial generated response) with objective feedback from the environment, analyze the root cause of the problem, and generate a revised, improved generated response aimed at solving the problem. This mechanism simulates the cognitive process of humans learning from practice and improving through trial and error, and is a key technology for enabling the model to continuously and autonomously evolve in task scenarios.

[0027] The task execution environment can be a high-fidelity simulation system or a real physical or business environment. The specific choice depends on the application stage and objectives: if this method aims to optimize the parameters of a large model during a specific training phase, the task execution environment typically uses a high-fidelity simulation system to allow for extensive trial and error and iteration under safe, controllable, and low-cost conditions; if this method aims to achieve online optimization and adaptation of a large model in actual deployment applications, the task execution environment is a real business system or physical environment, and its feedback directly reflects the execution effect of the solution in the actual scenario. It is important to note that the execution results fed back by the task execution environment, especially the "failure consequences" (Failure_Description) described in natural language, are irreplaceable key inputs that trigger and drive this optimization process: First, unlike simple binary success / failure identifiers or abstract error codes, the failure consequences described in natural language contain rich semantic information. It not only indicates "what went wrong," but also usually implies "the specific manifestation of the problem" and "possible contextual factors" (for example, "the pump model does not match the pipeline pressure" indicates the problem entity and its interrelationships). Secondly, this descriptive format perfectly matches the native capabilities of large language models, enabling agents to directly understand and analyze complex logic and causal clues in environmental feedback, much like understanding human instructions, without requiring the design of an additional complex structured error explanation mechanism for the system. Finally, it transforms specific, contextualized failure experiences into "learning material" that agents can directly process, providing clear, specific, and actionable optimization directions for subsequent self-correction. For example, in an intelligent irrigation configuration task, if the environment returns a description such as "the pump model in the irrigation plan does not match the current network pressure, resulting in insufficient flow," it not only informs of the failure but also directly points to the specific problem of the mismatch between equipment parameters and system state in the configuration scheme, thus guiding the agent to accurately adjust relevant parameters during regeneration. Therefore, this natural language feedback is the core link connecting environmental execution and agent cognition, enabling efficient semantic-level self-correction.

[0028] In practical implementation, the key design point of this step is how to effectively input the execution results of the initially generated response into the agent to drive optimization. The optimization process is not simply about throwing error descriptions at the model; rather, it requires constructing a context-rich, guided input prompt. Specifically, the input received by the agent is a combination of the original task instruction (or context representing user requests) and the execution result (especially the failure description). For example, a well-constructed prompt might be: "Original task: Configure the irrigation system for the northern farmland, starting at 6 AM daily for 30 minutes, using 500 liters of water. Execution feedback: After executing the configuration you previously generated, the system reported 'Missing critical equipment information, causing it to fail to start.' Based on this feedback, please optimize and regenerate a complete and executable irrigation configuration."

[0029] The significance of this approach lies in placing errors within the complete task context, ensuring that the agent doesn't forget the original goal during optimization, while simultaneously and accurately correcting exposed problems. Based on this combined prompt, the agent utilizes the reasoning and generation capabilities of its onboard large language model to output optimized content, i.e., an improved generated response. This process can be expressed by the following formula: ,in, It is the first The generation strategy model for each agent has the following parameters: The input is a combination of the failure description and the prompt; the output is... This refers to the improved generated response produced by the intelligent agent. This improved generated response will be submitted to the task execution environment for verification, thus forming a closed loop of "generation-execution-feedback-optimization" until the output meets the requirements.

[0030] S104, execute the improved generated response in the task execution environment to obtain the execution result of the improved generated response.

[0031] This step is a crucial verification and evaluation stage in the closed-loop optimization process of this embodiment. Its core significance lies in practically testing the improved generated response after optimization through the self-correction mechanism, thereby obtaining a direct and objective evaluation of the optimization effect. Executing the improved generated response and obtaining its execution result is essentially a verification of the agent's self-correction behavior in step S103. The quality of this execution result directly reflects the agent's ability to understand, analyze, and regenerate based on previous failure feedback, serving as the fundamental basis for judging whether a single round of optimization was successful and whether the problem was solved. More importantly, this execution result will serve as the core training signal and guide for subsequent parameter optimization of the large model carried by the agent. A successful execution result (such as the task being completed correctly) provides a positive reinforcement signal, while an execution result that still has problems indicates the direction for further optimization and may even trigger a more complex system backtracking mechanism. Therefore, this step is the bridge that transforms the immediate effect of "self-correction" into the long-term capability of "model evolution," ensuring that the learning and optimization of the entire system is based on continuous environmental interaction and effect verification.

[0032] It should be noted that the basic principles and processes for implementing the improved generated response and obtaining its execution result in the task execution environment can be found in the description of the initial generated response in step S102. In short, the system submits the improved generated responses output by each agent to the same task execution environment (which can be a high-fidelity simulation system or a real business environment). This environment executes corresponding operations or simulates corresponding processes based on the content of the improved generated response, ultimately producing an execution result. This result preferably includes a clear natural language description, not only describing the final state (success / failure) but also detailing key phenomena or remaining issues during the execution process. For example, in a water conservancy scheduling scenario, after executing an optimized reservoir water release plan, the environment might report, "According to the new plan, the downstream water level has stabilized below the warning line, but the flow of the eastern tributary is still slightly higher than the standard value." This specific description will become a key input for subsequent reward calculation, evaluation of the overall optimization effect, and determination of parameter adjustment strategies. Through this step, each optimization attempt by the agent can obtain immediate evaluation from the environment, thereby driving its model parameters to continuously evolve towards producing better results in actual tasks.

[0033] S105 optimizes the parameters of the large model carried by the agent based on the execution results of the task instructions and the improved generated response.

[0034] The significance of this step lies in transforming the experience gained from a single task execution and error correction loop (reflected in the improved execution results of generated responses) into a persistent improvement in the agent's internal model capabilities. If only the immediate content optimization in step S103 is performed, the system can only generate an improved response for a specific problem; the model itself remains unchanged, and it may repeat the same mistakes when handling similar new tasks. By optimizing the parameters of a large model, this step aims to adjust the model's generation strategy, enabling it to directly generate more accurate and robust responses when faced with the same or similar task instructions in the future, thereby reducing reliance on repeated error correction loops. Therefore, this step is a crucial process for distilling the "optimization effect" in a specific scenario into a general "task capability," achieving a leap from passive correction based on feedback to active evolution based on learning, allowing the system to achieve autonomous and continuous performance growth through continuous environmental interaction.

[0035] In its implementation, a preferred parameter optimization mechanism in this embodiment is policy gradient learning based on comprehensive rewards. Specifically, the system calculates the comprehensive reward value corresponding to the improved generation response of each agent's current output. This reward value is not a single-dimensional score, but a composite value that integrates multiple evaluation indicators. Its design follows the principle that the overall reward value is positively correlated with the values ​​of a series of specified evaluation parameters. These key evaluation parameters include, but are not limited to: 1) Task matching degree: that is, the improved generated response matches the expected task objective ( degree of matching This can be obtained through a predefined evaluation function or manual scoring; 2) That is, the semantic similarity between the improved generated response and the improved generated responses of all other agents. ,in Cosine similarity; 3) Context consistency: This means that the improved generated response is consistent with the historical generated content of the agent itself. Contextual consistency between .

[0036] Correspondingly, the calculation of the comprehensive reward can be formally expressed as:

[0037] in, and This involves adjusting the hyperparameters for different reward items. The optimization objective is to maximize the expected combined reward value obtained by each agent through reinforcement learning techniques such as policy gradient methods, thereby driving the model parameters. Updates, such as those using formulas Update, among which The learning rate is used as the learning rate. The deeper significance of this design lies in guiding the model not only to pursue the completion of a single task (task matching), but also to consider the collaborative efficiency among multiple agents (semantic synergy) and the behavioral consistency in long-term tasks (contextual consistency). This multi-objective optimization strategy ensures that the trained agent model is collaborative and has a long-term vision, thus enabling it to handle complex sequential decision-making and generation tasks.

[0038] To further enhance system stability and resilience, this embodiment also introduces a system backtracking and security optimization mechanism. Specifically, the system continuously records the state information of each agent. One key aspect of this state information is the parameter snapshot of the large model carried by the agent after each successful parameter optimization. During the agent's subsequent generation of an initial response or optimization correction, the system monitors its performance in real time. If a preset backtracking trigger condition is met, the system will interrupt the current optimization path and, based on the previous (or earlier) stable parameter snapshot recorded in the state information, reset and adjust the parameters of the large model carried by the agent.

[0039] As an example, backtracking trigger conditions can be designed as at least one of the following types: First, quality failure trigger: if the quality assessment value calculated based on the execution result of the improved generated response is lower than the preset quality threshold, it indicates that the optimization has produced a low-quality result; Second, continuous failure trigger: the execution results of the improved generated response output by the agent in a series of preset rounds (e.g., three rounds) do not achieve the expected goal, suggesting that the model may be trapped in a local optimum or an incorrect learning direction; Third, parameter drastic changes trigger: During a round of parameter optimization, the large model carried by the agent may experience parameter changes (such as gradient norm) that exceed the preset stability threshold, indicating that the update process may be unstable and there is a risk of divergence.

[0040] When a backtracking is triggered, the system does not simply roll back the parameters, but instead performs a targeted corrective optimization. Specifically, the system analyzes the failure mode based on the specific type of the backtracking trigger condition and determines a matching parameter adjustment amount for the large model accordingly. Subsequently, based on this adjustment, the previously stable model parameters selected from the state history (…) Adjustments are made to obtain new target parameters. This involves updating the agent model's current parameters to the target parameters. The significance of designing such parameter adjustments that match the trigger type is that it ensures backtracking is not merely "going back to the past," but rather a "reboot with lessons learned." For example, for a backtracking triggered by consecutive failures, the adjustment amount... This may involve specific fine-tuning for identified failure modes (such as using LoRA techniques to make minor adjustments to relevant parameter modules), thereby injecting the correct learning signal while restoring a stable baseline. This mechanism greatly improves the robustness of the system during long-term, automated optimization, effectively preventing system performance crashes caused by accumulated errors or training instability, and ensuring the safety and controllability of the learning process.

[0041] To further illustrate the practical application of this embodiment, the following example, focusing on the configuration task of an intelligent irrigation system for water management, demonstrates the complete application process of the multi-agent collaborative self-correction optimization method and system. This embodiment assumes that in an agricultural area, irrigation parameters need to be dynamically configured through an intelligent system to achieve efficient water resource utilization and healthy crop growth. The specific implementation process mainly includes the following stages: I. System Initialization and Input Processing a. Agent Loading and Fine-tuning: The system loads a shared pre-trained large language model base parameter. Subsequently, low-rank adaptation (LoRA) technology was used to perform lightweight fine-tuning on four agents with different functions, forming their own independent policy models. ( =1,2,3,4). These agents are: Agent 1 (Irrigation Strategy Generation): Responsible for generating specific irrigation configuration schemes based on instructions.

[0042] Agent 2 (Configuration Verification): Responsible for performing logical and compliance verification on the generated configuration.

[0043] Agent 3 (Error Detection and Correction): Responsible for analyzing environmental feedback and locating problems.

[0044] Agent 4 (Log Recording): Responsible for maintaining the task execution history.

[0045] b. Task queue and feedback module configuration: Configure an independent task queue and feedback receiving module for each agent to ensure that it can process tasks in parallel and receive information from the environment or other agents.

[0046] c. Input Data Preparation: The input processing module receives user instructions, such as: "Configure the irrigation system for the fields in the northern area, set it to start automatically at 6:00 AM daily for 30 minutes, with a water volume of 500 liters." This instruction is parsed and verified, and then converted into structured input data. And distribute them to the task queues of each agent.

[0047] II. Preliminary Response Generation and Environmental Feedback Agent 1 (irrigation strategy generation) is based on input Utilizing its strategy model Generate a preliminary response For example, a configuration draft in JSON format: { "irrigation_system": { "area": ​​"North Field", "schedule": "06:00", "duration_minutes": 30, "water_volume_liters": 500 } } The initial generated response is sent to the simulated irrigation system for execution. After execution, the environment returns a failure description in natural language: "The generated irrigation configuration lacks critical equipment information, causing the system to fail to start." III. Self-correction mechanism Agent 1 receives the above failure description, combines it with the original task instructions to form a new prompt, and then invokes its model again. Generate an optimized and improved response The new response supplements the missing information: { "irrigation_system": { "area": ​​"North Field", "schedule": "06:00", "duration_minutes": 30, "water_volume_liters": 500, "equipment_type": "Drip Irrigation", "maintenance_personnel": "John Doe" } } Improved response generation The issue is then resubmitted to the environment for verification. If the problem is resolved, the loop ends; otherwise, this self-correcting process is repeated.

[0048] IV. Collaboration and Semantic Alignment among Agents In more complex collaboration models, the system can introduce semantic alignment mechanisms. For example: a. Semantic embedding computation: Agent 2 (configuration verification) Perform analysis and generate your own validation opinions or supplementary configurations. The system calculates the semantic embedding vectors of both separately. and .

[0049] b. Semantic Similarity Calculation and Collaboration: Calculate the cosine similarity between the two. .like A low value indicates a discrepancy in the understanding between the two agents. The system can guide agent 1 to refer to the content generated by agent 2. Adjust its own strategy to improve output consistency. This adjustment can be achieved through additional LoRA fine-tuning by updating parameters. ← +Δ Collaborative feedback.

[0050] V. Contextual Consistency Check and Collaborative Rewards To ensure the continuity of long-term tasks, the system implements: a. History Maintenance and Consistency Calculation: Agent 1 maintains its own historical generated content. The system calculates the current improved response. With history Context consistency: .

[0051] b. Calculation of comprehensive reward and policy update: Calculate a comprehensive reward for agent 1. This is used to guide the global optimization of its model parameters:

[0052] in: yes With expected goals The degree of matching; , The weight parameters mentioned above are used to balance the importance of task matching, semantic collaboration, and contextual consistency.

[0053] Subsequently, the model parameters of agent 1 are updated using the policy gradient method:

[0054] in: This is the learning rate. This update aims to generate better responses in the future.

[0055] VI. System backtracking and multi-round training To ensure system stability, a backtracking mechanism is introduced: a. State recording: The system continuously records snapshots of the state of each agent, including model parameters. i. Generating content wait.

[0056] b. Backtracking Triggering and Adjustment: When specific conditions are met (such as consistently substandard generation quality or abnormal parameter update magnitude), the system triggers a backtracking. For example, the agent model parameters are rolled back to the previous stable version. And apply a corrective parameter adjustment amount for the identified problem. To obtain new target parameters .

[0057] c. Multiple iterations: The system starts a new round of training and task execution based on the backtracked state, and iterates in this way until the performance reaches a satisfactory level.

[0058] VII. Lightweight Fine-tuning and Batch Training To improve efficiency, the system employs an efficient training strategy: a. Lightweight fine-tuning: All agent models are based on shared base parameters. By accumulating their respective LoRA incremental parameters To achieve personalization, that is This significantly reduces computational overhead.

[0059] b. Batch Training: In actual training, the system collects a batch of task data. Then, a centralized parameter update is performed. The update formula integrates its own reward signal and cooperative signals from other agents:

[0060] in, Batch size; This is the batch serial number; For collaboration weights, For intelligent agents and Semantic similarity of responses. This approach improves training stability and collaborative efficiency.

[0061] c. Parameter synchronization: Periodically synchronize the LoRA parameters of each agent using the "average parameter method" or the performance-based "weighted aggregation method" to promote knowledge sharing and policy alignment.

[0062] in, Figure 3This document presents a complete system architecture diagram of the multi-agent collaborative self-correction optimization method proposed in this embodiment. The architecture meticulously designs an efficient closed-loop workflow from task input to continuous autonomous model optimization: the process begins with user-input natural language commands, which are then semantically parsed and structured by the input processing module, and dynamically scheduled to multiple agents by the agent management module. Each agent, in the generation and correction module, first generates an initial response based on its large language model. Then, upon receiving environmental feedback from the task execution environment (e.g., a water management environment), particularly failure consequences described in natural language, it triggers a self-correction mechanism to analyze and reconstruct the response, ultimately producing an optimized and improved generation response. Simultaneously, the semantic alignment and collaboration module ensures that the generated content among multiple agents maintains deep semantic consistency by calculating and comparing the semantic embedding vectors output by each agent, thereby achieving efficient collaboration. The context management module maintains the historical generation and feedback records of the agents, ensuring the logical coherence of content across multiple rounds of task execution. The continuous improvement of system performance is driven by the policy update and optimization module. This module calculates reward signals by integrating task matching degree, cooperation consistency, and contextual coherence. The fine-tuning and training module uses lightweight techniques such as low-rank adaptation to efficiently perform iterative updates of large model parameters. The parameter synchronization and sharing module further promotes knowledge transfer and policy alignment among different agents, accelerating the collective learning process. To ensure the long-term stability of the system, the backtracking and optimization module continuously monitors the system status. Once preset conditions such as a decline in generation quality or abnormal parameters are detected, state rollback and targeted parameter adjustments are triggered to effectively prevent error accumulation. All functional modules work closely together under the unified coordination of the system to form an autonomous system integrating perception, decision-making, execution, feedback, learning, and optimization, realizing autonomous iteration and collaborative evolution of multiple agents in complex task scenarios.

[0063] In summary, the method in this embodiment achieves automated training and continuous improvement of the agent's large-scale model by constructing a complete "generation-execution-feedback-optimization" closed loop. Specifically, the task instructions are first input into the agent, which then uses its built-in large language model to understand and reason, outputting an initial generated response. Subsequently, this initial generated response is executed in a simulated or real task execution environment to verify its actual effectiveness, and the execution results (especially problems or failure information described in natural language) are re-inputted into the agent as key feedback. Based on this feedback, the agent uses its large-scale model's reflection and correction capabilities to analyze and optimize the initial generated response, generating a higher-quality improved generated response. This improved generated response is then executed again in the task execution environment to verify whether the optimization effectively solves the previous problems. Finally, the system integrates the actual execution results of the original task instructions and the improved generated response, calculates multi-dimensional performance rewards, and uses these rewards to guide the targeted optimization of the agent's large-scale model parameters. This process is equivalent to achieving closed-loop, self-supervised training of the agent's large-scale model without external manual annotation, guided by the actual optimization effect of task execution. Through continuous application and the constant cycle of task execution and large model parameter updates, the agent can autonomously accumulate experience in dealing with various scenarios and dynamically adjust its internal generation strategy. This significantly improves the accuracy of output results, adaptability to changing tasks, and overall robustness of the system in complex and dynamic real-world environments.

[0064] In addition, corresponding to Figure 1 The method shown in this embodiment, in another embodiment, also provides an intelligent agent large model optimization device. Figure 3 This is a schematic diagram of the structure of the large-scale intelligent agent model optimization device 300, including: The task processing module 310 is used to input task instructions to the intelligent agent in order to obtain the initial generation response output by the intelligent agent.

[0065] The first verification module 320 is used to execute the preliminary generated response in the task execution environment to obtain the execution result of the preliminary generated response.

[0066] The response improvement module 330 is used to input the execution result of the preliminary generated response into the agent, so that the agent can optimize the preliminary generated response based on the execution result of the preliminary generated response to obtain an improved generated response.

[0067] The second verification module 340 is used to execute the improved generated response in the task execution environment to obtain the execution result of the improved generated response.

[0068] The model training module 350 is used to optimize the parameters of the large model carried by the agent based on the execution results of the task instructions and the improved generated response.

[0069] Optionally, there are multiple intelligent agents, each equipped with its own corresponding large model, and each intelligent agent independently generates a preliminary generation response based on the task instruction, and independently optimizes its own generated preliminary generation response.

[0070] Optionally, before executing the improved generated response in the task execution environment, the second verification module 340 is further configured to: calculate the semantic similarity between any two improved generated responses; if there are two target improved generated responses with a semantic similarity lower than a preset threshold, the first target improved generated response is input into the agent that generates the second target improved generated response, and the agent that generates the second target improved generated response integrates the semantic content of the first target improved generated response to adjust the second target improved generated response.

[0071] Optionally, the model training module 350 optimizes the parameters of the large model carried by the agent based on the execution results of the task instruction and the improved generated response, including: calculating the comprehensive reward value corresponding to the improved generated response for each agent; wherein the comprehensive reward value is positively correlated with the value of specified parameters, the specified parameters including at least one of the following: the matching degree between the improved generated response to which the comprehensive reward value belongs and the expected task objective, the semantic similarity between the improved generated response to which the comprehensive reward value belongs and the improved generated responses generated by other agents, and the degree of contextual consistency between the improved generated response to which the comprehensive reward value belongs and the historical generated content of the agent; and optimizing the parameters of the large model carried by each agent by maximizing the comprehensive reward value corresponding to each agent.

[0072] Optionally, the apparatus in this embodiment further includes: The model backtracking module is used to: record the state information of the agent, including the parameters of the large model carried by the agent after each parameter optimization; and if a preset backtracking trigger condition is met during the process of the agent generating the initial generation response or optimizing the initial generation response, optimize the parameters of the large model carried by the agent based on the parameters of the previous optimization recorded in the state information.

[0073] Optionally, the backtracking trigger condition belongs to at least one of the following backtracking trigger types: the quality assessment value calculated based on the execution result of the improved generated response is lower than a preset quality threshold; the execution result of the improved generated response output by the agent for a consecutive preset number of rounds fails to achieve the expected goal; the parameter change amplitude of the large model carried by the agent exceeds a preset stability threshold during a round of parameter optimization.

[0074] Optionally, the model backtracking module optimizes the parameters of the large model carried by the agent based on the parameters of the previous optimization recorded in the state information, including: determining the matching parameter adjustment amount for the large model based on the backtracking trigger type to which the backtracking trigger condition belongs; adjusting the parameters of the previous optimization recorded in the state information according to the parameter adjustment amount to obtain the target parameters; and adjusting the parameters of the large model carried by the agent to the target parameters.

[0075] In summary, the device in this embodiment achieves automated training and continuous improvement of the agent's large-scale model by constructing a complete "generation-execution-feedback-optimization" closed loop. Specifically, the task instruction is first input into the agent, which then uses its built-in large language model to understand and reason, outputting an initial generated response. Subsequently, this initial generated response is executed in a simulated or real task execution environment to verify its actual effect, and the execution results (especially problems or failure information described in natural language) are re-inputted into the agent as key feedback. Based on this feedback, the agent uses its large-scale model's reflection and correction capabilities to analyze and optimize the initial generated response, generating a higher-quality improved generated response. This improved generated response is then executed again in the task execution environment to verify whether the optimization effectively solves the previous problems. Finally, the system integrates the actual execution results of the original task instruction and the improved generated response, calculates multi-dimensional performance rewards, and uses these rewards to guide the targeted optimization of the agent's large-scale model's parameters. This process is equivalent to achieving closed-loop, self-supervised training of the agent's large-scale model without external manual annotation, guided by the actual optimization effect of task execution. Through continuous application and the constant cycle of task execution and large model parameter updates, the agent can autonomously accumulate experience in dealing with various scenarios and dynamically adjust its internal generation strategy. This significantly improves the accuracy of output results, adaptability to changing tasks, and overall robustness of the system in complex and dynamic real-world environments.

[0076] It should be noted that the intelligent agent large model optimization device in this embodiment can be used as... Figure 1 The execution body of the method shown is therefore able to achieve... Figure 1 The steps and functions of the method shown are illustrated.

[0077] Figure 4This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Please refer to it. Figure 4 At the hardware level, the electronic device includes a processor, and optionally also includes an internal bus, a network interface, and memory. The memory may include main memory, such as high-speed random-access memory (RAM), or non-volatile memory, such as at least one disk drive. Of course, the electronic device may also include other hardware required for other business operations.

[0078] The processor, network interface, and memory can be interconnected via an internal bus, which can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry Standard Architecture) bus, etc. This bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 4 The symbol is represented by a single double-headed arrow, but this does not mean that there is only one bus or one type of bus.

[0079] Memory is used to store computer programs. Specifically, a computer program may include program code, which includes computer operation instructions. Memory may include main memory and non-volatile memory, and provides the computer program to the processor.

[0080] Specifically, the processor reads the corresponding computer program from non-volatile memory into memory and then runs it, forming the above-mentioned logical structure. Figure 3 The illustrated large-scale agent model optimization device. Correspondingly, the processor executes the program stored in memory and specifically performs the following operations: Input the task instructions into the agent to obtain the initial generated response from the agent.

[0081] The initial response is generated in the task execution environment, and the execution result of the initial response is obtained.

[0082] The execution result of the initial generated response is input into the agent, which then optimizes the initial generated response based on the execution result to obtain an improved generated response.

[0083] The improved generated response is executed in the task execution environment to obtain the execution result of the improved generated response.

[0084] Based on the execution results of task instructions and improved generated responses, the parameters of the large model carried by the agent are optimized.

[0085] This embodiment of the electronic device achieves automated training and continuous improvement of a large-scale agent model by constructing a complete "generation-execution-feedback-optimization" closed loop. Specifically, the task instruction is first input into the agent, which then uses its built-in large language model to understand and reason, outputting an initial generated response. Subsequently, this initial generated response is executed in a simulated or real task execution environment to verify its actual effect, and the execution results (especially problems or failure information described in natural language) are re-inputted into the agent as key feedback. Based on this feedback, the agent uses its large-scale model's reflection and correction capabilities to analyze and optimize the initial generated response, generating a higher-quality improved generated response. This improved generated response is then executed again in the task execution environment to verify whether the optimization effectively solves the previous problems. Finally, the system integrates the actual execution results of the original task instruction and the improved generated response, calculates multi-dimensional performance rewards, and uses these rewards to guide the targeted optimization of the large-scale agent model's parameters. This process is equivalent to achieving closed-loop, self-supervised training of the large-scale agent model without external manual annotation, guided by the actual optimization effect of task execution. Through continuous application and the constant cycle of task execution and large model parameter updates, the agent can autonomously accumulate experience in dealing with various scenarios and dynamically adjust its internal generation strategy. This significantly improves the accuracy of output results, adaptability to changing tasks, and overall robustness of the system in complex and dynamic real-world environments.

[0086] The above is as described in this instruction manual. Figure 1The intelligent agent large model optimization method disclosed in the illustrated embodiments can be applied to a processor and implemented by the processor. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the integrated logic circuit in the processor or by software instructions. The processor mentioned above can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the various methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of this application can be directly embodied as being executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method.

[0087] Of course, in addition to software implementation, the electronic device described in this specification does not exclude other implementation methods, such as logic devices or a combination of hardware and software. In other words, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

[0088] Furthermore, embodiments of this application also propose a computer program product, including a computer-readable storage medium storing one or more computer programs, the one or more computer programs including instructions.

[0089] When the aforementioned instructions are executed by a portable electronic device that includes multiple applications, they enable the portable electronic device to perform... Figure 1 The steps in the method shown include: Input the task instructions into the agent to obtain the initial generated response from the agent.

[0090] The initial response is generated in the task execution environment, and the execution result of the initial response is obtained.

[0091] The execution result of the initial generated response is input into the agent, which then optimizes the initial generated response based on the execution result to obtain an improved generated response.

[0092] The improved generated response is executed in the task execution environment to obtain the execution result of the improved generated response.

[0093] Based on the execution results of task instructions and improved generated responses, the parameters of the large model carried by the agent are optimized.

[0094] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0095] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0096] The above are merely embodiments of this specification and are not intended to limit the scope of this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this specification. Furthermore, all other embodiments obtained by those skilled in the art without inventive effort should fall within the protection scope of this document.

Claims

1. An agent large model optimization method, characterized in that, include: The task instructions are input to the agent to obtain the initial generated response output by the agent; The preliminary generated response is executed in the task execution environment to obtain the execution result of the preliminary generated response; The execution result of the preliminary generated response is input into the agent, so that the agent can optimize the preliminary generated response based on the execution result of the preliminary generated response to obtain an improved generated response; The improved generated response is executed in the task execution environment to obtain the execution result of the improved generated response; Based on the execution results of the task instructions and the improved generated response, the parameters of the large model carried by the agent are optimized.

2. The method according to claim 1, characterized in that, There are multiple intelligent agents, each equipped with its own corresponding large model, and each intelligent agent independently generates a preliminary generation response based on the task instructions, and independently optimizes its own preliminary generation response.

3. The method according to claim 2, characterized in that, Before executing the improved response generation in the task execution environment, the method further includes: Calculate the semantic similarity between any two of the improved generated responses; If there are two target improvement generation responses with semantic similarity below a preset threshold, the first target improvement generation response is input into the agent that generates the second target improvement generation response. The agent that generates the second target improvement generation response then integrates the semantic content of the first target improvement generation response and adjusts the second target improvement generation response.

4. The method according to claim 1, characterized in that, Based on the execution results of the task instructions and the improved generated response, the parameters of the large model carried by the agent are optimized, including: For each agent, calculate the comprehensive reward value corresponding to its improved generated response; wherein the comprehensive reward value is positively correlated with the value of a specified parameter, the specified parameter including at least one of the following: the matching degree between the improved generated response to which the comprehensive reward value belongs and the expected task objective, the semantic similarity between the improved generated response to which the comprehensive reward value belongs and the improved generated responses generated by other agents, and the degree of contextual consistency between the improved generated response to which the comprehensive reward value belongs and the historical generated content of the agent to which it belongs; The parameters of the large model carried by each agent are optimized by maximizing the comprehensive reward value corresponding to each agent.

5. The method of claim 1, wherein, Also includes: Record the state information of the agent, which includes the parameters of the large model carried by the agent after each parameter optimization. During the process of the agent generating the initial generated response or optimizing the initial generated response, if a preset backtracking trigger condition is met, the parameters of the large model carried by the agent are optimized based on the parameters of the previous optimization recorded in the state information.

6. The method according to claim 5, characterized in that, The backtracking trigger condition belongs to at least one of the following backtracking trigger types: The quality assessment value calculated based on the execution result of the improved generated response is lower than the preset quality threshold; The execution results of the improved generated response output by the intelligent agent in consecutive preset rounds did not achieve the expected goal; The parameters of the large model carried by the agent change beyond a preset stability threshold during one round of parameter optimization.

7. The method of claim 6, wherein, Also includes: Based on the parameters of the previous optimization recorded in the state information, the parameters of the large model carried by the agent are optimized, including: Based on the backtracking trigger type to which the backtracking trigger condition belongs, determine the matching parameter adjustment amount for the large model; According to the parameter adjustment amount, the parameters recorded in the status information after the last optimization are adjusted to obtain the target parameters; The parameters of the large model carried by the agent are adjusted to the target parameters.

8. An agent large model optimization apparatus, characterized by, include: The task processing module is used to input task instructions to the agent in order to obtain the initial generation response output by the agent. The first verification module is used to execute the preliminary generated response in the task execution environment to obtain the execution result of the preliminary generated response; The response improvement module is used to input the execution result of the preliminary generated response into the agent, so that the agent can optimize the preliminary generated response based on the execution result of the preliminary generated response to obtain an improved generated response; The second verification module is used to execute the improved generated response in the task execution environment and obtain the execution result of the improved generated response; The model training module is used to optimize the parameters of the large model carried by the agent based on the execution results of the task instructions and the improved generated response.

9. An electronic device comprising: processor; And a memory arranged to store computer-executable instructions, characterized in that, when executed, the executable instructions cause the processor to perform the method as described in any one of claims 1 to 7.

10. A computer program product, the computer program product comprising a computer-readable storage medium having stored thereon a computer program, characterised in that, The computer program is operable to cause the computer to perform the method as described in any one of claims 1 to 7.