Multi-agent collaborative training method and device, equipment and readable storage medium

By deconstructing and optimizing the multi-agent processing flow, and generating overall agent rewards, the problem of poor scenario adaptability of multi-agent collaborative training technology is solved, achieving stable and efficient training results under different application scenarios.

CN122242557APending Publication Date: 2026-06-19GUANGZHOU QUWAN NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU QUWAN NETWORK TECH CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multi-agent collaborative training technologies have poor scenario adaptability and cannot take into account the training effect under different application forms. As a result, it is difficult to maintain stable training effect and reasonable resource consumption in scenarios with different task complexity and system operating resources.

Method used

By acquiring collaborative training samples of the training target and multi-agent processing flow, the multi-agent processing flow in each sample is decomposed, the processing sub-tasks and ideas of each training agent are determined, data distillation and collaborative optimization are performed, and the overall reward of the agent is generated until the preset conditions are met, thereby realizing the fine-tuning training of the target agent.

🎯Benefits of technology

It enhances the scenario adaptability of multi-agent collaborative training, making it suitable for different deployment environments, improving the stability and efficiency of training, and ensuring the collaborative consistency and execution efficiency of small-parameter agents in complex scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242557A_ABST
    Figure CN122242557A_ABST
Patent Text Reader

Abstract

This application discloses a multi-agent collaborative training method, apparatus, device, and readable storage medium. The method includes acquiring training targets and collaborative training samples containing corresponding collaborative training tasks and corresponding multi-agent processing flows; when data distillation is required, performing data distillation on each target agent based on the processing sub-tasks, processing ideas, and processing results of each training agent in different multi-agent processing flows until a preset stopping condition is reached; when collaborative optimization is required, using each target agent to process each collaborative training task sequentially to generate task execution data; generating an overall agent reward based on the task execution data of the same collaborative training task and the multi-agent processing flow; and fine-tuning training each target agent based on the overall agent rewards until a stopping condition is reached. Therefore, this method can adapt to the collaborative training needs of different deployment environments, improving the scenario adaptability of the collaborative training in this application.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of intelligent agent technology, and more specifically, to a multi-agent collaborative training method, apparatus, device, and readable storage medium. Background Technology

[0002] Multi-agent collaborative training technology has become an important research direction in the field of intelligent decision-making. However, most existing training schemes are designed for specific task scenarios and operating environments, and have significant limitations in actual deployment and application.

[0003] Because existing collaborative training mechanisms are generally designed for specific task scenarios, they suffer from insufficient scenario adaptability, only meeting training needs under a single scenario. In scenarios with varying task complexity, system resources, and deployment environments, they struggle to maintain stable training results and reasonable resource consumption. To address the poor scenario adaptability and inability to balance training effectiveness across different application scenarios in existing multi-agent collaborative training technologies, this application aims to provide a multi-agent collaborative training method that improves the scenario adaptability and operational stability of the training framework, enabling efficient and stable collaborative training across various application scenarios. Summary of the Invention

[0004] In view of this, this application provides a multi-agent collaborative training method, apparatus, device and readable storage medium to solve the shortcomings of existing multi-agent collaborative training technologies, such as poor scenario adaptability and inability to take into account training effects under different application forms.

[0005] To achieve the above objectives, the following solution is proposed:

[0006] A multi-agent collaborative training method, comprising:

[0007] Obtain the training target and multiple collaborative training samples containing the corresponding collaborative training tasks and the corresponding multi-agent processing flow;

[0008] When the training target representation performs data distillation on multiple target agents participating in the training, the multi-agent processing flow in each collaborative training sample is decomposed, and the processing sub-tasks, processing ideas and processing results of each training agent involved in each multi-agent processing flow are determined. Based on the processing sub-tasks, processing ideas and processing results of each training agent in different multi-agent processing flows, data distillation is performed on each target agent until a preset first stopping condition is reached.

[0009] When the training target representation is used to collaboratively optimize multiple target agents participating in the training, each target agent processes each collaborative training task in turn to generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, the overall agent reward corresponding to each collaborative training task is generated; based on the overall agent reward, each target agent is fine-tuned until the preset second stopping condition is reached.

[0010] Optionally, the data distillation of each target agent based on the processing sub-tasks and processing results of each trained agent in different multi-agent processing flows includes:

[0011] Based on the processing sub-tasks, processing ideas, and processing results corresponding to the same training agent, multiple sample data corresponding to that training agent are generated.

[0012] Each training agent is paired with each target agent to determine all the sample data corresponding to each target agent, and data distillation is performed on the target agent using the sample data corresponding to each target agent.

[0013] Optionally, the step of performing data distillation on the target agent using the sample data corresponding to each target agent includes:

[0014] For each target agent, all real processing tasks of the target agent are transferred to the Snapshoot version of the target agent; each processing subtask of the target agent is sequentially input into the target agent to obtain training processing data containing processing strategies and computational results; each processing approach is compared with its corresponding processing strategy to generate a process reward; each processing result is compared with its corresponding computational result to generate a result reward; based on the process reward and result reward corresponding to each sample data, the parameters of the target agent are adjusted.

[0015] Optionally, the step of generating the overall agent reward corresponding to each collaborative training task based on task execution data and multi-agent processing flow corresponding to the same collaborative training task includes:

[0016] Extract the task execution process and results from the data of each task execution, and extract the execution ideas and final execution results from each multi-agent processing flow;

[0017] The execution process of each task is compared with the corresponding execution approach to evaluate the simplification of the task execution process and generate process rewards.

[0018] The results of each task execution are compared with the corresponding final execution result to evaluate the effectiveness of the task execution result and generate a result reward.

[0019] By combining the process reward and result reward for each collaborative training task, the overall agent reward for the corresponding collaborative training task is generated.

[0020] Optionally, comparing each task execution process with its corresponding execution strategy, evaluating the simplification of the task execution process, and generating a process reward includes:

[0021] Using an AI model, the execution process of each task is compared with the corresponding execution approach to evaluate the simplification of tool calls, the rationality of process logic, and the simplification of steps in the task execution process, and to generate process rewards.

[0022] Optionally, comparing the execution result of each task with the corresponding final execution result to evaluate the validity of the task execution result and generate a result reward includes:

[0023] Using an AI model, the execution result of each task is compared with the corresponding final execution result to evaluate the intuitiveness of the task execution result, the richness of the display method, and the matching degree with the corresponding collaborative training task, and to generate a result reward.

[0024] A multi-agent collaborative training method, comprising:

[0025] Obtain multiple first training samples containing collaborative training tasks and multi-agent processing procedures;

[0026] Each heavy agent sequentially processes each collaborative training task to generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, the overall agent reward for each collaborative training task is generated; based on the overall agent reward, each heavy agent is fine-tuned until the stopping condition is met.

[0027] Each finely tuned heavy intelligent agent processes each collaborative training task sequentially, generating a second training sample containing the collaborative training task and the task execution process.

[0028] Deconstruct the execution flow of each task, and determine the processing sub-tasks, processing ideas, and processing results of each heavy agent involved in the execution flow of each task; based on the processing sub-tasks, processing ideas, and processing results of different heavy agents, perform data distillation on each lightweight agent until the preset stopping conditions are reached.

[0029] A multi-agent collaborative training device, comprising:

[0030] The acquisition module is used to acquire the training target and multiple collaborative training samples containing the corresponding collaborative training tasks and the corresponding multi-agent processing flow.

[0031] The decomposition module is used to decompose the multi-agent processing flow in each collaborative training sample when the training target representation performs data distillation on multiple target agents participating in the training, and to determine the processing sub-tasks, processing ideas and processing results of each training agent involved in each multi-agent processing flow; based on the processing sub-tasks, processing ideas and processing results of each training agent in different multi-agent processing flows, data distillation is performed on each target agent until a preset first stopping condition is reached.

[0032] The generation module is used to generate task execution data by using each target agent to process each collaborative training task sequentially when the training target representation is used to collaboratively optimize multiple target agents participating in the training; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, generate the overall agent reward corresponding to each collaborative training task; and fine-tune the training of each target agent based on the overall agent reward until a preset second stopping condition is reached.

[0033] A multi-agent collaborative training device, comprising a memory and a processor;

[0034] The memory is used to store programs;

[0035] The processor is used to execute the program to implement each step of the above-described multi-agent collaborative training method.

[0036] A readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the various steps of the multi-agent cooperative training method described above.

[0037] As can be seen from the above technical solution, the multi-agent collaborative training method provided in this application can obtain training targets and multiple collaborative training samples containing corresponding collaborative training tasks and corresponding multi-agent processing flows. Subsequently, if data distillation is performed, this application can decompose the multi-agent processing flow in each collaborative training sample to determine the processing sub-tasks, processing ideas, and processing results of each training agent involved in each multi-agent processing flow. Based on the processing sub-tasks, processing ideas, and processing results of each training agent in different multi-agent processing flows, data distillation is performed on each target agent until a preset first stopping condition is reached. Thus, this application can decompose multiple multi-agent processing flows so that the target agent belonging to the student model can learn the abstract collaborative processing logic, interaction methods, and large-scale task decomposition techniques and intelligence of the training agents belonging to the teacher model through processing sub-tasks, processing ideas, and processing results. The application employs a collaborative training technique to provide explicit collaborative strategies for each target agent belonging to the student model. This allows target agents with a small number of parameters to develop problem-solving strategies for complex scenarios after collaborative training. This makes the application applicable to collaborative training scenarios for small-parameter agents with limited system resources. When the training target representation is used to collaboratively optimize multiple target agents participating in the training, each target agent can sequentially process each collaborative training task to generate task execution data. Based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, an overall reward for each agent corresponding to the collaborative training task is generated. Based on the overall rewards of each agent, each target agent is fine-tuned until a preset second stopping condition is reached. Therefore, when facing collaborative fine-tuning tasks, this application uses training samples as a reward reference for the training status of target agents, and can guide each target agent to focus on the overall collaborative situation by generating an overall reward. Specifically, when the collaborative logic of the training samples is superior, overall rewards can be used to guide each target agent to learn the corresponding collaborative logic strategy of the training samples. Furthermore, when the collaborative logic of each target agent is superior, overall rewards can be used to guide each target agent to maintain its current collaborative logic strategy when handling corresponding collaborative training tasks, thereby improving the collaborative consistency and execution efficiency of multiple agents. It is evident that this application can solve the problem of a single approach being unable to match different training needs by providing different collaborative training methods for different training requirements. When facing scenarios requiring data distillation, this application uses the specific processing ideas of the multi-agent processing flow as the target, achieving the goal of enabling each target agent to learn specific collaborative logic. When facing collaborative fine-tuning requirements, this application uses the multi-agent processing flow as a reference, achieving the goal of enabling each target agent to choose a better collaborative logic. Thus, this application can adapt to the collaborative training needs of different deployment environments, improving its scenario adaptability for collaborative training. Attached Figure Description

[0038] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0039] Figure 1 This is a flowchart of a multi-agent collaborative training method disclosed in an embodiment of this application;

[0040] Figure 2 This is a structural block diagram of a multi-agent collaborative training device disclosed in an embodiment of this application;

[0041] Figure 3 This is a flowchart of another multi-agent collaborative training method disclosed in an embodiment of this application;

[0042] Figure 4 This is a hardware structure block diagram of a multi-agent collaborative training device disclosed in an embodiment of this application. Detailed Implementation

[0043] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0044] This application provides a multi-agent collaborative training method, which can be applied to various multi-agent training systems or multi-agent management systems, as well as to various computer terminals or smart terminals. The executing entity can be the processor or server of the computer terminal or smart terminal.

[0045] Next, combine Figure 1 The multi-agent cooperative training method of this application is described in detail, including the following steps:

[0046] Step S1: Obtain the training target and multiple collaborative training samples containing the corresponding collaborative training tasks and the corresponding multi-agent processing flow.

[0047] Specifically, it can respond to training objectives input by the user.

[0048] The training objective may include the target identifiers of each target agent participating in the training, as well as the training type.

[0049] Training types can include information related to the training objective, such as data distillation and fine-tuning optimization.

[0050] Each target agent trained using the data distillation method can be a lightweight agent, i.e., an agent with few parameters, while each target agent trained using the fine-tuning optimization method can be a heavyweight agent, i.e., an agent with many parameters.

[0051] Collaborative training tasks can come from a variety of different business scenarios, such as data prediction scenarios and data aggregation scenarios.

[0052] The training agents included in different collaborative training samples can be different. For example, a data prediction scenario can include a Coordinator Agent, a Sales Data QA Agent, a Senior Data AnalystX Agent, a Data Visualization Agent, and a Data Reporting Agent.

[0053] In the data aggregation scenario, there may be a data intelligence agent coordinator agent, a sales data intelligence assistant (Sale_Data_QA Agent), a data visualization expert (Data_Visualization Agent), and an enterprise data reporting expert (Data_ReportX Agent).

[0054] The send_message method can be invoked through the Coordinator Agent's custom tool to schedule the Sale_Data_QA Agent, Data_AnalystX Agent, Data_Visualization Agent, and Data_ReportXAgent.

[0055] The multi-agent processing flow may differ under different collaborative training tasks within the same business scenario.

[0056] For example, when the collaborative training task is to analyze the current June sales data to predict the July sales data, the Coordinator Agent can first receive and parse the data, and then schedule the Sale_Data_QA Agent to execute the task. With the support of three types of tools, mysql_mcp, math_mcp, and sandbox_mcp, the Sale_Data_QA Agent completes the query of June sales data and basic insights. After the processing results are sent back to the Coordinator Agent, they are uniformly output as AgentOutput, and AgentTrace and AgentSFT Samples are generated for process recording and model training.

[0057] The collaborative training task involves analyzing June sales data from the previous five years to predict June sales data for this year. At least two Sale_Data_QA Agent instances are scheduled to query June sales data from different years in parallel. The Sale_Data_QA Agent uses tools such as mysql_mcp, math_mcp, sandbox_mcp, and chart_mcp to gain initial insights. The CoordinatorAgent then transfers the results to the Data_AnalystX Agent for multi-dimensional in-depth analysis and model validation. The analysis results are then sequentially passed to the Data_Visualization Agent to generate professional charts, and the Data_ReportX Agent to encapsulate them into an HTML report. Finally, the CoordinatorAgent summarizes and outputs AgentOutput, while also generating AgentTrace and AgentSFT Samples for process tracing and model fine-tuning.

[0058] AI models can be used to score the results of multiple historical collaborative tasks, and select those with result reward scores exceeding a preset first threshold as initial training samples. AI models can also be used to score the process rewards of each initial training sample, and select those with process reward scores exceeding a preset second threshold to generate collaborative training samples.

[0059] Multi-agent processing flow can be used to characterize the running trajectory and output results of multi-agents.

[0060] Specifically, this can include information such as the order in which agents are invoked, the results returned by agents, and the data from tool calls.

[0061] The tool call data may also include tool call parameters, executed call SQL, and call results.

[0062] It can extract multi-dimensional data such as Agent Name, Query Task, Task Desc, Tool Name, Tool Arguments, Call Status, Success Result, Failure Result, Error Reason, End Reply, and Reply content to construct the corresponding multi-agent processing flow.

[0063] Step S2: When the training target representation performs data distillation on multiple target agents participating in the training, the multi-agent processing flow in each collaborative training sample is decomposed, and the processing sub-tasks, processing ideas and processing results of each training agent involved in each multi-agent processing flow are determined. Based on the processing sub-tasks, processing ideas and processing results of each training agent in different multi-agent processing flows, data distillation is performed on each target agent until the preset first stopping condition is reached.

[0064] Specifically, semantic analysis can be used to identify whether the training objective represents data distillation of each target agent in a small-parameter model being trained, or collaborative logic optimization of each target agent in a large-parameter model being trained.

[0065] This can be achieved by analyzing the overall processing strategy of the data agent coordinator in each multi-agent processing flow, the processing sub-tasks assigned to each sub-agent, the processing strategy of each trained agent, the processing results of each sub-agent, and the overall processing result output by the data agent coordinator integrating the processing results of each sub-agent.

[0066] Each training agent includes a data agent coordinator and various sub-agents.

[0067] Therefore, the processing subtasks corresponding to the same multi-agent processing flow may include the corresponding collaborative training task and the processing subtasks assigned to each sub-agent by the data agent coordinator.

[0068] Each sub-agent's processing sub-task can include its input and output information.

[0069] The processing approach corresponding to the same multi-agent processing flow can include the overall task processing approach of the data agent coordinator and the processing approach of each sub-task.

[0070] The processing results corresponding to the same multi-agent processing flow may include the overall processing result output by the data agent coordinator integrating the processing results of each sub-agent, as well as the processing result of each sub-agent.

[0071] Using the processing subtasks, processing ideas, and processing results of each trained agent in different multi-agent processing flows as sample data, each target agent can be trained until each target agent converges.

[0072] During the iteration process, each target agent can be trained independently or in parallel.

[0073] Step S3: When the training target representation is used to collaboratively optimize multiple target agents participating in the training, each target agent processes each collaborative training task in turn to generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, the overall agent reward corresponding to each collaborative training task is generated; based on the overall agent reward, each target agent is fine-tuned until the preset second stopping condition is reached.

[0074] Specifically, by treating each target agent as a whole and processing each collaborative training task sequentially, task execution data for each target agent can be obtained.

[0075] Task execution data may include the overall task processing ideas of the data agent coordinator and the final processing results of the data agent coordinator, but may not include the input and output information of each sub-agent.

[0076] Based on the task execution data and multi-agent processing flow corresponding to the same collaborative training task, the overall process reward and overall result reward for each collaborative training task can be generated; based on the overall process reward and overall result reward, fine-tuning training can be performed on each target agent until each target agent converges.

[0077] During training, a specific training agent may call upon other target agents to complete the assigned tasks of the specific target agent. The calling target agent can be scored and fine-tuned based on the processing results and processing ideas of other tasks similar to the assigned tasks.

[0078] Specifically, during the iteration process, there may be a situation where some target agents converge while others do not. In this case, we can stop fine-tuning the converged target agents and fine-tune the non-converged target agents based on the overall process reward and the overall result reward until all target agents converge.

[0079] As can be seen from the above technical solution, the multi-agent collaborative training method provided in this application can obtain training targets and multiple collaborative training samples containing corresponding collaborative training tasks and corresponding multi-agent processing flows. Subsequently, if data distillation is performed, this application can decompose the multi-agent processing flow in each collaborative training sample to determine the processing sub-tasks, processing ideas, and processing results of each training agent involved in each multi-agent processing flow. Based on the processing sub-tasks, processing ideas, and processing results of each training agent in different multi-agent processing flows, data distillation is performed on each target agent until a preset first stopping condition is reached. Thus, this application can decompose multiple multi-agent processing flows so that the target agent belonging to the student model can learn the abstract collaborative processing logic, interaction methods, and large-scale task decomposition techniques and intelligence of the training agents belonging to the teacher model through processing sub-tasks, processing ideas, and processing results. The application employs a collaborative training technique to provide explicit collaborative strategies for each target agent belonging to the student model. This allows target agents with a small number of parameters to develop problem-solving strategies for complex scenarios after collaborative training. This makes the application applicable to collaborative training scenarios for small-parameter agents with limited system resources. When the training target representation is used to collaboratively optimize multiple target agents participating in the training, each target agent can sequentially process each collaborative training task to generate task execution data. Based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, an overall reward for each agent corresponding to the collaborative training task is generated. Based on the overall rewards of each agent, each target agent is fine-tuned until a preset second stopping condition is reached. Therefore, when facing collaborative fine-tuning tasks, this application uses training samples as a reward reference for the training status of target agents, and can guide each target agent to focus on the overall collaborative situation by generating an overall reward. Specifically, when the collaborative logic of the training samples is superior, overall rewards can be used to guide each target agent to learn the corresponding collaborative logic strategy of the training samples. Furthermore, when the collaborative logic of each target agent is superior, overall rewards can be used to guide each target agent to maintain its current collaborative logic strategy when handling corresponding collaborative training tasks, thereby improving the collaborative consistency and execution efficiency of multiple agents. It is evident that this application can solve the problem of a single approach being unable to match different training needs by providing different collaborative training methods for different training requirements. When facing scenarios requiring data distillation, this application uses the specific processing ideas of the multi-agent processing flow as the target, achieving the goal of enabling each target agent to learn specific collaborative logic. When facing collaborative fine-tuning requirements, this application uses the multi-agent processing flow as a reference, achieving the goal of enabling each target agent to choose a better collaborative logic. Thus, this application can adapt to the collaborative training needs of different deployment environments, improving its scenario adaptability for collaborative training.

[0080] In some embodiments of this application, the process of data distillation for each target agent in step S2, based on the processing sub-tasks, processing ideas, and processing results of each trained agent in different multi-agent processing flows, is described in detail below:

[0081] S20. Based on the processing sub-tasks, processing ideas and processing results corresponding to the same training agent, generate multiple sample data corresponding to the training agent.

[0082] Specifically, the processing sub-tasks, processing ideas, and processing results of the same training agent and the same collaborative training task can be used to generate one of the sample data corresponding to that training agent.

[0083] S21. Pair each training agent with each target agent, determine all sample data corresponding to each target agent, and use the sample data corresponding to each target agent to perform data distillation on the target agent.

[0084] Specifically, based on the functional similarity between each training agent and each target agent, each training agent and each target agent can be paired to obtain sample data corresponding to each target agent.

[0085] Each target agent can be trained and its parameters optimized using sample data from each target agent.

[0086] During training, a coordinator can send each sample data point to the corresponding target agent and, based on the target agent's processing result and the original task reward, distribute the actual reward to that target agent. The parameters of the target agent can then be adjusted based on this actual reward.

[0087] Furthermore, a coordinator can be used for timed batch training.

[0088] In some embodiments of this application, an optional method is provided for data distillation of each target agent based on the processing sub-tasks and results of each training agent in different multi-agent processing flows. Through this method, the processing status of training agents paired with the target agent can be specifically used as training samples for that target agent, thereby improving the effectiveness of the training process.

[0089] In some embodiments of this application, the process of performing data distillation on the target agent using the sample data corresponding to each target agent in step S21 is described in detail, and the steps are as follows:

[0090] S210. For each target agent, all real processing tasks of the target agent are transferred to the Snapshoot version of the target agent; each processing subtask of the target agent is sequentially input into the target agent to obtain training processing data of the target agent containing processing strategies and computational results; each processing idea is compared with its corresponding processing strategy to generate a process reward; each processing result is compared with its corresponding computational result to generate a result reward; based on the process reward and result reward corresponding to each sample data, the parameters of the target agent are adjusted.

[0091] Specifically, a Snapshot version of an agent can be a snapshot version of an agent.

[0092] Training data can also include input information, output information, and the tools invoked.

[0093] The coordinator can transfer all the actual processing tasks of the target agent to the Snapshoot version of the target agent.

[0094] The coordinator can contain a task queue corresponding to each target agent.

[0095] During training, some target agents may invoke other target agents. In this case, a coordinator can issue tasks to the corresponding target agents. Since the processing logic of these target agents may differ from that of the corresponding multi-agent processing flow, during training, the same collaborative training task for some target agents may correspond to one or more computational results. One computational result is the result obtained by the target agent during training, and the other is the result of computation on the corresponding sample data of the collaborative training task. At this point, process rewards and result rewards can be applied to the target agent based on other similar tasks in its memory.

[0096] At this point, the overall processing result corresponding to the coordinated training task can be compared with the overall computational result obtained from the training. If they are consistent, then all participating target agents output correctly, and each target agent outputs correctly. If they are inconsistent, then some participating target agents output incorrectly. However, the current agent's analysis and thinking process may be correct. For such agents, sometimes they clearly have correct outputs, but the result is negative feedback due to errors by other agents. In such cases, the memory can be searched for relevant similar successful experiences to compare with the current situation. Under certain circumstances, the influence of negative feedback can be eliminated, and positive feedback can be obtained. The memory search can use vector recall of the top_k semantically similar related questions. This is then split into process reward and outcome reward. If the semantics are similar, the process reward is taken, with a maximum of 0.5. If the semantics and values ​​are identical, an additional outcome reward is taken, with a maximum of 0.5. After comparison with multiple similar questions, semantic similarity is used as weight to perform a weighted average of the multiple process reward similarity assessments. If both semantic and numerical similarity are encountered, an additional outcome reward is taken; otherwise, the outcome reward is 0. Therefore, the final reward can be a weighted average of the process reward and the outcome score.

[0097] As can be seen from the above technical solution, this embodiment provides an optional method for data distillation of a target agent using the sample data corresponding to each target agent. Through this method, the actual processing tasks of the target agent can be transferred to its Snapshoot version agent. This operation avoids interference from the training process with the actual task, ensuring the normal operation of real business. Combining process rewards and outcome rewards, scoring is performed from both the result and process dimensions, further improving the effectiveness and accuracy of the data distillation process.

[0098] In some embodiments of this application, the process of generating the overall agent reward corresponding to each collaborative training task based on the task execution data and multi-agent processing flow corresponding to the same collaborative training task is described in detail, and the steps are as follows:

[0099] S30. Extract the task execution process and task execution results from the execution data of each task, and extract the execution ideas and final execution results from each multi-agent processing flow.

[0100] Specifically, by identifying the Agent Name, Query Task, Task Desc, Tool Name, Tool Arguments, Results, and Failures, the processing ideas and results can be extracted from the execution data of each task, thus obtaining the task execution process and results.

[0101] By identifying the Agent Name, Query Task, Task Desc, Tool Name, Tool Arguments, Results, and Failures, the execution logic and final execution results can be extracted from each multi-agent processing flow.

[0102] S31. Compare the execution process of each task with the corresponding execution approach, evaluate the simplification of the task execution process, and generate process rewards.

[0103] Specifically, the simplification of each task execution process and its corresponding execution approach can be compared to determine the process reward for the corresponding task execution process.

[0104] S32. Compare the execution result of each task with the corresponding final execution result, evaluate the effectiveness of the task execution result, and generate a result reward.

[0105] Specifically, the validity of each task execution result can be compared with the corresponding final execution result to generate a result reward.

[0106] S33. Combine the process reward and result reward of each collaborative training task to generate the overall agent reward for the corresponding collaborative training task.

[0107] Specifically, it is possible to determine the degree to which the process represents the overall collaborative performance and to determine the process weights;

[0108] It can determine the degree to which the final result represents the overall collaborative performance and determine the weight of the result;

[0109] The overall agent reward for each collaborative training task can be calculated based on the process weight and process reward, and the result weight and result reward for each collaborative training task.

[0110] As can be seen from the above technical solution, this embodiment provides an optional method for generating the overall agent reward corresponding to each collaborative training task based on task execution data and multi-agent processing flow corresponding to the same collaborative training task. By evaluating the simplification of the task execution process and the effectiveness of the task execution results separately, this method can more comprehensively and objectively reflect the performance of multiple agents in collaborative training tasks, guiding them towards more efficient and optimized collaboration, and improving the accuracy and scientific rigor of the overall agent reward generation in multi-agent collaborative training.

[0111] In some embodiments of this application, step S31, which involves comparing each task execution process with its corresponding execution approach, evaluating the simplification of the task execution process, and generating a process reward, is described in detail below:

[0112] S310. Using an AI model, compare the execution process of each task with the corresponding execution approach, evaluate the simplification of tool calls, the rationality of process logic, and the simplification of steps in the task execution process, and generate process rewards.

[0113] Specifically, a semantic AI model can be used to analyze the execution process of each task and its matching execution ideas, evaluate the simplification of tool calls, the rationality of process logic, and the simplification of steps in the execution process of the task, and generate process rewards.

[0114] Tool invocation simplification can be determined based on whether the number of invocations is reduced during task execution, or whether invocations of tools and / or agents are reduced.

[0115] The rationality of the process logic can be determined based on whether there are redundant steps in the task execution process, such as whether the agent is called to obtain data that is unrelated to the corresponding task execution result.

[0116] The simplification of steps can be determined based on whether the total number of execution steps is less than the execution approach.

[0117] As can be seen from the above technical solution, this embodiment provides an optional method to compare the execution process of each task with its corresponding execution approach, evaluate the simplification of the task execution process, and generate process rewards. Through this method, the effectiveness of collaborative training of the task execution process can be comprehensively evaluated from multiple dimensions, such as tool invocation, the rationality of process logic, and step simplification, further improving the reliability of the process rewards in this application.

[0118] In some embodiments of this application, the process of step S32, which involves comparing the execution result of each task with the corresponding final execution result to evaluate the validity of the task execution result and generate a result reward, is described in detail below:

[0119] S320: Using an AI model, the execution result of each task is compared with the corresponding final execution result. The intuitiveness of the task execution result display, the richness of the display method, and the matching degree with the corresponding collaborative training task are evaluated, and a result reward is generated.

[0120] Specifically, a semantic AI model can be used to analyze the execution results of each task and the corresponding final execution results, evaluate the intuitiveness of the task execution results, the richness of the display methods, and the matching degree with the corresponding collaborative training tasks, and generate result rewards.

[0121] Numerical displays of intuitiveness can reflect whether task execution results are presented in an intuitive way, such as through charts.

[0122] The richness of display methods can be determined based on whether the task execution results are displayed in multiple ways.

[0123] Matching degree can characterize whether the task execution result completes the collaborative training task.

[0124] As can be seen from the above technical solution, this embodiment provides an optional method for comparing the execution result of each task with the corresponding final execution result, evaluating the effectiveness of the task execution result, and generating a result reward. Through this method, multi-dimensional information such as intuitiveness, richness of display methods, and matching degree with the corresponding collaborative training task can be comprehensively displayed to evaluate the result reward, resulting in higher accuracy of the result reward.

[0125] Next, we will combine Figure 2 This application provides a detailed description of the multi-agent cooperative training device, which can be compared with the multi-agent cooperative training method described above.

[0126] See Figure 2 It can be observed that multi-agent collaborative training devices may include:

[0127] The acquisition module 10 is used to acquire the training target and multiple collaborative training samples containing the corresponding collaborative training tasks and the corresponding multi-agent processing flow.

[0128] The decomposition module 20 is used to decompose the multi-agent processing flow in each collaborative training sample when the training target representation performs data distillation on multiple target agents participating in the training, and to determine the processing sub-tasks, processing ideas and processing results of each training agent involved in each multi-agent processing flow; based on the processing sub-tasks, processing ideas and processing results of each training agent in different multi-agent processing flows, data distillation is performed on each target agent until a preset first stopping condition is reached;

[0129] The generation module 30 is used to generate task execution data by using each target agent to process each collaborative training task in turn when the training target representation is used to collaboratively optimize multiple target agents participating in the training; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, generate the overall agent reward corresponding to each collaborative training task; and fine-tune the training of each target agent based on the overall agent reward until a preset second stopping condition is reached.

[0130] Furthermore, the disassembly module 20 may include:

[0131] The sample data generation unit is used to generate multiple sample data corresponding to the same training agent based on the processing sub-tasks, processing ideas and processing results.

[0132] The target agent pairing unit pairs each training agent with each target agent, determines all sample data corresponding to each target agent, and performs data distillation on the target agent using the sample data corresponding to each target agent.

[0133] Furthermore, the target agent pairing unit may include:

[0134] The parameter adjustment subunit is used to, for each target agent, transfer all real processing tasks of the target agent to the Snapshoot version of the target agent; sequentially input each processing subtask of the target agent into the target agent to obtain training processing data of the target agent containing processing strategies and computational results; compare the similarity between each processing idea and its corresponding processing strategy to generate a process reward; compare the similarity between each processing result and its corresponding computational result to generate a result reward; and adjust the parameters of the target agent based on the process reward and result reward corresponding to each sample data.

[0135] Furthermore, the generation module 30 may include:

[0136] The execution strategy extraction unit is used to extract the task execution process and task execution results from each task execution data, and to extract the execution strategy and final execution results from each multi-agent processing flow.

[0137] The process reward generation unit is used to compare the execution process of each task with the corresponding execution approach, evaluate the simplification of the task execution process, and generate process rewards.

[0138] The result reward generation unit is used to compare the execution result of each task with the corresponding final execution result, evaluate the effectiveness of the task execution result, and generate a result reward.

[0139] The overall reward generation unit is used to combine the process reward and result reward of each collaborative training task to generate the overall agent reward for the corresponding collaborative training task.

[0140] Furthermore, the process reward generation unit may include:

[0141] The step simplification assessment subunit uses an AI model to compare the execution process of each task with the corresponding execution approach, evaluate the simplification of tool calls, the rationality of process logic, and the simplification of steps in the task execution process, and generate process rewards.

[0142] Furthermore, the result reward generation unit may include:

[0143] The intuitiveness assessment subunit is used to compare the execution result of each task with the corresponding final execution result using an AI model. It evaluates the intuitiveness of the task execution result, the richness of the display method, and the matching degree with the corresponding collaborative training task, and generates a result reward.

[0144] In some embodiments of this application, a large-parameter model can be trained first, and a small-parameter model can be obtained by data distillation based on the trained large-parameter model. This allows for the processing of complex collaborative tasks with relatively limited system resources. Therefore, this application provides specific implementation methods for the above-described training process.

[0145] Next, we will combine Figure 3 The specific implementation methods described above are detailed below:

[0146] Step S100: Obtain multiple first training samples containing collaborative training tasks and multi-agent processing procedures.

[0147] Specifically, the first training sample is equivalent to the collaborative training sample mentioned above. For details, please refer to step S1 above, which will not be repeated here.

[0148] Step S200: Use each heavy agent to process each collaborative training task in sequence to generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, generate the overall agent reward corresponding to each collaborative training task; based on the overall agent reward, fine-tune the training of each heavy agent until the stopping condition is met.

[0149] Specifically, each heavy intelligent agent is equivalent to each target intelligent agent in step S3. For details, please refer to step S3 above, which will not be repeated here.

[0150] Step S300: Use each finely tuned heavy intelligent agent to process each collaborative training task in turn, and generate a second training sample containing the collaborative training task and the task execution process.

[0151] Specifically, each collaborative training task can be input into each finely tuned heavy intelligent agent to obtain the task execution flow of the corresponding collaborative training task.

[0152] The task execution process here may include information such as the order of agent invocation, agent result return, and tool call data.

[0153] The task execution process refers to the task execution status of each heavy intelligent agent after the stopping condition is met;

[0154] Task execution data refers to the task execution status of each heavy agent during the training process;

[0155] The task execution process and task execution data for the same collaborative training task can be the same or different.

[0156] Step S400: Decompose the execution flow of each task, determine the processing sub-tasks, processing ideas and processing results of each heavy agent involved in the execution flow of each task; based on the processing sub-tasks, processing ideas and processing results of different heavy agents, perform data distillation on each lightweight agent until the preset stopping condition is reached.

[0157] Specifically, the task execution process here is equivalent to the multi-agent processing process in step S2, and each lightweight agent is equivalent to the target agent in step S2. Therefore, for a detailed introduction, please refer to step S2 above, and it will not be repeated here.

[0158] As can be seen from the above technical solution, this embodiment provides a novel multi-agent collaborative training method. By using a coordinated and optimized heavy agent as the teacher model, knowledge is transferred to a lightweight agent, which can effectively reduce the consumption of system operating resources while ensuring the collaborative training effect. After data distillation, the lightweight agent can efficiently handle complex collaborative tasks in resource-constrained environments, improving the practicality and scalability of the multi-agent system. This allows agents to better cooperate when handling collaborative tasks, improving the efficiency and quality of task execution.

[0159] Next, a multi-agent collaborative training device corresponding to the above training process will be provided, which may include:

[0160] The first collaborative training module is used to acquire multiple first training samples containing collaborative training tasks and multi-agent processing procedures.

[0161] The second collaborative training module is used to process each collaborative training task sequentially by each heavy agent and generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, it generates the overall reward of the agent corresponding to each collaborative training task; based on the overall reward of each agent, it performs fine-tuning training on each heavy agent until the stopping condition is met.

[0162] The third collaborative training module is used to process each collaborative training task sequentially using each finely tuned heavy intelligent agent to generate a second training sample containing the collaborative training task and the task execution process.

[0163] The fourth collaborative training module is used to break down the execution process of each task, determine the processing sub-tasks, processing ideas and processing results of each heavy agent involved in the execution process of each task, and perform data distillation on each lightweight agent based on the processing sub-tasks, processing ideas and processing results of different heavy agents until the preset stopping conditions are reached.

[0164] The multi-agent collaborative training device provided in this application embodiment can be applied to multi-agent collaborative training equipment, such as PC terminals, cloud platforms, servers, and server clusters. Optionally, Figure 4 The hardware structure block diagram of the multi-agent collaborative training device is shown below. Figure 4 The hardware structure of a multi-agent collaborative training device may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication bus 4.

[0165] In this embodiment of the application, the number of processor 1, communication interface 2, memory 3, and communication bus 4 is at least one, and processor 1, communication interface 2, and memory 3 communicate with each other through communication bus 4;

[0166] Processor 1 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.

[0167] Memory 3 may include high-speed RAM, and may also include non-volatile memory, such as at least one disk storage device;

[0168] The memory stores a program, which the processor can call. The program is used for:

[0169] Obtain the training target and multiple collaborative training samples containing the corresponding collaborative training tasks and the corresponding multi-agent processing flow;

[0170] When the training target representation performs data distillation on multiple target agents participating in the training, the multi-agent processing flow in each collaborative training sample is decomposed, and the processing sub-tasks, processing ideas and processing results of each training agent involved in each multi-agent processing flow are determined. Based on the processing sub-tasks, processing ideas and processing results of each training agent in different multi-agent processing flows, data distillation is performed on each target agent until a preset first stopping condition is reached.

[0171] When the training target representation is used to collaboratively optimize multiple target agents participating in the training, each target agent processes each collaborative training task in turn to generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, the overall agent reward corresponding to each collaborative training task is generated; based on the overall agent reward, each target agent is fine-tuned until the preset second stopping condition is reached.

[0172] or,

[0173] Obtain multiple first training samples containing collaborative training tasks and multi-agent processing procedures;

[0174] Each heavy agent sequentially processes each collaborative training task to generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, the overall agent reward for each collaborative training task is generated; based on the overall agent reward, each heavy agent is fine-tuned until the stopping condition is met.

[0175] Each finely tuned heavy intelligent agent processes each collaborative training task sequentially, generating a second training sample containing the collaborative training task and the task execution process.

[0176] Deconstruct the execution flow of each task, and determine the processing sub-tasks, processing ideas, and processing results of each heavy agent involved in the execution flow of each task; based on the processing sub-tasks, processing ideas, and processing results of different heavy agents, perform data distillation on each lightweight agent until the preset stopping conditions are reached.

[0177] Optionally, the refined and extended functions of the program can be referred to the above description.

[0178] This application embodiment also provides a readable storage medium that can store a program suitable for execution by a processor, the program being used for:

[0179] Obtain the training target and multiple collaborative training samples containing the corresponding collaborative training tasks and the corresponding multi-agent processing flow;

[0180] When the training target representation performs data distillation on multiple target agents participating in the training, the multi-agent processing flow in each collaborative training sample is decomposed, and the processing sub-tasks, processing ideas and processing results of each training agent involved in each multi-agent processing flow are determined. Based on the processing sub-tasks, processing ideas and processing results of each training agent in different multi-agent processing flows, data distillation is performed on each target agent until a preset first stopping condition is reached.

[0181] When the training target representation is used to collaboratively optimize multiple target agents participating in the training, each target agent processes each collaborative training task in turn to generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, the overall agent reward corresponding to each collaborative training task is generated; based on the overall agent reward, each target agent is fine-tuned until the preset second stopping condition is reached.

[0182] or,

[0183] Obtain multiple first training samples containing collaborative training tasks and multi-agent processing procedures;

[0184] Each heavy agent sequentially processes each collaborative training task to generate task execution data; based on the task execution data corresponding to the same collaborative training task and the multi-agent processing flow, the overall agent reward for each collaborative training task is generated; based on the overall agent reward, each heavy agent is fine-tuned until the stopping condition is met.

[0185] Each finely tuned heavy intelligent agent processes each collaborative training task sequentially, generating a second training sample containing the collaborative training task and the task execution process.

[0186] Deconstruct the execution flow of each task, and determine the processing sub-tasks, processing ideas, and processing results of each heavy agent involved in the execution flow of each task; based on the processing sub-tasks, processing ideas, and processing results of different heavy agents, perform data distillation on each lightweight agent until the preset stopping conditions are reached.

[0187] Optionally, the refined and extended functions of the program can be referred to the above description.

[0188] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0189] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0190] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. The various embodiments of this application can be combined with each other. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-agent collaborative training method, characterized in that, include: Obtain the training target and multiple collaborative training samples containing the corresponding collaborative training tasks and the corresponding multi-agent processing flow; When the training target representation performs data distillation on multiple target agents participating in the training, the multi-agent processing flow in each collaborative training sample is decomposed, and the processing sub-tasks, processing ideas and processing results of each training agent involved in each multi-agent processing flow are determined. Based on the processing sub-tasks, processing ideas and processing results of each training agent in different multi-agent processing processes, data distillation is performed on each target agent until the preset first stopping condition is reached. When the training target representation is used to collaboratively optimize multiple target agents participating in the training, each target agent processes each collaborative training task in turn to generate task execution data. Based on the task execution data and multi-agent processing flow corresponding to the same collaborative training task, the overall reward of the agent corresponding to each collaborative training task is generated; based on the overall reward of each agent, the target agent is fine-tuned and trained until the preset second stopping condition is reached.

2. The multi-agent collaborative training method according to claim 1, characterized in that, The process of data distillation for each target agent, based on the processing sub-tasks, processing ideas, and processing results of each trained agent in different multi-agent processing flows, includes: Based on the processing sub-tasks, processing ideas, and processing results corresponding to the same training agent, multiple sample data corresponding to that training agent are generated. Each training agent is paired with each target agent to determine all the sample data corresponding to each target agent, and data distillation is performed on the target agent using the sample data corresponding to each target agent.

3. The multi-agent collaborative training method according to claim 2, characterized in that, The step of using the sample data corresponding to each target agent to perform data distillation on the target agent includes: For each target agent, all real processing tasks of the target agent are transferred to the Snapshoot version of the target agent; each processing subtask of the target agent is sequentially input into the target agent to obtain training processing data containing processing strategies and computational results; each processing approach is compared with its corresponding processing strategy to generate a process reward; each processing result is compared with its corresponding computational result to generate a result reward; based on the process reward and result reward corresponding to each sample data, the parameters of the target agent are adjusted.

4. The multi-agent collaborative training method according to claim 1, characterized in that, The process of generating the overall agent reward for each collaborative training task based on task execution data and multi-agent processing flow corresponding to the same collaborative training task includes: Extract the task execution process and results from the data of each task execution, and extract the execution ideas and final execution results from each multi-agent processing flow; The execution process of each task is compared with the corresponding execution approach to evaluate the simplification of the task execution process and generate process rewards. The results of each task execution are compared with the corresponding final execution result to evaluate the effectiveness of the task execution result and generate a result reward; By combining the process reward and result reward for each collaborative training task, the overall reward for the agent corresponding to the collaborative training task is generated.

5. The multi-agent collaborative training method according to claim 4, characterized in that, The process of comparing each task execution process with its corresponding execution strategy, evaluating the simplification of the task execution process, and generating a process reward includes: Using an AI model, the execution process of each task is compared with the corresponding execution approach to evaluate the simplification of tool calls, the rationality of process logic, and the simplification of steps in the task execution process, and to generate process rewards.

6. The multi-agent cooperative training method according to claim 4, characterized in that, The process of comparing the execution result of each task with the corresponding final execution result to evaluate the validity of the task execution result and generate a result reward includes: Using an AI model, the execution result of each task is compared with the corresponding final execution result to evaluate the intuitiveness of the task execution result, the richness of the display method, and the matching degree with the corresponding collaborative training task, and to generate a result reward.

7. A multi-agent collaborative training method, characterized in that, include: Obtain multiple first training samples containing collaborative training tasks and multi-agent processing procedures; Each heavy intelligent agent processes each collaborative training task sequentially to generate task execution data. Based on the task execution data and multi-agent processing flow corresponding to the same collaborative training task, the overall reward of the agent corresponding to each collaborative training task is generated. Based on the overall reward of each agent, fine-tune the training of each heavy agent until the stopping condition is met. Each finely tuned heavy intelligent agent processes each collaborative training task sequentially, generating a second training sample containing the collaborative training task and the task execution process. Break down the execution process of each task and determine the sub-tasks, processing ideas, and processing results of each heavy intelligent agent involved in the execution process of each task. Based on the processing subtasks, processing ideas, and processing results of different heavy intelligent agents, data distillation is performed on each lightweight intelligent agent until a preset stopping condition is reached.

8. A multi-agent collaborative training device, characterized in that, include: The acquisition module is used to acquire the training target and multiple collaborative training samples containing the corresponding collaborative training tasks and the corresponding multi-agent processing flow. The decomposition module is used to decompose the multi-agent processing flow in each collaborative training sample when the training target representation performs data distillation on multiple target agents participating in the training, and to determine the processing sub-tasks, processing ideas and processing results of each training agent involved in each multi-agent processing flow. Based on the processing sub-tasks, processing ideas and processing results of each training agent in different multi-agent processing processes, data distillation is performed on each target agent until the preset first stopping condition is reached. The generation module is used to generate task execution data by having each target agent process each collaborative training task sequentially when the training target representation is used to collaboratively optimize multiple target agents participating in the training. Based on the task execution data and multi-agent processing flow corresponding to the same collaborative training task, the overall reward of the agent corresponding to each collaborative training task is generated; based on the overall reward of each agent, the target agent is fine-tuned and trained until the preset second stopping condition is reached.

9. A multi-agent collaborative training device, characterized in that, Including memory and processor; The memory is used to store programs; The processor is used to execute the program to implement each step of the multi-agent collaborative training method as described in any one of claims 1-7.

10. A readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements each step of the multi-agent cooperative training method as described in any one of claims 1-7.