Data processing method and related system

By using a shared reflector in a multi-agent system to determine the cause information of each agent, the collaboration of multiple agents is optimized, the problems of response accuracy and time in multi-agent systems are solved, and more efficient response generation is achieved.

WO2026066400A9PCT designated stage Publication Date: 2026-06-11HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-06-28
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

In multi-agent systems, the division of labor among agents depends on the researcher's prior knowledge, resulting in poor accuracy and long processing time in generating responses.

Method used

By using a shared reflector when multiple agents collaborate, the cause information of each agent can be determined, and the collaboration of the multi-agent system can be optimized through the reflector, thereby improving response accuracy and reducing generation time.

🎯Benefits of technology

It improves the accuracy of response generation in multi-agent systems, reduces computational resource requirements, and makes model training more stable and efficient.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025105198_11062026_PF_FP_ABST
    Figure CN2025105198_11062026_PF_FP_ABST
Patent Text Reader

Abstract

A data processing method, comprising: acquiring a question; on the basis of the question and by means of interaction between a plurality of agents, obtaining a plurality of pieces of first interaction information, the plurality of pieces of first interaction information comprising a first reply to the question; when the first reply is not a correct reply to the question, on the basis of the plurality of pieces of first interaction information and by means of a reflector, determining reason information corresponding to each of the agents, the reason information being a reason for obtaining the incorrect reply; and, on the basis of the question and the reason information corresponding to each of the agents and by means of interaction between the plurality of agents, obtaining a second reply to the question. The present application can improve the collaboration capability of multi-agent systems and the accuracy of generated replies, and reduce time overheads of generation processes.
Need to check novelty before this filing date? Find Prior Art

Description

A data processing method and related system

[0001] This application claims priority to Chinese Patent Application No. 202411394315.9, filed with the State Intellectual Property Office of China on September 30, 2024, entitled “A Data Processing Method and Related System”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence (AI) technology, and more particularly to a data processing method, apparatus, computing device, chip, computer-readable storage medium, and computer program product. Background Technology

[0003] Large-scale language models (LLMs) encapsulate rich human knowledge within their massive parameters, achieving good performance in many complex natural language tasks (including question answering). Leveraging the powerful language expression, reasoning, and planning capabilities of large models, autonomous agents based on them can effectively understand and generate instructions, interact and make decisions in complex environments, and achieve good results in various downstream tasks. To address even more complex task scenarios, building upon single-agent research, researchers have proposed constructing multi-agent systems, breaking down complex tasks into several sub-tasks and assigning them to multiple agents with different domain skills and expertise.

[0004] In multi-agent systems, the division of labor among agents largely relies on the researcher's prior knowledge, resulting in poor accuracy and long generation times for the generated responses. Therefore, improving the accuracy of response generation through multi-agent systems is a pressing technical problem that needs to be solved. Summary of the Invention

[0005] This application provides a data processing method, as well as an apparatus, computing device cluster, computer-readable storage medium, and computer program product corresponding to the above method.

[0006] In a first aspect, this application provides a data processing method, the method comprising: acquiring a question; obtaining multiple first interaction information based on the question through interaction between multiple agents, the multiple first interaction information including a first response to the question; when the first response is not a correct answer to the prompt, determining, based on the multiple first interaction information, cause information corresponding to each agent through a reflector, the cause information being the reason for obtaining an incorrect response; and obtaining a second response to the question based on the question and the cause information corresponding to each agent through interaction between the multiple agents.

[0007] The method proposed in this application can reflect on the interaction information obtained by multiple agents when the collaboration among multiple agents fails to obtain a correct response. The purpose of the reflection is to determine the reason why the correct response was not obtained and the reasons of each agent. Based on this information, the agent participates in the next round of multi-agent collaboration to determine the response. This can improve the collaboration capability of the multi-agent system, the accuracy of the generated response, and reduce the time cost of the generation process.

[0008] In one possible implementation, the same reflector is used when determining the cause information corresponding to each of the agents.

[0009] This application proposes a shared reflector for constructing a multi-agent collaborative system. Considering the homogeneity among different reflectors—that is, their action spaces (reflections) are consistent, and their optimization objectives are also completely identical (aiding in solving the overall task)—this application proposes that all agents share the same reflector, and through carefully designed prompts, the shared reflector can perceive the role information of different agents. This sharing mechanism not only reduces the demand for computational resources but also results in more training data and more stable training.

[0010] In one possible implementation, the plurality of agents includes a target agent; obtaining a second response to the question through interaction between the plurality of agents based on the question and the cause information corresponding to each agent includes: obtaining second interaction information corresponding to each agent through interaction between the plurality of agents based on the question and the cause information corresponding to each agent, wherein the plurality of second interaction information includes the second response.

[0011] In one possible implementation, the second interaction information corresponding to the target agent is obtained in the following way:

[0012] Based on the question and the reason information corresponding to the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

[0013] In one possible implementation, the second interaction information corresponding to the target agent is obtained in the following way:

[0014] Based on the question and the reason information corresponding to multiple agents, including the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

[0015] In one possible implementation, the method further includes:

[0016] Obtain a label for the causal information corresponding to each agent, and a weight corresponding to the causal information; the weight indicates the degree of positive impact of using the reflector on the accuracy of the obtained second interaction information compared to not using the reflector; fine-tune the reflector based on the label and the weight. For example, a loss can be constructed based on the label and causal information, and then the loss can be adjusted using the weight.

[0017] This application proposes constructing counterfactual rewards as a supervisory signal for fine-tuning individual agents, and using a proximal policy optimization method to fine-tune the agent reflector. First, the reflections of all agents are added to the prompts of the corresponding decision model to obtain the overall reward. Then, the reflections of each agent are removed sequentially, and the interaction with the environment is repeated, with the resulting reward serving as the marginal reward. The difference between the overall reward and the marginal reward is used as an evaluation of the individual agent's contribution in the cooperation process, and this is used as a supervisory signal to fine-tune the agent reflector. The reflector is optimized using a proximal policy optimization method to maximize the environmental reward.

[0018] In one possible implementation, the agent is a Large Language Model (LLM).

[0019] Secondly, this application provides a data processing apparatus, the apparatus comprising:

[0020] The question retrieval module is used to retrieve questions.

[0021] The response confirmation module is used to obtain multiple first interaction information based on the question through interaction between multiple agents, the multiple first interaction information including a first response to the question; when the first response is not a correct answer to the prompt, based on the multiple first interaction information, a reflector determines the reason information corresponding to each agent, the reason information being the reason for obtaining an incorrect response; based on the question and the reason information corresponding to each agent, a second response to the question is obtained through interaction between the multiple agents.

[0022] The second interaction information corresponding to the target agent is obtained by using the same reflector when determining the cause information corresponding to each agent.

[0023] Wherein, the second interaction information corresponding to the target intelligent agent is obtained by the plurality of intelligent agents including the target intelligent agent in the following manner;

[0024] The response confirmation module is specifically used for:

[0025] Based on the question and the reason information corresponding to each agent, the second interaction information corresponding to each agent is obtained through the interaction between the multiple agents, and the multiple second interaction information includes the second response.

[0026] The second interaction information corresponding to the target intelligent agent is obtained in the following manner:

[0027] Based on the question and the reason information corresponding to the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

[0028] The second interaction information corresponding to the target intelligent agent is obtained in the following manner:

[0029] Based on the question and the reason information corresponding to multiple agents, including the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

[0030] The second interaction information corresponding to the target intelligent agent is obtained through the following method. The device further includes:

[0031] The model training module is used to obtain the label of the cause information corresponding to each agent, and the weight corresponding to the cause information; the weight indicates the degree of positive influence of using the reflector on the accuracy of the obtained second interaction information compared with not using the reflector;

[0032] The reflector is fine-tuned based on the labels and weights.

[0033] In one possible implementation, the agent is a Large Language Model (LLM).

[0034] Thirdly, this application provides a computing device. The computing device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the computing device performs a method as described in the first aspect or any implementation thereof.

[0035] Fourthly, this application provides a chip. The chip includes at least one processing unit and an interface circuit, the interface circuit being used to provide program instructions or data to the at least one processing unit, the at least one processing unit being used to execute the program instructions to implement a method as described in the first aspect or any of the implementations of the first aspect.

[0036] Fifthly, this application provides a computing device cluster. The computing device cluster includes at least one computing device, which includes at least one processor and at least one memory. The at least one processor and the at least one memory communicate with each other. The at least one processor is configured to execute instructions stored in the at least one memory to cause the computing device or the computing device cluster to perform a method as described in the first aspect or any implementation thereof.

[0037] In a sixth aspect, this application provides a computer-readable storage medium storing instructions that instruct a computing device or a cluster of computing devices to perform the data processing method implementing the first aspect or any one of the first aspects.

[0038] In a seventh aspect, this application provides a computer program product containing instructions that, when run on a computing device or a cluster of computing devices, causes the computing device or cluster of computing devices to perform the data processing method described in the first aspect or any implementation thereof.

[0039] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description

[0040] Figure 1A is a schematic diagram of a structural framework for artificial intelligence.

[0041] Figures 1B and 2 are schematic diagrams of the application system framework of the present invention;

[0042] Figure 3 is a schematic diagram of an optional hardware structure for the terminal;

[0043] Figure 4 is a schematic diagram of a server structure;

[0044] Figure 5 is a schematic diagram of a system architecture according to this application;

[0045] Figure 6 illustrates the process of a cloud service.

[0046] Figure 7 is a flowchart illustrating a data processing method provided in an embodiment of this application;

[0047] Figures 8 to 9C are schematic diagrams of an application architecture according to an embodiment of this application;

[0048] Figure 9D illustrates a beneficial effect of an embodiment of this application;

[0049] Figure 10 is a schematic diagram of a data processing device provided in an embodiment of this application;

[0050] Figure 11 is a schematic diagram of a terminal device provided in an embodiment of this application;

[0051] Figure 12 is a schematic diagram of a server structure provided in an embodiment of this application;

[0052] Figure 13 is a schematic diagram of a chip structure provided in an embodiment of this application. Detailed Implementation

[0053] The embodiments of this application are described below with reference to the accompanying drawings. The terminology used in the implementation section of this application is for explaining specific embodiments only and is not intended to limit the scope of this application.

[0054] The embodiments of this application will now be described with reference to the accompanying drawings. Those skilled in the art will recognize that, with technological advancements and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are equally applicable to similar technical problems.

[0055] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.

[0056] The terms "substantially," "about," and similar terms used in this application are used as approximations, not as terms of degree, and are intended to take into account the inherent biases of measurements or calculations known to those skilled in the art. Furthermore, the term "may" used in describing embodiments of this application refers to "one or more possible embodiments." The terms "use," "using," and "used" used in this application can be considered synonymous with the terms "utilize," "utilizing," and "utilized," respectively. Additionally, the term "exemplary" is intended to refer to an instance or illustration.

[0057] First, the overall workflow of an artificial intelligence system is described, as shown in Figure 1A. Figure 1A is a structural diagram of the main framework of artificial intelligence. The framework is then elaborated from two dimensions: the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis). The "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output. In this process, data undergoes a condensation process of "data—information—knowledge—wisdom." The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (provided and processed by technology) to the industrial ecosystem of the system.

[0058] (1) Infrastructure

[0059] Infrastructure provides computing power to support artificial intelligence systems, enabling communication with the external world and providing support through a basic platform. This communication occurs through sensors; computing power is provided by intelligent chips (hardware acceleration chips such as CPUs, NPUs, GPUs, ASICs, and FPGAs); and the basic platform includes distributed computing frameworks and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, and this data is provided to intelligent chips in the distributed computing system provided by the basic platform for computation.

[0060] (2) Data

[0061] The data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.

[0062] (3) Data processing

[0063] Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.

[0064] Among them, machine learning and deep learning can perform intelligent information modeling, extraction, preprocessing, and training on data, including symbolization and formalization.

[0065] Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.

[0066] Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.

[0067] (4) General ability

[0068] After the data processing mentioned above, the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

[0069] (5) Smart Products and Industry Applications

[0070] Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Their application areas mainly include: intelligent terminals, intelligent transportation, intelligent healthcare, autonomous driving, smart cities, etc.

[0071] First, we will introduce the application scenarios of this application. This application can be used, but is not limited to, applications with generative artificial intelligence (AIGC) functionality (hereinafter referred to as text generation applications) or cloud services provided by cloud-side servers, etc., which will be introduced separately below:

[0072] I. Text Generation Applications

[0073] The product form of this application embodiment can be a text generation application. Text generation applications can run on terminal devices or cloud-based servers.

[0074] In one possible implementation, a text generation application can perform a text generation task based on input data (e.g., a question), wherein the text generation application can perform the text generation task in response to the input data (e.g., a question) and obtain the text generation result.

[0075] In one possible implementation, a user can open a text generation application installed on a terminal device and enter input data (e.g., a question). The text generation application can generate text from the input data using the methods provided in the embodiments of this application and present the text generation result to the user (the presentation method may include, but is not limited to, displaying, saving, uploading to the cloud, etc.).

[0076] In one possible implementation, a user can open a text generation application installed on a terminal device and input data. The text generation application can then send the input data to a cloud-based server. The cloud-based server uses the method provided in this application to generate text from the input data and sends the generated text back to the terminal device. The terminal device can then present the generated text to the user (the presentation method may include, but is not limited to, displaying, saving, or uploading to the cloud).

[0077] The following sections will introduce the text generation application in this application from the perspectives of functional architecture and product architecture that implements the functions.

[0078] Referring to Figure 1B, which is a schematic diagram of the functional architecture of the text generation application in this embodiment of the application:

[0079] In one possible implementation, as shown in FIG1B, a text generation application 102 may receive input parameters 101 (e.g., containing input data) and produce a text generation result 103. The text generation application 102 may execute on at least one computer system (for example) and includes computer code that, when executed by one or more computers, causes the computers to execute a natural language model trained by the methods provided in the embodiments of this application.

[0080] Referring to Figure 2, which is a schematic diagram of the entity architecture of the text generation application in an embodiment of this application:

[0081] Referring to Figure 2, which illustrates a system architecture, the system may include a terminal 100 and a server 200. The server 200 may include one or more servers (Figure 2 illustrates this with one server as an example), and the server 200 may provide text generation services to one or more terminals.

[0082] The terminal 100 may have a text generation application installed or a webpage related to text generation open. The application or webpage can provide an interface. The terminal 100 can receive relevant parameters input by the user on the text generation interface and send the parameters to the server 200. The server 200 can obtain the processing result based on the received parameters and return the processing result to the terminal 100.

[0083] It should be understood that in some optional implementations, the terminal 100 can also complete the action of obtaining the processing result based on the received parameters on its own, without the need for the server to cooperate. This application embodiment is not limited to this.

[0084] The product form of terminal 100 in Figure 2 will be described next;

[0085] The terminal 100 in this application embodiment can be a mobile phone, tablet computer, wearable device, vehicle device, augmented reality (AR) / virtual reality (VR) device, laptop computer, ultra-mobile personal computer (UMPC), netbook, personal digital assistant (PDA), etc., and this application embodiment does not impose any restrictions on it.

[0086] Figure 3 shows a schematic diagram of an optional hardware structure for terminal 100.

[0087] Referring to Figure 3, terminal 100 may include components such as a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, and a power supply 190. Those skilled in the art will understand that Figure 3 is merely an example of a terminal or multi-functional device and does not constitute a limitation on the terminal or multi-functional device; it may include more or fewer components than illustrated, or combine certain components, or use different components.

[0088] The input unit 130 can be used to receive input numerical or character information, and to generate key signal inputs related to user settings and function control of the portable multi-functional device. Specifically, the input unit 130 may include a touchscreen 131 (optional) and / or other input devices 132. The touchscreen 131 can collect touch operations performed by the user on or near it (such as operations performed by the user using fingers, knuckles, styluses, or any suitable object on or near the touchscreen), and drive the corresponding connection devices according to a pre-set program. The touchscreen can detect the user's touch actions, convert the touch actions into touch signals and send them to the processor 170, and can receive and execute commands sent by the processor 170; the touch signal includes at least touch point coordinate information. The touchscreen 131 can provide an input interface and an output interface between the terminal 100 and the user. In addition, various types of touchscreens, such as resistive, capacitive, infrared, and surface acoustic wave, can be used to implement the touchscreen. Besides the touchscreen 131, the input unit 130 may also include other input devices. Specifically, other input devices 132 may include, but are not limited to, one or more of the following: physical keyboard, function keys (such as volume control buttons, power buttons, etc.), trackball, mouse, joystick, etc.

[0089] Other input devices 132 can receive input data, etc.

[0090] The display unit 140 can be used to display information input by the user or information provided to the user, various menus of the terminal 100, interactive interfaces, file display, and / or playback of any multimedia file. In this embodiment, the display unit 140 can be used to display the interface of a text generation application, the generated text generation results, etc.

[0091] The memory 120 can be used to store instructions and data. The memory 120 may primarily include an instruction storage area and a data storage area. The data storage area can store various types of data, such as multimedia files and text. The instruction storage area can store software units such as operating systems, applications, and instructions required for at least one function, or subsets or extended sets thereof. It may also include non-volatile random access memory. It provides the processor 170 with hardware, software, and data resources for managing the computing device, supporting control software and applications. It is also used for storing multimedia files, as well as storing running programs and applications.

[0092] The processor 170 is the control center of the terminal 100. It connects various parts of the terminal 100 via various interfaces and lines. By running or executing instructions stored in the memory 120 and calling data stored in the memory 120, it performs various functions and processes data of the terminal 100, thereby controlling the terminal device as a whole. Optionally, the processor 170 may include one or more processing units; preferably, the processor 170 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 170. In some embodiments, the processor and memory can be implemented on a single chip; in some embodiments, they can also be implemented separately on independent chips. The processor 170 can also be used to generate corresponding operation control signals, send them to the corresponding components of the computing processing device, read and process data in the software, especially read and process data and programs in the memory 120, so that the various functional modules therein perform corresponding functions, thereby controlling the corresponding components to act according to the instructions.

[0093] The memory 120 can be used to store software code related to the data processing method, and the processor 170 can execute the steps of the chip's data processing method, and can also schedule other units (such as the above-mentioned input unit 130 and display unit 140) to achieve the corresponding functions.

[0094] The radio frequency unit 110 (optional) can be used for receiving and transmitting signals during information transmission or calls. For example, it can receive downlink information from the base station and process it for the processor 170; additionally, it can transmit uplink data to the base station. Typically, the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low-noise amplifier (LNA), a duplexer, etc. Furthermore, the radio frequency unit 110 can also communicate wirelessly with network devices and other devices. This wireless communication can use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), etc.

[0095] In this embodiment of the application, the radio frequency unit 110 can send input data to the server 200 and receive the text generation result sent by the server 200.

[0096] It should be understood that the radio frequency unit 110 is optional and can be replaced with other communication interfaces, such as a network port.

[0097] The terminal 100 also includes a power supply 190 (such as a battery) that supplies power to various components. Preferably, the power supply can be logically connected to the processor 170 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system.

[0098] Terminal 100 also includes an external interface 180, which can be a standard Micro USB interface or a multi-pin connector, which can be used to connect terminal 100 to other devices for communication or to connect a charger to charge terminal 100.

[0099] Although not shown, terminal 100 may also include a flash, a wireless fidelity (WiFi) module, a Bluetooth module, and sensors with various functions, which will not be described in detail here. Some or all of the methods described below can be applied to terminal 100 as shown in Figure 3.

[0100] The product form of server 200 in Figure 2 will be described next;

[0101] Figure 4 provides a schematic diagram of the structure of a server 200. As shown in Figure 4, the server 200 includes a bus 201, a processor 202, a communication interface 203, and a memory 204. The processor 202, the memory 204, and the communication interface 203 communicate with each other via the bus 201.

[0102] Bus 201 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, only one thick line is used in Figure 4, but this does not indicate that there is only one bus or one type of bus.

[0103] The processor 202 can be any one or more of the following processors: central processing unit (CPU), graphics processing unit (GPU), microprocessor (MP), or digital signal processor (DSP).

[0104] Memory 204 may include volatile memory, such as random access memory (RAM). Memory 204 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).

[0105] The memory 204 can be used to store software code related to the data processing method, and the processor 202 can execute the steps of the chip's data processing method, and can also schedule other units to achieve corresponding functions.

[0106] It should be understood that the aforementioned terminal 100 and server 200 can be centralized or distributed devices. The processors (e.g., processor 170 and processor 202) in the aforementioned terminal 100 and server 200 can be hardware circuits (such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (DSPs), microprocessors or microcontrollers, etc.) or combinations of these hardware circuits. For example, the processor can be a hardware system with instruction execution capabilities, such as a CPU or DSP, or a hardware system without instruction execution capabilities, such as an ASIC or FPGA, or a combination of the aforementioned hardware systems without instruction execution capabilities and hardware systems with instruction execution capabilities.

[0107] It should be understood that the steps related to the model inference process in the embodiments of this application involve AI-related operations. When performing AI operations, the instruction execution architecture of the terminal device and the server is not limited to the processor-memory architecture described above. The system architecture provided in the embodiments of this application will be described in detail below with reference to Figure 5.

[0108] Figure 5 is a schematic diagram of the system architecture provided in an embodiment of this application. As shown in Figure 5, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data acquisition device 560.

[0109] The execution device 510 includes a calculation module 511, an I / O interface 512, a preprocessing module 513, and a preprocessing module 514. The calculation module 511 may include a target model / rule 501, while the preprocessing modules 513 and 514 are optional.

[0110] The execution device 510 can be a terminal device or a server that runs the aforementioned text generation application.

[0111] The data acquisition device 560 is used to collect training samples. Training samples can be program files (including program code and program input data), etc. After collecting the training samples, the data acquisition device 560 stores these training samples in the database 530.

[0112] The training device 520 can maintain training samples in the database 530 to obtain the target model / rule 501 from the neural network to be trained.

[0113] It should be noted that in practical applications, the training samples maintained in database 530 may not all come from the data acquisition device 560; they may also be received from other devices. Furthermore, it should be noted that training device 520 may not necessarily train the target model / rule 501 entirely based on the training samples maintained in database 530; it may also obtain training samples from the cloud or other sources for model training. The above description should not be construed as limiting the embodiments of this application.

[0114] The target model / rule 501 trained by the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in Figure 5. The execution device 510 can be a terminal, such as a mobile terminal, tablet computer, laptop computer, augmented reality (AR) / virtual reality (VR) device, vehicle terminal, etc., or it can be a server, etc.

[0115] Specifically, the training device 520 can transfer the trained model to the execution device 510.

[0116] In Figure 5, the execution device 510 is configured with an input / output (I / O) interface 512 for data interaction with external devices. Users can input data to the I / O interface 512 through the client device 540 (e.g., input data in the embodiment of this application).

[0117] Preprocessing modules 513 and 514 are used to preprocess the input data received from the I / O interface 512. It should be understood that preprocessing modules 513 and 514 may be absent, or only one preprocessing module may be used. When preprocessing modules 513 and 514 are absent, the calculation module 511 can be used directly to process the input data.

[0118] During the preprocessing of input data by the execution device 510, or during the calculation module 511 of the execution device 510 performing calculations and other related processes, the execution device 510 can call data, code, etc. in the data storage system 550 for corresponding processing, or store the data, instructions, etc. obtained from the corresponding processing into the data storage system 550.

[0119] Finally, the I / O interface 512 provides the processing results (such as text generation results) to the client device 540, thereby providing them to the user.

[0120] In the scenario shown in Figure 5, the user can manually provide input data, which can be done through the interface provided by I / O interface 512. Alternatively, the client device 540 can automatically send input data to I / O interface 512. If user authorization is required for the client device 540 to automatically send input data, the user can set the corresponding permissions in the client device 540. The user can view the output results of the execution device 510 on the client device 540, which can be presented in various forms such as display, sound, or animation. The client device 540 can also act as a data acquisition terminal, collecting the input data and output results of the input I / O interface 512 as shown in the figure, and storing them as new sample data in database 530. Alternatively, data can be collected directly from the I / O interface 512 without going through the client device 540, using the input data and output results of the input I / O interface 512 as shown in the figure, and storing them as new sample data in database 530.

[0121] It is worth noting that Figure 5 is merely a schematic diagram of a system architecture provided in an embodiment of this application. The positional relationships between the devices, components, modules, etc., shown in the figure do not constitute any limitation. For example, in Figure 5, the data storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510. It should be understood that the aforementioned execution device 510 can be deployed in the client device 540.

[0122] From the inference side of the model:

[0123] In this embodiment, the computing module 511 of the execution device 510 can obtain the code stored in the data storage system 550 to implement the steps related to the model reasoning process in this embodiment.

[0124] In this embodiment of the application, the computing module 511 of the execution device 510 may include hardware circuits (such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (DSPs), microprocessors or microcontrollers, etc.) or combinations of these hardware circuits. For example, the training device 520 may be a hardware system with instruction execution capabilities, such as a CPU or DSP, or a hardware system without instruction execution capabilities, such as an ASIC or FPGA, or a combination of the aforementioned hardware systems without instruction execution capabilities and hardware systems with instruction execution capabilities.

[0125] Specifically, the computing module 511 of the execution device 510 can be a hardware system with the function of executing instructions. The steps related to the model inference process provided in this application embodiment can be software code stored in the memory. The computing module 511 of the execution device 510 can obtain the software code from the memory and execute the obtained software code to implement the steps related to the model inference process provided in this application embodiment.

[0126] It should be understood that the computing module 511 of the execution device 510 can be a combination of a hardware system without the function of executing instructions and a hardware system with the function of executing instructions. Some steps related to the model reasoning process provided in the embodiments of this application can also be implemented by the hardware system in the computing module 511 of the execution device 510 without the function of executing instructions, which is not limited here.

[0127] From the training side of the model:

[0128] In this embodiment, the training device 520 can obtain the code stored in the memory (not shown in Figure 5, which can be integrated into the training device 520 or deployed separately from the training device 520) to implement the steps related to model training in this embodiment.

[0129] In this embodiment of the application, the training device 520 may include hardware circuits (such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (DSPs), microprocessors or microcontrollers, etc.) or combinations of these hardware circuits. For example, the training device 520 may be a hardware system with instruction execution capabilities, such as a CPU or DSP, or a hardware system without instruction execution capabilities, such as an ASIC or FPGA, or a combination of the aforementioned hardware systems without instruction execution capabilities and hardware systems with instruction execution capabilities.

[0130] It should be understood that the training device 520 can be a combination of a hardware system without the function of executing instructions and a hardware system with the function of executing instructions. Some steps related to the training of the neutralization model provided in the embodiments of this application can also be implemented by the hardware system in the training device 520 without the function of executing instructions, which is not limited here.

[0131] II. Cloud-based text generation services provided by the server:

[0132] In one possible implementation, the server can provide text generation services to the client side through an application programming interface (API).

[0133] In this process, the terminal device can send relevant parameters (such as input data) to the server through the API provided by the cloud. The server can obtain the processing result (such as text generation result) based on the received parameters and return the processing result to the terminal.

[0134] The description of the terminal and server can be found in the above embodiments, and will not be repeated here.

[0135] Figure 6 illustrates the process of using a text generation cloud service provided by a cloud platform.

[0136] 1. Activate and purchase the text generation service.

[0137] 2. Users can download the software development kit (SDK) corresponding to the text generation service. Cloud platforms typically provide multiple development versions of the SDK for users to choose from based on their development environment needs, such as a Java version SDK, a Python version SDK, a PHP version SDK, an Android version SDK, etc.

[0138] 3. After downloading the corresponding version of the SDK to their local machine according to their needs, users can import the SDK project into their local development environment, configure and debug it in the local development environment, and develop other functions in the local development environment to form an application that integrates text generation capabilities.

[0139] 4. When a text generation application needs to generate text, it can trigger a text generation API call. When the application triggers text generation, it sends an API request to the running instance of the text generation service in the cloud environment. The API request carries the input data, which is then processed by the running instance in the cloud environment to obtain the processing result (such as the text generation result).

[0140] 5. The cloud environment returns the processing result to the application, thus completing a text generation service call.

[0141] In addition to applications and cloud services, the implementation of this application can also be in a large-scale application SDK.

[0142] Since the embodiments of this application involve a large number of neural network applications, for ease of understanding, the relevant terms and concepts such as neural networks involved in the embodiments of this application will be introduced below.

[0143] (1) Neural Network

[0144] A neural network can be composed of neural units, which can be defined as a computational unit that takes xs (i.e., input data) and an intercept of 1 as input. The output of this computational unit can be:

[0145] Where s = 1, 2, ..., n, where n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be the sigmoid function. A neural network is a network formed by connecting multiple of the above-mentioned individual neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.

[0146] (2) Backpropagation algorithm

[0147] Convolutional neural networks can employ backpropagation (BP) to correct the parameters in the initial super-resolution model during training, thereby reducing the reconstruction error loss. Specifically, forward propagation of the input signal to the output generates an error loss; this error loss information is then propagated back to update the parameters in the initial super-resolution model, leading to convergence of the error loss. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the super-resolution model, such as the weight matrix.

[0148] (3) Loss Function

[0149] In training a deep neural network, to ensure the output closely approximates the desired predicted value, we compare the network's prediction with the target value. Based on the difference, we update the weight vector of each layer (usually pre-configuring parameters before the initial update). For example, if the prediction is too high, the weight vector is adjusted to predict a lower value. This adjustment continues until the deep neural network predicts the target value or a value very close to it. Therefore, we need to predefine "how to compare the difference between the predicted and target values," which is the loss function or objective function. These are important equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, and training the deep neural network becomes a process of minimizing this loss.

[0150] (4) Intelligent agent: An independent entity that can think and interact with its environment.

[0151] (5) Intelligent agents based on large language models: Based on large language models, task planning, decision-making mechanisms and environmental interaction capabilities are combined to enable them to perform complex tasks in dynamic environments.

[0152] (6) Multi-agent collaborative system based on large language model: A system composed of multiple agents based on a large language model. The agents in the system usually have different domain skills and expertise, and cooperate to complete tasks by interacting with the environment multiple times according to a certain division of labor.

[0153] (7) Reward function: Defines the learning objective of the agent. At each step of the interaction, the environment passes a value to the agent, which is called the reward. The agent's goal is to maximize the total reward during the interaction.

[0154] In recent years, with the emergence of large language models, intelligent agents based on large language models have become a research hotspot in the field of artificial intelligence. Leveraging the powerful language expression, reasoning, and planning capabilities of large models, autonomous intelligent agents based on these models can effectively understand and generate instructions, interact and make decisions in complex environments, and have achieved good results in various downstream tasks. To cope with more complex task scenarios, researchers have proposed constructing multi-agent systems based on single-agent research, breaking down complex tasks into several sub-tasks and assigning them to multiple intelligent agents with different domain skills and expertise.

[0155] In multi-agent systems, the division of labor among agents largely relies on the researcher's prior knowledge, resulting in poor accuracy and long generation times for the generated responses. Therefore, improving the accuracy of response generation through multi-agent systems is a pressing technical problem that needs to be solved.

[0156] To improve the performance of multi-agent systems across various task scenarios, researchers have designed various multi-agent collaboration frameworks and encoded the roles and collaboration methods of agents as prompts. However, limited by the contextual understanding capabilities of large models and the complexity of division of labor and collaboration, these frameworks cannot fully leverage the collaborative capabilities of agents. Clearly defining roles and facilitating communication and collaboration is a crucial step in enhancing the problem-solving capabilities of multi-agent systems. A natural approach is to supervise and fine-tune the agent collaboration process within the system. However, this leads to unacceptable computational resource consumption and impacts the generalization ability of large language models. Inspired by the reflective progress of human collaborators in team collaboration, this application proposes optimizing the multi-agent collaboration process using an agent's self-reflection mechanism. This self-reflection mechanism can transform numerical rewards from the environment into textual feedback and add it as additional context to the prompts of the agent's decision-making model. This structure can improve the agent's role understanding and communication / collaboration capabilities in specific tasks without sacrificing the agent's general capabilities.

[0157] While self-reflection mechanisms can iteratively optimize decision-making models based on prompts, effective reflection still requires the model to have a sufficient understanding of complex scenarios to reasonably analyze the reasons for task failures. This poses a challenge to large language models with frozen parameters. To address this, Retroformer proposes using open-source language models (such as Llama and ChatGLM) as reflectors, utilizing numerical feedback from the environment as supervisory signals to fine-tune the reflector in single-agent scenarios. However, directly extending this method to multi-agent collaborative scenarios presents several problems. On one hand, numerical feedback from the environment can only evaluate the overall effect of cooperation, not the contribution of individual agents in the system. Directly using it as a supervisory signal in multi-agent systems leads to serious credit allocation problems, making it difficult to encourage agents to improve individual behavior based on collective interests, or even make personal sacrifices for the collective good. On the other hand, in multi-agent systems, the reflector is required to perceive the role information of agents, understand the interaction relationships between agents, and make targeted improvements based on the task division of roles. A natural approach is to train a reflector for each agent. However, as the number of agents increases, the time and computational cost required for training the reflector will also increase, posing a challenge to applications in real-world scenarios.

[0158] Therefore, for multi-agent cooperation scenarios, this application proposes training a plug-and-play reflector (COPPER) to efficiently reflect on the multi-agent cooperation process, thereby improving the performance of the multi-agent system on complex tasks. To address the first challenge mentioned above, this application proposes constructing counterfactual rewards as a supervisory signal for fine-tuning individual agents. Specifically, this application first adds the reflections of all agents to the prompts of the corresponding decision model to obtain an overall reward. Then, the reflections of each agent are removed sequentially, and the process of interacting with the environment is repeated, with the obtained rewards used as marginal rewards. The difference between the overall reward and the marginal reward is used as the contribution evaluation of an individual agent in the cooperation process, and this is used as a supervisory signal for fine-tuning the agent reflector. The reflector is optimized using a proximal policy optimization method to maximize the environmental reward. To address the second challenge, considering the homogeneity between different reflectors, i.e., their action spaces (reflections) are consistent, and their optimization objectives are also completely consistent (aiding in solving the overall task), this application proposes that all agents share the same reflector, and through carefully designed prompts, the shared reflector can perceive the role information of different agents. This sharing mechanism not only reduces the demand for computing resources, but also makes the model training data more abundant and the training more stable.

[0159] The division of labor among agents in multi-agent systems largely relies on the researcher's prior knowledge. Clearly defining the division of labor and facilitating communication and collaboration is a crucial step in improving the problem-solving capabilities of multi-agent systems for complex issues.

[0160] 2) Due to the limitations of the contextual understanding capabilities of large models and the complexity of division of labor, such frameworks cannot fully leverage the collaborative capabilities of agents.

[0161] For multi-agent cooperative scenarios, this invention proposes training a plug-and-play reflector, COPPER, to efficiently reflect on the multi-agent cooperative process, thereby improving the performance of multi-agent systems on complex tasks. To address the first challenge, this invention proposes constructing counterfactual rewards as a supervisory signal for fine-tuning individual agents. Specifically, this invention first adds the reflections of all agents to the prompts of the corresponding decision model to obtain an overall reward. Then, it sequentially removes the reflections of each agent, repeating the interaction with the environment, and uses the resulting rewards as marginal rewards. The difference between the overall reward and the marginal rewards is used as the contribution evaluation of an individual agent in the cooperative process, and this is used as a supervisory signal for fine-tuning the agent reflector. The reflector is optimized using a proximal policy optimization method to maximize environmental rewards. To address the second challenge, considering the homogeneity between different reflectors—that is, their action spaces (reflections) are consistent, and their optimization objectives are also completely consistent (aiding in solving the overall task)—this invention proposes that all agents share the same reflector, and through carefully designed prompts, the shared reflector can perceive the role information of different agents. This sharing mechanism not only reduces the demand for computing resources, but also makes the model training data more abundant and the training more stable.

[0162] The data processing method of this application is described below. Referring to the flowchart of the data processing method shown in Figure 7, this method can be executed by a data processing system. The method includes:

[0163] 701. Get the question;

[0164] 702. Based on the question, multiple first interaction information is obtained through the interaction between multiple intelligent agents, and the multiple first interaction information includes a first answer to the question;

[0165] To address more complex task scenarios, complex tasks can be broken down and assigned to multiple expert agents, thereby improving the model's performance on complex tasks by constructing a multi-agent system. For multi-hop reasoning question-answering scenarios, this application embodiment constructs a "teacher-student" collaborative model. In mathematics and chess scenarios, this application embodiment adopts a collaborative debate mechanism. The agent collaboration mechanism adopted in this application embodiment is shown in Figure 8.

[0166] Specifically, agents in the environment can be configured to speak in a specified order, and efficient communication and collaboration between agents can be achieved by maintaining a shared message pool. For the k-th question in the environment, at time t, agent i (i = t mod N) first retrieves the previous historical interaction records from the message pool. And obtain the current environment state s k,t The decision-making process of an intelligent agent can be represented as:

[0167] Where p i The role and action space of the current agent are defined. After making a decision, the agent will send the new message s. k,t ,a k,t Add it to the message pool. However, in actual interaction, due to the number of agents and decision steps, there may be a problem of excessively long historical interaction records. Therefore, for each agent in the environment, this embodiment introduces a context model to iteratively update the interaction history and uses it as the agent's short-term memory. The short-term memory update process can be represented as:

[0168] The decision-making process of the corresponding intelligent agent can be represented as:

[0169] 703. When the first response is not the correct answer to the prompt, based on the plurality of first interaction information, the reflector determines the cause information corresponding to each of the agents, and the cause information is the reason for obtaining the incorrect response;

[0170] In one possible implementation, the same reflector is used when determining the cause information corresponding to each of the agents.

[0171] This application proposes a shared reflector for constructing a multi-agent collaborative system. Considering the homogeneity among different reflectors—that is, their action spaces (reflections) are consistent, and their optimization objectives are also completely identical (aiding in solving the overall task)—this application proposes that all agents share the same reflector, and through carefully designed prompts, the shared reflector can perceive the role information of different agents. This sharing mechanism not only reduces the demand for computational resources but also results in more training data and more stable training.

[0172] The method proposed in this application can reflect on the interaction information obtained by multiple agents when the collaboration among multiple agents fails to obtain a correct response. The purpose of the reflection is to determine the reason why the correct response was not obtained and the reasons of each agent. Based on this information, the agent participates in the next round of multi-agent collaboration to determine the response. This can improve the collaboration capability of the multi-agent system, the accuracy of the generated response, and reduce the time cost of the generation process.

[0173] Inspired by the reflective progress of human collaborators in teamwork, this application proposes to optimize multi-agent collaboration by leveraging a self-reflection mechanism. This mechanism transforms numerical rewards from the environment into textual feedback, adding it as additional context to the agent's decision-making model. This structure enhances the agent's role understanding and collaborative communication capabilities in specific tasks without sacrificing general agent capabilities.

[0174] For example, to enhance the collaborative capabilities of multi-agent systems in specific scenarios while maintaining the generality of agent decision-making models, this application proposes introducing a multi-agent reflection mechanism. Guided by environmental reward signals, reflections generated by a reflector are used to continuously optimize the prompts of the decision-making model. Figure 9A illustrates the details of the multi-agent reflection framework. Specifically, this application adds a reflector to each agent $i$ in the system. i This reflector takes the interaction trajectory between the agent system and the environment, as well as the reward signals from the environment, as input to reflect on the problems existing in the cooperation process and formulate new action plans. It uses the current role information p of the agent... i This is also added to the input of the reflector, thus obtaining a reflection for the current agent's action space. For an agent $i$ in the environment, its reflector... i The process of reflection can be defined as:

[0175] Where k represents the k-th problem in the environment, λ represents the λ-th attempt at solving that problem, and T represents the length of the interaction trajectory during the λ-th attempt. Since the agent's short-term memory is updated iteratively, the short-term memory at time T... It contains complete action information of the agent in this interaction. By using the short-term memory of all agents in the system at time T as input to the current agent, the agent gains a global perspective and can effectively reflect on the problems it encountered in the cooperation process.

[0176] The reflections generated by agent i are stored in its long-term memory. At the λ-th attempt on problem k, agent i's long-term memory contains all the reflections generated from its λ-1 attempts, i.e. This application's embodiments incorporate the agent's long-term memory into the prompts of its decision-making model, thereby helping the agent make better decisions. Therefore, for the λth attempt at problem k, the agent's decision-making process can be defined as:

[0177] By utilizing the agent's reflection mechanism and guided by environmental reward signals, the agent can learn from historical interaction records, continuously improve its cooperative ability, and realize the self-learning and self-evolution of the multi-agent collaborative system.

[0178] 704. Based on the question and the reason information corresponding to each of the intelligent agents, a second answer to the question is obtained through the interaction between the multiple intelligent agents.

[0179] In one possible implementation, the plurality of agents includes a target agent; based on the question and the cause information corresponding to each agent, second interaction information corresponding to each agent can be obtained through the interaction between the plurality of agents, and the plurality of second interaction information includes the second response.

[0180] In one possible implementation, the second interaction information corresponding to the target agent is obtained by: based on the question and the reason information corresponding to the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

[0181] In one possible implementation, the second interaction information corresponding to the target agent is obtained in the following way: based on the question and the reason information corresponding to multiple agents including the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

[0182] For example, for problem k in the environment, during the λth attempt, the intelligent agent system first interacts with the environment to generate an interaction trajectory τ. k,λ Subsequently, the environment generates a reward signal r. k,λ This application's embodiments convert environmental signals into natural language, guiding the intelligent agent to reflect on historical interactions. In this process, the Reflector... i Input The output is Will As training data, this embodiment uses the reflections generated by all agents in the system on K tasks as training data for a shared reflector. The training data D can be defined as follows:

[0183] Where Λ is the maximum number of attempts and K is the total number of problems in the environment.

[0184] In one possible implementation, the method further includes: obtaining a label for cause information corresponding to each agent, and a weight corresponding to the cause information; the weight indicating the degree of positive impact of using the reflector on the accuracy of the obtained second interaction information compared to not using the reflector; and fine-tuning the reflector according to the label and the weight.

[0185] This application proposes constructing counterfactual rewards as a supervisory signal for fine-tuning individual agents, and using a proximal policy optimization method to fine-tune the agent reflector. First, the reflections of all agents are added to the prompts of the corresponding decision model to obtain the overall reward. Then, the reflections of each agent are removed sequentially, and the interaction with the environment is repeated, with the resulting reward serving as the marginal reward. The difference between the overall reward and the marginal reward is used as an evaluation of the individual agent's contribution in the cooperation process, and this is used as a supervisory signal to fine-tune the agent reflector. The reflector is optimized using a proximal policy optimization method to maximize the environmental reward.

[0186] During the fine-tuning of the reflector, it is necessary to score the reflections generated by the agent, that is, to score each piece of training data. Constructing fractions A natural idea is to use the difference between the scores of the two attempts as the score for the current reflection, i.e. However, this would result in agents in the environment having the same score for reflection, making it impossible to personalize the contribution of agent reflection to cooperation. Therefore, embodiments of this application propose introducing counterfactual scores to provide personalized evaluations for agents in the environment. The construction process of the counterfactual reward is shown in Figure 9B.

[0187] Specifically, the embodiment of this application calculates the difference in scores between two attempts. As the overall reward of the intelligent agent system, the reflection of a certain agent i in the environment is then removed sequentially. Obtain a new round of iteration scores Such rewards are called marginal rewards. The reflection score generated by agent i is obtained by subtracting the overall reward from the marginal reward. The reward received is called a counterfactual reward, and... Used as training data for a shared reflector.

[0188] The embodiments of this application may include three stages: supervised training, reward model training, and proximal policy optimization.

[0189] In the first stage, the reflector is fine-tuned through supervised training using positive samples from the collected reflective data. The loss function for model training in this stage is as follows:

[0190] Where x represents the input to the reflector and y represents the generated text.

[0191] In the second stage, considering the cost of constructing training data, unlike the standard RLHF, this embodiment does not construct paired positive and negative samples. Instead, it predicts reflective scores by training a regression model. This embodiment uses the mean squared error loss function to train the reward model R. φ Optimization. This process can be represented as:

[0192] In the third stage, the supervised training model is further fine-tuned using a proximate policy optimization method. The trained reward model is then used to score the model's output. The training objective of this process is to maximize environmental rewards, and its loss function is as follows:

[0193] The scenario described in this application is a multi-agent collaborative system based on a large language model. The main structure of this scenario is shown in Figure 9C, which includes a multi-agent system, tools, environment, and memory.

[0194] The basic operational logic of a multi-agent system based on a large language model is as follows: First, the agents in the system utilize the powerful natural language processing capabilities of the large language model to understand and process input, perform planning and reasoning, and generate preliminary task execution plans. Then, during task execution, the agents rely on various tools to complete specific operations, such as generating text responses, executing code, or performing database queries, and dynamically interacting with the external environment. The agents obtain feedback through interaction with the environment and store this feedback in a memory module. Next, the agents use the large language model to reflect on the stored feedback, analyze and summarize problems and shortcomings in the execution process, and propose improvement strategies. The memory module not only records past experiences and feedback but also supports agents in updating their knowledge base and decision-making models within the environment, supporting continuous self-learning and adaptation. In this way, the multi-agent system can continuously improve its task execution capabilities and environmental adaptability through a cycle of understanding, execution, storing feedback, reflection, and learning.

[0195] The core device in this application's embodiments is primarily a shared reflector fine-tuning framework for multi-agent systems. Within this framework, to reduce the efficiency of reflector fine-tuning in multi-agent systems while retaining personalized reflection capabilities, this invention proposes constructing a shared reflection framework and incorporating agent role descriptions into the reflector prompts. The multi-agent system obtains a series of task trajectories through continuous interaction with the environment. For trajectories with incorrect answers, the system utilizes a shared reflector to generate a reflection for each agent, which is then added to the prompts for the agent's next task response. A counterfactual reward method is used to obtain the reward score for agent reflections in the environment. A counterfactual reward dataset is collected, and the RLHF algorithm is used for shared reflection fine-tuning.

[0196] The framework proposed in this application consists of two phases: Phase 1 involves counterfactual data collection, and Phase 2 involves fine-tuning the shared reflector. In Phase 1, this application first adds the reflections of all agents to the prompts of the corresponding decision model to obtain an overall reward. Then, the reflections of each agent are removed sequentially, and the interaction with the environment is repeated, with the resulting reward used as a marginal reward. The difference between the overall reward and the marginal reward is used as an evaluation of the contribution of a single agent in the cooperation process, and this is used as a supervision signal to fine-tune the agent reflector. In Phase 2, this invention uses a proximal policy optimization method to optimize the reflector, maximizing environmental rewards to improve the reflection capability of the shared reflector in a multi-agent system.

[0197] For example, an algorithm process corresponding to this application is shown below:

[0198] Suppose the task is a multi-hop question-answering reasoning task. For this task, we assume a multi-agent system with two agents: a student agent and a teacher agent. The student agent can use a search engine to search for information related to keywords or choose to answer questions. The teacher agent can guide the student agent's subsequent actions based on the question information and the student agent's answer process, including two options: [Rethink] and [Continue]. The student agent and teacher agent in the environment participate in question-answering in turn until the student agent provides an answer or a preset number of steps is reached.

[0199] Assuming the training set contains 10 tasks, the multi-agent system will attempt to solve the problem at most Λ times. If the answer is correct, it stops. If the answer is incorrect, it generates a reflection, such as, "Last time I got stuck in an infinite loop while searching for a certain keyword, I will try to change the keyword next time." The newly generated reflection will be added to the agent's prompts in the next round to help the agent answer the question.

[0200] In a multi-hop reasoning question-answering task, we use the F1-Score (a value between 0 and 1) to evaluate the answers. Assume the multi-agent system scores 0.5 on the λth answer and 0.9 on the λ+1th answer. Subtracting the scores from these two rounds, we obtain the overall reward for the multi-agent system (the combined increase from the student agent and teacher agent) as 0.4. Then, we remove the reflections of both the student and teacher agents from the λ+1th answer, resulting in scores of 0.7 and 0.8 for the new λ+1 rounds. The marginal rewards for the student and teacher agents are calculated as 0.7 - 0.5 = 0.2 (excluding the task improvement from the student agent's reflection) and 0.8 - 0.5 = 0.3 (excluding the improvement from the teacher agent's reflection), respectively. Subtracting the overall reward from the marginal rewards, we obtain the reward for the student agent's reflection as 0.4 - 0.2 = 0.2 and the reward for the teacher agent's reflection as 0.4 - 0.3 = 0.1.

[0201] The reward for a single agent's reflection is obtained according to the described counterfactual reward construction method. This data is collected. Used as training data for a shared reflector.

[0202] The training of the shared reflector can be divided into three steps: first, select positive examples from the training set to perform supervised fine-tuning of the shared reflector; then, train the reward function using a linear regression model; and finally, use the trained reward function to further fine-tune the shared reflector using the PPO algorithm.

[0203] The following section uses a mobile phone smart assistant as an example to introduce the technical solution of this application embodiment. In the smart assistant, users can converse with the intelligent agent system. The assistant can answer user questions based on existing knowledge and can also call tools to retrieve relevant data. To provide better user service, we assume that the smart assistant system consists of multiple collaborative expert intelligent agents, such as a music intelligent agent and an encyclopedia intelligent agent. For a user request, the intelligent agents discuss and arrive at a final answer through collaboration.

[0204] For each agent in the mobile assistant intelligent agent system, a large model with fixed parameters and good general task performance is selected as the initial decision model. A shared reflection model is set up for the intelligent agent system to continuously improve the quality of user responses through continuous reflection. The shared reflector in the system can be fine-tuned using the method of the embodiments of this application. First, the mobile assistant intelligent agent system is launched to interact with users. User feedback is collected through certain strategies, and counterfactual data is collected from multiple user feedback interactions. Then, the RLHF algorithm is used to fine-tune the shared reflector. This framework is a basic framework for a multi-agent collaborative system based on a large language model. It has broad adaptability, is very flexible, and is easy to apply to various multi-agent collaborative scenarios.

[0205] This application's embodiments selected HotPotQA, GSM8K, and Checkmate in One Move to test the model's capabilities in multi-hop reasoning question answering, mathematics, and chess. In all three scenarios, this application's embodiments used perfect matching as the model evaluation metric. The following is an introduction to the datasets.

[0206] HotPotQA is a question-and-answer dataset that focuses on fact-based multi-hop reasoning, aiming to improve the interpretability of question-and-answer systems. In this dataset, an agent needs to reason between two or more Wikipedia paragraphs to arrive at an answer. The dataset contains 90,447 question-and-answer pairs.

[0207] The GSM8K dataset is a collection of 8.5K math problems designed for elementary school students. These math problems were carefully crafted by human experts to ensure linguistic diversity; they primarily focus on basic arithmetic operations such as addition, subtraction, multiplication, and division; each problem requires 2 to 8 steps of reasoning to arrive at a solution.

[0208] The Checkmate in One Move dataset is derived from the BIG-Bench Chess-State Tracking benchmark and aims to evaluate the effectiveness of moves in chess. Given a sequence of moves that can result in checkmate in one move, the agent is required to predict the moves that will lead to the checkmate. The dataset contains 3500 chess games.

[0209] In this embodiment, gpt-3.5-turbo is selected as the agent's decision model, and longchat-7b-16k is selected as the reflector to be fine-tuned. On the HotPotQA and Checkmate in One Move datasets, this embodiment randomly selects 2000 data points as the training set and 100 data points as the test set. On the GSM8K dataset, due to the higher success rate of the agent resulting in fewer reflection attempts, this embodiment randomly selects 3000 data points as the training set and 100 data points as the test set. Reflection data for reflector fine-tuning is collected on the training set. Specifically, the maximum number of attempts is set to 5, the decision model temperature is 0, and the reflection model temperature is 0.9. During the testing phase, to ensure the reproducibility of the results, both the decision model temperature and the reflector temperature are set to 0. LoRA is used for efficient fine-tuning of the reflector model, and the various stages of RLHF are implemented using the trl package of Hugging Face. During supervised training and proximal policy optimization, this embodiment searches for the number of training epochs within the range of {1,2,3,4}, the batch size within the range of {64,128,256}, and the learning rate within the range of {1e-4,2e-4,3e-4,5e-4}. GPT-2 is selected as the reward model, with a learning rate of 5e-5, a training epoch count of 3, and a batch size of 16. This embodiment trains the model on a 4-card A800 80G GPU.

[0210] Figure 9D shows the experimental results of different methods after 5 rounds of trials. The darkest line represents COPPER, the lightest line represents Reflexion using GPT-3.5 as the reflector, the middle-depth line represents Reflexion using LongChat without fine-tuning as the reflector, and the gray line represents ReAct or CoT.

[0211] Experimental results show that different methods exhibit roughly the same pattern on the three datasets: (1) The self-reflection mechanism enables the model performance to increase with the number of attempts, indicating that reflection can help the model learn from historical errors, thereby avoiding errors in the next round of attempts and improving task performance. (2) The method COPPER proposed in this application has a significant improvement in reflection performance compared to LongChat without fine-tuning, and even surpasses the method using GPT-3.5 as the reflector. The possible reason is that the reflector learns to reflect on the reasons for the failure of the task based on historical attempts during the fine-tuning process, and can deeply understand the role played by the current agent in cooperation, thereby providing targeted improvement solutions for the current agent in cooperation.

[0212] Compared to existing technologies, this application's embodiments use numerical rewards from the environment as supervisory signals to fine-tune the reflector. On one hand, this application's embodiments propose counterfactual rewards to evaluate the contribution of individual agents in the system, mitigating the credit allocation problem to some extent. On the other hand, this application's embodiments propose a shared reflector mechanism, enabling the reflector to reflect on roles while reducing computational resource requirements during model training and improving training stability. Experimental results on three public datasets show that COPPER has stronger reflective capabilities than the baseline model. Compared to the initial success rate, COPPER improves performance by 31.8%, 18.5%, and 86.4% on the HotPotQA, GSM8K, and Checkmate in One Move datasets, respectively.

[0213] Referring to Figure 10, which is a schematic diagram of the structure of a data processing apparatus provided in an embodiment of this application, as shown in Figure 10, the data processing apparatus 1000 provided in this embodiment of the application includes:

[0214] The question acquisition module 1001 is used to acquire questions.

[0215] The response confirmation module 1002 is used to obtain multiple first interaction information based on the question through interaction between multiple agents, the multiple first interaction information including a first response to the question; when the first response is not a correct answer to the prompt, based on the multiple first interaction information, a reflector determines the reason information corresponding to each agent, the reason information being the reason for obtaining an incorrect response; based on the question and the reason information corresponding to each agent, a second response to the question is obtained through interaction between the multiple agents.

[0216] The second interaction information corresponding to the target agent is obtained by using the same reflector when determining the cause information corresponding to each agent.

[0217] Wherein, the second interaction information corresponding to the target intelligent agent is obtained by the plurality of intelligent agents including the target intelligent agent in the following manner;

[0218] The response confirmation module is specifically used for:

[0219] Based on the question and the reason information corresponding to each agent, the second interaction information corresponding to each agent is obtained through the interaction between the multiple agents, and the multiple second interaction information includes the second response.

[0220] The second interaction information corresponding to the target intelligent agent is obtained in the following manner:

[0221] Based on the question and the reason information corresponding to the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

[0222] The second interaction information corresponding to the target intelligent agent is obtained in the following manner:

[0223] Based on the question and the reason information corresponding to multiple agents, including the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

[0224] The second interaction information corresponding to the target intelligent agent is obtained through the following method. The device further includes:

[0225] The model training module is used to obtain the label of the cause information corresponding to each agent, and the weight corresponding to the cause information; the weight indicates the degree of positive influence of using the reflector on the accuracy of the obtained second interaction information compared with not using the reflector;

[0226] The reflector is fine-tuned based on the labels and weights.

[0227] In one possible implementation, the agent is a Large Language Model (LLM).

[0228] The following describes an execution device provided in an embodiment of this application. Please refer to Figure 11, which is a schematic diagram of the structure of an execution device provided in an embodiment of this application. Specifically, the execution device 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103, and a memory 1104 (the number of processors 1103 in the execution device 1100 can be one or more; Figure 11 shows one processor as an example). The processor 1103 may include an application processor 11031 and a communication processor 11032. In some embodiments of this application, the receiver 1101, transmitter 1102, processor 1103, and memory 1104 can be connected via a bus or other means.

[0229] Memory 1104 may include read-only memory and random access memory, and provides instructions and data to processor 1103. A portion of memory 1104 may also include non-volatile random access memory (NVRAM). Memory 1104 stores processor and operation instructions, executable modules, or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.

[0230] Processor 1103 controls the operation of the execution device. In specific applications, the various components of the execution device are coupled together through a bus system, which may include not only the data bus, but also power buses, control buses, and status signal buses. However, for clarity, all buses are referred to as the bus system in the diagram.

[0231] The methods disclosed in the embodiments of this application can be applied to or implemented by the processor 1103. The processor 1103 can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 1103 or by instructions in software form. The processor 1103 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor, or a microcontroller, and may further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The processor 1103 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 1104. Processor 1103 reads the information from memory 1104 and, in conjunction with its hardware, completes the steps involved in the model inference process described above.

[0232] Receiver 1101 can be used to receive input digital or character information, and to generate signal inputs related to the settings and function control of the execution device. Transmitter 1102 can be used to output digital or character information through the first interface; transmitter 1102 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; transmitter 1102 may also include a display device such as a display screen.

[0233] This application embodiment also provides a server device. Please refer to Figure 12. Figure 12 is a schematic diagram of a server structure provided in this application embodiment. Specifically, the server 1200 is implemented by one or more servers. The server 1200 can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 1212 (e.g., one or more processors) and memory 1232, and one or more storage media 1230 (e.g., one or more mass storage devices) for storing application programs 1242 or data 1244. The memory 1232 and storage media 1230 can be temporary or persistent storage. The program stored in the storage media 1230 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the CPU 1212 may be configured to communicate with the storage media 1230 and execute the series of instruction operations in the storage media 1230 on the server 1200.

[0234] Server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input / output interfaces 1258; or, one or more operating systems 1241, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.

[0235] In this embodiment, the central processing unit 1212 is used to execute the data processing method described in the above embodiment.

[0236] This application also provides a computer program product that, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

[0237] This application also provides a computer-readable storage medium storing a program for signal processing, which, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

[0238] The execution device, training device, or terminal device provided in this application embodiment can specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip within the execution device to execute the data processing method described in the above embodiments, or to cause the chip within the training device to execute the data processing method described in the above embodiments. Optionally, the storage unit can be a storage unit within the chip, such as a register or cache. Alternatively, the storage unit can be a storage unit located outside the chip within the wireless access device, such as a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

[0239] Specifically, please refer to Figure 13, which is a schematic diagram of a chip structure provided in an embodiment of this application. The chip can be represented as a neural network processor (NPU) 1300. The NPU 1300 is mounted as a coprocessor on the host CPU, and tasks are assigned by the host CPU. The core part of the NPU is the arithmetic circuit 1303, which is controlled by the controller 1304 to extract matrix data from the memory and perform multiplication operations.

[0240] In some implementations, the arithmetic circuit 1303 internally includes multiple processing engines (PEs). In some implementations, the arithmetic circuit 1303 is a two-dimensional pulsating array. The arithmetic circuit 1303 can also be a one-dimensional pulsating array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1303 is a general-purpose matrix processor.

[0241] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 1302 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 1301 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is ​​stored in the accumulator 1308.

[0242] Unified memory 1306 is used to store input and output data. Weight data is directly transferred to weight memory 1302 via Direct Memory Access Controller (DMAC) 1305. Input data is also transferred to unified memory 1306 via DMAC.

[0243] BIU stands for Bus Interface Unit, which is used for interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1309.

[0244] The Bus Interface Unit (BIU) 1310 is used by the instruction fetch memory 1309 to fetch instructions from external memory, and also by the memory access controller 1305 to fetch the original data of the input matrix A or the weight matrix B from external memory.

[0245] The DMAC is mainly used to move input data from external memory DDR to unified memory 1306, or to weight data to weight memory 1302, or to input data to input memory 1301.

[0246] The vector computation unit 1307 includes multiple processing units that further process the output of the computation circuit 1303 when needed, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional / fully connected layers of neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.

[0247] In some implementations, the vector computation unit 1307 can store the processed output vector in the unified memory 1306. For example, the vector computation unit 1307 can apply a linear function, or a nonlinear function, to the output of the computation circuit 1303, such as performing linear interpolation on the feature planes extracted by the convolutional layer, or, for example, accumulating a vector of values ​​to generate activation values. In some implementations, the vector computation unit 1307 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as an activation input to the computation circuit 1303, for example, for use in subsequent layers of the neural network.

[0248] The instruction fetch buffer 1309 connected to the controller 1304 is used to store the instructions used by the controller 1304;

[0249] Unified memory 1306, input memory 1301, weighted memory 1302, and instruction fetch memory 1309 are all on-chip memories. External memory is proprietary to this NPU hardware architecture.

[0250] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above program.

[0251] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0252] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0253] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0254] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

Claims

1. A data processing method, characterized by, The method includes: Ask a question; Based on the question, multiple first interaction information is obtained through the interaction between multiple intelligent agents, and the multiple first interaction information includes the first answer to the question; When the first response is not the correct answer to the prompt, the reason information corresponding to each agent is determined by the reflector based on the multiple first interaction information, and the reason information is the reason for obtaining the incorrect response; Based on the question and the reason information corresponding to each agent, a second answer to the question is obtained through the interaction between the multiple agents.

2. The method of claim 1, wherein, The same reflector is used to determine the cause information corresponding to each of the aforementioned agents.

3. The method according to claim 1 or 2, characterized in that, The plurality of intelligent agents includes the target intelligent agent; The step of obtaining a second response to the question based on the question and the reason information corresponding to each agent, through the interaction between the multiple agents, includes: Based on the question and the reason information corresponding to each agent, the second interaction information corresponding to each agent is obtained through the interaction between the multiple agents, and the multiple second interaction information includes the second response.

4. The method of claim 3, wherein, The second interaction information corresponding to the target intelligent agent is obtained in the following way: Based on the question and the reason information corresponding to the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

5. The method of claim 3, wherein, The second interaction information corresponding to the target intelligent agent is obtained in the following way: Based on the question and the reason information corresponding to multiple agents, including the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

6. The method according to any one of claims 1 to 5, characterized in that, The method further includes: Obtain the label of the cause information corresponding to each agent, and the weight corresponding to the cause information; the weight indicates the degree of positive impact of using the reflector on the accuracy of the obtained second interaction information compared to not using the reflector; The reflector is fine-tuned based on the labels and weights.

7. The method according to any one of claims 1 to 6, characterized in that, The intelligent agent is a Large Language Model (LLM).

8. A data processing apparatus, characterized by, The device includes: The question retrieval module is used to retrieve questions. The response confirmation module is used to obtain multiple first interaction information based on the question through interaction between multiple agents, the multiple first interaction information including a first response to the question; when the first response is not a correct answer to the prompt, based on the multiple first interaction information, a reflector determines the reason information corresponding to each agent, the reason information being the reason for obtaining an incorrect response; based on the question and the reason information corresponding to each agent, a second response to the question is obtained through interaction between the multiple agents.

9. The apparatus of claim 8, wherein, The same reflector is used to determine the cause information corresponding to each of the aforementioned agents.

10. The apparatus of claim 8 or 9, wherein, The plurality of intelligent agents includes the target intelligent agent; The response confirmation module is specifically used for: Based on the question and the reason information corresponding to each agent, the second interaction information corresponding to each agent is obtained through the interaction between the multiple agents, and the multiple second interaction information includes the second response.

11. The apparatus of claim 10, wherein, The second interaction information corresponding to the target intelligent agent is obtained in the following way: Based on the question and the reason information corresponding to the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

12. The apparatus of claim 10, wherein, The second interaction information corresponding to the target intelligent agent is obtained in the following way: Based on the question and the reason information corresponding to multiple agents, including the target agent, the second interaction information of the target agent is obtained through the interaction between the target agent and other agents.

13. The apparatus of any one of claims 8 to 12, wherein, The device further includes: The model training module is used to obtain the label of the cause information corresponding to each agent, and the weight corresponding to the cause information; the weight indicates the degree of positive influence of using the reflector on the accuracy of the obtained second interaction information compared with not using the reflector; The reflector is fine-tuned based on the labels and weights.

14. The apparatus of any one of claims 8 to 13, wherein, The intelligent agent is a Large Language Model (LLM).

15. A computing device, comprising: The device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the computing device performs the method as described in any one of claims 1 to 7.

16. A chip, characterized by It includes at least one processing unit and an interface circuit, the interface circuit being used to provide program instructions or data to the at least one processing unit, the at least one processing unit being used to execute the program instructions to implement the method of any one of claims 1 to 7.

17. A cluster of computing devices, characterized in that, The computing device cluster includes at least one computing device, the at least one computing device including at least one processor and at least one memory, the at least one memory storing computer-readable instructions; the at least one processor executes the computer-readable instructions to cause the computing device cluster to perform the method as described in any one of claims 1 to 7.

18. A computer-readable storage medium, characterized in that, Includes computer-readable instructions; the computer-readable instructions are used to implement the method according to any one of claims 1 to 7.

19. A computer program product, characterised in that, Includes computer-readable instructions; the computer-readable instructions are used to implement the method according to any one of claims 1 to 7.