Method for training multi-agent interaction model, multi-agent interaction method and device
By splitting the visual language navigation task into localization, path planning, and navigation sub-tasks, and employing a multi-agent interaction model based on emergent language and reinforcement learning, the high cost and low generalization issues of multi-agent navigation tasks are solved, achieving low-cost, efficient navigation results and wide applicability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2022-12-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing multi-agent navigation tasks suffer from high implementation costs, poor generalization, and low navigation efficiency. In particular, in machine-to-machine collaboration scenarios, existing solutions rely on natural language dialogue and limited communication and interaction rules, resulting in high model training costs and insufficient generalization ability.
A multi-agent interaction model is adopted to break down the visual language navigation task into three sub-tasks: localization, path planning, and navigation. Collaborative navigation is carried out using localization networks, path planning networks, and navigation networks. Emergent language is used for interaction between agents to avoid the high annotation costs associated with natural language interaction. Network parameters are optimized through reinforcement learning.
It achieves low-cost and efficient visual language navigation tasks, improves the generalization ability and navigation efficiency of multi-agent systems in different environments, reduces dependence on labeled data, and adapts to more application scenarios.
Smart Images

Figure CN115900723B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to artificial intelligence technology, and in particular to a training method for a multi-agent interaction model, a multi-agent interaction method, and a corresponding device. Background Technology
[0002] Currently, visual-language navigation (VLN) has become a research hotspot. It involves two roles: an embodied agent and a user. These two agents cooperate through language interaction to guide the agent to a target location or complete a specific task. From a linguistic perspective, current VLN tasks primarily focus on human-computer dialogue, i.e., only modeling the agent's language capabilities on one side, while the user can only provide fixed instructions or simple gestures. However, with the widespread emergence of machine-to-machine collaboration scenarios, navigation tasks are no longer limited to human-machine cooperation; numerous navigation applications also exist in machine-to-machine collaboration scenarios. Therefore, to apply VLN to multi-agent scenarios, a navigation task in a multi-agent collaborative scenario is abstracted: this task involves two agents with unequal capabilities, namely the guide agent and the tourist agent. The guide does not know the tourist's initial location but needs to interact and cooperate with the tourist to guide the tourist to the target location.
[0003] In the process of realizing this invention, the inventors discovered that existing multi-agent navigation tasks suffer from problems such as high implementation cost, poor generalization, and low navigation efficiency. Through research, the specific reasons for these problems were found to be as follows:
[0004] Existing multi-agent navigation schemes employ random walks for modeling, allowing the target agent to move randomly. The navigation agent and target agent communicate via natural language to locate the target agent, and the task ends once the target agent is located at the target position. Because this navigation method lacks guidance for the target agent's movement, it results in random movement, hindering the target agent's ability to quickly reach the navigation target, leading to poor navigation performance and low efficiency. Furthermore, the use of natural language for dialogue requires a large number of finely labeled interaction instances as training data, increasing the cost of model learning. Additionally, some navigation tasks are limited by predefined communication rules, restricting them to a limited number of simple scenarios. This results in models trained for these tasks lacking scalability and unable to be applied to other scenarios, leading to poor generalization. Summary of the Invention
[0005] In view of this, the main objective of the present invention is to provide a training method for a multi-agent interaction model, a multi-agent interaction method, and a corresponding device, which can efficiently complete visual language navigation tasks with low implementation cost and strong generalization performance.
[0006] To achieve the above objectives, the technical solution proposed in this embodiment of the invention is as follows:
[0007] A training method for a multi-agent interaction model, comprising:
[0008] When the target agent being navigated needs to obtain the next hop path, the navigation agent uses the localization network of the multi-agent interaction model to locate the first position of the target agent based on the object feature information currently observed by the target agent.
[0009] The navigation agent uses the path planning network of the multi-agent interaction model to determine the shortest path from the first position to the target position based on the first position and the target position of navigation.
[0010] The navigation agent uses the navigation network of the multi-agent interaction model to navigate the target agent, including: generating a path navigation language for the target agent with the next hop path based on the shortest path and sending it to the target agent to trigger the target agent to predict the next hop movement path based on the path navigation language and move accordingly according to the prediction result;
[0011] Once the target agent reaches the target location, based on the output results of the corresponding modules of the navigation agent and the target agent in the positioning network and the navigation network, respectively, the corresponding loss function values are calculated, and the network parameters of the corresponding modules are optimized and adjusted using the loss function values.
[0012] This invention also proposes a multi-agent interaction method, including:
[0013] When navigating a target agent, the navigation agent uses a pre-trained multi-agent interaction model to interact with the target agent and assist the target agent in predicting the next-hop movement path, so that the target agent moves according to the next-hop movement path; wherein, the multi-agent interaction model is obtained in advance using any of the training methods described above.
[0014] This invention also proposes a training device for a multi-agent interaction model, including a navigation agent and a target agent;
[0015] The navigation agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, for causing the processor to perform the operations performed by the navigation agent in any of the training methods described above;
[0016] The target agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, which causes the processor to perform the operations performed by the target agent in any of the training methods described above.
[0017] This invention also proposes a multi-agent interaction device, including a navigation agent and a target agent;
[0018] The navigation agent includes a processor and a memory; wherein, the memory stores an application program that can be executed by the processor, which enables the processor to perform the operations performed by the navigation agent in the multi-agent interaction method described above;
[0019] The target agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, which enables the processor to perform the operations performed by the target agent in the multi-agent interaction method described above.
[0020] In summary, the training scheme and interaction scheme of the multi-agent interaction model proposed in this invention, when performing visual language navigation tasks, decompose the execution of the visual language navigation task into three sub-tasks: localization, path planning, and navigation. These three sub-tasks are modeled separately, namely, a localization network, a path planning network, and a navigation network, and are used together to achieve the complete navigation task. During the navigation task execution, the navigation agent guides the target agent through path navigation and generates corresponding navigation language based on the navigation results to guide it in selecting the planned route, thus helping the target agent reach the navigation target location more quickly. Simultaneously, the agents interact using emergent language spontaneously learned during the interaction process rather than natural language, thereby avoiding the high cost of using labeled data for model training. Furthermore, since the environmental features of each observation point—that is, the object feature information observable at the observation point—are introduced during the navigation task execution, richer environmental data can be used to enhance the language expression and generalization capabilities of the agents through multi-turn language interactions, making the multi-agent interaction model no longer limited to limited application scenarios. Therefore, by employing the above embodiments of the present invention, visual language navigation tasks can be completed efficiently with low cost and strong generalization performance. Attached Figure Description
[0021] Figure 1 This is a schematic diagram illustrating the process of intelligent agents engaging in dialogue using emergent language, as described in an embodiment of the present invention.
[0022] Figure 2 This is a schematic diagram of the multi-agent interaction model structure according to an embodiment of the present invention;
[0023] Figure 3 This is a schematic diagram of the training method for a multi-agent interaction model according to an embodiment of the present invention. Detailed Implementation
[0024] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0025] The multi-agent visual-language navigation task implemented in this invention specifically involves a Tourist agent and a Guide agent located in different environments. Guided by the Guide, the Tourist reaches the target location by moving between different positions. The Guide possesses all environmental information but is unaware of the Tourist's location. Each environment contains several observation points. These observation points refer to observations of the environment obtained from different locations within the environment, generally including typical elements such as obstacles, passageways, and objects to be observed.
[0026] Assume each environment has N observation points: Env = {r0, r1, ..., r...} N Their connectivity can be represented by the graph G = (V, E). Wherein, The observation point features are represented by E, which represents the connectivity between observation points. The two agents in the task, Tourist and Guide, reside in the aforementioned environment Env, and they use the language m = w0w1...w len-1 w i The agent engages in dialogue with the environment, where len is the maximum sentence length and W is the vocabulary. The agent also needs to perform actions to interact with the environment; for example, the Tourist can perform movement actions. g Reach the new observation point r to obtain new observations.
[0027] To overcome the problems of high cost, low navigation efficiency, and poor generalization in existing navigation task implementation schemes, this invention proposes to introduce emergent language from existing multi-agent communication methods to realize interaction between agents, thereby avoiding the high cost of using labeled data for model training when using natural language interaction.
[0028] Emergent language, or automatically emerging machine language, is a language spontaneously learned by intelligent agents during interactions. It learns and improves during model training, and its more flexible dialogue methods allow it to better adapt to new scenarios. Furthermore, it employs reinforcement learning to learn a task-oriented language from scratch. Since it does not require additional language supervision signals, it effectively reduces model training costs.
[0029] Figure 3 A schematic diagram illustrating the process of dialogue between agents using emergent language in an embodiment of the present invention is provided. First, the target agent on the left generates descriptive emergent language based on its observations of its environment. The navigation agent receives the language from the target agent and, based on the description in the language and information from all observation points, predicts the target agent's location. Then, based on the predicted location of the target agent and the known target location, the navigation agent plans the route (e.g., the shortest path) that the target agent should take. Finally, the navigation agent uses emergent language to instruct the target agent on the next movement action to be taken and transmits this instruction to the target agent. The target agent, based on the instructions from the navigation agent, determines its next movement path and executes the corresponding movement action.
[0030] Figure 2 A schematic diagram of the multi-agent interaction model structure according to an embodiment of the present invention is given, as follows: Figure 2 As shown, the model breaks down the execution of the visual language navigation task into three sub-tasks: localization, path planning, and navigation. These three sub-tasks correspond to the localization network, path planning network, and navigation network, respectively. This invention will utilize these three networks to jointly achieve the complete navigation task.
[0031] Figure 3 This is a schematic diagram of the training method for a multi-agent interaction model according to an embodiment of the present invention. The following is in conjunction with... Figure 2 and Figure 3 The specific implementation of this embodiment will be described.
[0032] like Figure 3 As shown, the training method for the multi-agent interaction model in this embodiment of the invention mainly includes the following steps:
[0033] Step 301: When the target agent being navigated needs to obtain the next hop path, the navigation agent uses the localization network of the multi-agent interaction model to locate the first position of the target agent based on the object feature information currently observed by the target agent.
[0034] This step utilizes a localization network based on a multi-agent interaction model. The navigation agent determines the current location of the target agent based on the target agent's description of the observation point. This allows subsequent steps to determine the shortest path from the target agent to the target location.
[0035] In one implementation, the following steps 3011-3014 can be used to locate the first position of the target intelligent agent based on the object feature information currently observed by the target intelligent agent:
[0036] Step 3011: The navigation agent uses a first observation point encoder to obtain a first feature representation of each observation point in a preset electronic map, based on the encoding information of all first paths originating from that observation point and the object feature information that can be observed at that observation point; the first path is a single-hop path; the encoding information of the first path is the result of encoding the position features of the corresponding path.
[0037] This step is used to obtain the first feature representation of each observation point in the environment Env where the navigation agent and the target agent are located.
[0038] The electronic map is the electronic map G = (V, E) corresponding to the environment Env where the navigation agent and the target agent are located.
[0039] The first observation point encoder is specifically composed of a fully connected layer, which takes the path features g of all first paths of observation point i. i and the object features f that can be observed at observation point i i As input, the first feature representation of the corresponding observation point i is output. The specific formula is as follows:
[0040]
[0041] Where i represents the observation point number.
[0042] Step 3012: The navigation agent uses a structural encoder to obtain all second paths on the electronic map with each observation point as the endpoint. For each second path, the first feature representation of all nodes on the path is encoded to obtain the encoding result of the path. For each observation point i, the average value of the encoding results of all second paths of observation point i is calculated to obtain the second feature representation of the observation point. The number of hops of the second path is the same as the current dialogue round.
[0043] This step utilizes the structural encoder in the localization network to obtain a second feature representation for each observation point based on the first feature representation of each observation point. This second feature representation incorporates graph structure information, so that the current location of the target agent can be predicted based on this second feature representation, combined with the feature information of the observation point fed back by the target agent and historical interaction information, using similarity distribution.
[0044] It should be noted that, in order to use historical interaction information to calculate the similarity distribution in subsequent steps, the number of hops for each sampled second path needs to be the same as the current dialogue round to ensure the accuracy of localization.
[0045] Specifically, for each observation point i, it can be done according to... The second feature representation of the observation point is obtained. Where K represents the number of second paths ending at observation point i; This represents the k-th second path ending at observation point i.
[0046] Step 3013: The target agent uses a second observation point encoder to obtain a third feature representation of the first observation point based on the encoded information of all the first paths starting from the first observation point and the object feature information observed at the first observation point; and uses a first language generation module to generate the location feature language information of the first observation point based on the third feature representation using an autoregressive method and sends it to the navigation agent; the first observation point is the current observation point of the target agent.
[0047] In this step, the target agent needs to use the second observation point encoder in the localization network to generate a third feature representation of its observation point. The third feature representation will contain encoded information about the location of the target agent and environmental information, which is equivalent to the first feature representation of the observation point obtained in step 3011. Then, the first language generation module will be used to convert the third feature representation of the observation point of the target agent into machine language, and the converted location feature language information will be used. Send it to the navigation agent so that the navigation agent can learn the characteristics of the current observation point of the target agent.
[0048] Step 3014: The navigation agent uses the first language understanding module to encode the corresponding semantic encoding vector based on the concatenation result of the historical interaction information between the current agent and the target agent and the location feature language information; based on the similarity distribution between the semantic encoding vector and the second feature representation, it predicts the current location of the target agent and obtains the first location.
[0049] In this step, the navigation agent will utilize the first language understanding module in the positioning network to analyze the historical interaction information between the current navigation agent and the target agent, along with the location feature language information. The concatenated result is encoded to obtain a semantic encoding vector. The second feature representation of each observation point obtained in step 3012 By calculating the similarity distribution, the current location of the target agent can be predicted based on the obtained similarity distribution.
[0050] The historical interaction information refers to all dialogue information exchanged between the navigation agent and the target agent.
[0051] Step 302: The navigation agent uses the path planning network of the multi-agent interaction model to determine the shortest path from the first position to the target position based on the first position and the target position of navigation.
[0052] In this step, the navigation agent will use the localization result of the target agent obtained in step 301. And the current navigation target location, determine the location from The shortest path to the target location is determined so that subsequent steps can guide the target agent's next movement route based on this shortest path. Existing methods, such as the Floyd algorithm, can be used to obtain the shortest path, but this is not a limitation.
[0053] Step 303: The navigation agent uses the navigation network of the multi-agent interaction model to navigate the target agent, including: generating a path navigation language for the next hop path for the target agent based on the shortest path and sending it to the target agent, so as to trigger the target agent to predict the next hop movement path based on the path navigation language and move accordingly according to the prediction result.
[0054] In this step, the navigation agent will use the model's navigation network to guide the target agent on its next movement route.
[0055] like Figure 2 As shown, the navigation network may specifically include a route information encoder, a language generation module, and a language understanding module. Accordingly, in one embodiment, the following steps 3031-3033 may be used to navigate the target agent:
[0056] Step 3031: The navigation agent uses a route information encoder to encode the next-hop path of the target agent indicated by the shortest path, and obtains the feature code of the next-hop path; the second language generation module uses the feature code of the next-hop path to generate the corresponding path navigation language and sends it to the target agent.
[0057] In this step, based on the shortest path obtained in step 302, the next-hop path to the current location of the target agent is found, encoded, and then converted into machine language for communication using the language generation module. The corresponding path navigation language is then sent to the target agent for it to make a next-hop path decision.
[0058] Step 3032: The target agent uses a second language understanding module to combine the historical interaction information between the navigation agent and the target agent with the path navigation language. Encode the data to obtain the corresponding navigation code representation.
[0059] It's important to note that in practical applications, the following scenario might occur: First, the target agent is at position 1. It then chooses an incorrect single-hop path, arriving at position 2. Second, it chooses the correct path at position 2 and returns to position 1. Upon returning to position 1, the target agent might repeat steps one and two due to the same input. To avoid this, this step incorporates historical interaction information between the current navigation agent and the target agent, concatenating it with the path navigation language. This allows the target agent to avoid repeating incorrect paths based on this historical interaction information.
[0060] Step 3033: The target intelligent agent is based on the navigation coding representation. The encoding result of the first path corresponding to the first observation point Similarity distribution to predict the next hop path And in accordance with this Move.
[0061] In this step, the navigation code representation sent by the navigation agent is used. From all the first paths corresponding to the first observation point, find the first path with the highest similarity and use it as the next hop path.
[0062] Step 304: After the target agent reaches the target location, based on the output results of the corresponding modules of the navigation agent and the target agent in the positioning network and the navigation network, respectively, calculate the corresponding loss function value, and use the loss function value to optimize and adjust the network parameters of the corresponding module.
[0063] In this step, the corresponding positioning loss function value and navigation loss function value will be calculated based on the output results of the corresponding modules of the navigation agent in the positioning network and the navigation network, respectively; the corresponding positioning loss function value and navigation loss function value will also be calculated based on the output results of the corresponding modules of the target agent in the positioning network and the navigation network, respectively.
[0064] In one implementation, the loss function value can be calculated using the following method:
[0065] Step 3041: Based on the output results of the navigation agent in the corresponding module of the positioning network during the positioning process, calculate the positioning loss function value corresponding to the navigation agent according to the reinforcement learning method.
[0066] This step can be implemented using existing methods, specifically by calculating the localization loss function value based on the following formula 1:
[0067]
[0068] in:
[0069] The localization loss function value corresponding to the navigation agent.
[0070] R(·): The reward function during the localization process. A positive reward is given if the navigation agent correctly predicts the target agent; otherwise, a negative penalty is given.
[0071] π guide : Action function of the navigation agent. It gives the action function of the navigation agent in state s. guide Below, the predicted location of the target agent is: The probability of.
[0072] s guide The state information of the navigation agent. It includes historical interaction information h. t Information on all observation points g i f i And the connectivity G between observation points.
[0073] The navigation agent predicts the location of the target agent.
[0074] Step 3042: Based on the output results of the target agent in the corresponding module of the localization network during the localization process, calculate the localization loss function value corresponding to the target agent according to the reinforcement learning method.
[0075] This step can be implemented using existing methods, specifically, the localization loss function value can be calculated based on the following formula 2:
[0076]
[0077] in:
[0078] The localization loss function value corresponding to the target agent.
[0079] R(·): The reward function during the localization process. If the navigation agent correctly predicts the target agent, a positive reward is given; otherwise, a negative penalty is given.
[0080] M: The length of the sentence generated by the navigation agent.
[0081] π tourist : The action function of the target agent. It gives the action function of the target agent in state s. tourist Next, generate vocabulary in the sentence. The probability of.
[0082] s tourist : The state information of the target agent. It includes the observation information of the observation point where the target agent is located (the encoded information of the first path g). tourist and the object feature information f observed at the observation point tourist ).
[0083] The vocabulary used in the language generated by the target intelligent agent.
[0084] Step 3043: Based on the output results of the navigation agent in the corresponding module of the navigation network during the navigation process, calculate the navigation loss function value corresponding to the navigation agent according to the reinforcement learning method.
[0085] This step can be implemented using existing methods, specifically, the navigation loss function value in this step can be calculated based on the following formula 3:
[0086]
[0087] in:
[0088] The navigation loss function value corresponding to the navigation agent.
[0089] R(·): The reward function during navigation. A positive reward is given if the target agent correctly predicts the next path planned by the navigation agent; otherwise, a negative penalty is given.
[0090] M: The length of the sentence generated by the navigation agent.
[0091] π guide : Action function of the navigation agent. It gives the action function of the navigation agent in state s. guide Next, generate vocabulary in the sentence. The probability of.
[0092] s guide The state information of the navigation agent. It includes the route planned by the navigation agent for the next step of the target agent.
[0093] The vocabulary used in the language generated by the navigation agent.
[0094] Step 3044: Based on the output results of the target agent in the corresponding module of the navigation network during the navigation process, calculate the navigation loss function value corresponding to the target agent according to the reinforcement learning method.
[0095] This step can be implemented using existing methods, specifically by calculating the navigation loss function value based on the following formula 4:
[0096]
[0097] in:
[0098] The navigation loss function value corresponding to the target agent.
[0099] R(·): The reward function during navigation. A positive reward is given if the target agent correctly predicts the next path planned by the navigation agent; otherwise, a negative penalty is given.
[0100] π tourist : The action function of the target agent. It gives the action function of the target agent in state s. tourist Next, predict the next movement. The probability of.
[0101] s tourist The target agent's state information. It includes historical interaction information (ht) and path information (g) from its current observation point. i .
[0102] The next hop path predicted by the target agent.
[0103] In practice, there is no requirement for the order of execution of steps 3041 to 3044.
[0104] It should be noted that the above method employs reinforcement learning algorithms for training to address the problem that the discrete information transmission between the target agent and the navigation agent prevents gradient backpropagation. In the localization task, the reward function R(·) provides a positive reward when the navigation agent selects correctly and a penalty when it guesses incorrectly. Similarly, in the guidance task, the target agent receives a positive reward for a correct selection and a negative reward for an incorrect selection. The target agent's state information includes historical interaction information and the feature information of its current observation point. The navigation agent's state information includes historical interaction information, the features of all observation points, and structural information.
[0105] w in the above formula i The vocabulary is selected from the vocabulary list. It's important to note that the vocabulary has no initial meaning before training. Taking a localization task as an example, the vocabulary representations in the given vocabulary list are first randomly initialized. When the navigation agent correctly guesses the location of the target agent, both receive positive rewards, driving the target agent to use the vocabulary representations from that round to increase the probability of using these words in similar states. This also drives the navigation agent to adjust its vocabulary representations, thereby increasing the probability of selecting similar observation points when receiving the current sentence. Similarly, when the navigation agent mislocates, both are penalized, reducing the probability of the target agent using these words in that state and the probability of the navigation agent selecting that observation point when receiving that sentence. The agent's language generation and language understanding abilities develop together, reaching a consensus and forming a discrete protocol—an emergent language.
[0106] As can be seen from the above-described model training method embodiment, in this embodiment, when training the model to perform the visual language navigation task, the execution of the visual language navigation task is broken down into three sub-tasks: localization, path planning, and navigation. These three sub-tasks are modeled separately, namely, a localization network, a path planning network, and a navigation network, and these three networks work together to achieve the complete navigation task. During the navigation task execution, the navigation agent guides the target agent through path navigation and generates corresponding navigation language based on the navigation results to guide it in selecting the planned route, thus helping the target agent reach the navigation target location more quickly. Simultaneously, the agents interact using emergent language spontaneously learned during the interaction process, rather than natural language, thereby avoiding the high cost of using labeled data for model training. Furthermore, since the environmental features of each observation point—that is, the object feature information that can be observed at the observation point—are introduced during the navigation task execution, richer environmental data can be used to enhance the language expression and generalization capabilities of the agents through multi-turn language interactions, making the multi-agent interaction model no longer limited to limited application scenarios. Therefore, using the above-described embodiment of the present invention, the visual language navigation task can be completed efficiently with low cost and strong generalization performance.
[0107] Based on the above-described training method embodiment for the multi-agent interaction model, this invention also proposes a multi-agent interaction method, which includes the following:
[0108] When navigating a target agent, the navigation agent uses a pre-trained multi-agent interaction model to interact with the target agent and assist the target agent in predicting the next-hop movement path, so that the target agent moves according to the next-hop movement path; wherein, the multi-agent interaction model is obtained in advance using any of the multi-agent interaction model training methods described above.
[0109] Based on the above-described training method embodiment for the multi-agent interaction model, the present invention also proposes a training device for the multi-agent interaction model, which includes a navigation agent and a target agent.
[0110] The navigation agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, which enables the processor to perform the operations performed by the navigation agent in any of the multi-agent interaction model training methods described above.
[0111] The target agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, which enables the processor to perform the operations performed by the target agent in any of the multi-agent interaction model training methods described above.
[0112] The training methods and devices for the aforementioned multi-agent interaction models are based on the same inventive concept. Since the methods and devices solve problems in similar ways, the implementation of the devices and methods can refer to each other, and the repeated parts will not be described again.
[0113] Based on the above embodiments of the multi-agent interaction method, the present invention also proposes a multi-agent interaction device, which includes a navigation agent and a target agent;
[0114] The navigation agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, for causing the processor to perform operations performed by the navigation agent as described in the multi-agent interaction method;
[0115] The target agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, which enables the processor to perform the operations performed by the target agent in the multi-agent interaction method described above.
[0116] The above-mentioned multi-agent interaction methods and devices are based on the same inventive concept. Since the methods and devices solve problems in similar ways, the implementation of the devices and methods can refer to each other, and the repeated parts will not be described again.
[0117] Furthermore, each embodiment of the present invention can be implemented by a data processing program executed by a data processing device such as a computer. Clearly, the data processing program constitutes the present invention. Moreover, the data processing program, typically stored in a storage medium, is executed by directly reading the program from the storage medium or by installing or copying the program to the storage device (such as a hard disk and / or memory) of the data processing device. Therefore, such a storage medium also constitutes the present invention. The storage medium can use any type of recording method, such as paper storage media (e.g., paper tape), magnetic storage media (e.g., floppy disks, hard disks, flash memory), optical storage media (e.g., CD-ROMs), magneto-optical storage media (e.g., MOs), etc.
[0118] Furthermore, the steps described in this invention can be implemented not only by a data processing program but also by hardware, such as logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Therefore, such hardware capable of implementing the methods described in this invention can also constitute this invention.
[0119] The solutions described in this specification and embodiments, if involving the processing of personal information, will be processed only on the premise of having a legal basis (such as obtaining the consent of the personal information subject, or being necessary for the performance of a contract), and will only be processed within the scope stipulated or agreed upon. A user's refusal to process personal information beyond what is necessary for basic functions will not affect the user's use of basic functions.
[0120] In summary, the above are merely preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A training method for a multi-agent interaction model, characterized in that, include: When the target agent being navigated needs to obtain the next hop path, the navigation agent uses the localization network of the multi-agent interaction model to locate the first position of the target agent based on the object feature information currently observed by the target agent. The navigation agent uses the path planning network of the multi-agent interaction model to determine the shortest path from the first position to the target position based on the first position and the target position of navigation. The navigation agent uses the navigation network of the multi-agent interaction model to navigate the target agent, including: generating a path navigation language for the target agent with the next hop path based on the shortest path and sending it to the target agent to trigger the target agent to predict the next hop movement path based on the path navigation language and move accordingly according to the prediction result; Once the target agent reaches the target location, based on the output results of the corresponding modules of the navigation agent and the target agent in the positioning network and the navigation network, respectively, the corresponding loss function values are calculated, and the network parameters of the corresponding modules are optimized and adjusted using the loss function values.
2. The method of claim 1, wherein, The step of locating the first position of the target intelligent agent based on the object feature information currently observed by the target intelligent agent includes: The navigation agent uses a first observation point encoder to obtain a first feature representation of each observation point in a preset electronic map, based on the encoding information of all first paths originating from that observation point and the object feature information that can be observed at that observation point; the first path is a single-hop path; the encoding information of the first path is the result of encoding the position features of the corresponding path. The navigation agent uses a structural encoder to obtain all second paths on the electronic map with each observation point as the endpoint. For each second path, the first feature representation of all nodes on the path is encoded to obtain the encoding result of the path. For each observation point i, the average value of the encoding results of all second paths of observation point i is calculated to obtain the second feature representation of the observation point. The number of hops of the second path is the same as the current dialogue round. The target agent uses a second observation point encoder to obtain a third feature representation of the first observation point based on the encoding information of each single-hop path starting from the first observation point and the object feature information observed at the first observation point; and uses a first language generation module to generate the position feature language information of the first observation point based on the third feature representation using an autoregressive method and sends it to the navigation agent; the first observation point is the current observation point of the target agent. The navigation agent uses a first language understanding module to encode the corresponding semantic encoding vector based on the concatenation result of historical interaction information between the current agent and the target agent and the location feature language information; based on the similarity distribution between the semantic encoding vector and the second feature representation, it predicts the current location of the target agent and obtains the first location.
3. The method of claim 2, wherein, The navigation of the target intelligent agent includes: The navigation agent uses a route information encoder to encode the next-hop path of the target agent indicated by the shortest path, obtaining the feature code of the next-hop path; and uses a second language generation module to generate the corresponding path navigation language based on the feature code of the next-hop path and send it to the target agent. The target intelligent agent uses a second language understanding module to encode the path navigation language based on the historical interaction information between the navigation intelligent agent and the target intelligent agent and the concatenation result of the path navigation language, thereby obtaining the corresponding navigation code representation; The target agent predicts the next-hop path based on the similarity distribution of the coding results of the navigation coding representation and the first path corresponding to the first observation point, and moves according to the prediction results.
4. The method of claim 1, wherein, The calculation of the corresponding loss function value based on the output results of the corresponding modules of the navigation agent and the target agent in the positioning network and the navigation network, respectively, includes: Based on the output results of the navigation agent in the corresponding module of the positioning network during the positioning process, the positioning loss function value corresponding to the navigation agent is calculated according to the reinforcement learning method. Based on the output results of the target agent in the corresponding module of the localization network during the localization process, the localization loss function value of the target agent is calculated according to the reinforcement learning method. Based on the output results of the navigation agent in the corresponding module of the navigation network during the navigation process, the navigation loss function value corresponding to the navigation agent is calculated according to the reinforcement learning method. Based on the output results of the target agent in the corresponding module of the navigation network during the navigation process, the navigation loss function value corresponding to the target agent is calculated according to the reinforcement learning method.
5. A multi-agent interaction method, comprising: include: When navigating a target agent, the navigation agent uses a pre-trained multi-agent interaction model to interact with the target agent and assist the target agent in predicting the next-hop movement path, so that the target agent moves according to the next-hop movement path; wherein, the multi-agent interaction model is obtained in advance using any of the training methods described in claims 1 to 4.
6. A device for training a multi-agent interaction model, comprising: This includes navigation agents and target agents; The navigation agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, for causing the processor to perform the operations performed by the navigation agent in any of the training methods described in claims 1 to 4; The target agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, for causing the processor to perform the operations performed by the target agent in any of the training methods described in claims 1 to 4.
7. A multi-agent interaction device, characterized in that, This includes navigation agents and target agents; The navigation agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, for causing the processor to perform the operations performed by the navigation agent in the multi-agent interaction method as described in claim 5; The target agent includes a processor and a memory; wherein the memory stores an application program that can be executed by the processor, for causing the processor to perform the operations performed by the target agent in the multi-agent interaction method as described in claim 5.