Robot control method, device, electronic device, computer-readable storage medium and computer program product

By employing a first and second network mechanism for adversarial training, combined with a visual language model for multi-view matching, the problem of low accuracy in robot task planning and execution in existing reinforcement learning methods is solved, achieving efficient and accurate task execution and generalization capabilities.

CN122185147APending Publication Date: 2026-06-12BEIJING CO WHEELS TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING CO WHEELS TECH CO LTD
Filing Date
2024-12-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing reinforcement learning methods suffer from problems such as low sample efficiency, high dependence on explicit world models, weak generalization ability, and incomplete task description in robot task planning, resulting in low accuracy of robot task instruction planning and execution.

Method used

An adversarial training mechanism using a first network and a second network is adopted. The first network generates action instructions, while the second network supervises the task completion. A visual language model is used for multi-view matching and feedback optimization, reducing the dependence on the explicit world model and improving the network training effect.

🎯Benefits of technology

It improves the accuracy of robot task instruction planning and execution, enhances the model's generalization ability in new environments, reduces dependence on samples, and achieves self-optimization and adjustment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122185147A_ABST
    Figure CN122185147A_ABST
Patent Text Reader

Abstract

The application provides a robot control method and device, electronic equipment, computer readable storage medium and computer program product. The method comprises: reasoning according to task information and first environment information through a first network to determine an action instruction; controlling the robot to execute the action instruction and obtaining second environment information after the execution of the action instruction; determining the task completion condition according to the second environment information and the task information through a second network; wherein the first network and the second network are obtained by performing adversarial training on an initial first network for generating sample training instructions and an initial second network for determining the task completion condition corresponding to the sample training instructions. Through the application, the accuracy of instruction planning and execution of the robot task can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to artificial intelligence technology, and more particularly to a robot control method, apparatus, electronic device, computer-readable storage medium, and computer program product. Background Technology

[0002] Currently, reinforcement learning is frequently used in autonomous intelligent agents and robot task planning. In these methods, the agent or robot learns how to perform tasks, move, and explore through interaction with its environment. However, most reinforcement learning methods require extensive interaction with the environment to acquire useful information, which is impractical in many real-world applications. This leads to poor model training results, and consequently, lower accuracy when using the trained model for robot task instruction planning and execution. Summary of the Invention

[0003] This application provides a robot control method, device, electronic device, computer-readable storage medium, and computer program product, which can improve the accuracy of robot task instruction planning and execution.

[0004] The technical solution of this application is implemented as follows:

[0005] This application provides a robot control method, the method comprising:

[0006] Through the first network, reasoning is performed based on task information and first environment information to determine action instructions;

[0007] Control the robot to execute the action command, and obtain the second environmental information after the action command is executed;

[0008] The task completion status is determined through the second network based on the second environmental information and the task information.

[0009] The first network and the second network are obtained by adversarial training of an initial first network used to generate sample training instructions and an initial second network used to determine the task completion status corresponding to the sample training instructions.

[0010] This application provides a robot control device, the device comprising:

[0011] The instruction reasoning module is used to infer action instructions based on task information and first environment information through the first network;

[0012] The control module is used to control the robot to execute the action instructions and to acquire second environmental information after the action instructions are executed.

[0013] The determination module is used to determine the task completion status based on the second environment information and the task information through the second network; wherein the first network and the second network are obtained by adversarial training on an initial first network used to generate sample training instructions and an initial second network used to determine the task completion status corresponding to the sample training instructions.

[0014] Optionally, the robot control device further includes: a training module; the training module is configured to determine an initial first network and an initial second network; perform inference based on first sample task information and first sample environment information through the initial first network and the initial second network respectively, and output a first training action command; and perform inference based on the first sample task information, the first training action command, and the second sample environment information, and output a first task completion probability; the second sample environment information represents the environment information obtained when the sample task information is completed based on the first sample environment information; and update the network parameters of the initial first network and the initial second network based on the first task completion probability until a first training objective is achieved, thereby obtaining the first network and the second network.

[0015] Optionally, the training module is further configured to, during one round of training, use the initial first network to perform inference based on the first sample task information and the first sample environment information, output the first training action instruction, and use the initial second network to perform inference based on the first sample task information, the first training action instruction, and the second sample environment information, output the task completion probability; during another round of training, use the initial second network to perform inference based on the first sample task information and the first sample environment information, output the first training action instruction, and use the initial first network to perform inference based on the first sample task information, the first training action instruction, and the second sample environment information, output the task completion probability.

[0016] Optionally, the training module is further configured to use the original first network to perform inference based on the second sample task information and the third sample environment information, and output a second training action instruction; and update the network parameters of the original first network according to the first sample training action instruction and the second training action instruction corresponding to the second sample task information and the third sample environment information, until the second training objective is achieved, thereby obtaining the initial first network.

[0017] Optionally, the training module is further configured to use the original second network to perform inference based on the third sample task information, the fourth sample environment information, and the second sample training action instruction, and output the second task completion probability corresponding to the second sample training action instruction; and update the network parameters of the original second network based on the second task completion probability and the true value of the completion probability corresponding to the second sample training action instruction, until the third training objective is reached, thereby obtaining the initial second network.

[0018] Optionally, the training module is further configured to determine a first parameter adjustment amount corresponding to the initial first network with the objective of minimizing the error between the first task completion probability and the target task completion probability, and update the network parameters of the initial first network according to the first parameter adjustment amount; and to determine a second parameter adjustment amount corresponding to the initial second network with the objective of maximizing the error between the first task completion probability and the target task completion probability, and update the network parameters of the initial second network according to the second parameter adjustment amount.

[0019] Optionally, the first environmental information includes: a first image acquired from a first perspective; the second environmental information includes: a second image acquired from a second perspective; the second perspective is different from the first perspective; the determining module is further configured to perform template matching on the first environmental information and the second environmental information through the second network to determine the region of interest in the second environmental information; and determine the task completion status based on the region of interest and the task information.

[0020] Optionally, the determining module is further configured to, after determining the task completion status, send the task completion status and the second environmental information to the first network if the task completion status indicates that the task is not completed, so that the first network generates a new action instruction based on the task information, the task completion status and the second environmental information to complete the task information.

[0021] Optionally, the first network includes a first visual language model; the second network includes a second visual language model.

[0022] This application provides an electronic device, the electronic device comprising:

[0023] Memory is used to store executable instructions for a computer;

[0024] The processor, when executing computer-executable instructions stored in the memory, implements the robot control method provided in the embodiments of this application.

[0025] This application provides a computer-readable storage medium storing a computer program or computer-executable instructions for implementing the robot control method provided in this application when executed by a processor.

[0026] This application provides a computer program product, including a computer program or computer-executable instructions, which, when executed by a processor, implement the robot control method provided in this application.

[0027] This application has the following beneficial effects:

[0028] The first and second networks are obtained through adversarial training of an initial first network and an initial second network. The initial first network generates training action commands, while the initial second network determines the task completion status corresponding to the training action commands. This allows for continuous optimization and improvement of the generated training action commands during training through a feedback mechanism, enhancing the network training effect and consequently improving the accuracy of command planning and completion monitoring based on the trained first and second networks. Thus, the trained first network infers action commands based on task information and initial environmental information; while the trained second network determines task completion status based on the second environmental information and task information obtained after the robot executes the action commands. This achieves real-time evaluation of task completion by monitoring the completion status of robot action commands generated by the first network through the second network, enabling self-optimization and adjustment. The cooperation of the two networks improves the accuracy of robot task command planning and execution. Attached Figure Description

[0029] Figure 1 This is an optional flowchart illustrating the robot control method provided in an embodiment of this application;

[0030] Figure 2 This is an optional flowchart illustrating the robot control method provided in an embodiment of this application;

[0031] Figure 3 This is a schematic diagram illustrating the process of applying the robot control method provided in this application embodiment to a real-world scenario;

[0032] Figure 4 A schematic diagram of an optional structure for a robot control device;

[0033] Figure 5 This is a schematic diagram of an optional structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0034] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0035] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0036] In the following description, the terms "first, second, third" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first, second, third" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0037] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.

[0038] Unless otherwise defined, all technical and scientific terms used in the embodiments of this application have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the embodiments of this application is for the purpose of describing the embodiments of this application only and is not intended to limit this application.

[0039] In the implementation of this application, the collection and processing of relevant data should strictly comply with the requirements of relevant national laws and regulations, obtain the informed consent or separate consent of the personal information subject, and carry out subsequent data use and processing within the scope of laws and regulations and the authorization of the personal information subject.

[0040] Currently, reinforcement learning methods are frequently used in autonomous intelligent agents and robot task planning. In these methods, the agent or robot learns how to perform tasks, move, and explore through interaction with the environment. Common reinforcement learning methods include Q-learning and SARSA, which primarily make decisions by learning a state-action value function.

[0041] However, reinforcement learning methods typically suffer from the following significant drawbacks when applied to robot control:

[0042] I. Sample efficiency problem: In most reinforcement learning methods, intelligent agents or robots need to interact with the environment extensively to obtain useful information, which is impractical in many real-world applications.

[0043] II. The World Model Problem: Current reinforcement learning algorithms typically require an explicit world model, that is, they need a specific function or equation to describe the dynamic relationship between the intelligent agent and the environment. However, in many real-world problems, such a model is unavailable or difficult to obtain.

[0044] Third, the issue of generalization ability: In current reinforcement learning frameworks, the generalization ability of models is usually relatively weak. That is to say, if the robot works in a new environment, even if this environment is very similar to the training environment, the robot may still perform very poorly.

[0045] IV. Task Description Problem: In current training frameworks, the robot's task is usually given by the programmer in the form of a reward function. However, this type of method often fails to capture all the requirements and details of the task, especially for complex tasks.

[0046] Furthermore, in robot control, such as during robotic arm operations, when the robotic arm's actions cause errors due to its inability to perceive the global environment, a feedback mechanism is needed to correct and optimize its movements to ensure the accurate completion of the task. In addition, when multi-view cameras are introduced into robot control scenarios to obtain a global view, the impact of errors in the classification and recognition of images acquired from different perspectives on the monitoring of completion status becomes a concern.

[0047] In summary, current robot control systems have shortcomings in network training and performance supervision in multi-view scenarios. The action commands generated by the trained network model have low accuracy in completing tasks, thus reducing the accuracy of robot task instruction planning and execution.

[0048] This application provides a robot control method, device, electronic device, computer-readable storage medium, and computer program product, which can improve the learning efficiency and generalization ability of the task planning system, reduce the dependence on a large number of samples and explicit world models, and more accurately understand and execute complex tasks, thereby improving the accuracy of robot task instruction planning and execution.

[0049] The robot control method provided in this application can be applied to electronic devices. In some embodiments, the electronic device may include a control module on the robot, or it may include a server. That is, the robot collects first environmental information and second environmental information and sends them to the server. The server generates action commands and monitors the completion of the action commands based on the acquired task information, first environmental information, and second environmental information. The specific choice depends on the actual situation, and this application does not limit it.

[0050] refer to Figure 1 , Figure 1 This is a schematic diagram of an optional flowchart of the robot control method provided in an embodiment of this application. Figure 1 As shown, the robot control method provided in this application embodiment can be implemented by executing the process S101-S103, as follows:

[0051] S101. Through the first network, reasoning is performed based on task information and first environment information to determine action instructions.

[0052] In S101, the task information may include tasks in the form of voice or text, which are used to instruct the robot to perform the corresponding tasks. For example, the task information may include "Please move the cup on the table from point A to point B".

[0053] The first environmental information may include environmental information collected from the robot's environment before the task is executed. For example, it may include color images, depth maps, heat maps, etc. The specific selection is based on the actual situation, and this application embodiment does not limit it.

[0054] In some embodiments, the first environmental information includes an image captured from a first viewpoint. Exemplarily, the first viewpoint may include a top-down view or a bird's-eye view. The first environmental information includes a target object related to the task information. For example, for the task information "Please move the cup on the table from point A to point B," the first environmental information may include an image of the table with the cup placed on it, captured from a bird's-eye view.

[0055] In this embodiment, the first network can perform reasoning based on task information and first environmental information. Based on the first environmental information, it can plan at least one action step required for the robot to complete the task information and generate at least one action instruction corresponding to the at least one action step. For example, for the task information "Please move the cup on the table from point A to point B", combined with an image of the table with the cup taken from a bird's-eye view, action steps such as "move the robotic arm above point A", "grab the cup", "move the robotic arm to point B", and "release the cup" can be planned, and action instructions corresponding to each action step can be generated.

[0056] For example, the motion command can be output in the form of a vector of robot motion: T[x,y,z,rx,ry,rz,h], where (x,y,z) is the position of the point of force application on the object, (rx,ry,rz) is the rotation angle of the robotic arm, and h is the degree of opening of the gripper.

[0057] S102. Control the robot to execute action commands and obtain the second environmental information after the action commands are executed.

[0058] In step S102, the robot executes action commands inferred from the first network. After the action commands are executed, it collects information about the current environment again to obtain second environmental information. It can be understood that the second environmental information represents the environmental information obtained after the action commands alter the first environmental information.

[0059] Similarly, the second environmental information may include color images, depth maps, heat maps, etc., and the specific selection is made according to the actual situation. This application embodiment does not limit the selection.

[0060] In some embodiments, the second environmental information includes a second image acquired from a second perspective; the second perspective differs from the first perspective. For example, the first perspective is a bird's-eye view or a top-down view, while the second perspective can be a frontal view. That is, after the robot executes an action command, information can be acquired from another angle, and the task completion status can be determined based on the environmental information acquired from that other angle. This avoids the impact of perspective limitations on task execution and supervision, improving the accuracy of task execution.

[0061] S103. Through the second network, determine the task completion status based on the second environment information and task information.

[0062] In this embodiment of the application, the second network performs reasoning based on the second environment information and the task information to determine whether the state of the target object displayed in the second environment information matches the state after the task information has been completed, thereby outputting the task completion status.

[0063] In some embodiments, task completion status may include the task completion probability, or it may include the task completion probability and auxiliary information, etc., which are not limited in this application embodiment. Here, auxiliary information may include descriptive information about the task completion status, such as success or failure flag information, failure reason information, etc.

[0064] In some embodiments, for second environmental information collected from another perspective, a second network can be used to perform template matching between the first and second environmental information to determine the region of interest (ROI) in the second environmental information; based on the ROI and task information, the task completion status can be determined. In this way, by matching the first and second perspectives and then monitoring the completion status, multi-perspective monitoring is achieved, improving the accuracy of task completion monitoring.

[0065] In some embodiments, the first network includes a first visual-language model (VLM); the second network includes a second visual-language model (second VLM). Through VLM, multimodal inputs of images and text can be combined to efficiently subdivide complex tasks into a series of goals and actions, accurately parsing and executing complex tasks, and achieving efficient and precise action instruction planning. This greatly improves the accuracy and efficiency of instruction planning. Furthermore, VLM has excellent generalization ability, enabling it to cope with more diverse and complex real-world environments.

[0066] In this embodiment of the application, the first network and the second network are obtained by performing adversarial training on an initial first network used to generate training action instructions and an initial second network used to determine the task completion status corresponding to the training action instructions.

[0067] In this embodiment, adversarial training is used to solve the sample efficiency problem in instruction planning of current reinforcement learning methods, reduce the dependence on explicit world models, and enhance the model's generalization ability in new environments.

[0068] In some embodiments, after determining the task completion status, if the task completion status indicates that the task is not completed, the second network sends second environmental information to the first network. In this way, the first network can further generate new action commands based on the task information and the second environmental information to complete the task. For example, if the first action command fails to move the cup to the correct position, an image of the first action command's execution can be sent to the first network. The first network then generates new action commands based on this image and the task information, causing the robot to execute the new action commands again based on the first action command's execution, until the cup is moved to the designated position.

[0069] It is understood that the first and second networks in this embodiment are obtained by adversarial training of an initial first network and an initial second network. The initial first network is used to generate training action instructions, and the initial second network is used to determine the task completion status corresponding to the training action instructions. This allows for continuous optimization and improvement of the generated training action instructions through a feedback mechanism during training, improving the network training effect and consequently enhancing the accuracy of instruction planning and completion monitoring based on the trained first and second networks. Thus, the trained first network infers action instructions based on task information and first environment information; and the trained second network determines task completion status based on second environment information and task information obtained after the robot executes the action instructions. This achieves real-time evaluation of task completion status through the second network monitoring the robot action instructions generated by the first network, enabling self-optimization and adjustment. The cooperation of the two networks improves the accuracy of robot task instruction planning and execution.

[0070] In some embodiments, it can be achieved through, as shown in Figure 2 The method shown is used to train the first network and the second network, as follows:

[0071] S001. Determine the initial first network and the initial second network.

[0072] In some embodiments, the parameters of the VLM architecture neural network can be initialized to obtain an initial first network and an initial second network. Then, adversarial training is performed on the initial first network and the initial second network using a training sample set. During adversarial training, one of the initial first network and the initial second network outputs an action instruction, while the other outputs a probability of completing the task relative to the sample task information. This probability of completion is used as supervision information for iterative optimization of the network. In this way, the initial first network and the initial second network form a Generative Adversarial Network (GAN), which can reduce the workload of pre-labeling samples in a self-supervised manner and perform online adaptation and learning during task execution through the interaction and coordination of the two initial networks.

[0073] In some embodiments, before jointly performing self-supervised adversarial training on the initial first network and the initial second network, the initial first network and the initial second network can be obtained separately through supervised pre-training to optimize the accuracy of the initial first network inferring action instructions and the accuracy of the initial second network in judging the probability of task completion.

[0074] For example, using the original first network, inference is performed based on the second sample task information and the third sample environment information to output the second training action instruction; based on the first sample training action instruction and the second training action instruction corresponding to the second sample task information and the third sample environment information, the network parameters of the original first network are updated until the second training objective is achieved, thus obtaining the initial first network.

[0075] In this way, by optimizing the original first network with a large amount of training data to obtain the initial first network, the training action instructions generated by the initial first network when used for adversarial training with the initial second network can be as consistent as possible with the optimization target, thereby increasing the sample efficiency of adversarial training.

[0076] For example, using the original second network, inference is performed based on the third sample task information, the fourth sample environment information, and the second sample training action instructions to output the second task completion probability corresponding to the second sample training action instructions; based on the second task completion probability and the true value of the completion probability corresponding to the second sample training action instructions, the network parameters of the original second network are updated until the third training objective is reached, thus obtaining the initial second network.

[0077] In this way, by optimizing the original second network with a large amount of training data to obtain the initial second network, the initial second network can generate task completion probabilities that are as consistent as possible with the optimization objective when used for adversarial training against the initial first network, thereby increasing the sample efficiency of adversarial training.

[0078] S002. Using the initial first network and the initial second network respectively, reasoning is performed based on the first sample task information and the first sample environment information to output the first training action instruction, and reasoning is performed based on the first sample task information, the first training action instruction and the second sample environment information to output the first task completion probability.

[0079] In this embodiment of the application, a training sample set for adversarial training of the initial first network and the initial second network can be predetermined.

[0080] The training sample set may include: first sample task information, first sample environment information, and second sample environment information. The second sample environment information represents the environment information obtained when the robot completes the sample task based on the first sample environment information. In other words, the second sample environment information can be pre-collected after the robot executes the correct action instructions within the first sample environment information to complete the first sample task. Then, the first sample task information, the first sample environment information, and the second sample environment information are used to form a set of training samples. Multiple sets of training samples are used to form a training sample set, which is then used to perform adversarial training on the initial first network and the initial second network.

[0081] It should be noted that in some embodiments, the second sample environment information and the first sample environment information can be collected from different angles. In this way, the trained second network can supervise the task completion status of the first network from different perspectives.

[0082] In some embodiments, during adversarial training, an initial first network can act as a producer (or executor, equivalent to a P-Module in a GAN network), performing inference based on the first sample task information and the first sample environment information to output a first training action instruction. An initial second network can act as a judge (or supervisor, equivalent to a J-Module in a GAN network), performing inference based on the first sample task information, the first training action instruction, and the second sample environment information to output a first task completion probability.

[0083] In some embodiments, during a training round, an initial first network is used to infer based on the first sample task information and the first sample environment information to output a first training action instruction, and an initial second network is used to infer based on the first sample task information, the first training action instruction and the second sample environment information to output a task completion probability.

[0084] In another round of training, the initial second network is used to infer based on the first sample task information and the first sample environment information, and outputs the first training action instruction. The initial first network is also used to infer based on the first sample task information, the first training action instruction and the second sample environment information, and outputs the task completion probability.

[0085] In other words, during the training process, the roles of producer and judge can be interchanged between the initial first network and the initial second network. This allows both the initial first network and the initial second network to fully understand the task information through network training, thereby improving the training effect.

[0086] S003. Based on the first task completion probability, update the network parameters of the initial first network and the initial second network until the first training objective is achieved, and obtain the first network and the second network.

[0087] In S003, after each round of training, the network parameters of the initial first network and the initial second network are updated based on the first task completion probability obtained in that round of training.

[0088] In the case where the initial first network acts as the producer, its goal is to generate more accurate action instructions, i.e., to reduce the error between the task completion probability and the target task completion probability. Here, the target task completion probability represents the probability threshold of successful task completion; for example, the target task completion probability can be 90% or 100%, without specific limitation. Therefore, in some embodiments, with the goal of minimizing the error between the first task completion probability and the target task completion probability, a first parameter adjustment amount corresponding to the initial first network is determined, and the network parameters of the initial first network are updated according to the first parameter adjustment amount.

[0089] In the case where the initial second network acts as the judge, its goal is to more accurately determine the task completion probability and identify task failures. Therefore, in some embodiments, the adjustment amount of the second parameter corresponding to the initial second network can be determined with the objective of maximizing the error between the first task completion probability and the target task completion probability, and the network parameters of the initial second network can be updated according to the adjustment amount of the second parameter.

[0090] For example, the training process of adversarial training in this application embodiment can be performed according to the following steps:

[0091] Step 1, Initialization Phase: The model parameters of the initial first network (P-Module) and the initial second network (J-Module) are randomly initialized.

[0092] Step 2: Select a batch of environmental states i (first sample environmental information) and first sample task information and input them into P-Module to generate task vector T (equivalent to the first training action instruction).

[0093] Step 3: Input the environmental state i' (second sample environmental information), the first sample task information, and the generated task vector T into the J-Module to generate the probability P of task completion (first task completion probability).

[0094] Step 4: Update the model parameters of P-Module and J-Module according to the probability of task completion.

[0095] Repeat steps 2-4 until the preset number of iterations is reached or the stopping condition is met.

[0096] Thus, the P-Module and J-Module constitute a dynamically optimized generative adversarial network. The optimization objective for training this generative adversarial network can be expressed as: argmin_G = max(V(D,G)), where V(D,G) represents the performance of the P-Module (G function) and J-Module (D function) after one round of iteration.

[0097] Understandably, Generative Adversarial Networks (GANs) avoid the difficulties of loss functions by introducing a discriminator. Furthermore, in this process, the P-Module and J-Module continuously improve the efficiency and accuracy of task planning based on feedback. This enriches the research perspective on task planning, improves sample efficiency, enhances generalization ability, and improves the accuracy of task description and feedback mechanisms.

[0098] This application provides a robot control method applicable to real-world scenarios, such as... Figure 3 As shown, the execution network (equivalent to the first network) generates motion commands based on the general instruction (equivalent to task information) and the first image (equivalent to the first environment information) and sends them to the robot. The robot executes the motion commands and then acquires a second image (equivalent to the second environment information) and sends it to the supervision network (equivalent to the second network). The supervision network determines whether the general instruction has been completed (equivalent to determining the task completion status) based on the second image and the general instruction. If completion is determined, the processing of the instruction ends; if incomplete, the second image and failure information are sent to the execution network so that the execution network can generate new motion commands. The execution network and supervision network can be implemented using a VLM network.

[0099] It is understood that the embodiments of this application construct a generative adversarial network consisting of an execution network (P-Module) and a supervision network (J-Module), enabling the instruction planning process of robot tasks to have iterative optimization characteristics and enhancing the robustness of the system. Furthermore, using a visual language model (VLM) as the basic framework for both the executor and supervisor, and leveraging the advantages of deep learning and vision-language processing technologies, not only is the efficiency and accuracy of task planning improved, but the model also possesses superior generalization ability. Thus, the accuracy of robot task instruction planning and execution is enhanced.

[0100] This application provides a robot control device, such as... Figure 4 As shown, the robot control device 1 may include:

[0101] Instruction reasoning module 11 is used to determine action instructions by reasoning based on task information and first environment information through the first network;

[0102] The control module 12 is used to control the robot to execute the action command and to obtain the second environmental information after the action command is executed;

[0103] The determination module 13 is used to determine the task completion status through the second network based on the second environment information and the task information; wherein the first network and the second network are obtained by performing adversarial training on an initial first network used to generate sample training instructions and an initial second network used to determine the task completion status corresponding to the sample training instructions.

[0104] In some embodiments, the robot control device 1 further includes: a training module; the training module is configured to determine an initial first network and an initial second network; perform inference based on first sample task information and first sample environment information through the initial first network and the initial second network respectively, and output a first training action command; and perform inference based on the first sample task information, the first training action command and the second sample environment information, and output a first task completion probability; the second sample environment information represents the environment information obtained when the sample task information is completed based on the first sample environment information; and update the network parameters of the initial first network and the initial second network based on the first task completion probability until a first training objective is achieved, thereby obtaining the first network and the second network.

[0105] In some embodiments, the training module is further configured to, during one round of training, use the initial first network to perform inference based on the first sample task information and the first sample environment information, output the first training action instruction, and use the initial second network to perform inference based on the first sample task information, the first training action instruction, and the second sample environment information, output the task completion probability; during another round of training, use the initial second network to perform inference based on the first sample task information and the first sample environment information, output the first training action instruction, and use the initial first network to perform inference based on the first sample task information, the first training action instruction, and the second sample environment information, output the task completion probability.

[0106] In some embodiments, the training module is further configured to use the original first network to perform inference based on the second sample task information and the third sample environment information, and output a second training action instruction; and update the network parameters of the original first network based on the first sample training action instruction corresponding to the second sample task information and the third sample environment information and the second training action instruction, until the second training objective is achieved, thereby obtaining the initial first network.

[0107] In some embodiments, the training module is further configured to use the original second network to perform inference based on the third sample task information, the fourth sample environment information, and the second sample training action instruction, and output the second task completion probability corresponding to the second sample training action instruction; and update the network parameters of the original second network based on the second task completion probability and the true value of the completion probability corresponding to the second sample training action instruction, until the third training objective is reached, thereby obtaining the initial second network.

[0108] In some embodiments, the training module is further configured to: determine a first parameter adjustment amount corresponding to the initial first network with the objective of minimizing the error between the first task completion probability and the target task completion probability; and update the network parameters of the initial first network according to the first parameter adjustment amount; and determine a second parameter adjustment amount corresponding to the initial second network with the objective of maximizing the error between the first task completion probability and the target task completion probability; and update the network parameters of the initial second network according to the second parameter adjustment amount.

[0109] In some embodiments, the first environmental information includes: a first image acquired from a first perspective; the second environmental information includes: a second image acquired from a second perspective; the second perspective is different from the first perspective; the determining module 13 is further configured to perform template matching on the first environmental information and the second environmental information through the second network to determine the region of interest in the second environmental information; and determine the task completion status based on the region of interest and the task information.

[0110] In some embodiments, the determining module 13 is further configured to, after determining the task completion status, send the task completion status and the second environmental information to the first network if the task completion status indicates that the task is not completed, so that the first network generates a new action instruction based on the task information, the task completion status and the second environmental information to complete the task information.

[0111] In some embodiments, the first network includes a first visual language model; the second network includes a second visual language model.

[0112] It should be noted that the description of the above device embodiments is similar to the description of the above method embodiments, and has similar beneficial effects. For technical details not disclosed in the device embodiments of this application, please refer to the description of the method embodiments of this application for understanding.

[0113] This application provides an electronic device, such as... Figure 5 As shown, the electronic device 3 may include: a memory 32 and a processor 33. The memory 32 and the processor 33 are connected via a communication bus 34; the memory 32 stores executable instructions; the processor 33 executes the executable instructions stored in the memory 32 to implement the robot control method provided in this application embodiment.

[0114] This application provides a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored and, when executed by a processor, will cause the processor to execute any of the robot control methods provided in this application.

[0115] In some embodiments, the computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or it may be a variety of devices including one or any combination of the above-mentioned memories.

[0116] In some embodiments, executable instructions may take the form of a program, software, software module, script, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

[0117] As an example, executable instructions may, but do not necessarily, correspond to files in the file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple co-located files (e.g., a file that stores one or more modules, subroutines, or code sections).

[0118] As an example, executable instructions can be deployed to execute on a single computing device, or on multiple computing devices located in one location, or on multiple computing devices distributed across multiple locations and interconnected via a communication network.

[0119] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, and improvements made within the spirit and scope of this application are included within the scope of protection of this application.

Claims

1. A robot control method, characterized in that, The method includes: Through the first network, reasoning is performed based on task information and first environment information to determine action instructions; Control the robot to execute the action command, and obtain the second environmental information after the action command is executed; The task completion status is determined through the second network based on the second environmental information and the task information. The first network and the second network are obtained by adversarial training of an initial first network used to generate training action instructions and an initial second network used to determine the task completion status corresponding to the training action instructions.

2. The method according to claim 1, characterized in that, The method further includes: Determine the initial first network and the initial second network; The system uses the initial first network and the initial second network to perform inference based on the first sample task information and the first sample environment information, respectively, and outputs a first training action instruction. It also performs inference based on the first sample task information, the first training action instruction, and the second sample environment information, and outputs a first task completion probability. The second sample environment information represents the environment information obtained when the sample task information is completed based on the first sample environment information. Based on the first task completion probability, the network parameters of the initial first network and the initial second network are updated until the first training objective is achieved, thus obtaining the first network and the second network.

3. The method according to claim 2, characterized in that, The first training action instruction is output by reasoning based on the first sample task information and the first sample environment information through the initial first network and the initial second network, respectively; and the first task completion probability is output by reasoning based on the first sample task information, the first training action instruction and the second sample environment information. The second sample environment information characterizes the environment information obtained when the sample task information is successfully completed based on the first sample environment information, including: During a training round, the initial first network is used to infer based on the first sample task information and the first sample environment information to output the first training action instruction, and the initial second network is used to infer based on the first sample task information, the first training action instruction and the second sample environment information to output the task completion probability. In another round of training, the initial second network is used to infer based on the first sample task information and the first sample environment information to output the first training action instruction, and the initial first network is used to infer based on the first sample task information, the first training action instruction and the second sample environment information to output the task completion probability.

4. The method according to claim 2 or 3, characterized in that, Determine the initial first network, including: Using the original first network, reasoning is performed based on the second sample task information and the third sample environment information to output the second training action instruction; Based on the first sample training action instruction and the second training action instruction corresponding to the second sample task information and the third sample environment information, the network parameters of the original first network are updated until the second training objective is achieved, thus obtaining the initial first network.

5. The method according to claim 2 or 3, characterized in that, Determine the initial second network, including: Using the original second network, inference is performed based on the third sample task information, the fourth sample environment information, and the second sample training action instructions, and the second task completion probability corresponding to the second sample training action instructions is output. Based on the completion probability of the second task and the true value of the completion probability corresponding to the training action instruction of the second sample, the network parameters of the original second network are updated until the third training objective is achieved, thus obtaining the initial second network.

6. The method according to claim 2 or 3, characterized in that, Based on the first task completion probability, network parameters are updated for the initial first network and the initial second network, including: With the goal of minimizing the error between the probability of completing the first task and the probability of completing the target task, the first parameter adjustment amount corresponding to the initial first network is determined, and the network parameters of the initial first network are updated according to the first parameter adjustment amount. With the goal of maximizing the error between the probability of completing the first task and the probability of completing the target task, the second parameter adjustment amount corresponding to the initial second network is determined, and the network parameters of the initial second network are updated according to the second parameter adjustment amount.

7. The method according to any one of claims 1-3, characterized in that, The first environmental information includes: a first image captured from a first perspective; the second environmental information includes: a second image captured from a second perspective; the second perspective is different from the first perspective; determining the task completion status through a second network based on the second environmental information and the task information includes: Using the second network, template matching is performed between the first environmental information and the second environmental information to determine the region of interest in the second environmental information; Based on the region of interest and the task information, the task completion status is determined.

8. The method according to any one of claims 1-3, characterized in that, After determining the task completion status, the method further includes: If the task completion status indicates that the task is not completed, the task completion status and the second environmental information are sent to the first network, so that the first network can generate new action instructions based on the task information, the task completion status and the second environmental information to complete the task.

9. The method according to any one of claims 1-3, characterized in that, The first network includes a first visual language model; the second network includes a second visual language model.

10. A robot control device, characterized in that, The device includes: The instruction reasoning module is used to infer action instructions based on task information and first environment information through the first network; The control module is used to control the robot to execute the action instructions and to acquire second environmental information after the action instructions are executed. The determination module is used to determine the task completion status based on the second environment information and the task information through the second network; wherein the first network and the second network are obtained by adversarial training on an initial first network used to generate sample training instructions and an initial second network used to determine the task completion status corresponding to the sample training instructions.

11. An electronic device, characterized in that, The electronic device includes: Memory is used to store executable instructions for a computer; A processor, when executing computer-executable instructions stored in the memory, implements the method according to any one of claims 1 to 9.

12. A computer-readable storage medium storing computer-executable instructions, characterized in that, When the computer-executable instructions or computer program are executed by a processor, they implement the method described in any one of claims 1 to 9.

13. A computer program product comprising computer-executable instructions, characterized in that, When the computer-executable instructions or computer program are executed by a processor, they implement the method according to any one of claims 1 to 9.