A robot control method, a robot, a storage medium, and a program product
By combining VLM and dynamic discriminator, the robot can understand natural language commands and environmental images, generate appropriate displacement sequences, and adjust its actions in real time. This solves the stability problem of the robot performing diverse and complex tasks and ensures the smooth completion of the tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI TASHI ZHIHANG TECHNOLOGY CO LTD
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
AI Technical Summary
Existing robots struggle to perform diverse and complex tasks and suffer from poor dynamic stability, making them prone to falling or getting damaged, which can prevent them from completing tasks successfully.
The system employs a visual language multimodal large model (VLM) to jointly understand user natural language commands and environmental images, generating robot displacement sequences adapted to the task. A dynamic discriminator is also introduced to adjust the robot's actions in real time to maintain balance and avoid damage.
This enables robots to perform diverse and complex tasks, with smoother and more stable movements, preventing them from losing balance and falling, and improving the accuracy of task completion.
Smart Images

Figure CN122185252A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of robotics technology, and in particular to a robot control method, a robot, a storage medium, and a program product. Background Technology
[0002] With the rapid iteration of robotics technology, humanoid robots have been gradually applied to various fields such as industrial production, home services, and scientific research, undertaking various complex tasks such as pushing boxes, kicking balls, precision operations, and assisting in nursing care.
[0003] Currently, robot full-body control frameworks (such as VisualMimic) typically deploy AI models (e.g., high-level policy networks) for task execution. These models are trained individually for specific tasks (e.g., pushing boxes, kicking a ball). When the robot needs to perform other tasks, the model must be retrained. This involves redesigning the reward function and performing large-scale reinforcement learning to train a new model capable of performing the new task, which is then deployed in the robot. This results in the robot being limited to performing a single task, making it difficult to meet the demands of diverse tasks. Furthermore, the robot suffers from poor dynamic stability during instruction execution; for example, it is prone to tipping over or sustaining damage, preventing it from successfully completing tasks.
[0004] Therefore, how to enable robots to accurately perform complex and diverse tasks is a technical problem that urgently needs to be solved. Summary of the Invention
[0005] To enable robots to accurately perform complex and diverse tasks, embodiments of this application provide a robot control method, a robot, a storage medium, and a program product.
[0006] In a first aspect, embodiments of this application provide a robot control method for a robot, the method comprising: detecting a first instruction and acquiring a first environmental image; processing the first instruction and the first environmental image based on a visual-language multimodal large model to obtain first displacement sequence information of the robot executing the first instruction; processing the first displacement sequence information and the robot's current state information based on a dynamic discriminator to obtain a first probability that the robot will maintain balance and not suffer damage within a first preset time period when performing an action based on the first displacement sequence information; generating second displacement sequence information based on the first probability and the first displacement sequence information when the first probability is less than a preset threshold; and controlling the robot's movement based on the second displacement sequence information.
[0007] It is understandable that, since VLM has the ability to semantically associate and generalize across tasks, the above scheme can use VLM to jointly understand user natural language commands and environmental images to generate robot displacement sequences that are adapted to the task, enabling the robot to perform diverse and complex tasks.
[0008] Building upon this foundation, a dynamic constraint learning mechanism is introduced. This mechanism uses a dynamic discriminator to assess the probability of the robot maintaining balance and avoiding injury during command execution based on displacement sequences, enabling real-time adjustments to the robot's movements. For example, if the robot exhibits a significant forward tilt while executing a command, it can be determined that the probability of maintaining balance and avoiding injury is low. The robot's movements can be adjusted to limit further forward tilting, effectively preventing the robot from losing balance and falling. Compared to some implementation schemes that use fixed Gaussian distribution statistics to rigidly prune the motion space, this scheme, through dynamic boundary constraints, allows the robot to explore a larger motion space within a safe range of motion. This results in smoother and more stable execution of extreme maneuvers such as pushing heavy objects and rapid turns, ensuring successful and accurate task execution.
[0009] In one possible implementation of the first aspect, the first displacement sequence information includes displacement sequence information of key points of the robot within a future second preset time period, and the key points of the robot include head key points, hand key points, foot key points and root key points.
[0010] In one possible implementation of the first aspect, generating second displacement sequence information based on the first probability and the first displacement sequence information includes: generating second displacement sequence information based on the first probability and the first displacement sequence information by means of gradient descent or Lagrange multiplier method.
[0011] In one possible implementation of the first aspect, the method further includes: acquiring a second environment image after acquiring a first environment image for a third preset time; processing the first instruction and the second environment image based on a visual language multimodal large model to obtain third displacement sequence information of the robot executing the first instruction; processing the third displacement sequence information and the robot's current state information based on a dynamic discriminator to obtain a second probability that the robot will maintain balance and not suffer damage within a first preset time period when it performs an action based on the third displacement sequence information; generating fourth displacement sequence information based on the second probability and the third displacement sequence information if the second probability is less than a preset threshold; and controlling the robot's movement based on the fourth displacement sequence information.
[0012] In one possible implementation of the first aspect, the robot's current state information includes at least one of the following: joint angular velocity, center of mass position, foot pressure, and inertia tensor; the dynamic discriminator is trained based on multiple task instructions and the state information of the robot during the execution of multiple task instructions.
[0013] In one possible implementation of the first aspect, controlling the robot's motion based on the second displacement sequence information includes:
[0014] The second displacement sequence information is processed by a key point tracker to obtain the robot's behavior control information, and the robot's movement is controlled based on the behavior control information.
[0015] In one possible implementation of the first aspect, the robot's motion is controlled based on the first displacement sequence information when the first probability is greater than or equal to a preset threshold.
[0016] In a second aspect, embodiments of this application provide a robot, including: one or more processors; one or more memories; the one or more memories storing one or more programs, which, when executed by the processors, cause the robot to perform the method of any one of the first aspects.
[0017] Thirdly, embodiments of this application provide a computer-readable storage medium storing a program or instructions that, when executed by an electronic device, implement the method of any one of the first aspects.
[0018] Fourthly, embodiments of this application provide a computer program product including instructions that, when executed, cause the method of any one of the first aspects to be implemented. Attached Figure Description
[0019] Figure 1 According to an embodiment of this application, a flowchart of a robot control method is illustrated.
[0020] Figure 2 According to an embodiment of this application, a schematic diagram of the structure of a robot is shown.
[0021] Figure 3 According to an embodiment of this application, a schematic diagram comparing a static cropping control method with the dynamic HMS learning mechanism based on a dynamic discriminator mentioned in the embodiment of this application is shown.
[0022] Figure 4 According to an embodiment of this application, a schematic diagram of the structure of a robot is shown. Detailed Implementation
[0023] The illustrative embodiments of this application include, but are not limited to, a robot control method, a robot, a storage medium, and a program product.
[0024] The robot mentioned in the embodiments of this application can be any intelligent robot such as a humanoid robot, bipedal robot, multi-legged robot, wheel-leg hybrid robot, service robot, or industrial robot. This application does not make any specific limitation in this regard.
[0025] To enable robots to accurately perform complex and diverse tasks, this application provides a method comprising: detecting a first instruction and acquiring a first environmental image; processing the first instruction and the first environmental image based on a Vision-Language Multimodal Large Model (VLM) to obtain first displacement sequence information of the robot executing the first instruction; processing the first displacement sequence information based on a dynamic discriminator to obtain a first probability that the robot will maintain balance and avoid damage within a first preset time period when performing actions based on the first displacement sequence information; if the first probability is less than a preset threshold, generating second displacement sequence information based on the first probability and the first displacement sequence information; and controlling the robot's movement based on the second displacement sequence information.
[0026] It is understandable that, since VLM has the ability to semantically associate and generalize across tasks, the above scheme can use VLM to jointly understand user natural language commands and environmental images to generate robot displacement sequences that are adapted to the task, enabling the robot to perform diverse and complex tasks.
[0027] Building upon this foundation, a dynamic constraint learning mechanism is introduced. This mechanism uses a dynamic discriminator to assess the probability of the robot maintaining balance and avoiding injury during command execution based on displacement sequences, enabling real-time adjustments to the robot's movements. For example, if the robot exhibits a significant forward tilt while executing a command, it can be determined that the probability of maintaining balance and avoiding injury is low. The robot's movements can be adjusted to limit further forward tilting, effectively preventing the robot from losing balance and falling. Compared to some implementation schemes that use fixed Gaussian distribution statistics to rigidly prune the motion space, this scheme, through dynamic boundary constraints, allows the robot to explore a larger motion space within a safe range of motion. This results in smoother and more stable execution of extreme maneuvers such as pushing heavy objects and rapid turns, ensuring successful and accurate task execution.
[0028] The robot control method provided in the embodiments of this application will be described in detail below.
[0029] Figure 1 A flowchart illustrating a robot control method is shown, which can be executed by a robot. For example... Figure 1 As shown, the method includes:
[0030] 101: First instruction detected, first environment image acquired.
[0031] In some embodiments, the first instruction can be a natural language instruction, such as: "Push the chair on the left under the table" or "Pick up the package on the floor for me".
[0032] It is understandable that after detecting the first command, the robot can acquire environmental images at a preset frequency. The first environmental image can be any frame acquired by the robot at the preset frequency during the execution of the first command. For example, the first environmental image can be the first frame acquired by the robot after detecting the first command, or it can be the second frame or the third frame, etc.
[0033] In some embodiments, the first environmental image may be a view depth image or an RGB image, etc.
[0034] In some embodiments, the first environmental image may be a single frame or multiple frames; this application does not limit the scope of the embodiments.
[0035] 102: Based on VLM, the first instruction and the first environment image are processed to obtain the first displacement sequence information of the robot executing the first instruction.
[0036] In some embodiments, VLM can employ a Transformer architecture.
[0037] In some embodiments, processing the first instruction and the first environment image based on the VLM to obtain the first displacement sequence information of the robot executing the first instruction may include: extracting spatial features of the first environment image through a visual encoder (such as ViT), and simultaneously extracting semantic features of the first instruction (such as "push the chair on the left under the table") through a text encoder. The spatial features and semantic features are input into the VLM model, and the VLM model outputs the first displacement sequence information of the first instruction.
[0038] In some embodiments, the first displacement sequence information includes the displacement sequence information of key points within a second preset time period (e.g., 2 seconds, 100 frames) of the robot in the future. The displacement sequence information of key points can be the coordinate information of the position points of key points in a continuous time series.
[0039] In some embodiments, the robot's key points (e.g., 6 sets of key points) may include head key points, hand key points (e.g., key points for the left and right hands), foot key points (e.g., key points for the left and right feet), and root key points. Specifically, hand key points may include key points for the left and right hands, and foot key points may include key points for the left and right feet; that is, the robot's key points may include 6 sets of key points.
[0040] In some embodiments, the number and type of key points of the robot can be adaptively adjusted according to the actual shape of the robot. For example, when the robot switches to a four-armed form, the key points of the hands are configured to correspond to the exclusive key points of the four hands, etc., which is not limited in the embodiments of this application.
[0041] In some embodiments, the specific value of the second preset duration can be determined according to actual needs, and this application embodiment does not limit it.
[0042] 103: Based on the dynamic discriminator, the first displacement sequence information and the current state information of the robot are processed to obtain the first probability that the robot will maintain balance and not be damaged within a first preset time period when it executes the first instruction based on the first displacement sequence information.
[0043] In some embodiments, the first displacement sequence information and the robot's current state information can be input into the dynamic discriminator. The dynamic discriminator can output the first probability that the robot will maintain balance and not suffer damage (e.g., joint damage) within a first preset time period after performing an action based on the first displacement sequence information.
[0044] In some embodiments, the robot’s current state information includes at least one of the following: joint angular velocity, center of mass position, foot pressure, and inertia tensor.
[0045] In some embodiments, the first probability is a value between 0 and 1. The higher the value, the more the action conforms to the human motion space (HMS) and the less likely it is to fall, that is, the easier it is to maintain balance and avoid injury.
[0046] In some embodiments, the first preset duration can be determined according to actual needs, for example, it can be the duration of the next 10 frames.
[0047] In some embodiments, the dynamic discriminator is trained based on multiple task instructions and state information of the robot during the execution of the multiple task instructions. The state information includes at least one of the following: joint angular velocity, center of mass position, foot pressure, and inertial tensor. The dynamic discriminator is a neural network-based discriminator used to predict the probability that, in the current state, after performing an action, the robot will maintain balance and avoid anti-joint damage within the next k frames (e.g., 10 frames).
[0048] 104: If the first probability is less than a preset threshold, generate the second displacement sequence information based on the first probability and the first displacement sequence information.
[0049] It is understandable that when the first probability is less than a preset threshold (e.g., 90%), it indicates that the robot is highly likely to fail to maintain balance or suffer damage after executing the action corresponding to the current first displacement sequence information. Therefore, the first displacement sequence information can be optimized to obtain the second displacement sequence information. When the first probability is greater than or equal to the preset threshold, it indicates that the robot can maintain balance and avoid damage after executing the action corresponding to the current first displacement sequence information. In this case, the robot's movement can be controlled based on the first displacement sequence information.
[0050] In some embodiments, the preset threshold can be determined based on actual needs, and this application does not limit it.
[0051] In some embodiments, the second displacement sequence information may include optimized displacement sequence information of the robot's key points.
[0052] In some embodiments, generating second displacement sequence information based on a first probability and first displacement sequence information includes: generating second displacement sequence information based on a first probability and first displacement sequence information using gradient descent or Lagrange multiplier method.
[0053] For example, taking gradient descent as an example, the method for generating the second displacement sequence information using gradient descent can be referenced by the following formula:
[0054] a final =a raw +η∇aD(a∣st)
[0055] Among them, a final This is the final execution instruction, i.e., the second displacement sequence information.
[0056] a raw This is the original instruction, i.e., the first displacement sequence information.
[0057] D(a|sat) is a dynamic discriminator, a trained function used to evaluate the "safe probability" of performing action a (corresponding to the first displacement sequence information) in the robot's current state St, i.e., to output the first probability.
[0058] ∇a is the gradient operator, used to find the direction in which the function D(a|sat) grows fastest. In other words, it represents the direction in which the keypoint coordinates should be fine-tuned to make the action safer.
[0059] ∇aD(a|sat) is the safety gradient, a "vector guide" in the action space. It is used to define the safety gradient in the action space. raw If the robot becomes unstable, guide it to "pull back" to a safe area.
[0060] η(Eta) is the correction step size / learning rate, used to control the strength of the correction. If η is large, the corrected action will be strictly limited to a very safe range. If η is small, the corrected action tends to retain α. raw The original intent has been slightly modified.
[0061] It is understandable that the trained dynamic discriminator can obtain the dynamic feasible region corresponding to the current state, that is, the set of all objects satisfying D(a|st)>ϵ (preset threshold) in the current state st. When the first probability output by the dynamic discriminator is greater than or equal to the preset threshold, it means that the action falls within the dynamic feasible region and no action correction is needed. When the first probability output by the dynamic discriminator is less than the preset threshold, it means that the action falls outside the action feasible region. In this case, action correction is required, for example, by projecting the first displacement sequence information onto the nearest dynamic feasible region boundary point (a human-like and safe boundary point) to generate the second displacement sequence information.
[0062] It is understandable that when the robot is in a stable posture, the dynamic feasible region (or the allowable HMS range) is relatively large. When the robot is in an unstable posture such as single-leg support, leaning forward, or high-speed movement, the dynamic feasible region will shrink.
[0063] 105: Controlling robot motion based on second displacement sequence information.
[0064] In some embodiments, controlling robot motion based on second displacement sequence information includes: processing the second displacement sequence information through a key point tracker to obtain robot behavior control information, and controlling robot motion based on the behavior control information.
[0065] In some embodiments, the behavior control information can be the displacement sequence information of each joint of the robot. After obtaining the displacement sequence information of each joint, the motors can be controlled based on the displacement sequence information of each joint to realize the movement of the robot.
[0066] In some embodiments, when the first probability is greater than or equal to a preset threshold, the method of controlling the robot's movement based on the first displacement sequence information can refer to the method of controlling the robot's movement based on the second displacement sequence information, which will not be described here.
[0067] In some embodiments, the keypoint tracker can reuse the low-level keypoint tracker of the VisualMimic algorithm, and the keypoint tracker can be trained by injecting fluctuation noise generated by dynamic boundaries to enhance its adaptability to high-level instruction adjustments.
[0068] In some embodiments, after acquiring the first environmental image for a third preset duration, a second environmental image can be acquired. The preset duration can be determined based on a preset frequency for acquiring the environmental image, that is, the next frame image can be acquired according to the preset frequency, and the subsequent displacement sequence can be determined until the first instruction is completed.
[0069] For example, after acquiring the second environmental image, the first command and the second environmental image are processed based on a visual-language multimodal large model to obtain the third displacement sequence information for the robot to execute the first command. The third displacement sequence information and the robot's current state information are then processed based on a dynamic discriminator to obtain a second probability that the robot will maintain balance and avoid damage within a first preset time period when performing actions based on the third displacement sequence information. If the second probability is less than a preset threshold, a fourth displacement sequence information is generated based on the second probability and the third displacement sequence information. The robot's movement is then controlled based on the fourth displacement sequence information.
[0070] It is understood that in the embodiments of this application, during the execution of the first instruction, environmental images can be acquired in real time, and the above steps 102-105 can be executed to update the displacement sequence information in real time during the execution of the first instruction, thereby improving the accuracy of action execution.
[0071] In summary, the above scheme uses VLM to jointly understand user natural language commands and environmental images to generate robot displacement sequences adapted to the task, enabling the robot to perform diverse and complex tasks.
[0072] Building upon this foundation, a dynamic constraint learning mechanism is introduced. This mechanism uses a dynamic discriminator to assess the probability of the robot maintaining balance and avoiding injury during the execution of displacement-based commands, enabling real-time adjustments to the robot's movements. For example, if the robot exhibits a significant forward tilt while executing a command, it can be determined that the probability of maintaining balance and avoiding injury is low. The robot's movements can be adjusted to limit further forward tilting, effectively preventing the robot from losing balance and falling. Compared to some implementation schemes that use fixed Gaussian distribution statistics to rigidly prune the motion space, this scheme, through dynamic boundary constraints, allows the robot to explore a larger motion space within a safe range of motion. This results in smoother and more stable execution of extreme maneuvers such as pushing heavy objects and rapid turns, ensuring successful and accurate task execution.
[0073] Furthermore, with the help of VLM, the robot can understand unfamiliar text instructions and translate them into reasonable actions, solving the engineering problem of having to retrain the model for each task. Additionally, the introduction of a dynamic discriminator, based on a dynamic HMS learning mechanism, can automatically adjust constraints according to the robot's current real-time posture, which is more accurate than static statistical clipping. For example, when the body leans forward significantly, dynamic constraints will automatically limit further forward leaning, effectively preventing falls. Moreover, compared to static clipping, dynamic boundaries allow the robot to explore a larger motion space while remaining safe, making extreme movements such as pushing heavy objects or rapid turns smoother.
[0074] The control method in the embodiments of this application will be further described below in conjunction with the structure of the robot.
[0075] Figure 2 A schematic diagram of the structure of a robot is shown. For example... Figure 2 As shown, robot 100 may include a multimodal input layer, a general and dynamic control layer, and a low-level execution layer.
[0076] The multimodal input layer can be used to detect a first instruction and acquire a first environmental image. For example, the first instruction can be the natural language instruction "push the box", and the first environmental image can be an RGB image captured by an RGB camera or a visual depth image captured by a depth camera.
[0077] The general and dynamic control layer can include a general keypoint generator and a dynamic discriminator (or dynamic HMS boundary learner).
[0078] A general key point generator can be a VLM, used to process the first instruction and the first environment image to obtain the first displacement sequence information of the robot executing the first instruction.
[0079] The dynamic HMS boundary learner is used to process the first displacement sequence information and the robot's current state information to obtain the first probability that the robot will maintain balance and not suffer damage within a first preset time period when it performs an action based on the first displacement sequence information.
[0080] The adaptive cropping and projection module is used to generate second displacement sequence information based on the first probability and the first displacement sequence information when the first probability is less than a preset threshold.
[0081] The lower execution layer is used to control the robot's motion based on the second displacement sequence information. For example, the lower execution layer includes a keypoint tracker (or lower-level keypoint tracker). The keypoint tracker can be an MLP network, used to obtain the displacement sequence information of each joint of the robot based on the second displacement sequence information. After obtaining the displacement sequence information of each joint, the motors can be controlled based on the displacement sequence information of each joint to realize the robot's motion. The lower execution layer can also be used to send the robot's current state information to the dynamic HMS boundary learner.
[0082] In some embodiments, the robot can run a general key point generator and a dynamic HMS boundary learner at a frequency of 50 Hz and a low-level control layer at a frequency of 1000 Hz to achieve dexterous and coherent full-body operations.
[0083] Figure 3 The diagram illustrates a comparison between static clipping and the dynamic HMS learning mechanism based on a dynamic discriminator mentioned in the embodiments of this application. Static clipping can involve hard clipping of the action space using fixed Gaussian distribution statistics.
[0084] Figure 3 Figure (a) illustrates a static HMS truncation technique. This approach uses a fixed static HMS boundary to rigidly constrain the robot's motion space. Simple directional trajectories or optimized target trajectories generated by the robot based on task instructions are directly truncated if they conflict with the preset rectangular static boundary, ultimately leading to motion obstruction or even failure. While this method can avoid actions exceeding the safe range, it rigidly restricts the robot's degrees of freedom and cannot adjust the constraint boundary according to real-time posture and task scenario, resulting in stiff robot movements and a high failure rate in complex tasks.
[0085] Figure 3 Algebra (b) presents the dynamic HMS learning mechanism proposed in this application. By introducing a dynamic discriminator, the system can assess the safety probability of robot actions in real time and generate optimized dynamic trajectory boundaries that change dynamically with the robot's posture and task progress. The optimized target trajectory generated by the robot based on VLM is corrected into a smooth and optimized actual execution trajectory under the guidance of the dynamic boundary. That is, the trajectory is no longer abruptly truncated by fixed boundaries, but is finely adjusted in a balanced and low-risk direction under the safety guidance of the dynamic boundary, which not only fully preserves the original task intent, but also always stays within the safe and feasible domain. This dynamic constraint mechanism realizes the adaptive adjustment of the safety boundary: when the robot is in a stable posture, it is allowed to fully explore a larger action space; when the robot is in an unstable state such as single-leg support or large forward tilt, the action is strongly guided back to the safety center to avoid imbalance or joint damage. For example, the language command detected by the robot is "Please pick up the package on the ground". After generating the keypoint sequence for bending over and picking up objects based on the aforementioned VLM, the dynamic discrimination mechanism monitors in real time that the robot's center of gravity shifts towards its toes during the bending process based on this keypoint sequence. It dynamically reduces the allowable range of forward arm extension, preventing the robot from losing balance due to overextension. Compared to static trimming schemes, the dynamic HMS learning mechanism not only ensures the safety of robot movement but also makes the execution of extreme actions such as pushing heavy objects and rapid turns smoother and more stable, significantly improving the accuracy of complex tasks.
[0086] Figure 4 A schematic diagram of the structure of a robot is shown. Figure 4 In the illustrated embodiment, robot 100 may include one or more processors 101, system control logic 102 connected to at least one of the processors 101, system memory 103 connected to system control logic 102, non-volatile memory (NVM) 104 connected to system control logic 102, and network interface 106 connected to system control logic 102.
[0087] In some embodiments, processor 101 may include one or more single-core or multi-core processors. In some embodiments, processor 101 may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In embodiments where robot 100 employs an Evolved Node B (eNB) or Radio Access Network (RAN) controller, processor 101 may be configured to perform various conforming embodiments. For example, processor 101 may be used in the control methods provided in embodiments of this application.
[0088] In some embodiments, system control logic 102 may include any suitable interface controller to provide any suitable interface to at least one suitable device or component in processor 101 that communicates with system control logic 102.
[0089] In some embodiments, system control logic 102 may include one or more memory controllers to provide an interface to system memory 103. System memory 103 may be used to load and store vector data and / or instructions.
[0090] In some embodiments, the system memory 103 of the robot 100 may include any suitable volatile memory, such as suitable dynamic random access memory (DRAM).
[0091] NVM memory 104 may include one or more tangible, non-transitory computer-readable media for storing data and / or instructions. In some embodiments, NVM memory 104 may include any suitable non-volatile memory such as flash memory and / or any suitable non-volatile storage device, such as at least one of a hard disk drive (HDD), a compact disc (CD) drive, and a digital versatile disc (DVD) drive.
[0092] NVM memory 104 may include a portion of the storage resources on the device on which robot 100 is mounted, or it may be accessible by the device, but is not necessarily part of the device. For example, NVM memory 104 may be accessed over a network via network interface 106.
[0093] Specifically, system memory 103 and NVM memory 104 may each include a temporary copy and a permanent copy of instruction 105. Instruction 105 may include instructions that, when executed by at least one of processors 101, cause robot 100 to implement the vector data query method as described in embodiments of this application. In some embodiments, instruction 105, hardware, firmware, and / or its software components may additionally / alternatively reside in system control logic 102, network interface 106, and / or processor 101.
[0094] Network interface 106 may include a transceiver for providing a radio interface to robot 100, thereby enabling communication with any other suitable devices (such as front-end modules, antennas, etc.) via one or more networks. In some embodiments, network interface 106 may be integrated into other components of robot 100.
[0095] In some embodiments, at least one of the processors 101 may be packaged together with the logic of one or more controllers for system control logic 102 to form a system in a package (SiP). In some embodiments, at least one of the processors 101 may be integrated on the same die with the logic of one or more controllers for system control logic 102 to form a system on a chip (SoC).
[0096] The robot 100 may further include an input / output (I / O) device 107. The I / O device 107 may include a user interface that enables a user to interact with the robot 100; the peripheral component interface is designed so that peripheral components can also interact with the robot 100.
[0097] Understandable Figure 4 The illustrated structure does not constitute a specific limitation on robot 100. In other embodiments of this application, robot 100 may include more or fewer components than illustrated, or combine some components, or separate some components, or have different component arrangements. The illustrated components may be implemented by hardware or software, or a combination of software and hardware.
[0098] This application provides a chip, including: one or more processors; one or more memories; the one or more memories storing one or more programs, which, when executed by the processors, cause the chip to execute the robot control method provided in this application.
[0099] This application provides a robot, including: one or more processors; one or more memories; the one or more memories store one or more programs, and when one or more programs are executed by the processors, the robot performs the robot control method provided in this application.
[0100] This application provides a computer-readable storage medium storing a program or instructions. When the program or instructions are executed by an electronic device, the robot control method provided in this application is implemented.
[0101] This application provides a computer program product, including instructions, which, when executed, cause the robot control method provided in this application to be implemented.
[0102] This application provides an electronic device, including: one or more processors; one or more memories; the one or more memories storing one or more programs, which, when executed by the processors, cause the electronic device to perform the robot control method provided in this application. In some embodiments, the electronic device may be a robot, or it may be a control device other than a robot.
[0103] Various embodiments of the mechanisms disclosed in this application can be implemented in hardware, software, firmware, or combinations of these implementation methods. Embodiments of this application can be implemented as computer programs or program code executable on a programmable system, the programmable system including at least one processor, a storage system (including volatile and non-volatile memory and / or storage elements), at least one input device, and at least one output device.
[0104] Program code can be applied to input instructions to execute the functions described in this application and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, the processing system includes any system having a processor such as, for example, a digital signal processor, a microcontroller, an application-specific integrated circuit, or a microprocessor.
[0105] The program code can be implemented using a high-level procedural language or an object-oriented programming language to communicate with the processing system. Assembly language or machine language can also be used when needed. In fact, the mechanisms described in this application are not limited to any particular programming language. In either case, the language can be a compiled language or an interpreted language.
[0106] In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried or stored on or on one or more temporary or non-temporary machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or through other computer-readable media. Therefore, machine-readable media may include any mechanism for storing or transmitting information in a machine-readable (e.g., computer-readable) form, including but not limited to floppy disks, optical disks, optical discs, read-only memory, magneto-optical disks, random access memory, erasable programmable read-only memory, electrically erasable programmable read-only memory, magnetic cards or optical cards, flash memory, or tangible machine-readable storage for transmitting information (e.g., carrier waves, infrared signals, digital signals, etc.) using the Internet in the form of electrical, optical, acoustic, or other propagation signals. Therefore, machine-readable media include any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a machine-readable (e.g., computer-readable) form.
[0107] In the accompanying drawings, some structural or methodological features may be shown in a specific arrangement and / or order. However, it should be understood that such a specific arrangement and / or order may not be necessary. Rather, in some embodiments, these features may be arranged in a manner and / or order different from that shown in the illustrative drawings. Furthermore, the inclusion of structural or methodological features in a particular figure does not imply that such features are required in all embodiments, and in some embodiments, these features may be omitted or may be combined with other features.
[0108] It should be noted that all units / modules mentioned in the device embodiments of this application are logical units / modules. Physically, a logical unit / module can be a physical unit / module, a part of a physical unit / module, or a combination of multiple physical units / modules. The physical implementation of these logical units / modules themselves is not the most important factor; the combination of functions implemented by these logical units / modules is the key to solving the technical problems proposed in this application. Furthermore, to highlight the innovative aspects of this application, the above-described device embodiments of this application have not introduced units / modules that are not closely related to solving the technical problems proposed in this application. This does not mean that the above-described device embodiments do not contain other units / modules.
[0109] It should be noted that in the examples and description of this patent, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one" does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0110] Although this application has been illustrated and described with reference to certain preferred embodiments thereof, those skilled in the art should understand that various changes in form and detail may be made thereto without departing from the spirit and scope of this application.
Claims
1. A robot control method, characterized in that, For use with a robot, the method includes: Upon detecting the first instruction, acquire the first environmental image; The first instruction and the first environmental image are processed based on a visual language multimodal large model to obtain the first displacement sequence information of the robot executing the first instruction; The dynamic discriminator processes the first displacement sequence information and the current state information of the robot to obtain a first probability that the robot will maintain balance and not suffer damage within a first preset time period when it performs an action based on the first displacement sequence information. If the first probability is less than a preset threshold, second displacement sequence information is generated based on the first probability and the first displacement sequence information. The robot's movement is controlled based on the second displacement sequence information.
2. The method according to claim 1, characterized in that, The first displacement sequence information includes displacement sequence information of the robot's key points within a second preset time period in the future. The robot's key points include head key points, hand key points, foot key points, and root key points.
3. The method according to claim 1, characterized in that, The step of generating second displacement sequence information based on the first probability and the first displacement sequence information includes: The second displacement sequence information is generated based on the first probability and the first displacement sequence information using either gradient descent or the Lagrange multiplier method.
4. The method according to any one of claims 1-3, characterized in that, The method further includes: After a third preset time period of acquiring the first environmental image, the second environmental image is acquired; The first instruction and the second environmental image are processed based on a visual language multimodal large model to obtain the third displacement sequence information of the robot executing the first instruction; The third displacement sequence information and the current state information of the robot are processed by a dynamic discriminator to obtain a second probability that the robot will maintain balance and not be damaged within a first preset time period when it performs an action based on the third displacement sequence information. If the second probability is less than a preset threshold, a fourth displacement sequence information is generated based on the second probability and the third displacement sequence information. The robot's movement is controlled based on the fourth displacement sequence information.
5. The method according to any one of claims 1-3, characterized in that, The robot's current state information includes at least one of the following: joint angular velocity, center of mass position, foot pressure, and inertia tensor; The dynamic discriminator is trained based on multiple task instructions and the state information of the robot during the execution of the multiple task instructions.
6. The method according to any one of claims 1-3, characterized in that, The control of the robot's movement based on the second displacement sequence information includes: The second displacement sequence information is processed by a key point tracker to obtain the robot's behavior control information, and the robot's movement is controlled based on the behavior control information.
7. The method according to any one of claims 1-3, characterized in that, When the first probability is greater than or equal to the preset threshold, the robot's movement is controlled based on the first displacement sequence information.
8. A robot, characterized in that, include: One or more processors; One or more memories; the one or more memories storing one or more programs that, when executed by the processor, cause the robot to perform the method of any one of claims 1-7.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a program or instructions that, when executed by an electronic device, implement the method as claimed in any one of claims 1 to 7.
10. A computer program product, characterized in that, Includes instructions that, when executed, cause the method of any one of claims 1-7 to be implemented.