A dual-arm robot multi-environment adaptive cooperation control method and system based on RDT
By adopting a multi-environment adaptive cooperative control method based on RDT, the problems of cross-environment generalization and safety of dual-arm robots under different lighting conditions are solved, and stable cooperative control under different lighting conditions is achieved, improving the accuracy and safety of task execution.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIVERSE CHANGYOU ROBOT CO LTD
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing dual-arm robot control methods have poor cross-environment generalization ability, high collision risk during dual-arm collaboration, visual perception is greatly affected by ambient lighting, and lack multi-environment adaptive processing strategies, leading to task execution failure.
A multi-environment adaptive cooperative control method based on RDT is adopted. Through multi-environment data acquisition, unified state vector encoding, multi-view visual perception and environmental adaptive processing, combined with the diffusion process of the RDT model, a safe action sequence is generated to achieve dual-arm cooperative control.
It achieves cross-environment generalization capability under different lighting conditions, improves the safety of dual-arm collaboration and the robustness of visual perception, reduces deployment costs, and improves the accuracy and continuity of task execution.
Smart Images

Figure CN121848404B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of robot control technology, and in particular to a multi-environment adaptive cooperative control method and system for a dual-arm robot based on RDT. Background Technology
[0002] With the rapid development of robotics technology, dual-arm robots, possessing more human-like operational capabilities, are increasingly widely used in fields such as industrial assembly, service interaction, and medical assistance. In practical applications, dual-arm robots often need to perform collaborative tasks under different lighting environments (such as indoor natural light and indoor artificial light). Changes in ambient lighting can severely affect the stability of the robot's visual perception system. At the same time, controlling spatial constraints during dual-arm collaboration is difficult and prone to collisions, leading to task failure.
[0003] Most existing dual-arm robot control methods are designed for single, fixed environments. When environmental conditions such as lighting change, data needs to be re-collected and the model retrained, resulting in poor cross-environment generalization ability and high deployment costs. Some methods with certain environmental adaptability have not achieved effective fusion of multi-environment data, resulting in low model adaptation efficiency to new environments. Furthermore, they lack a dual-arm coordination constraint mechanism based on a unified state representation, leading to insufficient real-time performance and accuracy in collision detection. At the same time, traditional visual perception modules have not designed adaptive processing strategies for different lighting environments, resulting in poor robustness of visual feature extraction, which further affects the execution accuracy of dual-arm collaborative tasks.
[0004] The Robotics Diffusion Transformer (RDT) model has been initially applied in the field of robot motion generation due to its long sequence modeling capabilities and multimodal condition generation advantages. However, the original RDT model lacks dual-arm collision detection functionality and does not consider visual perception and motion adaptation issues in multiple environments. Furthermore, its built-in 128-dimensional state vector encoding rules and core configuration of the diffusion process have not formed a standardized implementation solution, making it impossible to directly apply to dual-arm collaborative control scenarios in multiple environments.
[0005] To address the aforementioned issues, there is an urgent need for a dual-arm robot control method that can achieve multi-environment adaptability, improve the safety of dual-arm collaboration, and enhance the robustness of visual perception. This method would solve technical problems in existing technologies, such as poor cross-environment generalization ability, high risk of dual-arm collision, and significant environmental influence on visual perception. Summary of the Invention
[0006] To overcome the problems existing in the prior art, this application provides a multi-environment adaptive cooperative control method and system for dual-arm robots based on RDT.
[0007] This application provides a multi-environment adaptive cooperative control method and system for a dual-arm robot based on RDT, which adopts the following technical solution:
[0008] A multi-environment adaptive cooperative control method for a dual-arm robot based on RDT includes the following steps:
[0009] S1: Multi-environment data acquisition step. Data on the collaborative task of the dual-arm robot is collected under different lighting conditions. An environment type label and a task type label are added to each data sample. The environment types include indoor natural light environment and indoor artificial light environment. Simultaneously, multi-view image data from the head camera, left wrist camera, and right wrist camera are acquired. S2: Dual-arm coordination constraint step. The states and movements of the left and right arms of the dual-arm robot are uniformly encoded into a 128-dimensional unified state vector provided by the RDT model. Based on this 128-dimensional unified state vector, a dual-arm collision detection and movement adjustment mechanism is implemented. S3: Multi-view visual perception step. The multi-view image data acquired in step S1 is fused, and an environment-adaptive image processing strategy is applied to the fused image according to the environment type label. S4: Action generation step based on the RDT model. Using language commands, multi-view image features processed in step S3, and the 128-dimensional unified state vector from step S2 as conditions, the random noise is gradually denoised through the diffusion process of the RDT model to generate a sequence of collaborative actions for the dual-arm robot for the next 64 steps. The dual-arm coordination constraint mechanism from step S2 is applied again after action generation to ensure the safety of the action sequence.
[0010] Further, in step S2, the mechanism for implementing dual-arm collision detection and motion adjustment based on a 128-dimensional unified state vector specifically includes: S21: extracting the motion trajectories of the left and right arm end effectors from the motion sequence corresponding to the 128-dimensional unified state vector; S22: calculating the minimum spatial distance between the left and right arms by combining the geometry of the robot body and the gripper; S23: setting a preset safety distance threshold, comparing the minimum spatial distance with the safety distance threshold, and performing collision detection; if the minimum spatial distance is less than the safety distance threshold, it is determined that there is a collision risk, and step S24 is executed; if the minimum spatial distance is greater than or equal to the safety distance threshold, it is determined that there is no collision risk, and the motion sequence remains unchanged; S24: dynamically adjusting the motion sequence, prioritizing ensuring no collision between the two arms, and minimizing the modification of the original motion sequence under the premise of satisfying collision constraints, to ensure the continuity of task execution.
[0011] Furthermore, in step S2, the 128-dimensional unified state vector encodes the joint position end effector posture information of the left and right arms, and the end effector posture adopts the 6D rotation representation of the RDT model to replace the traditional quaternion or Euler angle, thus avoiding the gimbal lock problem.
[0012] Further, in step S3, the environment-adaptive image processing strategy specifically includes: S31: calculating the average brightness of the fused image; if the average brightness is lower than a preset brightness threshold, automatically increasing the image brightness and contrast; if the average brightness is higher than or equal to the preset brightness threshold, maintaining standard image processing parameters; S32: applying different image enhancement parameters according to the environment type label, setting differentiated color jitter and noise injection parameters for indoor natural light environment and indoor artificial light environment respectively, wherein the noise includes Gaussian noise, Laplacian noise, and Poisson noise; S33: extracting features from the image processed by steps S31 and S32 to obtain multi-view fused visual features for subsequent action generation.
[0013] Furthermore, in step S4, the RDT model uses Transformer as the denoising network for the diffusion model, supports long sequence action modeling, and the diffusion process is a conditional diffusion process, using language instructions, multi-view visual features and the robot's 128-dimensional unified state vector as conditions to guide the generation of dual-arm collaborative action sequences, ensuring the matching of action sequences with task instructions and environmental states.
[0014] A multi-environment adaptive collaborative control system for a dual-arm robot based on RDT is disclosed to implement the aforementioned control method. The system includes a data acquisition module, a model fine-tuning module, and a model deployment module. The data acquisition module synchronously acquires multimodal collaborative task data through motion capture teaching, visual perception, and real-time TCP data transmission. The model fine-tuning module processes and transforms the acquired data, fine-tuning the training model through visual perception, multi-environment data fusion, and state encoding to generate a dedicated collaborative control model. The model deployment module loads the fine-tuned model, combines real-time environmental perception and robotic arm state data to generate a safe action sequence and drive the dual-arm robot to complete the collaborative task. The modules interact via industrial Ethernet, with a data transmission latency of less than 10ms, meeting the real-time control requirements of the dual-arm robot. The system is configured with a 128-dimensional unified state vector inherent in the RDT model. This 128-dimensional unified state vector includes the joint positions of the left and right arms of the dual-arm robot and the end effector posture information. The end effector posture is represented by 6D rotation to avoid discontinuous rotation angles.
[0015] Preferably, the data acquisition module includes: a motion capture teaching unit for capturing the operator's end effector TCP pose data; a data transmission unit for transmitting the TCP pose data to the robotic arm control program, wherein the robotic arm control system has a built-in robotic arm TCP inverse kinematics unit; a robotic arm TCP inverse kinematics unit for converting the TCP pose data into robotic arm joint angle commands; an RTDE data transmission unit for transmitting the robotic arm joint angle commands to the UR robotic arm and transmitting the real-time status data of the UR robotic arm back to the robotic arm control program; and a multimodal data synchronization unit for synchronously recording the visual data from the head camera, left wrist camera, and right wrist camera, as well as the body status data of the left and right robotic arms, at a frequency of 100ms, and packaging them into task segments.
[0016] Preferably, the model fine-tuning module includes: a data processing and conversion unit, used to clean, remove outliers and standardize the format of the collected task segments, extract environmental and collaborative object features from the image data, fuse and encode the visual features with the robotic arm state data, and finally encode the fused state into a low-dimensional vector; and a model training unit, used to fine-tune the pre-trained model based on the fused data to generate a collaborative control model adapted to a specific scenario.
[0017] Preferably, the model deployment module includes: a real-time perception unit for acquiring environmental images through multiple cameras and performing enhancement processing; a state data acquisition unit for acquiring state data of the joint angles and end-effector poses of the dual-arm robot in real time; a model deployment unit for loading the fine-tuned model, receiving visual and state data, and outputting action commands; an action generation unit for integrating discrete action commands into a continuous and smooth action sequence and sending it to the robotic arm control program; a collision detection unit for performing safety prediction on the action sequence received by the robotic arm control program; and a UR robotic arm for receiving the safe action sequence after collision detection from the robotic arm control program and completing the collaborative task.
[0018] Preferably, the data acquisition module uses a head camera, a left wrist camera, and a right wrist camera to simultaneously acquire robot status data. The robot status data includes data from the left and right robotic arms. The environmental types acquired by the data acquisition module include indoor natural light environment and indoor artificial light environment.
[0019] Preferably, the collision detection unit is used to extract the motion trajectories of the left and right arm end effectors from the action sequence corresponding to the 128-dimensional unified state vector, calculate the minimum spatial distance between the left and right arms by combining the geometry of the robot body and the gripper, preset a safety distance threshold, and compare the minimum spatial distance with the threshold to determine whether there is a collision risk; the robotic arm control program is used to dynamically adjust the action sequence when the collision detection unit determines that there is a collision risk, minimize the modification of the original action sequence while prioritizing ensuring that the two arms do not collide, and keep the action sequence unchanged if it is determined that there is no collision risk.
[0020] Preferably, the visual perception in the data acquisition module, model fine-tuning module, and model deployment module is implemented through a visual perception unit. The visual perception module includes an image fusion unit and an adaptive processing unit connected in sequence. The image fusion unit is used to perform pixel-level fusion of multi-view image data acquired by the data acquisition module to obtain a multi-view fused image. The adaptive processing unit is used to calculate the average brightness of the fused image, and adaptively adjust the brightness and contrast according to the brightness value. At the same time, it applies differentiated image enhancement parameters to indoor natural light environment and indoor artificial light environment according to the environment type label. The image enhancement parameters include color jitter parameters and noise injection parameters. The noise includes Gaussian noise, Laplacian noise, and Poisson noise.
[0021] Preferably, the action generation unit is equipped with a pre-trained RDT model, which uses a Transformer as the denoising network for the diffusion model. The action generation unit is used to generate a sequence of collaborative actions for the dual-arm robot for the next 64 steps by gradually denoising random noise through the conditional diffusion process of the RDT model, using language commands, multi-view fused visual features extracted by the visual perception module and a 128-dimensional unified state vector output by the state encoding module as conditions, and calling a collision test on one side to verify the safety of the action sequence.
[0022] Preferably, the data fine-tuning module is configured with GPU training computing power to input multi-environment labeled data collected by the data acquisition module into the RDT model, complete the fine-tuning of the model, and improve the model's cross-environment generalization ability and action generation accuracy.
[0023] Beneficial effects
[0024] Compared with the prior art, the present invention has the following beneficial effects:
[0025] 1. Strong cross-environment generalization ability and low deployment cost: Through the multi-environment data fusion mechanism, environmental type labels are added to the data samples to achieve unified training of data in different lighting environments. The model can directly perform tasks under different lighting conditions such as indoor natural light and artificial light without retraining the model for each environment, which greatly reduces the deployment and maintenance cost of the robot and achieves zero-sample or few-sample adaptation across environments.
[0026] 2. High safety and good coordination of dual-arm collaboration: For the first time, a dual-arm collision detection and motion adjustment mechanism is implemented based on the 128-dimensional unified state vector of the RDT model. It combines the geometry of the robot body and gripper to make accurate collision judgments. When a collision risk is detected, the motion sequence is dynamically adjusted to prioritize safety and minimize the modification of the original motion sequence. This avoids dual-arm collisions and ensures the continuity of task execution, thereby improving the success rate of dual-arm collaborative tasks.
[0027] 3. High robustness of visual perception and strong resistance to environmental interference: Through multi-view camera image fusion and environmentally adaptive image processing strategies, the brightness and contrast are automatically adjusted according to the lighting environment. Different enhancement parameters are applied to different environments, which effectively improves the stability of the visual perception system under different lighting conditions and reduces the impact of changes in ambient lighting on visual feature extraction and task execution.
[0028] 4. High flexibility in motion generation and excellent task execution accuracy: Based on the conditional diffusion process of the RDT model, a 64-step dual-arm collaborative motion sequence is generated with language instructions, multi-view visual features and robot state as conditions. The Transformer denoising network supports long sequence modeling, and the diffusion process ensures the smoothness and continuity of the motion sequence. At the same time, it supports multimodal motion distribution and can generate adaptive motion sequences according to different environments and task requirements, thus improving the flexibility and accuracy of task execution.
[0029] 5. Modular system architecture with strong scalability: The system is divided into a data acquisition module, a model fine-tuning module, and a model deployment module that work together. Combined with multi-environment data annotation and environment adaptive processing functions, the system can directly execute tasks under different lighting conditions such as indoor natural light and artificial light, without the need for re-debugging and training for each environment. This greatly reduces deployment and maintenance costs and significantly improves the system's cross-environment generalization ability. Attached Figure Description
[0030] Figure 1 This is a schematic diagram of a multi-environment adaptive cooperative control method for a dual-arm robot based on RDT.
[0031] Figure 2 This is a schematic diagram of the data acquisition module in a multi-environment adaptive cooperative control system for a dual-arm robot based on RDT.
[0032] Figure 3 This is a schematic diagram of the model fine-tuning module and the model deployment module of a multi-environment adaptive collaborative control system for a dual-arm robot based on RDT. Detailed Implementation
[0033] The following is in conjunction with the appendix Figure 1-3 This application will be described in further detail.
[0034] Example 1
[0035] This embodiment provides a multi-environment adaptive cooperative control method for a dual-arm robot based on RDT, such as... Figure 1 As shown, it includes the following steps:
[0036] S1: Multi-environment data acquisition steps
[0037] Demonstration data on dual-arm robots performing collaborative tasks such as assembly, grasping, and handling were collected under two lighting conditions: indoor natural light environments, such as industrial assembly stations near windows and service robot work areas under natural light; and indoor artificial light environments, such as workshops without natural light and medical assistance scenarios with artificial lighting. Environmental type labels and task type labels (assembly / grasping / handling) were added to each data sample. Simultaneously, multi-view image data were collected through an external fixed camera, a camera on the robot's left wrist, and a camera on its right wrist to provide multi-dimensional information for subsequent visual perception.
[0038] S2: Dual-arm coordinated constraint steps
[0039] The states and movements of the left and right arms of the dual-arm robot are uniformly encoded into a 128-dimensional unified state vector built into the RDT model. This vector encodes the joint positions of the left and right arms and the end effector pose information, and the end effector pose is represented by 6D rotation to avoid gimbal lock issues. Based on this 128-dimensional unified state vector, a dual-arm collision detection and motion adjustment mechanism is implemented, such as... Figure 3As shown, the specific steps include: S21: Extracting the motion trajectories of the left and right arm end effectors from the action sequence corresponding to the 128-dimensional unified state vector; S22: Obtaining the geometric shape parameters of the robot body and gripper based on the robot's CAD model, and calculating the minimum spatial distance between the left and right arms based on the trajectory information; S23: Setting a preset safety distance threshold of 5cm, comparing the minimum spatial distance with the threshold, if it is less than 5cm, it is determined that there is a collision risk, and S24 is executed; if it is greater than or equal to 5cm, there is no collision risk, and the action sequence remains unchanged; S24: Dynamically adjusting the action sequence through the action generation interface of the RDT model, adjusting the motion path of the left and right arm end effectors by using trajectory offset, prioritizing the absence of collision between the two arms, and preserving the motion trend and task execution rhythm of the original action sequence as much as possible while satisfying collision constraints, and minimizing the modification of the original action sequence.
[0040] S3: Multi-view visual perception steps
[0041] Images from the head camera, left wrist camera, and right wrist camera acquired in S1 are fused to obtain a multi-view fused image. An environment-adaptive image processing strategy is then applied to the fused image based on the environment type label. Specifically: S31: The average brightness of the fused image is calculated, with a preset brightness threshold of 120 (0-255 grayscale value). If the average brightness is below 120, gamma correction is used to automatically enhance the image brightness and contrast (gamma value set to 0.6-0.8). If the average brightness is above or equal to 120, standard image processing parameters are maintained. S32: Differentiated image enhancement parameters are set according to the environment type label. For indoor natural light environments, the color jitter coefficient is set to 0.1 and the Gaussian noise variance is set to 0.01. For indoor artificial light environments, the color jitter coefficient is set to 0.05 and the Laplacian noise coefficient is set to 0.005. Poisson noise is also injected to improve the model's robustness to noise. S33: A CNN convolutional neural network is used to extract features from the processed image, obtaining 256-dimensional multi-view fused visual features for subsequent action generation.
[0042] S4: Action generation steps based on RDT model
[0043] The user-input language commands, such as "grab the workpiece and assemble it to the designated position," the 256-dimensional multi-view visual features extracted by S3, and the 128-dimensional unified state vector of S2 are used as conditional inputs to the pre-trained RDT model. The RDT model uses a Transformer as a denoising network to gradually denoise random noise through a conditional diffusion process, generating a sequence of collaborative actions for the dual-arm robot for the next 64 steps. Each action corresponds to a joint control command for the robot. After the action sequence is generated, the dual-arm coordination constraint mechanism of S2 is called again for safety verification to ensure that the generated action sequence has no collision risk. Finally, the verified action sequence is sent to the actuator of the dual-arm robot to control the robot to perform the collaborative task.
[0044] The training process of the RDT model is as follows: the multi-environment labeled data collected by S1 is input into the initial RDT model, and pre-training is performed first (batch size 32, training rounds 100), and then fine-tuned for specific collaborative tasks (batch size 16, training rounds 50). During the training process, the dual-arm coordination constraint loss and visual feature matching loss are incorporated to improve the model's action generation accuracy and cross-environment generalization ability.
[0045] Example 2
[0046] This embodiment provides a multi-environment adaptive cooperative control system for a dual-arm robot based on RDT, used to implement the control method of Embodiment 1, such as... Figure 2 and 3 As shown:
[0047] 1. System hardware configuration:
[0048] Dual-arm robot: It adopts two UR5 collaborative robotic arms, with the left and right arms arranged symmetrically. The end effector is equipped with a two-finger gripper with a gripper opening and closing stroke of 0-80mm and a repeatability of ±0.03mm.
[0049] Visual acquisition equipment: The head camera uses an Intel Realsense D455 depth camera (1280×720 resolution, 30fps), and the left and right wrist cameras use Intel Realsense D435. The head camera is mounted on the robot support at a height of 0.7m, and the wrist cameras are fixed to the side of the gripper at the end of the robotic arm.
[0050] Computing equipment: Data acquisition and model deployment use an Intel Core i7-12700K processor and 32GB DDR5 memory, while the model fine-tuning module is equipped with an NVIDIA RTX 4090 GPU (24GB VRAM).
[0051] Communication module: Industrial Ethernet switch (supports Gigabit Ethernet), RTDE data transmission unit is compatible with UR robotic arm's native communication protocol, and the actual measured data transmission delay is 85±10ms;
[0052] Auxiliary equipment: The motion capture teaching unit adopts the QingTong motion capture system, with a positioning accuracy of ±0.1mm, and supports real-time TCP pose capture.
[0053] 2. System software parameter settings
[0054] Data acquisition parameters: multimodal data synchronization frequency 100ms, single task segment duration 10s, including 640 frames of visual data (head camera + dual wrist cameras) and 100 sets of robotic arm status data (joint angles, end pose); environment type labels are divided into "indoor natural light" (brightness range 500-1500 lux) and "indoor artificial light" (brightness range 800-2000 lux).
[0055] Visual processing parameters: The image fusion unit adopts a weighted average pixel fusion algorithm, with weights allocated as follows: head camera 0.4, left wrist camera 0.3, and right wrist camera 0.3; Adaptive processing unit brightness adjustment threshold: brightness is increased by 50% when below 300 lux and decreased by 30% when above 1800 lux; Differentiation enhancement parameters: color jitter amplitude ±10% and Gaussian noise variance 0.01 in natural light environment, color jitter amplitude ±5% and Laplacian noise coefficient 0.005 in artificial light environment;
[0056] Model parameters: The RDT model uses a Transformer denoising network with 12 layers, 8 attention heads, and 512 hidden layer dimensions; it has 1000 diffusion steps and a denoising learning rate of 1e-4; the 128-dimensional unified state vector includes left arm joint position (6D), right arm joint position (6D), left arm distal end 6D pose (6D), and right arm distal end 6D pose (6D), with the remaining 98 dimensions encoding environmental features and collaborative tasks;
[0057] Safety parameters: Collision detection safety distance threshold is 50mm, minimum spatial distance calculation adopts Euclidean distance algorithm, action sequence adjustment step size is ≤5mm / step; the action generation unit outputs action sequence frame rate of 10Hz, single step action duration is 100ms, and the future 64 steps of action cover a 6.4s collaborative process.
[0058] 3. Specific Implementation Process
[0059] (1) Data acquisition stage
[0060] Set up the test environment: Set up an indoor natural light scene (no additional light source, relying on natural light from outside the window) and an indoor artificial light scene respectively;
[0061] Motion capture teaching: The operator demonstrates a collaborative task (such as "cooperative gripping and assembling parts with both arms") through the QingTong motion capture system. The motion capture teaching unit captures TCP pose data and sends it to the robotic arm control program through the data transmission unit. The TCP inverse kinematics unit converts the data into joint angle commands (6 joints in the left arm and 6 joints in the right arm).
[0062] Real-time data transmission and synchronization: The RTDE data transmission unit sends joint angle commands to the UR5e robotic arm and simultaneously sends back real-time status data of the robotic arm (joint angles, end-effector pose); the multimodal data synchronization unit records visual data from the head camera and dual wrist cameras and robotic arm status data at a frequency of 100ms.
[0063] (2) Model fine-tuning stage
[0064] Data preprocessing: The data processing and conversion unit cleans and removes outliers and standardizes the format of 1,000 task segments. It extracts environmental and part features (such as part size, position, and color) from the image data, fuses the visual features (256-dimensional) with the robotic arm state data (24-dimensional) into a 384-dimensional vector, and then reduces the dimensionality to a 128-dimensional unified state vector.
[0065] Model fine-tuning: The preprocessed multi-environment labeled data was input into the RDT model and trained using an RTX 4090 GPU for 20,000 training rounds with a batch size of 32. After fine-tuning, the success rate of single-arm grasping tasks was >80%; the success rate of dual-arm collaborative tasks was >80%.
[0066] (3) Model deployment and task execution phase
[0067] Real-time perception and state acquisition: The real-time perception unit acquires environmental images through multiple cameras, and after image fusion and adaptive processing, extracts multi-view fused visual features; the state data acquisition unit acquires the joint angles and end-effector poses of the dual-arm robot in real time, and encodes them into a 128-dimensional unified state vector.
[0068] Action generation and safety verification: The model deployment unit loads the fine-tuned RDT model and generates a sequence of 64 future actions through a conditional diffusion process, based on the fusion of visual features and a 128-dimensional state vector. The robotic arm control program performs safety verification on the action sequence, and the collision detection unit calculates the minimum spatial distance at the ends of the two arms (the measured minimum distance is 30mm, which is greater than the threshold of 20mm, so there is no risk of collision).
[0069] Task execution: The robotic arm control program receives the safety action sequence and sends it to the UR5e dual-arm robot. The two arms complete collaborative tasks such as part grasping, alignment, and assembly according to the generated continuous action sequence. The whole process is collision-free and the movements are smooth. The success rate of single-arm grasping is >80%; the success rate of dual-arm collaborative tasks is >80%, meeting the needs of industrial collaboration.
[0070] 4. Implementation effect verification
[0071] Real-time verification: The measured average data transmission delay is 82ms, and the action sequence generation response time is ≤10ms, which meets the real-time control requirement of ≤100ms.
[0072] Multi-environment adaptation verification: under natural light (800 lux) and artificial light (1500 lux) environments;
[0073] Safety verification: 30 extreme collaborative scenarios were simulated (minimum theoretical distance between the two arms is 20mm). The collision detection unit accurately identified the risks in all cases. After the action sequence was adjusted, the minimum actual distance was 30mm, and no collisions occurred.
[0074] Task completion rate verification: The "dual-arm collaborative assembly" task was executed 100 times in a row, and 80 times were successfully completed, with a completion rate of 80%. The average time was 45 seconds, and the movements were smooth and without any lag.
[0075] The above are all preferred embodiments of this application, and are not intended to limit the scope of protection of this application. Therefore, all equivalent changes made in accordance with the structure, shape and principle of this application should be covered within the scope of protection of this application.
Claims
1. A multi-environment adaptive cooperative control method for a dual-arm robot based on RDT, characterized in that, Includes the following steps: S1: Multi-environment data acquisition step: Collect demonstration data of dual-arm robot collaborative task under different lighting conditions, add environment type label and task type label to each data sample, the environment type includes indoor natural light environment and indoor artificial light environment, and simultaneously collect multi-view image data from head camera, left wrist camera and right wrist camera. S2: Dual-arm coordination constraint step, which encodes the state and action of the left and right arms of the dual-arm robot into a 128-dimensional unified state vector provided by the RDT model, and realizes the dual-arm collision detection and action adjustment mechanism based on the 128-dimensional unified state vector. The dual-arm collision detection and motion adjustment mechanism is implemented based on the aforementioned 128-dimensional unified state vector, specifically including: S21: Extract the motion trajectories of the left and right arm end effectors from the action sequence corresponding to the 128-dimensional unified state vector; S22: Calculate the minimum spatial distance between the left and right arms by combining the geometry of the robot body and the gripper; S23: Set a preset safety distance threshold, compare the minimum spatial distance with the safety distance threshold, and perform collision detection; if the minimum spatial distance is less than the safety distance threshold, it is determined that there is a collision risk, and step S24 is executed; if the minimum spatial distance is greater than or equal to the safety distance threshold, it is determined that there is no collision risk, and the action sequence remains unchanged; S24: Dynamically adjust the action sequence, prioritizing ensuring no collisions between the arms, and minimizing modifications to the original action sequence while satisfying collision constraints; S3: Multi-view visual perception step, which integrates the multi-view image data collected in step S1 and applies an environment-adaptive image processing strategy to the integrated image based on the environment type label; S4: Action generation step based on RDT model, which uses language instructions, multi-view image features processed in step S3 and the 128-dimensional unified state vector in step S2 as conditions, and gradually removes noise from random noise through the diffusion process of the RDT model to generate a sequence of collaborative actions for the dual-arm robot for the next 64 steps. After the actions are generated, the dual-arm coordination constraint mechanism of step S2 is applied again to ensure the safety of the action sequence.
2. The multi-environment adaptive cooperative control method for a dual-arm robot based on RDT according to claim 1, characterized in that: In step S2, the 128-dimensional unified state vector encoding includes the joint positions of the left and right arms and the end effector posture information, and the end effector posture is represented by 6D rotation to avoid gimbal lock problems.
3. The multi-environment adaptive cooperative control method for a dual-arm robot based on RDT according to claim 1, characterized in that: In step S3, the environment-adaptive image processing strategy specifically includes: S31: Calculate the average brightness of the fused image. If the average brightness is lower than the preset brightness threshold, automatically increase the image brightness and contrast; if the average brightness is higher than or equal to the preset brightness threshold, maintain the standard image processing parameters. S32: Apply different image enhancement parameters according to the environment type label, and set differentiated color jitter and noise injection parameters for indoor natural light environment and indoor artificial light environment respectively. The noise includes Gaussian noise, Laplacian noise and Poisson noise. S33: Extract features from the image processed by steps S31 and S32 to obtain multi-view fused visual features.
4. The multi-environment adaptive cooperative control method for a dual-arm robot based on RDT according to claim 1, characterized in that: In step S4, the RDT model uses Transformer as the denoising network for the diffusion model. The diffusion process is a conditional diffusion process, which uses language instructions, multi-view visual features and the robot's 128-dimensional unified state vector as conditions to guide the generation of the dual-arm collaborative action sequence.
5. A multi-environment adaptive cooperative control system for a dual-arm robot based on RDT, used to implement the control method described in any one of claims 1-4, characterized in that, It includes a data acquisition module, a model fine-tuning module, and a model deployment module; The data acquisition module synchronously collects multimodal collaborative task data through motion capture teaching, visual perception, and TCP real-time data transmission. The model fine-tuning module is used to process and transform the collected data, and fine-tune the training model through visual perception, multi-environment data fusion and state coding to generate a dedicated collaborative control model. The model deployment module is used to load the fine-tuned model, combine real-time environmental perception and robotic arm state data to generate a safe action sequence and drive the dual-arm robot to complete collaborative tasks. The modules interact with each other via industrial Ethernet, with data transmission latency of less than 10ms, meeting the real-time control requirements of the dual-arm robot. The system is configured with a 128-dimensional unified state vector built into the RDT model. The 128-dimensional unified state vector contains the joint positions of the left and right arms of the dual-arm robot and the end effector posture information. The end effector posture is represented by 6D rotation to avoid the problem of discontinuous rotation angles.
6. A multi-environment adaptive cooperative control system for a dual-arm robot based on RDT according to claim 5, characterized in that: The data acquisition module includes: The motion capture teaching unit is used to capture the TCP pose data of the operator's end effector; The data transmission unit is used to transmit TCP pose data to the robotic arm control program. The robotic arm control system has a built-in robotic arm TCP inverse decoding unit. The robotic arm TCP inverse kinematics unit is used to convert TCP pose data into robotic arm joint angle commands; The RTDE data transmission unit is used to transmit the joint angle command of the robotic arm to the UR robotic arm and to send the real-time status data of the UR robotic arm back to the robotic arm control program. The multimodal data synchronization unit is used to synchronously record visual data from the head camera, left wrist camera, and right wrist camera, as well as the body state data of the left and right robotic arms, at a frequency of 100ms, and package them into task segments.
7. A multi-environment adaptive cooperative control system for a dual-arm robot based on RDT according to claim 5, characterized in that: The model fine-tuning module includes: The data processing and conversion unit is used to clean, remove outliers and standardize the format of the collected task segments, extract environmental and collaborative object features from the image data, fuse and encode the visual features with the robotic arm state data, and finally encode the fused state into a low-dimensional vector. The model training unit is used to fine-tune the pre-trained model based on the fused data to generate a collaborative control model adapted to a specific scenario.
8. The multi-environment adaptive cooperative control system for a dual-arm robot based on RDT according to claim 5, characterized in that: The model deployment module includes: A real-time sensing unit is used to acquire environmental images through multiple cameras and perform enhancement processing; The status data acquisition unit is used to acquire the status data of the joint angles and end-effector pose of the dual-arm robot in real time. The model deployment unit is used to load the fine-tuned model, receive visual and status data, and output action commands. The motion generation unit is used to integrate discrete motion instructions into a continuous and smooth motion sequence and send it to the robotic arm control program. The collision detection unit is used to make safety predictions on the action sequences received by the robotic arm control program. The UR robotic arm is used to receive a sequence of safe actions after collision detection from the robotic arm control program and complete collaborative tasks.
9. A multi-environment adaptive cooperative control system for a dual-arm robot based on RDT according to claim 6, characterized in that: The data acquisition module uses a head camera, a left wrist camera, and a right wrist camera to simultaneously acquire robot status data. The robot status data includes data from the left and right robotic arms. The environmental types collected by the data acquisition module include indoor natural light environments and indoor artificial light environments.
10. A multi-environment adaptive cooperative control system for a dual-arm robot based on RDT according to claim 8, characterized in that: The collision detection unit is used to extract the motion trajectory of the left and right arm end effectors, calculate the minimum spatial distance between the left and right arms, and compare it with a preset safe distance threshold to determine the collision risk; the robotic arm control program is used to dynamically adjust the action sequence when a collision risk is determined, and minimize the modification of the original action sequence while ensuring no collision.
11. A multi-environment adaptive cooperative control system for a dual-arm robot based on RDT according to any one of claims 6, 7, and 8, characterized in that: The visual perception in the data acquisition module, model fine-tuning module, and model deployment module is implemented through a visual perception unit. The visual perception module includes an image fusion unit and an adaptive processing unit connected by sequential signals. The image fusion unit is used to perform pixel-level fusion of multi-view image data. The adaptive processing unit is used to adaptively adjust the brightness and contrast of the fused image and apply differentiated image enhancement parameters according to the environment type label. The image enhancement parameters include color jitter parameters and noise injection parameters.
12. A multi-environment adaptive cooperative control system for a dual-arm robot based on RDT according to claim 8, characterized in that: The action generation unit is equipped with a pre-trained RDT model, which uses Transformer as the denoising network for the diffusion model. The action generation unit is used to generate a 64-step dual-arm collaborative action sequence based on language instructions, multi-view fused visual features, and a 128-dimensional unified state vector, and calls the collision test unit to verify the safety of the action sequence.
13. A multi-environment adaptive cooperative control system for a dual-arm robot based on RDT according to claim 5, characterized in that: The model fine-tuning module is equipped with GPU training computing power to input multi-environment labeled data into the RDT model to complete fine-tuning.