A robot control system and method based on task-independent visual alignment
By employing a task-agnostic vision-aligned robot control method, the problems of coupling between visual understanding and task actions, as well as poor hardware adaptability, in robot systems have been solved. This method enables precise operation of robots in dynamic scenarios and improves hardware adaptability and flexibility, while reducing data and hardware dependencies.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TIANJIN UNIV
- Filing Date
- 2026-05-18
- Publication Date
- 2026-06-12
AI Technical Summary
Existing robot control systems suffer from limitations in the flexible application of robots in dynamic and complex scenarios due to issues such as poor coupling between visual understanding and task actions, poor hardware adaptability, high data requirements, insufficient fusion of multi-source heterogeneous data, and strong hardware dependence.
A task-agnostic visual alignment robot control method is adopted. By integrating a large model fine-tuning strategy for directional optimization, a multimodal feature fusion mechanism for deep association, and a dynamically adaptable diffusion model calibration scheme, a basic visual alignment capability system that is reusable across tasks and compatible across hardware is constructed, enabling precise operation control of the robot based on first-person visual information.
It reduces the data requirements for robot operation, improves robustness and hardware adaptability in different environments, achieves "one-time training, multi-task adaptation", and does not require the use of a third-view camera for hand-eye system calibration.
Smart Images

Figure CN122185255A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of computer science, artificial intelligence, and robotics, and involves large model fine-tuning, multimodal feature fusion, and diffusion model technology. Specifically, it is a robot control method based on task-independent visual alignment. Background Technology
[0002] In the field of robot control, with the rapid development of the vision-language-action (VLA) model, robots have the ability to map visual-text input into operational actions. However, they still face key problems such as the coupling of visual understanding and task actions, poor hardware adaptability, and high data requirements, which restrict the flexible application of robots in dynamic and complex scenarios.
[0003] Existing VLA models mostly employ an end-to-end training architecture, requiring the binding of visual feature extraction with task-specific action generation during training. This leads to redundant updates to the overall parameters when switching tasks and necessitates collecting massive amounts of complete operation sequences as training data, resulting in extremely high data acquisition costs. To reduce the complexity of fine-tuning and data dependence, the industry has gradually adopted an adapter fine-tuning strategy, injecting a small number of learnable parameters into a pre-trained large multimodal model (LMM) to achieve functional adaptation. However, existing adapter fine-tuning solutions mostly focus on task-specific scenarios and lack targeted optimization for the general fundamental capability of robot visual alignment. This makes it difficult to form a reusable visual processing module across tasks and fails to meet the needs of efficient robot deployment in multi-task scenarios.
[0004] Meanwhile, robot control requires the collaborative processing of multi-source heterogeneous data, including visual images (first-person / third-person perspective), text commands (human operation requirements), and motion parameters (robotic arm motion information). Existing fusion solutions often employ simple stitching or fusion methods dominated by single-modal features, failing to fully explore the correlations between different modal data. This results in fused features that cannot accurately reflect the "vision-motion" mapping relationship, thus affecting the robot's operational accuracy. Furthermore, existing fusion strategies largely rely on third-person cameras to provide global visual information. However, the fixed hand-eye system formed by the third-person camera and the robotic arm is hardware-dependent. When the camera-robotic arm configuration changes, tedious recalibration of the hand-eye system is required, severely limiting the robot's hardware adaptability and flexibility. Summary of the Invention
[0005] To address the technical problems existing in the prior art, this invention provides a robot control method based on task-independent visual alignment. By innovatively integrating a large model fine-tuning strategy for directional optimization, a multimodal feature fusion mechanism for deep association, and a dynamically adaptable diffusion model calibration scheme, a basic visual alignment capability system that can be reused across tasks, is compatible across hardware, and efficiently utilizes data is constructed, enabling the robot to achieve precise operation control in dynamic scenes based on first-person visual information.
[0006] To address the problems existing in the prior art, the present invention adopts the following technical solution: A robot control system based on task-independent viewpoint alignment, the system comprising a first visual image extraction module, a second visual image extraction module, a relative pose determination module, a visual image calibration module, and a robotic arm control module, wherein: the first visual image extraction module comprises a task parser, a gated loop unit, and an image processing unit; the relative pose determination module comprises a visual encoder, a front-end adapter, a projection layer, a back-end adapter, a language encoder, and a decoder; the visual image calibration module comprises a semantic feature extraction unit, a multilayer perceptual neural network, a cross-modal feature fusion module, a diffusion model, and a calibration unit; wherein: The first visual image extraction module extracts the target object and its spatiotemporal relationship from human instructions, and at the same time acquires a top-down view image of the object, and uses the top-down view image as the target image related to the task. The second visual image extraction module acquires the robot's current first-view image through a camera mounted on the wrist of the robotic arm and records it as the current image; The relative pose determination module will input the current image, the target image and the corresponding text prompt, and calculate the camera six-degree-of-freedom relative transformation motion between the current image and the target image; Determine if the current first-person view image and the top-down view image are aligned. If they are not aligned, continue execution; if they are aligned, the task is complete, and the robot can perform subsequent operations based on the aligned view. Wherein: A feature matching similarity threshold is set, and the feature matching similarity between the current image and the target image is calculated. If the similarity does not reach the threshold, it is determined to be misaligned, and subsequent calibration and movement steps are performed. If the threshold is reached, it is determined to be aligned. The visual image calibration module generates a conditional feature vector related to hand-eye transformation by combining the visual geometric features of the current image and the target image with the kinematic semantic features of the robot's actions. Through iterative optimization of the diffusion process, it calibrates the hand-eye error between the robotic arm actuator and the wrist camera, compensating for specific pose deviations in the hardware.
[0007] Furthermore, the construction process of the relative pose determination module includes: The modal alignment projection layer of the visual encoder and the language encoder is set to 32 for the front-end adapter and the back-end adapter; the relative pose judgment module divides the six-degree-of-freedom relative pose parameters into 256 discrete intervals according to percentiles, where each interval corresponds to an action sentence; The shared features of action semantics and language semantics are embedded in the space; the adapter is adjusted during the training of the multimodal large model, and the cross-entropy loss of action semantics and language semantics is used as the multimodal optimization objective, so that the multimodal large model has the ability of task-independent visual-pose mapping.
[0008] Furthermore, the process by which the multilayer perceptual neural network obtains visual geometric features includes: Extract 4×4 uniformly distributed grid anchor points from the target image. This forms a rectangular lattice structure composed of 16 anchor points; Implementing mesh anchors using the COTR algorithm With the current image Pixel-level matching yields deformable mesh anchor points. For grid anchor points and deformable mesh anchor points The two-dimensional coordinates are respectively Fourier position encoded and mapped to a 512-dimensional high-dimensional descriptor; The encoded descriptor is input into the dual Image-MLP branches, which process the features at time steps t and t+1 respectively. Outlier noise is suppressed by a Gaussian filter, and the first visual geometric features are output. Second visual geometric features .
[0009] Furthermore, the semantic feature extraction module will extract the robot's relative transformation actions between time steps t and t+1. The vector is converted into a six-DOF relative pose vector composed of Euler angles and translation vectors; the vector is then input into a lightweight model and mapped to a 512-dimensional feature space to obtain kinematic semantic features. .
[0010] Furthermore, the process of generating conditional feature vectors related to robot hand-eye transformation through cross-modal feature fusion includes: Visual geometric features With kinematic semantic features Concatenate into a unified feature matrix As a key to the cross-attention mechanism ( )-value( )right; Initialize 8 trainable query vectors It aggregates cross-modal relationships through cross-attention operations and outputs... The calculation process is as follows, where For feature dimensions: Then Compressed to using a linear projection layer A 3D vector is used to obtain the conditional feature vector related to hand-eye changes. The calculation process is as follows, where For linear projection layers: .
[0011] Furthermore, the visual image calibration module, through iterative optimization of the diffusion process, calibrates the hand-eye error between the robotic arm actuator and the wrist camera, compensating for specific hardware pose deviations, including: The 6-DOF relative pose transformation motion between the robotic arm actuator and the wrist camera Encoded as 4 spatial points ,in: Orthogonally distributed on the unit sphere, p4 has no geometric constraints; The diffusion model, guided by conditional eigenvectors related to hand-eye transformation, is used for noisy point clouds. Iterative denoising is performed, and the optimization process uses the total loss function: in, Rank-N contrast loss is used to enhance the discriminative power of different hand-eye configurations. in, This is a conditional feature vector related to hand-eye transformation. For hyperparameters, It is a negative L2 norm. For label distance higher The sample set; To account for diffusion loss, constrain the denoising effect: in, For the denoised point cloud predicted by the model, The optimized point cloud representation For point cloud supervision, truth labels, The parameters represent the diffusion model. Noise from random sampling; Spherical constraint loss Forced Located on a unit sphere; For orthogonal constraint loss: ,make sure and Orthogonal; From the optimized point cloud Decoding yields the hand-eye transformation matrix ,use relative transformation action The calibration process compensates for specific pose deviations in the hardware, and the resulting calibrated hand-eye calibration transformation matrix is input to the robotic arm control module.
[0012] The present invention can also adopt the following technical solution: a method for controlling a robot using the system described in any one of claims 1-6, comprising the following steps: Step 1: Target Information Extraction and Image Acquisition Step 2: First-person view image acquisition: Acquire the robot's current first-person view image using a camera mounted on the robotic arm's wrist, and record it as the current image. ; Step 3: Relative pose estimation: Based on a pre-trained multimodal large model, input the current image. Target image Based on the corresponding text prompts, the six-degree-of-freedom relative pose between the two is estimated, and the relative transformation action is obtained; Step 4: Alignment Judgment and Looping: Determine whether the current first-view image and the top-view image are aligned. If they are not aligned, continue execution; if they are aligned, the task is completed, and the robot can perform subsequent operations based on the aligned view. Step 5: Generative Online Hand-Eye Error Calibration: An online hand-eye calibration module based on a conditional diffusion model is adopted. By combining the visual geometric features of the current image and the target image and the kinematic semantic features of the robot's actions, conditional feature vectors related to hand-eye transformation are generated. Through iterative optimization of the diffusion process, the hand-eye error between the robotic arm actuator and the wrist camera is calibrated to compensate for specific pose deviations of the hardware. Step 6: Control the movement of the robotic arm: Move the robotic arm according to the calibrated transformation matrix, and then return to step 2 to continue execution.
[0013] Beneficial effects 1. This invention addresses the technical challenge of existing end-to-end VLA models in the field of robot operation, which require training a complete demonstration operation sequence from the robot's initial state to the completion of the operation, have high requirements for data integrity, and are difficult to obtain training data. Therefore, it proposes a robot control system and method based on task-independent view alignment. The main feature is that the complete mechanical operation task is decoupled into two independent sub-tasks: visual alignment and operation execution. The visual alignment process of the mechanical gripper moving to the vicinity of the object to be operated is a general capability that can be learned once and reused repeatedly. Training it separately can reduce the model's need for complete operation sequence data and improve its robustness in different environments.
[0014] 2. This invention focuses on the implementation of the visual alignment submodule, clarifies the module's input and output and the complete iterative execution process, solves the core problem of coordinate system transformation, and can achieve better results when used with existing end-to-end VLA models; at the same time, this invention achieves "one-time training, multi-task adaptation" by adapting the visual alignment capability to a large model, without the need to retrain or update global parameters for new tasks, which greatly reduces the cost of cross-task deployment.
[0015] 3. This invention is based on dynamic hand-eye calibration using a first-view camera and a diffusion model. It does not rely on a third-view fixed hand-eye system or external calibration tools, and can adapt to different camera-robotic arm configurations, eliminating hardware adaptation limitations. Attached Figure Description
[0016] Figure 1 This is a schematic diagram of a robot control system based on task-independent visual alignment according to the present invention; Figure 2 This is a flowchart of a robot control method based on task-independent visual alignment according to the present invention. Detailed Implementation
[0017] The robot control method based on task-independent visual alignment of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. All other embodiments obtained by those skilled in the art based on the technical solutions of the present invention without inventive effort are within the scope of protection of the present invention.
[0018] like Figure 1 As shown, this invention provides a robot control system based on task-independent viewpoint alignment. The system includes a first visual image extraction module, a second visual image extraction module, a relative pose judgment module, a visual image calibration module, and a robotic arm control module. The first visual image extraction module includes a task parser and an image processing unit. The relative pose judgment module includes a visual encoder, a front-end adapter, a feature similarity threshold judgment unit, a projection layer, a back-end adapter, and a large language model decoder. The visual image calibration module includes a semantic feature extraction unit, a multilayer perceptual neural network, a cross-modal feature fusion module, a diffusion model, and a calibration unit. This invention employs a dual-camera functional division: two cameras are configured. The image processing unit uses the viewpoint camera to assist in acquiring spatiotemporal relationships and generate a top-down view of the target object. The second visual image extraction module uses the viewpoint camera mounted on the robotic arm wrist to acquire viewpoint images of the operating scene. Both cameras serve as input image data for subsequent processing. Specifically, the first visual image extraction module extracts the target object and its spatiotemporal relationships from human commands and simultaneously acquires a top-down view image of the object, using this top-down view image as the task-related target image. The target information and image acquisition of this invention mainly involve the GRU task parser decomposing the semantic information of the input human command, identifying the target object (single or multiple) involved in the command, and determining the spatiotemporal dependencies between the objects to form structured data. For a single-target task, a standard top-down view image of the target object is directly acquired via a camera. If it is a multi-objective composite task, then top-down images of the target objects at multiple stages need to be acquired. The second visual image extraction module acquires the robot's current first-person perspective image through a camera mounted on the robotic arm's wrist, and records it as the current image. ; The relative pose determination module will input the current image. Target image And corresponding text prompts, to estimate the current image With target image Relative pose of the cameras in six degrees of freedom (6-DoF) ,in , Indicates 3D rotation. This indicates 3D translation; it determines whether the current first-view image and the top-view image are aligned. If they are not aligned, execution continues; if they are aligned, the task is completed, and the robot can perform subsequent operations based on the aligned view. The visual image calibration module generates conditional feature vectors related to hand-eye transformation by combining the visual geometric features of the current image and the target image, and the kinematic semantic features of the robot's actions. Through iterative optimization via a diffusion process, it calibrates the hand-eye error between the robotic arm actuator and the wrist camera, compensating for hardware-specific pose deviations. This invention transforms the six-degree-of-freedom hand-eye transformation between the robotic arm actuator and the wrist camera into four-dimensional space, adds geometric constraints to achieve conditional diffusion calibration, and the model ultimately outputs a calibration matrix, completing the calibration and correction of hand-eye parameters. Wherein: Alignment judgment and loop control of the relative pose judgment module: Set the feature matching similarity threshold (cosine similarity ≥ 0.95), and calculate the current image. With target image (or The robot performs feature matching and similarity checks. If the similarity does not reach a threshold, it is considered misaligned, and subsequent calibration and movement steps continue. If the similarity reaches the threshold, it is considered aligned, and in a single-objective task, the robot can perform subsequent operations. In a multi-objective task, it proceeds to the next stage, updating the target image. Repeat this step and subsequent steps until all stages are aligned.
[0019] The relative pose determination module calculates the relative pose: using the current image Target image (Single-target task) or target image of the current stage (Multi-objective task) and corresponding text prompts input after training of a multimodal large model, the model outputs a token sequence, which is decoded to obtain the camera's six-degree-of-freedom relative transformation action. ,in , Indicates 3D rotation. This indicates 3D translation.
[0020] The multimodal large model can use any LMM as the basic backbone network. The fine-tuning training method is to inject LoRA learnable adapters into the modality alignment projection layers of the image encoder and the speech encoder, and the Transformer layer of the speech encoder, with the adaptation rank set to 32. A binning strategy is used to divide the 6 degrees of freedom (3D rotation R, 3D translation t) relative pose parameters into 256 discrete intervals according to the 1%-99th percentile, with each interval corresponding to an action token. If the number of special tokens in the vocabulary of the selected multimodal large model is less than 256, the 256 tokens with the lowest occurrence frequency in the vocabulary are replaced, so that action tokens and speech tokens share the feature embedding space. During training, only the injected learnable adapters are fine-tuned, with the cross-entropy loss of the predicted tokens as the optimization objective. Training is completed by predicting the next-token task, so that the model has task-independent vision-pose mapping capabilities.
[0021] Multilayer perceptron neural networks for visual geometric feature extraction: in target images (or Extract 4×4 uniformly distributed grid anchor points. This forms a rectangular lattice structure with 16 anchor points; implemented using the COTR algorithm. With the current image Pixel-level matching yields deformed meshes. ;right and The 2D coordinates are Fourier-encoded and mapped to 512-dimensional high-dimensional descriptors. These descriptors are then input into dual Image-MLP branches to process features at time steps t and t+1, respectively. Outlier noise is suppressed using a Gaussian filter, and the resulting visual geometric features are output. and .
[0022] The semantic feature extraction unit extracts kinematic semantic features: Adjust the robot's 6D motion between time steps t and t+1. The vector is converted into a 6D vector composed of Euler angles (roll, pitch, yaw) and translation vectors; this vector is then input into a lightweight Text-MLP (containing 2 hidden layers with ReLU activation function) and mapped to a 512-dimensional feature space to obtain kinematic semantic features. .
[0023] Visual geometric features With kinematic semantic features Concatenate into a unified feature matrix As a key to the cross-attention mechanism ( )-value( Yes; initialize 8 trainable query vectors. It aggregates cross-modal relationships through cross-attention operations and outputs... The calculation process is as follows, where For feature dimensions: Then Compressed to using a linear projection layer A 3D vector is used to obtain the conditional feature vector related to hand-eye changes. The calculation process is as follows, where... For linear projection layers: .
[0024] Diffusion process and error calibration, 6-DOF hand-eye transformation between robotic arm actuator and wrist camera Encoded as 4 spatial points ,in Orthogonally distributed on the unit sphere (encoding) p4 has no geometric constraints (encoding) The conditional diffusion model is implemented using Concatsquash MLP. Guided by CondFeature, this study analyzed noisy point clouds. Iterative denoising is performed (diffusion step number t is set to 100), and the optimization process uses the total loss function: in Rank-N contrast loss is used to enhance the discriminative power of different hand-eye configurations. in, This is a conditional feature vector related to hand-eye transformation. For hyperparameters, It is a negative L2 norm. For label distance higher The sample set.
[0025] To account for diffusion loss, constrain the denoising effect: in, For the denoised point cloud predicted by the model, The optimized point cloud representation For point cloud supervision, truth labels, The parameters represent the diffusion model. Noise from random sampling; Spherical constraint loss Forced Located on a unit sphere; For orthogonal constraint loss: ,make sure and Orthogonal; From the optimized point cloud Decoding yields the hand-eye transformation matrix ,use camera six-DOF relative transformation motion Calibration is performed to compensate for specific pose deviations in the hardware, resulting in a calibrated transformation matrix. .
[0026] like Figure 2 As shown, the present invention also provides a robot control method based on task-independent visual alignment: Step 1: Target Information Extraction and Image Acquisition Step 2: First-person view image acquisition: Acquire the robot's current first-person view image using a camera mounted on the robotic arm's wrist, and record it as the current image. ; Step 3: Relative pose estimation: Based on a pre-trained multimodal large model (LMM) inputting the current image. Target image Based on the corresponding text prompts, the 6-DOF relative pose between the two is estimated, and the camera's 6-DOF relative transformation motion is obtained. ,in , Indicates 3D rotation. Indicates 3D translation; Step 4: Alignment Judgment and Looping: Determine whether the current first-view image and the top-view image are aligned. If they are not aligned, continue execution; if they are aligned, the task is completed, and the robot can perform subsequent operations based on the aligned view. Step 5: Generative Online Hand-Eye Error Calibration: An online hand-eye calibration module based on a conditional diffusion model is adopted. By combining the visual geometric features of the current image and the target image and the kinematic semantic features of the robot's actions, conditional feature vectors related to hand-eye transformation are generated. Through iterative optimization of the diffusion process, the hand-eye error between the robotic arm actuator and the wrist camera is calibrated to compensate for specific pose deviations of the hardware. Step 6: Control the movement of the robotic arm: Move the robotic arm according to the calibrated transformation matrix, and then return to step 2 to continue execution.
Claims
1. A robot control system based on task-independent viewpoint alignment, the system comprising a first visual image extraction module, a second visual image extraction module, a relative pose determination module, a visual image calibration module, and a robotic arm control module, characterized in that: The first visual image extraction module comprises a task parser, a gated loop unit, and an image processing unit; the relative pose determination module includes a visual encoder, a front-end adapter, a projection layer, a back-end adapter, a language encoder, and a decoder; the visual image calibration module includes a semantic feature extraction unit, a multilayer perceptual neural network, a cross-modal feature fusion module, a diffusion model, and a calibration unit; wherein: The first visual image extraction module extracts the target object and its spatiotemporal relationship from human instructions, and at the same time acquires a top-down view image of the object, and uses the top-down view image as the target image related to the task. The second visual image extraction module acquires the robot's current first-view image through a camera mounted on the wrist of the robotic arm and records it as the current image; The relative pose determination module will input the current image, the target image and the corresponding text prompt, and calculate the camera six-degree-of-freedom relative transformation motion between the current image and the target image; Determine if the current first-person view image and the top-down view image are aligned. If they are not aligned, continue execution; if they are aligned, the task is complete, and the robot can perform subsequent operations based on the aligned view. Wherein: A feature matching similarity threshold is set, and the feature matching similarity between the current image and the target image is calculated. If the similarity does not reach the threshold, it is determined to be misaligned, and subsequent calibration and movement steps are performed. If the threshold is reached, it is determined to be aligned. The visual image calibration module generates a conditional feature vector related to hand-eye transformation by combining the visual geometric features of the current image and the target image with the kinematic semantic features of the robot's actions. Through iterative optimization of the diffusion process, it calibrates the hand-eye error between the robotic arm actuator and the wrist camera, compensating for specific pose deviations in the hardware.
2. The robot control system based on task-independent viewpoint alignment according to claim 1, characterized in that: The construction process of the relative pose determination module includes: Modality alignment projection layers are used to align the visual encoder and the language encoder. The adaptation rank of the front-end adapter and the back-end adapter is set to 32. Each interval corresponds to an action statement. The shared features of action semantics and language semantics are embedded in the space; the adapter is adjusted during the training of the multimodal large model, and the cross-entropy loss of action semantics and language semantics is used as the multimodal optimization objective, so that the multimodal large model has the ability of task-independent visual-pose mapping.
3. A robot control system based on task-independent viewpoint alignment according to claim 1, characterized in that: The process by which the multilayer perceptual neural network obtains visual geometric features includes: Extract 4×4 uniformly distributed grid anchor points from the target image. This forms a rectangular lattice structure composed of 16 anchor points; Implementing mesh anchors using the COTR algorithm With the current image Pixel-level matching yields deformable mesh anchor points. For grid anchor points and deformable mesh anchor points The two-dimensional coordinates are respectively Fourier position encoded and mapped to a 512-dimensional high-dimensional descriptor; The encoded descriptor is input into the dual Image-MLP branches, which process the features at time steps t and t+1 respectively. Outlier noise is suppressed by a Gaussian filter, and the first visual geometric features are output. Second visual geometric features .
4. A robot control system based on task-independent viewpoint alignment according to claim 1, characterized in that: The semantic feature extraction module will extract the robot's relative transformation actions between time steps t and t+1. The vector is converted into a six-DOF relative pose vector composed of Euler angles and translation vectors; the vector is then input into a lightweight model and mapped to a 512-dimensional feature space to obtain the kinematic semantic features. .
5. A robot control system based on task-independent viewpoint alignment according to claim 1, characterized in that: The process of generating conditional feature vectors related to robot hand-eye transformation through cross-modal feature fusion includes: Visual geometric features With kinematic semantic features Concatenate into a unified feature matrix As a key to the cross-attention mechanism ( )-value( )right; Initialize 8 trainable query vectors It aggregates cross-modal relationships through cross-attention operations and outputs... The calculation process is as follows, where For feature dimensions: Then Compressed to using a linear projection layer A 3D vector is used to obtain the conditional feature vector related to hand-eye changes. The calculation process is as follows, where For linear projection layers: 。 6. A robot control system based on task-independent viewpoint alignment according to claim 1, characterized in that: The visual image calibration module, through iterative optimization of the diffusion process, calibrates the hand-eye error between the robotic arm actuator and the wrist camera, compensating for specific hardware pose deviations. This process includes: The six-degree-of-freedom relative pose transformation motion between the robotic arm actuator and the wrist camera Encoded as 4 spatial points ,in: Orthogonally distributed on the unit sphere, p4 has no geometric constraints; The diffusion model, guided by conditional eigenvectors related to hand-eye transformation, is used for noisy point clouds. Iterative denoising is performed, and the optimization process uses the total loss function: in, Rank-N contrast loss is used to enhance the discriminative power of different hand-eye configurations. in, This is a conditional feature vector related to hand-eye transformation. For hyperparameters, It is a negative L2 norm. For label distance higher The sample set; To account for diffusion loss, constrain the denoising effect: in, For the denoised point cloud predicted by the model, The optimized point cloud representation For point cloud supervision, truth labels, The parameters represent the diffusion model. Noise from random sampling; Spherical constraint loss Forced Located on a unit sphere; For orthogonal constraint loss: ,make sure and Orthogonal; From the optimized point cloud Decoding yields the hand-eye transformation matrix ,use relative transformation action The calibration process compensates for specific pose deviations in the hardware, and the resulting calibrated hand-eye calibration transformation matrix is input to the robotic arm control module.
7. A method for controlling a robot using the system described in any one of claims 1-6, characterized in that, Includes the following steps: Step 1: Target Information Extraction and Image Acquisition Step 2: First-person view image acquisition: The robot's current first-person view image is acquired through a camera mounted on the wrist of the robotic arm and recorded as the current image; Step 3: Relative pose estimation: Based on the pre-trained multimodal large model, the current image, the target image, and the corresponding text prompt are input, and the six degrees of freedom relative pose between the two is estimated to obtain the relative transformation action; Step 4: Alignment Judgment and Looping: Determine whether the current first-view image and the top-view image are aligned. If they are not aligned, continue execution; if they are aligned, the task is completed, and the robot can perform subsequent operations based on the aligned view. Step 5: Generative Online Hand-Eye Error Calibration: An online hand-eye calibration module based on a conditional diffusion model is adopted. By combining the visual geometric features of the current image and the target image and the kinematic semantic features of the robot's actions, conditional feature vectors related to hand-eye transformation are generated. Through iterative optimization of the diffusion process, the hand-eye error between the robotic arm actuator and the wrist camera is calibrated to compensate for specific pose deviations of the hardware. Step 6: Control the movement of the robotic arm: Move the robotic arm according to the calibrated transformation matrix, and then return to step 2 to continue execution.