Embodied intelligence-oriented multi-modal grasping data distributed collection method and system

By combining cross-system clock synchronization and UDP collaborative control with golden pose perturbation and tactile data acquisition, the I/O bottleneck and multimodal alignment problem in robot data acquisition were solved, realizing high-quality embodied intelligent training data generation and automated annotation, thus improving the training efficiency and generalization ability of the model.

CN122311284APending Publication Date: 2026-06-30XIAMEN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIAMEN UNIV
Filing Date
2026-03-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing robot data acquisition solutions suffer from bottlenecks in I/O bandwidth and processing power, insufficient multimodal alignment accuracy, and low efficiency in negative sample generation and labeling, making it difficult to obtain high-quality embodied intelligence training data.

Method used

By using cross-system clock synchronization and UDP collaborative control, skill generalization sample construction based on golden pose perturbation, high-fidelity acquisition of sensor data through virtual-real fusion, and automatic evaluation of multimodal tactile spatiotemporal consistency operators, high-precision time alignment, efficient generalization sample generation, and automated high-quality annotation of multimodal grasping data in a distributed heterogeneous environment are achieved.

Benefits of technology

It achieves high-precision time alignment of multimodal data capture, generates high-quality generalized samples covering real challenges such as pose bias, contact imbalance, and rigidity adaptation, reduces the cost of manual annotation, and improves the training efficiency and generalization ability of embodied intelligent models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122311284A_ABST
    Figure CN122311284A_ABST
Patent Text Reader

Abstract

This invention discloses a distributed acquisition method and system for multimodal grasping data for embodied intelligence. The method includes: achieving high-precision clock synchronization between the Linux master control terminal and the Windows vision slave control terminal through the Chrony protocol, with the deviation controlled within 2ms, and establishing a UDP start / stop control protocol; employing preheating asynchronous writing to ensure zero-frame loss synchronization of the visual stream; acquiring the golden pose and tactile incremental intensity based on manual teaching to generate physically reasonable six-degree-of-freedom pose perturbation and force perturbation samples; integrating dual-sided optical tactile sensors to solve the problem of missing proprioception caused by serial port congestion; aligning tactile, proprioception, and visual data with a unified timestamp in the offline stage, exporting the standard Zarr format, and directly supporting supervised fine-tuning of VLA large models. This invention ensures continuous high-fidelity acquisition of proprioception data under serial port congestion through bus arbitration, and achieves distributed acquisition of multimodal grasping data through tactile tensor entropy, nonlinear collapse detection, and cross-modal cross-correlation analysis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the intersection of robotics and artificial intelligence, specifically to an adaptive budget strategy optimization method and system for large language models. Background Technology

[0002] With the rapid development of embodied AI technology, large language-vision-action models (VLA models, such as pi0) place extremely high demands on the quality of robot operation datasets. High-quality embodied AI training data not only needs to cover visual images and robot poses, but also needs to include high-fidelity tactile physical interaction information to enable the agent to perceive the physical properties of objects (such as hardness, friction, and mass distribution).

[0003] However, existing robot data acquisition solutions suffer from the following significant technical bottlenecks:

[0004] I / O bandwidth and processing capacity bottlenecks: The simultaneous operation of high-resolution vision sensors and high-frequency tactile sensors (especially optical imaging-based tactile sensors) generates significant pressure on the USB bus. When performing high-throughput data writes in a single system, it often leads to system freezes, frame drops, or severe desynchronization of vision and tactile data.

[0005] Insufficient multimodal alignment accuracy: Heterogeneous sensors (cameras, haptic serial ports, robot network ports) often operate at different frequencies and protocols, making it difficult to guarantee microsecond-level timestamp synchronization across systems and modalities, resulting in deviations in timing control of the trained model.

[0006] Negative sample generation and annotation are inefficient: Existing teaching methods mostly rely on manual demonstration of successful samples, lacking automated negative sample generation logic (such as grasping offset, slip, and drop), and the cost of manual annotation of failure reasons (such as one-sided contact and unstable grasping) is extremely high, which seriously restricts the large-scale production of datasets.

[0007] Therefore, there is an urgent need for a distributed data acquisition solution that can record multimodal interaction processes with high stability and high fidelity, and has automated sample expansion and labeling functions. Summary of the Invention

[0008] To address the aforementioned issues, this invention proposes a distributed data acquisition method and system for multimodal grasping data oriented towards embodied intelligence. Through cross-system clock synchronization and UDP collaborative control, skill generalization sample construction based on golden pose perturbation, high-fidelity acquisition of sensor data through virtual-real fusion, and automatic evaluation of multimodal tactile spatiotemporal consistency operators, this invention achieves high-precision time alignment, efficient generalization sample generation, and automated high-quality annotation of multimodal grasping data oriented towards embodied intelligence in a distributed heterogeneous environment.

[0009] On the one hand, a distributed data acquisition method for multimodal data capture oriented towards embodied intelligence includes:

[0010] S1. Establish a cross-system collaborative environment, limit the system clock deviation between the Linux master control terminal and the Windows visual slave control terminal to within a preset threshold through a clock synchronization protocol, and establish a start and stop control protocol based on UDP protocol.

[0011] S2, in a cross-system collaborative environment, performs manual dragging teaching for a specific object, and records the golden pose at the moment of successful grasping and the tactile increment intensity relative to the background noise.

[0012] S3, based on the golden pose and tactile incremental intensity, generates pose perturbations and force fluctuations containing translation and rotation deviations through random sampling, and constructs non-optimal grasping conditions as skill generalization samples.

[0013] S4, based on skill generalization samples, drives the Linux master control terminal to execute the grasping process, collects tactile data and robot body perception data in real time, and sends trigger commands to the Windows slave control terminal through the start and stop control protocol based on UDP, and synchronously records the visual video stream at the corresponding moment.

[0014] S5 automatically determines the grasping state based on the collected tactile data, and calculates the transformation matrix between the current grasping pose and the golden pose based on the current grasping state, and outputs the transformation matrix as a correction label.

[0015] Furthermore, in S1, the Windows vision slave adopts a preheating asynchronous write logic. Specifically, after the Windows vision slave completes clock synchronization with the Linux master, it starts the camera and maintains continuous video streaming, but does not write to the disk immediately. When it receives the start command sent by the Linux master through the established UDP control protocol, it immediately starts an independent thread to synchronously write all subsequent image frames along with their hardware-generated timestamps to local storage, based on the timestamp of that timestamp. This ensures that the visual data has no frame loss at the start of the task and that the time axis zero point is strictly aligned.

[0016] Furthermore, in S1, the clock synchronization protocol is the Chrony timing protocol, which limits the system clock deviation to within 2ms.

[0017] Furthermore, in S3, the pose disturbance is based on the golden pose, and the pose disturbance is generated by a six-degree-of-freedom rigid body transformation that includes translational deviation and rotational deviation around the axis through random sampling; the force fluctuation is based on the reference grasping force, and the random force value fluctuation is injected according to a preset ratio.

[0018] Furthermore, in S4, the collection of tactile data and robot proprioceptive data specifically includes: tactile physical field, robot proprioceptive stream, and visual image stream. The tactile physical field, robot proprioceptive stream, and visual image stream are aligned offline according to the nearest neighbor rule using a unified system timestamp, and exported as Zarr or HDF5 format conforming to the LeRobot data protocol for supervised fine-tuning training of VLA multimodal large models.

[0019] Furthermore, in S4, the tactile data is collected by left and right dual-sided optical tactile sensors. The Linux host terminal packages the physical information acquired by the tactile sensors into a preset bit floating-point raw array stream and publishes it, including a depth force matrix in 32FC1 format and a dual-channel shear force matrix in 32FC2 format, which are used to preserve the fine physical field characteristics during the object interaction process.

[0020] Furthermore, in S4, the process of real-time acquisition of tactile data and robot body perception data includes collaborative safeguarding steps:

[0021] S41 monitors the buffer level and blocking status of the gripper control command queue and serial port status feedback queue in real time; when the mean value of the tactile sensor depth force matrix exceeds the preset amplitude of the noise floor threshold for 3 consecutive frames and the absolute value of the gripper current signal slope is greater than the preset threshold, it determines that the gripper has entered the dynamic control cycle and triggers the bus arbiter suspended status monitoring thread to periodically read the physical serial port.

[0022] S42, during the suspension period of the state monitoring thread's periodic read request to the physical serial port, extract the joint target position and velocity sequence from the control command stream, input it into the preset gripper kinematic transfer function model, perform numerical integration through the fourth-order Runge-Kutta method, generate a virtual pose sequence that is strictly aligned with the timestamp of the robot's body perception data, and use the virtual pose sequence as virtual spacetime filling for the collected robot body perception data stream during the suspension period;

[0023] S43, when the pose change rate of the robot's body perception data is less than 0.01 rad / s for 5 consecutive frames and the standard deviation of tactile intensity is less than 10% of the noise floor, it is determined that the gripper has entered the static locking stage. The bus arbiter restores the physical serial port reading permission and uses the latest physical pose feedback value as a benchmark to calculate the pose residual between the virtual pose sequence and the latest physical pose feedback value in the most recent 100ms window using the exponential weighted recursive least squares method. The bias compensation parameters of the shadow observer are updated to complete the drift correction.

[0024] Furthermore, in S5, the grasping state is automatically determined based on the collected tactile data, specifically including:

[0025] S51, based on the pressure field matrix acquired in real time by dual-sided optical tactile sensors, construct left-side and right-side tactile response tensors respectively. Each tensor is formed by splicing the depth force matrix and the dual-channel shear force matrix along the channel dimension of the corresponding side. The intensity components after deducting background noise are calculated for both the left-side and right-side tactile response tensors. and , used to generate cross-modal cross-correlation coefficients;

[0026] S52 extracts the temporal nonlinear decay features of the intensity component, integrates and constructs a full-cycle tactile intensity time-series curve, locates the enhancement stage interval on the time-series curve, identifies the nonlinear collapse segment by jointly detecting curvature abrupt changes and the duration of monotonous intensity decrease after local extrema, calculates and outputs the temporal envelope decay factor of the identified nonlinear collapse segment.

[0027] S53, based on the normalized decay rate, left-side tactile response tensor, and right-side tactile response tensor, combined with the cross-modal cross-correlation coefficient, and extracted based on temporal nonlinear decay characteristics, calculates a multimodal tactile spatiotemporal consistency operator. The quantitative assessment is calculated using the following formula:

[0028] ;

[0029] in, To comprehensively evaluate steady-state metrics, this method quantifies the degree of physical consistency between the robot's end effector and the manipulated object in terms of spatial distribution symmetry and temporal dynamic envelope. The mapping relationship between real-time values ​​and preset spatial bias thresholds automatically classifies the grasping state into optimal samples, pose bias sub-optimal samples, or failure negative samples, and generates robot action correction labels accordingly. It is a preset small positive constant used to ensure numerical stability; This is a dynamic sensitivity weight used to adapt to the characteristics of objects with different rigidities.

[0030] On the other hand, a distributed data acquisition system for multimodal data capture oriented towards embodied intelligence includes:

[0031] The collaborative environment construction module establishes a cross-system collaborative environment, limits the system clock deviation between the Linux master control terminal and the Windows visual slave control terminal to within a preset threshold through a clock synchronization protocol, and establishes a start and stop control protocol based on UDP protocol.

[0032] The grasping module, in a cross-system collaborative environment, performs manual dragging teaching on a specific object, and records the golden pose at the moment of successful grasping and the tactile increment intensity relative to the background noise.

[0033] The random sampling module, based on the golden pose and tactile incremental intensity, generates pose perturbations and force fluctuations containing translation and rotation deviations through random sampling, and constructs non-optimal grasping conditions as skill generalization samples.

[0034] The data acquisition module, based on skill generalization samples, drives the Linux master control terminal to execute the grasping process, collects tactile data and robot body perception data in real time, and sends trigger commands to the Windows slave control terminal through a UDP-based start and stop control protocol, and synchronously records the visual video stream at the corresponding moment.

[0035] The label output module automatically determines the grasping state based on the collected tactile data, and calculates the transformation matrix between the current grasping pose and the golden pose based on the current grasping state. The transformation matrix is ​​then used as the model correction label output.

[0036] The present invention adopts the above technical solution and has the following beneficial effects:

[0037] (1) This invention controls the clock deviation between the Linux master control terminal and the Windows vision slave control terminal within 2ms through the Chrony protocol, and combines the preheating asynchronous writing mechanism and UDP triggering protocol to ensure that there are no dropped frames at the beginning of the visual video stream and that the timestamps of all modalities (tactile / body / vision) are strictly unified, which significantly improves the quality and reliability of the multi-source synchronous data required for VLA model training.

[0038] (2) This invention collects gold samples and actively constructs non-optimal grasping conditions through pose six degrees of freedom perturbation and force random fluctuation. Based on the tactile physical field and kinematic virtual filling mechanism, it ensures the continuity of the ontology data during the dynamic control period and generates high-quality generalized samples that cover real challenges such as pose bias, contact imbalance and rigidity adaptation.

[0039] (3) The present invention is achieved through The quantitative indicators integrate bilateral tactile symmetry, temporal decay dynamics, and cross-modal cross-correlation. They can adaptively distinguish between three types of grasping states: optimal, suboptimal, and failure. The indicators also directly output the pose correction transformation matrix as a supervision label, which greatly reduces the cost of manual annotation and supports the leap of embodied intelligent models from passive imitation to closed-loop feedback fine-tuning. Attached Figure Description

[0040] Figure 1 This is a flowchart of the distributed data acquisition process for multimodal data capture based on embodied intelligence, according to an embodiment of the present invention.

[0041] Figure 2 This is a hardware schematic diagram of a multimodal grasping system for embodied intelligence according to an embodiment of the present invention;

[0042] Figure 3This is a schematic diagram of a distributed multimodal data stream synchronization architecture according to an embodiment of the present invention;

[0043] Figure 4 This is a flowchart illustrating the automated sample generation and embodied quality determination process according to an embodiment of the present invention.

[0044] Figure 5 This is a diagram of a distributed data acquisition system for multimodal data capture based on embodied intelligence, according to an embodiment of the present invention. Detailed Implementation

[0045] The present invention will be further described in detail below with reference to the embodiments and accompanying drawings, but the embodiments of the present invention are not limited thereto.

[0046] like Figure 1 As shown, the present invention provides a distributed data acquisition method for multimodal data capture oriented towards embodied intelligence, comprising:

[0047] S1 establishes a cross-system collaborative environment, limits the system clock deviation between the Linux master control terminal and the Windows visual slave control terminal to within a preset threshold through a clock synchronization protocol, and establishes a start and stop control protocol based on UDP protocol.

[0048] Specifically, the Windows vision slave uses a preheating asynchronous write logic. After synchronizing its clock with the Linux master, the Windows vision slave starts the camera and maintains continuous video streaming, but does not write to the disk immediately. When it receives a start command from the Linux master via the established UDP control protocol, it immediately starts an independent thread to synchronously write all subsequent image frames, along with their hardware-generated timestamps, to local storage, based on that timestamp. This ensures that the vision data is free of frame loss and that the timeline zero point is strictly aligned at the start of the task.

[0049] Specifically, the clock synchronization protocol is the Chrony timing protocol, which limits the system clock deviation to within 2ms.

[0050] Specifically, the Windows vision slave terminal adopts a "preheating" asynchronous write logic, which means keeping the camera's underlying pipeline active before the task is triggered. After receiving the start signal, it uses an independent multi-threaded disk writer to synchronously map the image frame with the real-time system timestamp to eliminate data loss caused by cold start delay.

[0051] S2, in a cross-system collaborative environment, performs manual dragging teaching on a specific object, recording the golden pose at the moment of successful grasping and the tactile increment intensity relative to the background noise.

[0052] S3, based on the golden pose and tactile incremental intensity, generates pose perturbations and force fluctuations containing translation and rotation deviations through random sampling, and constructs non-optimal grasping conditions as skill generalization samples.

[0053] Specifically, the pose disturbance is based on the golden pose and is generated by a six-degree-of-freedom rigid body transformation that includes translational deviation and rotational deviation around the axis; the force fluctuation is based on the reference gripping force and is a random force value fluctuation injected according to a preset ratio.

[0054] S4, based on skill generalization samples, drives the Linux master control terminal to execute the grasping process, collects tactile data and robot body perception data in real time, and sends trigger commands to the Windows slave control terminal through a UDP-based start and stop control protocol, and synchronously records the visual video stream at the corresponding moment.

[0055] Specifically, the collected tactile data and robot proprioception data include: tactile physical field, robot proprioception stream, and visual image stream. The tactile physical field, robot proprioception stream, and visual image stream are aligned offline using the nearest neighbor rule through a unified system timestamp, and exported in Zarr or HDF5 format conforming to the LeRobot data protocol for supervised fine-tuning training of VLA multimodal large models.

[0056] Specifically, the tactile data is collected by optical tactile sensors on both the left and right sides. The Linux host terminal packages the physical information acquired by the tactile sensors into a 32-bit floating-point raw array stream and publishes it, including a depth force matrix in 32FC1 format and a dual-channel shear force matrix in 32FC2 format, which are used to preserve the fine physical field characteristics during the interaction process of objects.

[0057] Specifically, the process of real-time acquisition of tactile data and robot body perception data includes collaborative support steps:

[0058] S41 monitors the buffer level and blocking status of the gripper control command queue and serial port status feedback queue in real time; when the mean value of the tactile sensor depth force matrix exceeds the noise floor threshold of 200% for 3 consecutive frames and the absolute value of the gripper current signal slope is greater than the preset threshold, it determines that the gripper has entered the dynamic control cycle and triggers the bus arbiter suspension status monitoring thread to periodically read the physical serial port.

[0059] S42, during the suspension period of the state monitoring thread's periodic read request to the physical serial port, extract the joint target position and velocity sequence from the control command stream, input it into the preset gripper kinematic transfer function model, perform numerical integration through the fourth-order Runge-Kutta method, generate a virtual pose sequence that is strictly aligned with the timestamp of the robot's body perception data, and use the virtual pose sequence as virtual spacetime filling for the collected robot body perception data stream during the suspension period;

[0060] S43, when the pose change rate of the robot's body perception data is less than 0.01 rad / s for 5 consecutive frames and the standard deviation of tactile intensity is less than 10% of the noise floor, it is determined that the gripper has entered the static locking stage. The bus arbiter restores the physical serial port reading permission and uses the latest physical pose feedback value as a benchmark to calculate the pose residual between the virtual pose sequence and the latest physical pose feedback value in the most recent 100ms window using the exponential weighted recursive least squares method. The bias compensation parameters of the shadow observer are updated to complete the drift correction.

[0061] S5 automatically determines the grasping state based on the collected tactile data, and calculates the transformation matrix between the current grasping pose and the golden pose based on the current grasping state, and outputs the transformation matrix as the model correction label.

[0062] Specifically, the grasping state is automatically determined based on the collected tactile data, including:

[0063] S51, based on the pressure field matrix acquired in real time by dual-sided optical tactile sensors, construct left-side and right-side tactile response tensors respectively. Each tensor is formed by splicing the depth force matrix and the dual-channel shear force matrix along the channel dimension of the corresponding side. The intensity components after deducting background noise are calculated for both the left-side and right-side tactile response tensors. and , used to generate cross-modal cross-correlation coefficients;

[0064] S52 extracts the temporal nonlinear decay features of the intensity component, integrates and constructs a full-cycle tactile intensity time-series curve, locates the enhancement stage interval on the time-series curve, identifies the nonlinear collapse segment by jointly detecting curvature abrupt changes and the duration of monotonous intensity decrease after local extrema, calculates and outputs the temporal envelope decay factor of the identified nonlinear collapse segment.

[0065] S53, based on the normalized decay rate, left-side tactile response tensor, and right-side tactile response tensor, combined with the cross-modal cross-correlation coefficient, and extracted based on temporal nonlinear decay characteristics, calculates a multimodal tactile spatiotemporal consistency operator. The quantitative assessment is calculated using the following formula:

[0066] ;

[0067] in, To comprehensively evaluate steady-state metrics, this method quantifies the degree of physical consistency between the robot's end effector and the manipulated object in terms of spatial distribution symmetry and temporal dynamic envelope. The mapping relationship between real-time values ​​and preset spatial bias thresholds automatically classifies the grasping state into optimal samples, pose bias sub-optimal samples, or failure negative samples, and generates robot action correction labels accordingly. It is a preset small positive constant used to ensure numerical stability; This is a dynamic sensitivity weight used to adapt to the characteristics of objects with different rigidities.

[0068] Specifically, the hardware connection method in this embodiment is as follows: Figure 2 As shown, high-fidelity data continuous production is achieved through a high degree of decoupling between hardware and software. The core hardware carrier includes a work platform equipped with a six-DOF UR5 collaborative robot. The robotic arm's end effector uses a Dahuan AG95 two-finger translational gripper, with two symmetrically attached optically measured tactile sensors on the inner side of the gripper's fingertips to capture precise contact force fields. In terms of computing architecture, this embodiment employs a dual-machine distributed deployment scheme. The first computing terminal is a Linux host running Ubuntu, responsible for executing closed-loop motion control of the robotic arm based on the Real-Time Data Exchange Interface (RTDE), tactile data stream parsing, and ROS master node communication. The second computing terminal is a slave host running Windows, specifically utilizing its graphics processing capabilities for real-time encoding and asynchronous disk dumping of the RealSense camera group's video stream. Two cameras are respectively positioned at the robotic arm wrist (D435i) to obtain a hand-eye view and at the side of the workspace (D455) to obtain a third-person panoramic view.

[0069] Specifically, the initialization of the distributed collaborative environment is fundamental to ensuring the spatiotemporal alignment of multimodal data. The system's distributed multimodal data stream synchronization architecture, such as... Figure 3As shown, the Linux host, Windows host, and robot control box are connected to the same subnet via a gigabit Ethernet switch, and static IP addresses are configured for mutual communication. To eliminate cross-system clock drift, a Chrony timing server is deployed on the Linux side, and a NetTime synchronization client is deployed on the Windows side. High-frequency heartbeat packets are used to force the clock deviation between the two systems to be locked within 2 milliseconds. During the sensor warm-up phase, the Windows slave terminal starts the camera's underlying pipeline first, keeping the video buffer continuously refreshed to avoid automatic exposure adjustments and frame rate jitter caused by the RealSense hardware during cold starts. At the same time, the haptic node on the Linux side encapsulates the original gel deformation image into a 32-bit floating-point RAW array and publishes depth force field (32FC1 format) and shear force field (32FC2 format) data through ROS topics. For fine manipulation of small objects such as cherry tomatoes, the system uses a bilinear interpolation algorithm to downsample the haptic array to 160×120 resolution, significantly reducing the bus load of distributed communication while ensuring the integrity of physical features. The data recording process employs distributed synchronous triggering logic and a "shadow command" communication avoidance mechanism. When the Linux master control terminal starts a new acquisition cycle, it sends a feature string command to the Windows slave control terminal via the UDP protocol, achieving synchronous initiation of visual video recording and Rosbag physical feature recording. During the gripper's tactile feedback-based closing action, the system detects a surge in serial port communication load. At this time, the shadow command synchronization mechanism is automatically activated: the state acquisition thread temporarily suspends the cyclic reading operation of the gripper's physical position register and directly uses the command pulse value currently sent by the master control thread to fill the gripper position field in the robot's body perception state vector. This mechanism effectively solves the communication blockage caused by read / write command conflicts in high-speed control cycles, ensuring the continuity of the gripper closing action and the equidistant sampling of the robot's state stream (including 6-axis joint angles, 6-axis TCP poses, and 1-axis gripper position).

[0070] Specifically, in this embodiment of the invention, multi-dimensional demonstration benchmark teaching is performed for objects with different geometric shapes and physical properties. For highly rigid cylinders such as canned cola, fragile small spheres such as cherry tomatoes, and irregular flexible objects such as bananas, the operator uses a drag-and-drop teaching mode to guide the robotic arm's end effector to the optimal grasping pose P. gold Subsequently, the operator triggers the gripper to close via keyboard control commands and monitors the intensity increment in the tactile data stream after deducting background noise in real time. Taking a high-rigidity canned cola as an example, the system records the total tactile intensity increment F when it reaches a stable gripping state. gold(e.g., 12,000 units), and store the current body perception pose in the gold sample library. During this process, the system automatically calibrates the background white noise of the tactile sensor in real time. By sampling and obtaining the static initial value of each pixel before grasping, it ensures that all subsequent mechanical labels are based on pure physical interaction increments, eliminating zero-point drift caused by changes in ambient light or sensor self-heating.

[0071] Specifically, in this embodiment of the invention, an automated sample generalization mechanism is used to construct the "optimal-non-optimal" sample pairs required for embodied intelligence training. At the golden pose P gold Based on this, the system injects random noise within a preset range into the SE(3) six-degree-of-freedom space, including a positional deviation of ±1.5 cm and a rotational offset of ±10 degrees. Specifically, for objects with axisymmetric characteristics, such as canned cola, the sampling logic introduces a rotational constraint mechanism to keep the rotation around the approximation axis (X-axis) constant at zero, generating random attitude perturbations only in the elevation and yaw dimensions (Y-axis, Z-axis). In the force dimension, the system in F... gold Based on this, random target forces are generated using a scaling factor of 0.6 to 1.4. This physically constrained random sampling strategy can simulate various perceptual biases that embodied agents may encounter in real-world deployments, providing crucial negative sample data for VLA models such as pi0 to learn how to recover from a non-ideal initial state to a stable grasping state.

[0072] Specifically, in this embodiment of the invention, the automatic labeling logic for embodied interaction quality is achieved through real-time mechanical feature comparison. During the static holding phase after the object is lifted to a preset height by the robotic arm, the tactile intensity envelope is continuously recorded. If the total tactile intensity at the end of the holding phase is less than 5% of the intensity at the moment of grasping, the system automatically determines it as a "fall failure," marks the current sample as a failure case, and triggers emergency protection logic, canceling subsequent return actions and directly executing safe evacuation. In addition, the system performs "single-sided contact" determination by comparing the real-time intensity deviation of the left and right fingertip tactile sensors. If the intensity on one side exceeds the contact threshold (e.g., 0.01) while the other side is below the noise floor threshold (e.g., 0.002), the sample is recorded as a bias failure state. Before the task ends, the system automatically calculates correction labels, including the inverse transformation matrix of the current pose matrix relative to the golden pose matrix, and the difference between the actual achieved steady-state force and the golden teaching force. These label information, along with the 13-dimensional proprioception vector, are uniformly written into a binary data packet.

[0073] Specifically, the multimodal trajectory data generated in this embodiment of the invention is converted into a standardized embodied intelligence format through a post-processing unit. By performing nearest-neighbor alignment between the MP4 video stream with system timestamp index stored on the Windows side and the Rosbag physical feature stream recorded on the Linux side, timestamp matching ensures strict temporal consistency between visual frames and the tactile depth field, shear force field, and robot state vector. Taking over 100 collected interaction trajectories including cherry tomatoes, bananas, and canned cola as an example, the processing program converts them into a Zarr or HDF5 format dataset conforming to the Hugging Face LeRobot standard. This dataset not only preserves the original physical interaction details but also contains precise motion correction labels, which can be directly used for supervised fine-tuning of large VLA models, significantly improving the embodied agent's grasping robustness and skill generalization level when facing unknown objects and uncertain working conditions.

[0074] Specifically, the detailed process for executing automated sample generation and interaction quality assessment is as follows: Figure 4 As shown. This process aims to automatically generate training samples covering different working conditions through a single manual teaching session. First, the operator performs interactive baseline teaching on a specific object (such as a cherry tomato, banana, or canned cola). The system then monitors the tactile topics to lock onto the golden pose P at the moment of successful object grasping. gold With incremental intensity F gold The system then enters N repeated sampling cycles. For each set of automated samples, the system first injects random perturbations into the SE(3) space to generate a noise matrix containing multi-axis positional and rotational biases, for example, performing rotational constraint sampling around the axis of symmetry for a cola can. The system drives the robot to move to the target noisy pose P. target Then, cross-system multimodal recording is started simultaneously.

[0075] Specifically, such as Figure 4As shown in the judgment logic, after the robot completes the lifting action and enters the static holding phase, the system evaluates the embodied interaction quality in real time. The judgment logic unfolds in parallel from two physical dimensions: the first is the fall judgment, where the system compares the tactile intensity envelope at the end of the holding period with that at the moment of grasping. If the intensity attenuation ratio exceeds a preset threshold (e.g., 95% attenuation), it is automatically judged as a slip or fall failure. The second is the unilateral contact judgment, where the system calculates the real-time force difference between the left and right tactile sensors to identify whether there is a unilateral mis-touch state due to excessive pose deviation. If the judgment result is a failure, the system marks the sample as a negative sample and forces a safe evacuation action; if the judgment is successful, it performs a return to the original position and marks it as a positive sample. Before the end of each cycle, the system automatically calculates the transformation matrix between the current pose and the golden pose as a correction label. Through this process, the system can efficiently produce high-quality interaction trajectories covering successful and various typical failure modes based on the precise force control of cherry tomatoes, the irregular grasping of bananas, and the rigid interaction of cola cans.

[0076] In summary, this invention addresses the urgent need for high-quality, multimodal spatiotemporally aligned datasets for large-scale VLA model training in the field of embodied intelligence, as well as the technical problems of existing single-machine acquisition solutions, such as I / O load imbalance, data frame loss, and spatiotemporal desynchronization due to heterogeneous sources, when facing high-bandwidth image streams and high-frequency tactile physical fields. By constructing a distributed heterogeneous computing architecture with a Linux master control terminal and a Windows vision slave control terminal, the computationally intensive real-time video encoding and the high-real-time robot closed-loop control are physically isolated, completely eliminating hardware resource conflicts during high-throughput multimodal data acquisition and ensuring the integrity of the acquisition sequence. Furthermore, it utilizes the Chrony and NTP network clock synchronization protocols combined with UDP to synchronize touch signals. The system achieves high alignment of cross-system and cross-modal data on a microsecond-level time axis, providing a highly reliable time reference for embodied agents to learn the fine correlation between vision, touch, and action. The system innovatively introduces a shadow instruction synchronization mechanism and an SE(3) spatial noise injection strategy, which solves the bottleneck of multi-threaded preemption of serial communication and realizes the automated production of "optimal-non-optimal" sample pairs for objects with diverse physical characteristics such as cherry tomatoes, bananas, and canned cola. The system integrates automatic discrimination logic for slippage, fall, and one-sided contact based on mechanical feedback features, and can autonomously produce standardized embodied intelligence datasets with high-precision pose and force correction labels, and is fully compatible with mainstream pre-training protocols such as LeRobot. This invention greatly reduces the production cost of embodied interaction data, preserves high-fidelity original physical field details, and provides solid multimodal underlying data support for building generalized embodied intelligence with strong generalization capabilities.

[0077] Specifically, in this embodiment, the data acquisition process can also be completed through the following steps:

[0078] Cross-system collaborative environment initialization: The clock deviation between Linux and Windows is limited to within 2ms through the Chrony time synchronization protocol, and a UDP-based START / STOP control protocol is established;

[0079] Interactive baseline teaching: Manual drag teaching for a specific object, recording the perfect grasping pose Pgold and the tactile increment intensity Fgold relative to the noise floor;

[0080] Skill generalization sample generation: Based on Pgold, disturbance poses containing translation and axial rotation deviations are generated by random sampling, and force fluctuations are injected into Fgold according to a preset ratio to construct non-optimal grasping conditions;

[0081] Distributed synchronous data capture: The master control terminal executes the capture process and records tactile and proprioceptive data, and synchronously sends network commands to trigger the slave control terminal to record the visual video stream at the corresponding moment;

[0082] Embodied Interaction Automatic Judgment: Based on real-time tactile feedback, the system automatically determines the grab success state, sliding state, and one-sided contact state, and calculates the SE(3) transformation matrix between the current pose and the golden pose as the model correction label.

[0083] Specifically, in practical applications, this invention can be divided into the following parts:

[0084] Distributed computing architecture: It consists of a Linux master control terminal running the ROS system and robot motion control logic, and a Windows vision slave control terminal running a high-performance vision capture unit;

[0085] Robot execution unit: includes the collaborative robot and its end effector with two-finger translational gripper, used to perform grasping and lifting actions on objects of different shapes;

[0086] Multimodal sensing array: Two tactile sensors are equipped on the inside of the gripper's fingertips, and two RGB vision sensors are installed on the wrist of the robotic arm and in the third-person perspective, respectively, to capture the physical and visual features of the object during interaction.

[0087] Spatiotemporal synchronization communication unit: Used to achieve millisecond-level clock alignment between master and slave control terminals via local area network, and to achieve millisecond-level synchronous triggering of recording tasks using UDP protocol;

[0088] The main control unit packages the physical information acquired by the tactile sensor into a 32-bit floating-point raw array stream and publishes it, and realizes closed-loop force control based on the feedback from the tactile sensor through a subscription feedback mechanism.

[0089] In summary, this invention addresses the stringent requirements of embodied intelligent agents (such as VLA models) for high-quality, multimodal, and spatiotemporally aligned data by constructing a distributed architecture consisting of a Linux master control terminal and a Windows vision slave control terminal. The system, in conjunction with a collaborative robot, dual-channel high-resolution optical tactile sensors, and dual-channel visual sensors, achieves deep interactive data acquisition of objects of diverse shapes throughout the entire process of grasping, lifting, and holding. Through cross-system NTP clock synchronization and UDP triggering mechanisms, the I / O conflict between high-bandwidth vision and high-frequency tactile sampling is resolved; by utilizing golden sample teaching combined with SE(3) spatial noise injection technology, automated generation of "optimal-non-optimal" sample pairs is achieved. The system also integrates automatic labeling logic based on mechanical feedback for sliding and unilateral contact, and the generated data is fully compatible with embodied intelligent dataset standards such as LeRobot, effectively improving the training efficiency and generalization ability of embodied intelligent agents in grasping skills in complex environments.

[0090] In addition, the present invention also includes the following technical effects:

[0091] (1) By designing a distributed heterogeneous system, the computationally intensive video coding task and the control logic with extremely high real-time requirements are decoupled at the physical layer, which solves the hardware resource conflict during multimodal data acquisition and ensures the integrity of high-frequency sampling sequences above 30Hz and the consistency of the time axis.

[0092] (2) The high-fidelity physical field array storage technology is adopted to fully preserve the deformation characteristics and force distribution of the tactile sensor at the microscale, providing the underlying features that are superior to traditional grayscale images for the embodied intelligent agent to learn the fine physical interaction of objects.

[0093] (3) A closed-loop logic for automatic alignment of “optimal-non-optimal” samples was constructed. Combined with automatic labeling technology based on real-time force feedback, the production cost of embodied interactive data was greatly reduced, enabling the trained agent to have stronger autonomous correction and generalization capabilities when facing grasping bias.

[0094] (4) The introduced shadow instruction synchronization mechanism solves the original contradiction between serial communication protocol and concurrent sampling task at the hardware level, and significantly improves the smoothness and authenticity of robot state recording during complex interactive actions.

[0095] like Figure 5 As shown in the figure, this invention also discloses a distributed data acquisition system for multimodal data capture oriented towards embodied intelligence, comprising:

[0096] Collaborative environment construction module 51 establishes a cross-system collaborative environment, limits the system clock deviation between the Linux master control terminal and the Windows visual slave control terminal to within a preset threshold through a clock synchronization protocol, and establishes a start and stop control protocol based on UDP protocol.

[0097] The grasping module 52, in a cross-system collaborative environment, performs manual dragging teaching for a specific object, and records the golden pose at the moment of successful grasping and the tactile increment intensity relative to the background noise.

[0098] The random sampling module 53, based on the golden pose and tactile incremental intensity, generates pose perturbations and force fluctuations containing translation and rotation deviations through random sampling, and constructs non-optimal grasping conditions as skill generalization samples.

[0099] The data acquisition module 54, based on skill generalization samples, drives the Linux master control terminal to execute the grasping process, collects tactile data and robot body perception data in real time, and sends trigger commands to the Windows slave control terminal through the start and stop control protocol based on UDP protocol, and synchronously records the visual video stream at the corresponding moment.

[0100] The label output module 55 automatically determines the grasping state based on the collected tactile data, and calculates the transformation matrix between the current grasping pose and the golden pose based on the current grasping state, and uses the transformation matrix as the model correction label output.

[0101] A specific implementation of a distributed data acquisition system for multimodal data acquisition oriented towards embodied intelligence is described in this embodiment, which is the same as the distributed data acquisition method for multimodal data acquisition oriented towards embodied intelligence.

[0102] Although the invention has been specifically shown and described in conjunction with preferred embodiments, those skilled in the art should understand that various changes in form and detail may be made to the invention without departing from the spirit and scope of the invention as defined in the appended claims, all of which shall be within the scope of protection of the invention.

Claims

1. A distributed data acquisition method for multimodal data capture oriented towards embodied intelligence, characterized in that, Includes the following steps: S1. Establish a cross-system collaborative environment, limit the system clock deviation between the Linux master control terminal and the Windows visual slave control terminal to within a preset threshold through a clock synchronization protocol, and establish a start and stop control protocol based on UDP protocol. S2, in a cross-system collaborative environment, performs manual dragging teaching for a specific object, and records the golden pose at the moment of successful grasping and the tactile increment intensity relative to the background noise. S3, based on the golden pose and tactile incremental intensity, generates pose perturbations and force fluctuations containing translation and rotation deviations through random sampling, and constructs non-optimal grasping conditions as skill generalization samples. S4, based on skill generalization samples, drives the Linux master control terminal to execute the grasping process, collects tactile data and robot body perception data in real time, and sends trigger commands to the Windows slave control terminal through the start and stop control protocol based on UDP, and synchronously records the visual video stream at the corresponding moment. S5 automatically determines the grasping state based on the collected tactile data, and calculates the transformation matrix between the current grasping pose and the golden pose based on the current grasping state, and outputs the transformation matrix as a correction label.

2. The distributed data acquisition method for multimodal data capture based on embodied intelligence according to claim 1, characterized in that, In S1, the Windows vision slave uses a preheating asynchronous write logic. Specifically, after the Windows vision slave completes clock synchronization with the Linux master, it starts the camera and maintains continuous video streaming, but does not write to the disk immediately. When it receives the start command sent by the Linux master through the established UDP control protocol, it immediately starts an independent thread to synchronously write all subsequent image frames along with their hardware-generated timestamps to local storage, based on the timestamp of that timestamp. This ensures that the vision data is free of frame loss and that the time axis zero point is strictly aligned at the start of the task.

3. The distributed data acquisition method for multimodal data capture based on embodied intelligence according to claim 1, characterized in that, In S1, the clock synchronization protocol is the Chrony timing protocol, which limits the system clock deviation to within 2ms.

4. The distributed data acquisition method for multimodal data capture based on embodied intelligence according to claim 1, characterized in that, In S3, the pose disturbance is based on the golden pose and is generated by a six-degree-of-freedom rigid body transformation that includes translational deviation and rotational deviation around the axis; the force fluctuation is based on the reference gripping force and is a random force value fluctuation injected according to a preset ratio.

5. The distributed data acquisition method for multimodal data capture based on embodied intelligence according to claim 1, characterized in that, In S4, the tactile data and robot proprioception data collected specifically include: tactile physical field, robot proprioception stream, and visual image stream. The tactile physical field, robot proprioception stream, and visual image stream are aligned offline according to the nearest neighbor rule using a unified system timestamp, and exported as Zarr or HDF5 format conforming to the LeRobot data protocol for supervised fine-tuning training of VLA multimodal large models.

6. The distributed data acquisition method for multimodal data capture based on embodied intelligence according to claim 1, characterized in that, In S4, the tactile data is collected by left and right dual-side optical tactile sensors. The Linux host terminal packages the physical information acquired by the tactile sensors into a preset bit floating-point raw array stream and publishes it, including a depth force matrix in 32FC1 format and a dual-channel shear force matrix in 32FC2 format, which are used to preserve the fine physical field characteristics during the object interaction process.

7. The distributed data acquisition method for multimodal data capture oriented towards embodied intelligence according to claim 1, characterized in that, In S4, the process of real-time acquisition of tactile data and robot body perception data includes collaborative safeguarding steps: S41 monitors the buffer level and blocking status of the gripper control command queue and serial port status feedback queue in real time; when the mean value of the tactile sensor depth force matrix exceeds the preset amplitude of the noise floor threshold for 3 consecutive frames and the absolute value of the gripper current signal slope is greater than the preset threshold, it determines that the gripper has entered the dynamic control cycle and triggers the bus arbiter suspended status monitoring thread to periodically read the physical serial port. S42, during the suspension period of the state monitoring thread's periodic read request to the physical serial port, extract the joint target position and velocity sequence from the control command stream, input it into the preset gripper kinematic transfer function model, perform numerical integration through the fourth-order Runge-Kutta method, generate a virtual pose sequence that is strictly aligned with the timestamp of the robot's body perception data, and use the virtual pose sequence as virtual spacetime filling for the collected robot body perception data stream during the suspension period; S43, when the pose change rate of the robot's body perception data is less than 0.01 rad / s for 5 consecutive frames and the standard deviation of tactile intensity is less than 10% of the noise floor, it is determined that the gripper has entered the static locking stage. The bus arbiter restores the physical serial port reading permission and uses the latest physical pose feedback value as a benchmark to calculate the pose residual between the virtual pose sequence and the latest physical pose feedback value in the most recent 100ms window using the exponential weighted recursive least squares method. The bias compensation parameters of the shadow observer are updated to complete the drift correction.

8. The distributed data acquisition method for multimodal data capture oriented towards embodied intelligence according to claim 1, characterized in that, In S5, the grasping state is automatically determined based on the collected tactile data, specifically including: S51, based on the pressure field matrix acquired in real time by dual-sided optical tactile sensors, construct left-side and right-side tactile response tensors respectively. Each tensor is formed by splicing the depth force matrix and the dual-channel shear force matrix along the channel dimension of the corresponding side. The intensity components after deducting background noise are calculated for both the left-side and right-side tactile response tensors. and , used to generate cross-modal cross-correlation coefficients; S52 extracts the temporal nonlinear decay features of the intensity component, integrates and constructs a full-cycle tactile intensity time-series curve, locates the enhancement stage interval on the time-series curve, identifies the nonlinear collapse segment by jointly detecting curvature abrupt changes and the duration of monotonous intensity decrease after local extrema, calculates and outputs the temporal envelope decay factor of the identified nonlinear collapse segment. S53, based on the normalized decay rate, left-side tactile response tensor, and right-side tactile response tensor, combined with the cross-modal cross-correlation coefficient, and extracted based on temporal nonlinear decay characteristics, calculates a multimodal tactile spatiotemporal consistency operator. The quantitative assessment is calculated using the following formula: ; in, To comprehensively evaluate steady-state metrics, this method quantifies the degree of physical consistency between the robot's end effector and the manipulated object in terms of spatial distribution symmetry and temporal dynamic envelope. The mapping relationship between real-time values ​​and preset spatial bias thresholds automatically classifies the grasping state into optimal samples, pose bias sub-optimal samples, or failure negative samples, and generates robot action correction labels accordingly. It is a preset small positive constant used to ensure numerical stability; This is a dynamic sensitivity weight used to adapt to the characteristics of objects with different rigidities.

9. A distributed data acquisition system for multimodal data capture oriented towards embodied intelligence, characterized in that, include: The collaborative environment construction module establishes a cross-system collaborative environment, limits the system clock deviation between the Linux master control terminal and the Windows visual slave control terminal to within a preset threshold through a clock synchronization protocol, and establishes a start and stop control protocol based on UDP protocol. The grasping module, in a cross-system collaborative environment, performs manual dragging teaching on a specific object, and records the golden pose at the moment of successful grasping and the tactile increment intensity relative to the background noise. The random sampling module, based on the golden pose and tactile incremental intensity, generates pose perturbations and force fluctuations containing translation and rotation deviations through random sampling, and constructs non-optimal grasping conditions as skill generalization samples. The data acquisition module, based on skill generalization samples, drives the Linux master control terminal to execute the grasping process, collects tactile data and robot body perception data in real time, and sends trigger commands to the Windows slave control terminal through a UDP-based start and stop control protocol, and synchronously records the visual video stream at the corresponding moment. The label output module automatically determines the grasping state based on the collected tactile data, and calculates the transformation matrix between the current grasping pose and the golden pose based on the current grasping state. The transformation matrix is ​​then used as the model correction label output.