Multi-modal fusion perception and control method and device for robot dexterous hand

By establishing a unified temporal organization and quality reliability mechanism at the software level, spectral, tactile, visual, and mechanical signals are upgraded into a mutually constrained and mutually verified fusion closed loop. This solves the temporal and semantic correspondence problem of multimodal fusion perception and control in robot dexterity hands, and achieves stable grasping and consistent operation of complex targets.

CN122185244APending Publication Date: 2026-06-12GLITTERINTECH (XUZHOU) LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GLITTERINTECH (XUZHOU) LTD
Filing Date
2026-05-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing multimodal fusion perception methods and closed-loop interactive control strategies for robot dexterity hands have structural deficiencies, which make it difficult for the temporal and semantic correspondence of multimodal grasping in the same grasping process to be consistent. They lack explicit evaluation of data quality and reliability, making it difficult to achieve stable operation quality. Furthermore, they lack real-time monitoring and adjustment of contact point migration, force changes, and surface state changes.

Method used

By establishing a unified timing organization and quality reliability mechanism at the software level, spectral, tactile, visual, and mechanical signals are upgraded into a fusion closed loop that is mutually constrained, mutually verified, and jointly driven by control. Multi-encoder and cross-modal fusion structure are used for joint representation learning, and information complementarity and conflict suppression are achieved through gating or cross-modal attention mechanisms, thus constructing a hierarchical closed-loop control with a high-frequency inner loop and a policy outer loop.

Benefits of technology

It has achieved stable closed-loop execution of the robot's dexterous hand in complex environments, improved the stability and consistency of operation on complex targets, and significantly enhanced the robustness and reliability of grasping objects that are similar in appearance, have large differences in material, are flexible, or reflective.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122185244A_ABST
    Figure CN122185244A_ABST
Patent Text Reader

Abstract

The application provides a multimodal fusion perception and control method and device for a robot dexterous hand, which comprises the following steps: collecting multimodal data in the operation process of the dexterous hand, adding timestamps and metadata to the multimodal data based on the same time reference, writing the multimodal data into a ring buffer and sorting the multimodal data by time; pre-processing and quality evaluating the multimodal data to obtain standardized inputs and quality scores of each modality for the input of a multimodal fusion model; inputting the standardized inputs and quality scores of each modality and a current operation stage into the multimodal fusion model, and obtaining high-level semantic results and control-related quantities for closed-loop operation according to the output of the multimodal fusion model; and converting the high-level semantic results and control-related quantities into executable closed-loop control behaviors to control the motion of the dexterous hand. The application can realize the cooperative closed-loop control of the robot dexterous hand.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of machine learning and intelligent control, and in particular to a multimodal fusion perception and control method and apparatus for a robot dexterous hand. Background Technology

[0002] In industrial and application scenarios such as intelligent manufacturing, logistics sorting, agricultural product harvesting, and service robots, the technological development of robotic dexterous hands shows a significant trend of evolving from achieving basic grasping functions to pursuing higher levels of grasping accuracy, operational stability, and intelligent perception and understanding of targets and the environment. As tasks expand from simple handling to precise sorting, flexible object processing, and precision assembly, end effectors not only need to complete grasping actions but also need to continuously perceive, judge, and adaptively adjust during contact to maintain stable operational quality under complex environments and diverse object conditions.

[0003] Existing multimodal fusion perception methods and closed-loop interactive control strategies for robotic dexterity hands generally suffer from structural deficiencies: On the one hand, spectral, tactile, and force / pose signals often belong to different acquisition links and processing flows, lacking a unified data organization, time alignment, and coordination mechanism. This makes it difficult for the temporal and semantic sequences of multimodal signals to correspond consistently in the same grasping process. Furthermore, there is a lack of explicit evaluation of data quality and reliability, such as saturation, low signal-to-noise ratio, drift, oscillation, and impact. Fusion often relies on fixed weights or empirical thresholds, making it difficult for the fused recognition output to serve as a reliable input for the control strategy. Moreover, the control strategy lacks adaptive constraints and adjustment mechanisms for recognition preconditions such as sampling posture, contact stability, and signal-to-noise ratio, resulting in insufficient coordination between perception decision-making and closed-loop control. On the other hand, many solutions are "one-off" identification or post-verification processes, lacking online monitoring and action correction throughout the entire contact and handling process. This makes it difficult to respond in a timely manner to feature drift caused by contact point migration, force changes, and surface condition changes. Consequently, when slippage is imminent, pressure damage risk increases, or spectral confidence decreases, necessary adjustments such as increasing or decreasing gripping force, switching impedance, resampling after changing posture, and verification rollback cannot be triggered. At the same time, since tactile / mechanical signals usually require millisecond-level high-frequency closed loops, while spectral links have low sampling frequencies and variable integration time / readout delays, and in actual engineering, spectral acquisition can often be further subdivided into different working modes, such as first performing rapid light intensity / echo intensity measurement for optical path alignment and posture / force optimization (only observing absolute light intensity or integrated intensity, which is relatively faster), and then performing full-spectrum slow sampling for material / category identification (longer integration time and more significant delay). Without unified timing modeling and sampling scheduling, full-spectrum sampling can be easily triggered when the optical path is misaligned or in contact with unsteady regions, leading to artifacts such as underexposure, saturation, low signal-to-noise ratio, and poor repeatability. Furthermore, the lag between sampling and inference may mislead the control, making it difficult to achieve a stable closed-loop execution of "first optimize light intensity - then sample several frames of the full spectrum - check if necessary - if not satisfied, back off and retry".

[0004] Therefore, there is an urgent need for a multimodal fusion perception and closed-loop interactive control scheme for robot dexterity hands in real operation processes, so that the end effector can achieve a collaborative closed loop of data consistency, reliable decision-making and executable actions throughout the entire process of "contact-sampling-identification-adjustment-verification". Summary of the Invention

[0005] This invention provides a multimodal fusion perception and control method and device for a robot dexterous hand, enabling the robot dexterous hand to achieve a collaborative closed loop of consistent data, reliable decision-making, and executable actions throughout the actual operation process.

[0006] On one hand, embodiments of the present invention provide a multimodal fusion perception and control method for a robot dexterous hand, the method comprising:

[0007] Collect multimodal data during the dexterous hand operation process, add timestamps and metadata to the multimodal data based on the same time base, write the multimodal data into a circular buffer and sort it by time;

[0008] The multimodal data is preprocessed and its quality is evaluated to obtain the standardized inputs and quality scores for each modality used as inputs to the multimodal fusion model.

[0009] The standardized inputs and quality scores of each modality, as well as the current operation stage, are input into the multimodal fusion model. Based on the output of the multimodal fusion model, high-level semantic results and control-related quantities for closed-loop operation are obtained.

[0010] The high-level semantic results and control-related quantities are transformed into executable closed-loop control behaviors to control the dexterous hand movements.

[0011] Optionally, the multimodal data includes any one or more of the following: spectral data, tactile feature data, visual feature data, mechanical data, and auxiliary state data.

[0012] Optionally, the timestamp includes: a hard timestamp and / or a soft timestamp.

[0013] Optionally, the preprocessing of the multimodal data includes any one or more of the following:

[0014] The spectral data are subjected to spectral correction and standardization processing;

[0015] The tactile feature data is then filtered and characterized.

[0016] The mechanical data is then filtered and standardized.

[0017] The visual feature data is calibrated, corrected, denoised, and standardized, and geometric constraints related to the capture and acquisition of spectral data are generated.

[0018] Optionally, the quality evaluation of the multimodal data includes: obtaining a quality score for the modal data based on the validity and / or stability of each modal data.

[0019] Optionally, the multimodal fusion model adopts a quality-driven gating and weighted fusion mechanism, which triggers a closed-loop resampling process when the quality of key modal data is consistently below a threshold or the recognition confidence is insufficient to support subsequent actions.

[0020] Optionally, the output of the multimodal fusion model includes any one or more of the following information:

[0021] The category and / or material of the target object and its confidence level are used for sorting rule matching and strategy selection;

[0022] Operation-related mechanical objectives or constraint parameters are used for closed-loop force control and impedance parameter settings;

[0023] Suggestions on gripping techniques, as well as corresponding finger position adjustments or micro-movement ranges.

[0024] On the other hand, embodiments of the present invention also provide a multimodal fusion sensing and control device for a robot dexterous hand, the device comprising:

[0025] The data acquisition module is used to collect multimodal data during the dexterity hand operation process, add timestamps and metadata to the multimodal data based on the same time base, write the multimodal data into a circular buffer and sort it by time;

[0026] The preprocessing module is used to preprocess and evaluate the quality of the multimodal data to obtain standardized inputs and quality scores for each modality used as inputs to the multimodal fusion model.

[0027] The fusion module is used to input the standardized inputs and quality scores of each modality, as well as the current operation stage, into the multimodal fusion model. The quality scores are used as fusion weights or gating conditions. Based on the output of the multimodal fusion model, high-level semantic results and control-related quantities for closed-loop operation are obtained.

[0028] The control module is used to convert the high-level semantic results and control-related quantities into executable closed-loop control behaviors to control the dexterous hand movements.

[0029] Optionally, the acquisition module includes:

[0030] The spectral information acquisition unit is used to acquire the reflection and transmission spectral data of the target object within a set filter range;

[0031] The tactile information acquisition unit is used to acquire tactile feature data output by the tactile array of the dexterous hand; the tactile feature data includes any one or more of the following: contact pressure distribution, contact area, contact center, shear force, and vibration force;

[0032] A visual information acquisition unit is used to acquire visual feature data of the target object, including RGB images, depth images, or point cloud data of the target object.

[0033] The mechanical information acquisition unit is used to collect mechanical data directly related to grasping stability. The mechanical data includes any one or more of the following: force, torque, normal force, and tangential force.

[0034] An auxiliary status information acquisition unit is used to acquire auxiliary status data of the dexterous hand. The auxiliary status data includes any one or more of the following: current, temperature, and inertial information of the joint encoder and / or motor.

[0035] Optionally, the sampling frequency of the same information acquisition unit may be the same or different at different operation stages.

[0036] Optionally, the spectral information acquisition unit uses an adaptive triggering mechanism jointly driven by a state machine supervisor and multi-source events to acquire the spectral data.

[0037] Optionally, the preprocessing module includes: a preprocessing unit and a quality evaluation unit;

[0038] The preprocessing unit is used to preprocess the multimodal data; the preprocessing includes any one or more of the following:

[0039] The spectral data are subjected to spectral correction and standardization processing;

[0040] The tactile feature data is then filtered and characterized.

[0041] The mechanical data is then filtered and standardized.

[0042] The visual feature data is calibrated, corrected, denoised, and standardized, and geometric constraints related to the capture and acquisition of spectral data are generated.

[0043] Optionally, the multimodal fusion model includes: a multimodal coding layer and a multimodal neural network fusion layer.

[0044] Optionally, the control module includes: a state machine supervisor and a multi-mode controller;

[0045] The state machine supervisor is used for event-driven switching and scheduling between different operation phases;

[0046] The multi-mode controller is used to achieve stable execution in each stage using control modes such as position control, force control, or impedance control, and to update the target force and impedance parameters and micro-motion strategy online according to the upper-level scheduling instructions.

[0047] Compared with the prior art, the technical solution of the embodiments of the present invention has the following beneficial effects:

[0048] The multimodal fusion perception and control method and device for robot dexterous hands provided in this invention address the system characteristics of asynchronous multi-rate (corresponding to timing alignment and scheduling problems caused by differences in multimodal sampling frequency and delay), multi-source uncertainty, and strong coupling of contact states (corresponding to the need for full-process online monitoring, verification, and backoff due to dynamic fluctuations in multimodal quality and changes in contact conditions with actions) in the actual grasping process of dexterous hands. By establishing a unified timing organization and quality reliability mechanism at the software layer, the spectral, tactile, visual, and mechanical signals are upgraded from parallel and independent perception channels to a fusion closed loop that is mutually constrained, mutually verified, and jointly driven by control. Thus, without changing the basic hardware motion capabilities of the dexterous hand, a stable closed-loop execution of "contact establishment - perception recognition - verification and adjustment - subsequent execution" of the target is achieved, significantly improving the operational stability and consistency of complex targets (such as those with similar appearance, large material differences, flexibility, reflectivity, etc.).

[0049] Furthermore, addressing the contradiction between low-frequency spectral data with variable exposure delays and high-frequency tactile and mechanical data requiring millisecond-level closed-loop processing, this invention introduces data with hard and soft timestamps and a multimodal circular buffer into the software architecture. This manages data from different modalities along a unified timeline and uses a sliding window approach to form trainable and inferable time-series samples. Building upon this, data quality and reliability descriptions are introduced for each modality, enabling fusion and decision-making to no longer rely on fixed weights. Instead, it adaptively selects the more reliable modality data at different operational stages and under different environmental conditions, fundamentally alleviating the temporal and semantic mismatch problems caused by multimodal fragmentation.

[0050] Furthermore, at the model layer, a multi-encoder and cross-modal fusion structure is employed to jointly represent data from different dimensions. Through gating or cross-modal attention mechanisms, information complementarity and conflict suppression are achieved under quality scoring and confidence constraints. The model output does not merely provide a single material and / or category label, but rather simultaneously outputs recognition results and risk estimates and action suggestion parameters that can be directly invoked by closed-loop control in a multi-task manner. These parameters include target gripping force range, impedance / stiffness / damping adjustment direction, and fingertip micro-motion direction and amplitude. This integrated "recognition-risk-parameter" output allows the output results to better serve closed-loop control, rather than simply acting as an offline classifier, thus providing an executable and interpretable interface for subsequent control strategies.

[0051] Furthermore, at the control layer, a hierarchical closed loop consisting of a high-frequency inner loop and a strategy outer loop is constructed. The high-frequency inner loop, primarily based on tactile and mechanical sensing, continuously maintains grasping stability and safety, achieving anti-slip, force limiting, and impact suppression. The outer loop, centered on the output of the fusion model, switches and schedules strategies based on material recognition confidence, slip / pressure damage risk, and quality score. For example, it adaptively increases force and coordinates with finger position fine-tuning when friction is low or slip risk increases; it reduces normal force and switches to compliant impedance when fragility or pressure damage risk increases; it performs posture resampling and re-judgment when recognition confidence decreases or spectral quality is insufficient; and it confirms recognition consistency through one or more verification samplings after grasping, triggering backtracking retry or changing subsequent sorting / assembly paths when necessary. Through the closed-loop structure of "acquisition—fusion—control—reacquisition," this invention truly embeds multimodal perception into the operation sequence, achieving continuous and consistent control of contact establishment, perception recognition, and subsequent actions, thereby improving robustness and usability in complex objects and environments. Attached Figure Description

[0052] Figure 1 This is a flowchart of a multimodal fusion perception and control method for a robot dexterous hand provided in an embodiment of the present invention;

[0053] Figure 2 This is a flowchart of quality-driven multimodal fusion and closed-loop resampling in an embodiment of the present invention.

[0054] Figure 3 This is a schematic diagram of a state change process of a state machine in an embodiment of the present invention;

[0055] Figure 4 This is a schematic diagram of a multimodal fusion sensing and control device for a robot dexterous hand provided in an embodiment of the present invention. Detailed Implementation

[0056] To make the above-mentioned objectives, features and beneficial effects of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0057] Current dexterous hand perception systems primarily rely on two pathways: tactile / mechanical and visual. Tactile and mechanical signals (such as force / torque, tactile arrays, articular currents / torques, pose encoders, etc.) are used to characterize contact states, force changes, slippage trends, and grasping stability. Visual perception, such as external cameras or depth cameras, is used for target localization, posture estimation, and coarse-grained category recognition. This visual guidance + tactile / force control closed-loop model effectively improves grasping success rate and operational accuracy in many scenarios, but its core capabilities are still mainly concentrated on geometric and contact-level representation.

[0058] When applications further require distinguishing between targets that are "similar in appearance but different in material / process", or when reliable operation is required for flexible, transparent / semi-transparent, highly reflective, or weakly textured objects, relying solely on vision and touch often leads to uncertainty: vision may be affected by lighting, reflection, and occlusion, making it difficult to reliably distinguish material differences; while tactile / mechanical signals can reflect contact and friction behavior, they are difficult to directly provide clues about the inherent physical and chemical properties of objects. This can lead to misjudgments in recognition and action strategies or require repeated trial and error, thereby reducing efficiency and increasing the risk of damage.

[0059] Spectral sensing can provide the reflection / absorption characteristics of materials in specific wavelength bands (such as visible / near-infrared), and has natural separability and interpretability in terms of materials, coatings, water content, maturity, etc., thus serving as a key supplementary sensory dimension "beyond touch and vision". Introducing a miniature spectrometer into the dexterous hand operation link, and performing time-series alignment, quality assessment, and fusion modeling with tactile and mechanical signals on the software side, allows the system to simultaneously complete material identification and stability assessment during contact and grasping, thereby supporting a closed-loop process of "identification-grabbing-verification-subsequent operation", significantly improving robustness and intelligence in complex objects and environments.

[0060] The robotic dexterous hand exhibits the following characteristics during actual grasping:

[0061] (1) The timing alignment and scheduling problems caused by the differences in multimodal sampling frequency and delay result in the asynchronous multi-rate characteristic;

[0062] (2) Due to the dynamic fluctuations of multimodal quality and the need for online monitoring, verification and rollback throughout the entire process caused by changes in contact conditions with the action, the characteristics of multi-source uncertainty and strong coupling of contact state are formed.

[0063] To address the aforementioned characteristics, this invention provides a multimodal fusion perception and control method and apparatus for a robot dexterous hand. By establishing a unified temporal organization and quality reliability mechanism at the software layer, spectral, tactile, visual, and mechanical signals are upgraded from parallel and independent perception channels to a fusion closed loop that is mutually constrained, mutually verified, and jointly driven for control. This enables stable closed-loop execution of "contact establishment—perception recognition—verification and adjustment—subsequent execution" for the target object to be manipulated (such as fruit, garbage, medicine, etc.) without changing the basic hardware motion capabilities of the dexterous hand. This significantly improves the stability and consistency of operation on complex target objects (such as those with similar appearance, large material differences, flexibility, reflectivity, etc.).

[0064] like Figure 1 The diagram shown is a flowchart of a multimodal fusion perception and control method for a robot dexterous hand provided in an embodiment of the present invention.

[0065] In step 101, multimodal data during the dexterous hand operation process is collected, and timestamps and metadata are added to the multimodal data based on the same time reference. The multimodal data is then written into a circular buffer and sorted by time.

[0066] In some embodiments, the multimodal data may include, but is not limited to, any one or more of the following: spectral data, tactile feature data, visual feature data, mechanical data, etc.

[0067] The spectral data may include the reflection and / or transmission spectral data of the target object within a set spectral range (such as the near-infrared or mid-infrared band).

[0068] The tactile feature data may be tactile feature information such as contact pressure distribution, contact area, contact center and shear stress and / or dynamic vibration signals generated by relative sliding or impact, output by a tactile array set on the fingertips and / or finger pads of a dexterous hand.

[0069] The visual feature data may include: RGB image, depth image or point cloud data of the target object, used to determine the appearance and geometric information of the target object.

[0070] The mechanical data may include information such as forces or torques, normal forces, and tangential forces that are directly related to gripping stability.

[0071] In some embodiments, the multimodal data may further include auxiliary state data, such as the current, temperature, and inertial information (e.g., acceleration, angular velocity) of the joint encoder and / or motor. This information can be used to characterize the kinematics and state environment of the dexterous hand, providing contextual constraints for subsequent multimodal data fusion. For example, the joint encoder provides pose states such as joint angles and velocities to correlate contact positions with posture changes; motor current can be used to estimate joint load and friction states; temperature is used to compensate for spectral drift and tactile zero-point drift; and inertial information can be used to compensate for dynamic disturbances caused by rapid movement.

[0072] It should be noted that the acquisition of data for each modality can be performed by the corresponding information acquisition unit, and this embodiment of the invention does not limit this. Since different information acquisition units operate at different frequencies—for example, tactile sensors and pressure sensors typically operate at higher frequencies, while spectrometers operate at lower frequencies—a unified time base and synchronization mechanism can be set to ensure the consistency of the timing of data from different modalities. A timestamp is appended to the data from each acquisition channel during acquisition, and the acquired data is written to a multimodal circular buffer. The circular buffer, also known as a circular queue, is a fixed-size, contiguous memory data structure.

[0073] In practice, each modal data can correspond to an independent circular buffer in order to better support modal data with different sampling rates and data sizes.

[0074] In this embodiment of the invention, the method of adding timestamps to the multimodal data may be to add hard timestamps and / or add soft timestamps, and this embodiment of the invention does not limit the method.

[0075] The hard timestamp refers to the time stamp directly marked by hardware at the physical moment of data acquisition (such as the end of ADC conversion, the edge of an external trigger signal, or the instant the sensor completes sampling). This action is usually performed by hardware logic without software intervention. The hardware can be the sensor hardware itself or in the data acquisition link closest to the hardware, such as an FPGA or a dedicated timing chip. Hardware timestamps typically come from a unified clock, timer, FPGA, or MCU counter, and have low jitter and are reproducible (i.e., under exactly the same input conditions, the error range of the timestamp is fixed and predictable, rather than random).

[0076] A soft timestamp is a timestamp that is obtained by software from the system time and appended to the data. It records the "moment when the data was observed by the software".

[0077] The metadata is a set of structured information used to describe "how, when, where, and by whom" the corresponding modal data was generated. The metadata content differs for different modalities. For example, the metadata corresponding to spectral feature data may include integration time, gain, saturation flag, effective pixel range, etc.; the metadata corresponding to visual feature data may include camera exposure time, gain, intrinsic parameter calibration version, etc.

[0078] In this embodiment of the invention, metadata can be recorded, transmitted, and stored together with the original sampling data to better ensure the validity of the sampling data. Unless otherwise stated, the modal data mentioned thereafter refers to a data sequence that includes sampling data, metadata, and timestamps.

[0079] In step 102, the multimodal data is preprocessed and its quality is evaluated to obtain the standardized inputs and quality scores of each modality used as inputs to the multimodal fusion model.

[0080] The purpose of preprocessing multimodal data is to eliminate scale differences, noise and drift effects caused by different sensors, improve data quality, generate standardized inputs that can be used for training and online inference of multimodal fusion models, reduce the interference of non-ideal factors in motion contact process on recognition and control, and enable the fusion model to obtain more stable and transferable feature representations.

[0081] Different modal data require different processing methods, for example:

[0082] Spectral data undergoes spectral correction and standardization. Spectral correction may include processes such as dark current and / or background subtraction, white reference normalization, wavelength axis calibration, and intensity normalization to reduce the impact of light source intensity fluctuations, integration time variations, and ambient light leakage on spectral amplitude.

[0083] The tactile feature data undergoes filtering and characterization processing. This may include tactile zero-point calibration, drift compensation, spatial smoothing, temporal low-pass or band-pass filtering, and outlier repair to suppress the impact of sensor noise and local failures on the features. Furthermore, fundamental feature quantities related to closed-loop operation can be extracted from the tactile feature data, such as contact area, contact center, peak and average pressure, shear stress and / or vibration intensity, texture-related features, or micro-slip indications, so that subsequent fusion models and control strategies can jointly determine "contact morphology, contact stability, and slip trend."

[0084] The mechanical data undergoes filtering and standardization. This may include sensor zero-point compensation, low-pass filtering, impact or oscillation detection, coordinate system transformation (e.g., from sensor coordinate system to end-effector or finger coordinate system), and necessary resampling and alignment. The processed data outputs characteristics representing grasping stability and safety, such as normal force, tangential force, torque amplitude, and rate of change of force. Abnormal contacts (e.g., impact contact or high-frequency oscillation) are also marked, providing preliminary information for subsequent control loop limiting, gain adjustment, and safety protection.

[0085] The visual feature data undergoes calibration, denoising, and standardization, and geometric constraints related to the captured and sampled spectral data are generated to provide stable input for subsequent multimodal fusion inference and closed-loop control. Preprocessing of the visual feature data may include: loading camera intrinsic and extrinsic calibration parameters, distortion correction and coordinate system unification (e.g., converting point cloud data in the camera coordinate system to the end or world coordinate system), exposure and gain uniformity and brightness normalization, image denoising and deblurring, depth filtering and hole filling, point cloud outlier removal and downsampling, etc., to reduce the impact of factors such as illumination changes, reflections, high dynamic range, weak textures, transparent or semi-transparent materials, and motion blur on the stability of visual features.

[0086] The quality scores of each modality data can be used to support subsequent fusion weight allocation, gating selection and anomaly handling in closed-loop control, so as to improve the reliability of subsequent multimodal data fusion.

[0087] For example, the quality of spectral data, tactile feature data, mechanical data, and visual feature data are evaluated respectively, and the corresponding quality scores are calculated as q_spec, q_tact, q_force, and q_vis.

[0088] The quality score can be generated based on the validity and / or stability indicators of each modality's data, such as:

[0089] The quality score q_spec of spectral data can be obtained by comprehensively considering indicators such as signal-to-noise ratio, saturation ratio, echo intensity range, baseline drift, and repeatability.

[0090] The quality score q_tact of tactile feature data can be determined by indicators such as the effective channel ratio, noise level, zero-point drift, and contact distribution stability.

[0091] The quality score q_force of mechanical data can be determined by indicators such as signal variance, impact detection results, oscillation amplitude, and sensor saturation / frame dropping.

[0092] The quality score q_vis of visual feature data can be obtained by comprehensively considering indicators such as sharpness (e.g., gradient energy or Laplacian variance), exposure and saturation ratio, degree of motion blur, effective field of view ratio, degree of occlusion, effective depth pixel ratio, point cloud density, and noise level.

[0093] The quality score can be used as a gating weight input to the fusion model layer to reduce the impact of a certain modality's quality decline or trigger resampling. It can also be output to the decision and control layer to perform closed-loop actions such as backtracking and retrying, orientation sampling, or safety force limiting.

[0094] Furthermore, preprocessing can perform time alignment and windowed encoding on multimodal signals with different sampling frequencies and delays. Specifically, it reads spectral frames and corresponding tactile and mechanical data fragments within the same time window from a circular buffer based on a unified timestamp, forming a fixed-length or adaptive-length time-series sample as input to the fusion model. For example, in a non-limiting embodiment, using the effective time stamp t_spec of a frame of full-spectrum data as the center time, the tactile array sequence, force, or torque sequence within the time window [t_spec-100ms, t_spec+100ms] is extracted. High-frequency tactile and mechanical data can be downsampled at a fixed step size, or statistical features such as mean, peak value, rate of change, variance, and contact center drift can be extracted. Simultaneously, the visual feature data and corresponding quality score closest to this time window are read, and the metadata of one or more frames of spectral feature data, tactile feature data, mechanical data, visual feature data, and modal data within this time window are combined to form a time-series sample for training or online inference of the multimodal fusion model. For fixed-length samples, the window width is preset; for adaptive-length samples, the window width can be dynamically adjusted according to the contact establishment time, steady-state maintenance time, or task stage.

[0095] For low-frequency spectral feature data with variable integration time, strategies such as interpolation, resampling, or nearest neighbor alignment can be used during alignment to match tactile / mechanical high-frequency data, thereby forming unified temporal samples that can be used for training and online inference.

[0096] Through the above preprocessing, the dexterous hand can obtain a more consistent and robust multimodal representation under dynamic grasping conditions, providing a reliable data foundation for subsequent fusion recognition and closed-loop control.

[0097] In step 103, the standardized inputs and quality scores of each modality, as well as the current operation stage, are input into the multimodal fusion model. Based on the output of the multimodal fusion model, high-level semantic results and control-related quantities for closed-loop operation are obtained.

[0098] In one non-limiting embodiment, the multimodal fusion model includes at least a multimodal coding layer and a multimodal neural network fusion layer. The multimodal coding layer includes encoders corresponding to each modality of data, performing feature extraction and embedding representation on the respective modality data. The multimodal neural network fusion layer maps the multimodal features extracted by the multimodal coding layer (i.e., the embedding representations of each modality) to a unified feature space, and performs cross-modal interaction and information complementarity on the features within the unified feature space, thereby completing the fusion of multimodal features and achieving a comprehensive judgment of the target object's category / material, contact state, and grasping strategy.

[0099] It should be noted that the output of the multimodal neural network fusion layer is preferably in a multi-task form to simultaneously support recognition and control requirements. The specific information output can be determined according to the specific application needs. For example, in a non-limiting embodiment, the output of the multimodal neural network fusion layer may include: target object category and / or material recognition results and confidence levels, used for sorting rule matching and strategy selection; it may also output operation-related mechanical targets or constraint parameters, such as recommended normal force range, torque magnitude and direction, stability margin, or friction-related estimates, used for closed-loop force control and impedance parameter settings; and it may also output gripping method suggestions, such as two-finger gripping, multi-finger envelope, fingertip touch, fingertip surface touch, etc., as well as corresponding finger position adjustment direction or micro-motion amplitude suggestions.

[0100] The multimodal neural network fusion layer can use windowed alignment input for asynchronous multi-rate data. Specifically, low-frequency spectral frames can be used as anchors, and corresponding high-frequency tactile feature data sequences and mechanical data sequences can be selected within preset time windows before and after them. These sequences can then be aligned to a unified input format through nearest neighbor matching, linear interpolation, resampling, statistical aggregation, or temporal coding. For example, a single fusion input sample can be constructed using one frame of spectrum, plus 100ms of tactile feature data and mechanical data segments before and after it, plus nearest neighbor visual features. Alternatively, N consecutive frames of spectral data can be used to correspond to N local time windows, which are then combined with consecutive high-frequency segments to form a multi-step temporal input. This allows the learning of the dynamic process of 'contact establishment—stable sampling—grasping adjustment', so that stable output can be maintained even when there are contact changes and motion disturbances when a dexterous hand grasps a target object.

[0101] The multimodal neural network fusion layer can realize the unified representation and joint reasoning of data such as spectral material fingerprint information, tactile contact morphology and micro-slip cues, as well as mechanical force trends and dynamic constraints. The output can directly drive the multidimensional decision quantity of grasping control, thus providing a reliable, interpretable and adaptive core input for subsequent hierarchical closed-loop control.

[0102] For example, in a non-limiting embodiment, the output of the multimodal fusion model includes at least:

[0103] (1) The category and / or material identification results of the target object and the identification confidence level;

[0104] (2) At least one control aid related to gripping stability, preferably one of slip risk, stability margin or recommended gripping range.

[0105] In some embodiments, the output of the multimodal fusion model may further include any one or more of the following:

[0106] (1) Contact state type, such as point contact, line / edge contact or surface contact;

[0107] (2) Pressure loss risk assessment;

[0108] (3) Suggestions on the direction or magnitude of impedance parameter adjustment;

[0109] (4) Suggestions on the direction and amplitude of fingertip micro-movements;

[0110] (5) Whether a strategy of resampling, verification or rollback is required.

[0111] In practical implementation, specific output items can be selected according to application requirements, so that the output of the multimodal fusion model can not only meet the basic recognition and grasping control requirements, but also provide richer closed-loop scheduling information in more complex scenarios.

[0112] In this embodiment of the invention, the multimodal fusion model adopts a quality-driven gating and weighted fusion mechanism to achieve the reliability of multimodal fusion.

[0113] like Figure 2 The diagram shown is a flowchart of quality-driven multimodal fusion and closed-loop resampling in an embodiment of the present invention.

[0114] In step 201, the standardized inputs of each modality and the quality scores (such as q_spec, q_tact, q_force, q_vis) and the current operation stage are used as inputs to input the multimodal fusion model, and the output of each modality fusion weights w_spec, w_tact, w_force and w_vis is used to perform weighted fusion or selective masking of each modality embedding features, and the fused features are jointly output with the recognition confidence c and the risk estimate r.

[0115] The fused features can be determined based on the recognition task required for a specific application and the corresponding dexterous hand movement requirements, etc., and the embodiments of the present invention do not limit this.

[0116] The recognition confidence level c can be, for example, the confidence level of the recognition result of the target object's type or material.

[0117] In step 202, a quality check is performed on each modal data to determine whether the closed-loop resampling process is triggered; if so, step 203 is executed; otherwise, step 206 is executed.

[0118] Specifically, when the quality of data in a certain modality deteriorates but the quality of data in other modalities is relatively high, a flexible strategy of "de-weighting and continuing inference" can be implemented to maintain the continuity of the closed loop; when the quality of key modal data continues to be below the threshold or the identification confidence c is insufficient to support subsequent actions, the closed loop resampling process is triggered.

[0119] In step 203, the closed-loop resampling process is triggered: condition improvement—resampling—verification / rollback. For example, adjusting fingertip posture / incident angle, adjusting clamping force or impedance parameters, changing contact surface, adjusting spectral integration time and / or gain.

[0120] Among these, "condition improvement" refers to targeted adjustments to key conditions affecting the reliability of sampling and recognition based on the current sources of quality degradation. For example, if the quality score q_spec of spectral data is low and exhibits underexposure, saturation, or insufficient echo intensity, then priority should be given to improving spectral sampling conditions, including adjusting fingertip posture, incident angle, relative distance, integration time, gain, or local supplemental lighting. If the quality score q_tact of tactile feature data is low and exhibits unstable contact distribution, local failure, or enhanced slippage, then priority should be given to improving contact conditions, including adjusting clamping force, contact position, contact area, or fingertip micro-movement direction. If the quality score q_force of mechanical data is low and exhibits impact, oscillation, or excessive force fluctuations, then priority should be given to improving dynamic conditions, including reducing approach speed, adjusting impedance stiffness / damping, or extending steady-state holding time. If the quality score q_vis of visual feature data is low, then priority should be given to improving visual observation conditions, including adjusting the viewing angle, ROI (Region of Interest), exposure parameters, supplemental lighting, or target relocation.

[0121] Specifically, based on the quality score vector, anomaly type label, and current operation stage, the system can automatically select the condition category that needs improvement and the corresponding action.

[0122] In step 204, several frames of data are resampled and the quality score is recalculated.

[0123] In step 205, determine whether the quality meets the standard; if yes, proceed to step 206; otherwise, go back to step 203, or terminate the grabbing to avoid the risk of slippage or pressure damage.

[0124] Step 206: Output the high-level semantic results and control-related quantities for closed-loop operation.

[0125] Through the coupling mechanism of "quality scoring → fusion weights → action scheduling → resampling", the output of the multimodal fusion model can reliably drive the control, and the control action can also in turn ensure the credibility of the measurable conditions and the recognition results of the target object.

[0126] Continue to refer to Figure 1 In step 104, the high-level semantic results and control-related quantities are transformed into executable closed-loop control behaviors to control the dexterous hand movements.

[0127] For example, the target object category and / or material identification results, their confidence scores, modal quality scores, and grasping stability-related quantities (such as slip risk r, stability margin, etc.) output by the multimodal fusion model can be transformed into executable closed-loop control behaviors, forming a backtrackable and verifiable operation link of "identification-control-re-identification / re-control".

[0128] In a non-limiting embodiment, a layered architecture of "state machine supervisor + multi-mode controller" can be adopted: the state machine supervisor is responsible for event-driven switching and scheduling between different operation stages, and the multi-mode controller is responsible for stable execution in each stage using control modes such as position control, force control or impedance control, and updates the target force and impedance parameters and micro-motion strategy online according to the upper-level scheduling instructions.

[0129] Reference Figure 3 , Figure 3 This is a schematic diagram of a state change process of a state machine in an embodiment of the present invention, wherein each state represents an operation stage, as detailed below:

[0130] S1 represents the task input stage, where the current grabbing / sorting / assembly task and constraints are received and threshold and strategy parameters are initialized before proceeding to S2.

[0131] S2 represents the target approach stage, in which visual guidance is used to locate the target, estimate the posture, and pre-shape the fingers to approach the target, and then proceed to S3.

[0132] S3 represents the pre-contact stage, in which a light touch is established using compliant impedance or position control and tactile / mechanical signals are collected, providing conditions for subsequent steady-state determination and sampling triggering.

[0133] S4 represents the steady-state determination stage, in which the state machine determines whether the stable contact window conditions are met based on tactile / mechanical windowing indicators. If the condition is not met (No), it can regress to S2 target proximity (or perform fine-tuning of contact angle, contact pressure, fingertip posture and brief holding in the pre-contact stage before making a determination) to improve contact and optical path conditions. If the condition is met (Yes), it enters S5.

[0134] S5 represents the event-triggered spectral acquisition stage, which triggers spectral acquisition under steady-state conditions (Question 3: Does it only trigger spectral acquisition? What about the acquisition of other modal data?). It records sampling metadata and a spectral quality score q_spec to avoid acquiring underexposed, saturated, or low-signal-noise spectral frames when in contact with an unstable state or when the optical path is misaligned. When the spectral quality score q_spec falls below a preset threshold, the state machine can revert to the adjustment path corresponding to S3 pre-contact or S4 steady-state determination. By changing the orientation, adjusting the force, or briefly maintaining stability, sampling is retried, thereby reducing the probability of low-quality spectra entering fusion inference and causing misjudgments. After completing S5, it proceeds to S6. It should be noted that in the S5 stage, the state machine triggers focused sampling of key low-frequency modes under the conditions of steady-state contact, optical path conditions, and quality constraints. It can preferentially trigger spectral acquisition and record the corresponding sampling metadata and quality score. Meanwhile, high-frequency modal data such as tactile, mechanical, and assistive states can continue to be continuously or periodically acquired, while visual modal data can be synchronously updated or nearest neighbor matched as needed, thereby forming multimodal aligned samples corresponding to this key sampling.

[0135] S6 represents the multimodal fusion recognition stage, in which aligned spectral, tactile, and mechanical (optionally visual) fragments are input into the multimodal fusion model, outputting the material and / or category of the target object and confidence level c, and simultaneously outputting control-related quantities such as slip risk r, stability margin, and pressure loss risk; then proceeding to S7.

[0136] S7 represents the strategy selection stage, in which the grasping method and control parameters are selected based on the identification results, confidence level and quality score, and it is determined whether a review is required and its triggering conditions are met; then proceed to S8.

[0137] S8 represents the closed-loop gripping execution phase. During gripping and handling, force control / resistance control is the main method to maintain safety constraints and continuously monitor signals such as slip risk r, abnormal pressure distribution and sudden force changes. If necessary, gripping force adaptive adjustment, finger position fine-tuning, impedance parameter switching or micro-motion strategy are performed within S8 to form an inner loop closed loop. After gripping, it enters S9.

[0138] S9 represents the post-grab verification stage. During stable holding or handling, spectral / tactile verification is triggered again to verify recognition consistency and stability. If the verification fails, the process will revert to the S1 task input (or return to the approach / pre-contact stage to retry) according to the "failure" path shown in the figure to reconfirm and adjust the strategy. If the verification is successful, the process will proceed to S10.

[0139] S10 represents the subsequent operation stage, in which tasks such as sorting and delivery, assembly and insertion, transfer or stripping are performed.

[0140] Through the state machine scheduling of S1 to S10 and the collaboration of the multi-mode controller, the entire process of mapping from multi-modal fusion results to grasping control can be closed-loop, reflecting the operational characteristics of "can be rolled back and can be verified" and enhancing the robustness of the system in dynamic contact, complex materials and uncertain environments.

[0141] It should be noted that, Figure 3 The state change flow shown is merely an exemplary state machine scheduling method in this embodiment of the invention, used to illustrate the stage switching relationship between multimodal fusion sensing results and closed-loop control behavior, and is not the only limitation of the invention. Under different application scenarios, target objects, execution tasks, or hardware configurations, the division of operation stages, state entry conditions, rollback conditions, verification triggering timing, and state transition paths can all be adaptively adjusted or tailored. For example, in some applications, the verification state can be omitted, the pre-contact state and the steady-state determination state can be merged, or extended states such as anomaly handling states and handling monitoring states can be added; all of these should fall within the protection scope of this invention.

[0142] Accordingly, embodiments of the present invention also provide a multimodal fusion sensing and control device for a robot dexterous hand, such as... Figure 4 The diagram shown is a structural schematic of the device.

[0143] In this embodiment, the multimodal fusion sensing and control device 400 for a robot dexterous hand includes the following modules:

[0144] The data acquisition module 401 is used to acquire multimodal data during the operation of the dexterous hand, add timestamps to the multimodal data based on the same time base, write the multimodal data into a circular buffer and sort it by time;

[0145] Preprocessing module 402 is used to preprocess and evaluate the quality of the multimodal data to obtain standardized inputs and quality scores for each modality used as inputs to the multimodal fusion model.

[0146] The fusion module 403 is used to input the standardized inputs and quality scores of each modality, as well as the current operation stage, into the multimodal fusion model. The quality scores are used as fusion weights or gating conditions. Based on the output of the multimodal fusion model, high-level semantic results and control-related quantities for closed-loop operation are obtained.

[0147] The control module 404 is used to convert high-level semantic results and control-related quantities into executable closed-loop control behaviors to control the movements of the dexterous hand.

[0148] In a non-limiting embodiment, the data acquisition module 401 may include any one or more of the following information acquisition units: a spectral information acquisition unit, a tactile information acquisition unit, a visual information acquisition unit, and a mechanical information acquisition unit. Further, it may also include an auxiliary state information acquisition unit. Wherein:

[0149] The spectral information acquisition unit is used to acquire the reflection and transmission spectral data of the target object within a set filter range. The spectral information acquisition unit can preferably be a miniaturized chip-level spectrometer or other types of miniaturized spectrometers, such as the spectrometers described in patent documents CN116893148A, CN116539155B, CN118885713B, and CN118896915B, etc., but this embodiment of the invention does not limit the scope of the invention.

[0150] The tactile information acquisition unit is used to acquire tactile feature data output from the tactile array of the dexterous hand. The tactile feature data includes any one or more of the following: contact pressure distribution, contact area, contact center, shear force, vibration force, etc. The tactile array can be located at the fingertips and / or finger pads of the dexterous hand;

[0151] The visual information acquisition unit is used to acquire visual feature data of the target object. This visual feature data may include, for example, RGB images, depth images, or point cloud data of the target object; for instance, an RGB camera may be used to acquire RGB images, a depth camera to acquire depth images, and a structured light camera to acquire point cloud data. Using this visual feature data, the appearance and geometric information of the target object can be determined, which can then be used for target detection, grasp point and / or pose estimation, occlusion detection, and providing distance and angle-of-incidence constraints for non-contact spectral sampling.

[0152] The mechanical information acquisition unit is used to collect mechanical data directly related to gripping stability. This mechanical data may include, but is not limited to, any one or more of the following: force, torque, normal force, tangential force, etc. The mechanical data may originate from a six-dimensional force or torque sensor mounted on the dexterous wrist, a miniature force sensor mounted on the fingertip, or equivalent mechanical quantities estimated from the joint motor current / torque. The mechanical information acquisition unit can acquire mechanical data at a relatively high operating frequency. Furthermore, it can perform basic filtering and anomaly marking on the mechanical signals, such as impact contact, oscillation, saturation, etc., so that the subsequent control inner loop can achieve anti-slip, force limiting, and pressure loss risk suppression.

[0153] The auxiliary state information acquisition unit is used to collect auxiliary state data of the dexterous hand. This auxiliary state data may include, but is not limited to, any one or more of the following: current, temperature, and inertial information of the joint encoder and / or motor. The auxiliary state information is used to characterize the kinematics and state environment of the dexterous hand, providing contextual constraints for multimodal fusion. For example, the joint encoder provides pose states such as joint angles and velocities to correlate contact position and posture changes; motor current is used to estimate joint load and friction state; temperature is used to compensate for spectral drift and tactile zero-point drift; and the IMU (Inertial Measurement Unit) is used to compensate for dynamic disturbances caused by rapid movement.

[0154] It should be noted that different information acquisition units can be triggered to perform data acquisition by corresponding drive modules, and the operating frequencies of different information acquisition units can be different. Typically, the tactile information acquisition unit and pressure sensor operate at higher frequencies, while the spectrometer operates at lower frequencies. For example, the tactile information acquisition unit can read the raw data or feature data of the tactile array at a higher frequency and perform basic synchronization marking and buffer management on the acquired tactile data.

[0155] Furthermore, the same information acquisition unit can also use different sampling frequencies at different operation stages, i.e., it can operate in a frequency conversion mode. For example, the visual information acquisition unit can be preset to two modes: a fast observation mode and a high-quality sampling mode. In the approach and pre-contact stages, the visual information acquisition unit acquires low-resolution or ROI image / point cloud data at a higher frame rate for target tracking, distance estimation, occlusion judgment, and incident angle constraint, and outputs a visual quality score q_vis (such as sharpness, exposure, occlusion ratio, effective field of view, etc.). After entering the steady-state sampling stage or the verification stage, the visual information acquisition unit switches to the high-quality sampling mode with high resolution, full field of view, and multi-frame fusion to improve the reliability of pose estimation and category prior. Furthermore, when the visual quality score q_vis is lower than a set threshold, the control module 404 can also trigger the visual information acquisition unit to adjust the viewing angle, supplement light, reposition, or switch the ROI strategy and resample.

[0156] The approach phase refers to the stage where the dexterous hand moves towards the target under visual guidance, but effective contact has not yet been detected by tactile or mechanical signals. The pre-contact phase refers to the stage where the end effector has entered the contact establishment area, and the tactile array or normal force signal begins to show a low-amplitude response, but a contact state that meets the stable sampling conditions has not yet been formed. The verification phase refers to the stage where, at a preset time after completing a recognition, grasping, or handling process, the system resamples the target using spectral, visual, tactile, or combined methods to verify the consistency of the previous recognition results, confirm whether the current contact or holding state still meets the requirements for subsequent operations, and decide whether to continue execution, resample, adjust control parameters, or retry.

[0157] In some embodiments, the spectral information acquisition unit may employ a conventional fixed-period triggering method.

[0158] In some embodiments, the spectral information acquisition unit may also employ an adaptive triggering mechanism jointly driven by a state machine supervisor and multi-source events. The state machine supervisor may be located in the control module 404. Specifically, the state machine supervisor comprehensively considers the current task stage, visual guidance results, tactile contact state, mechanical stability indicators, and the trend of optical echo intensity / light intensity changes to determine whether to trigger the corresponding sampling mode. The multi-source events include at least:

[0159] (1) Visual events: the target enters the preset field of view, the target pose estimation converges, the relative distance between the end point and the target enters the threshold range, and the incident angle meets the preset constraints;

[0160] (2) Contact events: The tactile array detects that the contact area is greater than the threshold, the contact center is stable, and the normal force reaches the pre-contact threshold;

[0161] (3) Stable events: Normal force fluctuation, tangential force fluctuation, contact center drift velocity, and vibration amplitude are less than the corresponding threshold within a continuous preset time window;

[0162] (4) Optical events: echo intensity or integrated light intensity enters the effective range, saturation ratio is lower than the threshold, signal-to-noise ratio is higher than the threshold;

[0163] (5) Abnormal events: impact contact, increased risk of slippage, underexposure / saturation of spectrum, and aggravated visual obstruction.

[0164] For example, during the target approach phase, when visual feature data detects that the target has entered the sampleable distance and the incident angle is close to the preset range, the light intensity optimization mode is first triggered. This involves collecting only the absolute light intensity, or collecting the echo and integrated intensity, and fine-tuning the light intensity through fingertip posture, relative distance, incident and / or exit angles, and clamping force to achieve the preset optimal or stable range, thereby completing the optical path alignment and exposure condition optimization. Once tactile feature data and mechanical data confirm that a stable light touch has been formed and the echo intensity has entered the target range, the full-spectrum sampling mode is then triggered. This involves collecting one or more frames of spectral data for identification over a longer integration time and generating a spectral quality score for each frame. If the quality score is insufficient after full-spectrum sampling, an abnormal event triggers a fallback to posture adjustment or force adjustment, and resampling is performed after condition improvement. The absolute light intensity, echo, and integrated intensity can be measured by independent photodetectors located near the fingertip, fingertip, end effector, or coaxially / nearly coaxially with the sampling optical path, or obtained by effective band integration counting from a miniaturized spectrometer. Both the photodetector and the spectrometer are integrated or installed on the side of the dexterous hand near the target, and together with the light-emitting / illumination unit, they form a local spectral sampling assembly.

[0165] The intensity optimization mode is primarily used to complete optical path alignment, attitude optimization, distance adjustment, and exposure condition optimization before formal identification sampling. Its output is preferably used as the basis for sampling condition determination, quality assessment, and control adjustment, and is usually not directly used as the final discrimination input for the material identification model. The full-spectrum sampling mode collects one or more frames of full-spectrum data as the main identification input for the subsequent fusion model. In an optional implementation, the data collected during the intensity optimization stage can also be used as auxiliary contextual information input to the fusion model to characterize sampling conditions and signal reliability, but it does not replace the main identification function of the full-spectrum sampling data.

[0166] In this embodiment of the invention, the determination of the operation stage can be made by the state machine supervisor set in the control module 404 based on the relative distance, tactile threshold, normal force threshold and its stability conditions.

[0167] It should be noted that, to address the timing mismatch issue caused by differences in sampling frequencies and delays across different modes, a unified time base and synchronization mechanism can be established. Accordingly, data from each acquisition channel is appended with a hard or soft timestamp during acquisition and written to the multimodal circular buffer. The methods for adding hard and soft timestamps have been explained in detail previously and will not be repeated here.

[0168] A unified time base can be provided by a master clock module, which can be provided by a monotonically increasing high-precision system clock by a main control computing unit or an independent timing controller, and distribute synchronization signals or trigger signals to each acquisition channel.

[0169] For information acquisition units that support external trigger synchronization (such as spectrometers and cameras), hard timestamps are applied. The timing controller outputs a hardware trigger pulse and writes the trigger time t_trig into the data frame. For example, hard timestamps are added to spectral information acquisition units with exposure / integration processes. In addition, metadata such as integration time T_int, gain, readout delay Δ_read, and saturation flag can be recorded. This allows the calculation of the effective time stamp t_spec of the spectral frame (e.g., taking the exposure midpoint t_trig + T_int / 2 or the exposure end time), and using t_spec instead of the data arrival time as the time reference during alignment.

[0170] For information acquisition units that do not support externally triggered synchronization, a soft timestamp can be added to the data reception time at the driver layer, and the fixed bias and jitter upper limit of the channel can be obtained through offline or online calibration for delay compensation.

[0171] Each acquisition channel's information acquisition unit writes the acquired modal data, along with timestamps and metadata, into the corresponding channel's circular buffer and sorts it by time. In this embodiment, the circular buffer contains a data cache structure where data is written sequentially in chronological order, and the oldest data is overwritten when the buffer's storage space reaches its limit. This structure is used to continuously store multimodal sampling data from the most recent period within limited storage resources, supporting online time alignment, sliding window sampling, and real-time inference.

[0172] It should be noted that the circular buffer can contain raw data for each modality, as well as a combination of raw data and preprocessed feature data, and is stored in association with metadata such as timestamps, sampling modes, integration times, gain, quality scores, and operation stages. For example, for high-bandwidth channels, the corresponding circular buffer can retain raw data from the most recent time period; for low-bandwidth channels or channels with extracted stable features, the corresponding circular buffer can simultaneously store raw data and feature data to balance real-time performance and traceability.

[0173] Through the above architecture design, the acquisition layer to which the data acquisition module 401 belongs can reliably acquire multimodal sensing data, unify time organization and structured output, avoid the timing inconsistency and uncontrollable fusion caused by the independent processing of spectral, tactile and mechanical signals, and provide a stable data foundation for the subsequent preprocessing layer, fusion model layer and hierarchical closed-loop control layer.

[0174] In this embodiment of the invention, the preprocessing module 402 is located in the preprocessing layer and is used to correct, filter, normalize and evaluate the quality of raw data from multiple modalities such as spectroscopy, tactile and mechanical data, generate standardized inputs that can be used for training and online inference of multimodal fusion models, and output quality scores for each modality to support anomaly handling in subsequent fusion weight allocation, gating selection and closed-loop control.

[0175] Different preprocessing methods can be used for sampling data of different modalities. For details, please refer to the description in the embodiments of the present invention, which will not be repeated here.

[0176] By preprocessing the sampling data of different modalities, the scale differences, noise and drift effects caused by different sensors can be eliminated, and the interference of non-ideal factors in the motion contact process on recognition and control can be reduced, so that the multimodal fusion model can obtain more stable and more transferable feature representation.

[0177] In this embodiment of the invention, the input of the multimodal fusion model can be generated by a sliding window. High-frequency tactile feature data and mechanical data within the window can be downsampled, feature aggregated, or statistically encoded. Low-frequency spectral data and visual feature data within the window can be aligned with the window using nearest neighbor matching, interpolation, or resampling. At the same time, metadata such as trigger mode and exposure parameters are retained as quality assessment and gating conditions, enabling asynchronous multi-rate links to be trained, inferred, and scheduled on a unified time axis.

[0178] In this embodiment of the invention, a non-limiting structure of the control module 404 may include: a state machine supervisor and a multi-mode controller. The state machine supervisor is used for event-driven switching and scheduling between different operation stages; the multi-mode controller is used to achieve stable execution in each stage using control modes such as position control, force control, or impedance control, and to update the target force and impedance parameters and micro-motion strategy online according to upper-level scheduling instructions. The state changes and control methods in different operation stages can be referred to... Figure 3 The relevant examples and descriptions will not be repeated here.

[0179] The multimodal fusion perception and control method and device for robot dexterous hands provided in this invention consists of an acquisition layer, a preprocessing layer, a fusion model layer, and a decision and control layer, forming a closed-loop link with the dexterous hand actuator. The acquisition layer is responsible for synchronously acquiring, uniformly timestamping, and organizing multi-source information such as vision, spectrum, touch, mechanics, and pose. The preprocessing layer corrects, filters, and normalizes the data from each modality and generates quality scores. The fusion model layer jointly represents and infers the multimodal features to output recognition results and control-related quantities. The decision and control layer, based on the fusion output and quality and / or risk indicators, implements state machine scheduling and multi-mode control execution, thereby completing a reversible closed-loop process of "approach—non-contact spectrum acquisition / light intensity optimization—contact—full-spectrum sampling—recognition—grasping—verification—subsequent operation," achieving efficient and accurate control and recognition collaborative operation.

[0180] It should be understood that the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article indicates that the preceding and following related objects have an "or" relationship.

[0181] In this application's embodiments, "multiple" refers to two or more. The descriptions of "first," "second," etc., appearing in this application's embodiments are merely illustrative and for distinguishing the described objects; they have no order and do not indicate a specific limitation on the number of devices in this application's embodiments, nor do they constitute any limitation on the embodiments of this application.

[0182] While the present invention has been disclosed above, it is not limited thereto. Any person skilled in the art can make various modifications and alterations without departing from the spirit and scope of the invention; therefore, the scope of protection of the present invention should be determined by the scope defined in the claims.

Claims

1. A multimodal fusion sensing and control method for a robot dexterous hand, characterized in that, The method includes: Collect multimodal data during the dexterous hand operation process, add timestamps and metadata to the multimodal data based on the same time base, write the multimodal data into a circular buffer and sort it by time; The multimodal data is preprocessed and its quality is evaluated to obtain the standardized inputs and quality scores for each modality used as inputs to the multimodal fusion model. The standardized inputs and quality scores of each modality, as well as the current operation stage, are input into the multimodal fusion model. Based on the output of the multimodal fusion model, high-level semantic results and control-related quantities for closed-loop operation are obtained. The high-level semantic results and control-related quantities are transformed into executable closed-loop control behaviors to control the dexterous hand movements.

2. The multimodal fusion sensing and control method for a robot dexterous hand according to claim 1, characterized in that, The multimodal data includes any one or more of the following: spectral data, tactile feature data, visual feature data, mechanical data, and auxiliary state data.

3. The multimodal fusion sensing and control method for a robot dexterous hand according to claim 1, characterized in that, The timestamps include: hard timestamps and / or soft timestamps.

4. The multimodal fusion sensing and control method for a robot dexterous hand according to claim 2, characterized in that, The preprocessing of the multimodal data includes any one or more of the following: The spectral data are subjected to spectral correction and standardization processing; The tactile feature data is filtered and characterized. The mechanical data is then filtered and standardized. The visual feature data is calibrated, corrected, denoised, and standardized, and geometric constraints related to the capture and acquisition of spectral data are generated.

5. The multimodal fusion sensing and control method for a robot dexterous hand according to claim 1, characterized in that, The quality evaluation of the multimodal data includes: The quality score of the modal data is obtained based on the validity and / or stability of each modal data.

6. The multimodal fusion sensing and control method for a robot dexterous hand according to claim 5, characterized in that, The multimodal fusion model employs a quality-driven gating and weighted fusion mechanism, triggering a closed-loop resampling process when the quality of key modal data remains below a threshold or the recognition confidence is insufficient to support subsequent actions.

7. The multimodal fusion sensing and control method for a robot dexterous hand according to claim 1, characterized in that, The output of the multimodal fusion model includes any one or more of the following information: The category and / or material of the target object and its confidence level are used for sorting rule matching and strategy selection; Operation-related mechanical objectives or constraint parameters are used for closed-loop force control and impedance parameter settings; Suggestions on gripping techniques, as well as corresponding finger position adjustments or micro-movement ranges.

8. A multimodal fusion sensing and control device for a robot dexterous hand, characterized in that, The device includes: The data acquisition module is used to collect multimodal data during the dexterity hand operation process, add timestamps and metadata to the multimodal data based on the same time base, write the multimodal data into a circular buffer and sort it by time; The preprocessing module is used to preprocess and evaluate the quality of the multimodal data to obtain standardized inputs and quality scores for each modality used as inputs to the multimodal fusion model. The fusion module is used to input the standardized inputs and quality scores of each modality, as well as the current operation stage, into the multimodal fusion model. The quality scores are used as fusion weights or gating conditions. Based on the output of the multimodal fusion model, high-level semantic results and control-related quantities for closed-loop operation are obtained. The control module is used to convert the high-level semantic results and control-related quantities into executable closed-loop control behaviors to control the dexterous hand movements.

9. The multimodal fusion sensing and control device for a robot dexterous hand according to claim 8, characterized in that, The acquisition module includes: The spectral information acquisition unit is used to acquire the reflection and transmission spectral data of the target object within a set filter range; The tactile information acquisition unit is used to acquire tactile feature data output by the tactile array of the dexterous hand; the tactile feature data includes any one or more of the following: contact pressure distribution, contact area, contact center, shear force, and vibration force; A visual information acquisition unit is used to acquire visual feature data of the target object, including RGB images, depth images, or point cloud data of the target object. The mechanical information acquisition unit is used to acquire mechanical data directly related to grasping stability. The mechanical data includes any one or more of the following: force, torque, normal force, and tangential force. An auxiliary status information acquisition unit is used to acquire auxiliary status data of the dexterous hand. The auxiliary status data includes any one or more of the following: current, temperature, and inertial information of the joint encoder and / or motor.

10. The multimodal fusion sensing and control device for a robot dexterous hand according to claim 9, characterized in that, The sampling frequency of the same information acquisition unit may be the same or different at different operation stages.

11. The multimodal fusion sensing and control device for a robot dexterous hand according to claim 9, characterized in that, The spectral information acquisition unit uses an adaptive triggering mechanism driven by a state machine supervisor and multi-source events to acquire the spectral data.

12. The multimodal fusion sensing and control device for a robot dexterous hand according to claim 9, characterized in that, The preprocessing module includes: a preprocessing unit and a quality evaluation unit; The preprocessing unit is used to preprocess the multimodal data; the preprocessing includes any one or more of the following: The spectral data are subjected to spectral correction and standardization processing; The tactile feature data is filtered and characterized. The mechanical data is then filtered and standardized. The visual feature data is calibrated, corrected, denoised, and standardized, and geometric constraints related to the capture and acquisition of spectral data are generated.

13. The multimodal fusion sensing and control device for a robot dexterous hand according to claim 8, characterized in that, The multimodal fusion model includes: a multimodal coding layer and a multimodal neural network fusion layer.

14. The multimodal fusion sensing and control device for a robot dexterous hand according to any one of claims 8 to 13, characterized in that, The control module includes: a state machine supervisor and a multi-mode controller; The state machine supervisor is used for event-driven switching and scheduling between different operation phases; The multi-mode controller is used to achieve stable execution in each stage using control modes such as position control, force control, or impedance control, and to update the target force and impedance parameters and micro-motion strategy online according to the upper-level scheduling instructions.