A multi-modal glove-based robot adaptive grasping method and system

An adaptive grasping method that integrates tactile and posture data in real time using multimodal gloves solves the problem of poor robot grasping adaptability and improves the success rate and stability of grasping fragile and deformable objects.

CN122210643APending Publication Date: 2026-06-16WUHAN SHUZHI INNOVATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WUHAN SHUZHI INNOVATION TECHNOLOGY CO LTD
Filing Date
2026-05-14
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing robot grasping methods suffer from poor adaptability and low success rate in complex environments due to incomplete perception information and lack of closed-loop feedback adjustment capabilities. They are particularly prone to damage or slippage when grasping fragile or easily deformable objects.

Method used

An adaptive grasping method based on multimodal gloves is adopted, which integrates tactile sensing components and posture sensing components. Through a phased grasping process, tactile data and posture data are fused in real time to achieve closed-loop control, generate fine-tuning action commands, and improve grasping stability.

🎯Benefits of technology

It significantly improved the robot's success rate in grasping deformable or fragile objects, from 37% to 83%, reducing the risk of object breakage or slippage, and demonstrating dexterous operation capabilities and system robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122210643A_ABST
    Figure CN122210643A_ABST
Patent Text Reader

Abstract

The application provides a robot adaptive grasping method and system based on a multi-modal glove, and relates to the technical field of robots, wherein the method comprises: controlling the robot to perform a staged grasping, which comprises: a pre-grasping stage: controlling the robot to approach a target object; an adaptive grasping stage: after contacting the target object, real-time fusion of tactile data and posture data from the multi-modal glove is performed to evaluate the stability of the current grasping state, and when it is determined to be unstable, fine-tuning action instructions for adjusting the dexterous hand of the robot are generated and executed. Through the application, the grasping adaptability and success rate are improved, comprehensive multi-modal information fusion and utilization are realized, and the defects of poor grasping adaptability and low success rate caused by incomplete perception information and lack of closed-loop feedback regulation capability in the prior art robot grasping method are solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robotics, and in particular to a robot adaptive grasping method and system based on a multimodal glove. Background Technology

[0002] Robotic grasping is one of the core research areas in robotics. Traditional robotic grasping methods mostly rely on pre-programmed trajectories or vision-based open-loop control. These methods perform reasonably well in structured environments, but their adaptability and success rate decrease significantly when facing unstructured environments, objects with uncertain poses, or when interacting with fragile or deformable objects.

[0003] Specifically, relying solely on visual information for grasping makes it difficult for robots to perceive the physical interactions during the contact process, such as contact forces and sliding forces. This leads to the robot easily crushing fragile items (like eggs) or slipping on smooth surfaces due to insufficient grasping force. Although some studies have introduced force / tactile sensors, they are often used only as simple contact switches or force overload protection, failing to effectively integrate rich tactile data with robot posture data to achieve human-like dexterity and adaptive closed-loop regulation.

[0004] Therefore, existing technologies generally suffer from problems such as a single perception dimension, a lack of in-depth understanding of the contact state, and an inability to perform real-time closed-loop adaptive adjustments during the grasping process. This greatly limits the robot's ability to operate dexterously in complex and dynamic environments. Summary of the Invention

[0005] This invention provides a robot adaptive grasping method and system based on a multimodal glove, which solves the defects of existing robot grasping methods, such as poor grasping adaptability and low success rate due to incomplete perception information and lack of closed-loop feedback adjustment capability.

[0006] In a first aspect, the present invention provides a robot adaptive grasping method based on a multimodal glove, wherein a multimodal glove integrating tactile sensing components and attitude sensing components is worn on a robot's dexterous hand, the method comprising: Controlling the robot to perform phased grasping, the phased grasping includes: Pre-grabbing phase: Control the robot to approach a target object; Adaptive grasping phase: After contact with the target object, tactile data and posture data from the multimodal glove are fused in real time to assess the stability of the current grasping state. If it is determined to be unstable, fine-tuning instructions for adjusting the robot's dexterous hand are generated and executed.

[0007] According to the present invention, a robot adaptive grasping method based on a multimodal glove is provided, wherein the pre-grasping stage is executed through a pre-grasping model and includes: Acquire the current RGB image; the RGB image is acquired by a vision camera configured on the multimodal glove; The pre-trained pre-grasping model is invoked, and the robot's dexterous hand is controlled to move closer to the target object in combination with the RGB image; the pre-grasping model incorporates a diffusion strategy.

[0008] According to the present invention, a robot adaptive grasping method based on a multimodal glove is provided, which calls a pre-trained pre-grasping model and controls the robot's dexterous hand to approach a target object in conjunction with the RGB image, including: In each control cycle, the RGB image is acquired, and the robot's body state is read synchronously to form an observation. Using the pre-fetching model and the observations as conditions, a sequence of actions of a preset length is generated; Execute a specific length portion of the action sequence, and after execution, update the current RGB image and generate a new observation; the specific length does not exceed a preset length; In the next control cycle, subsequent trajectory segments are generated based on new observation diffusion sampling; these subsequent trajectory segments are used to correct the unexecuted parts of the previous control cycle. The pre-grabbing phase ends when the target relative error or the distance from the robot's dexterous end to the target object meets a set threshold.

[0009] According to the present invention, a robot adaptive grasping method based on a multimodal glove is provided, wherein the adaptive grasping stage is executed by an adaptive grasping model and includes: Real-time acquisition of contact pressure distribution data and three-dimensional motion posture data of the robot's dexterous hand during the grasping process, and generation of tactile sequences and hand posture sequences; Extract the high-dimensional spatial features of the tactile sequence and the kinematic features of the hand posture sequence; Cross-modal information fusion is performed on the high-dimensional spatial features and the kinematic features to generate a fused feature sequence; The stability of the current grasping state is determined based on the fused feature sequence, and when it is determined to be unstable, fine-tuning instructions for adjusting the robot's dexterous hand are generated and executed.

[0010] According to the present invention, a robot adaptive grasping method based on a multimodal glove is provided. The adaptive grasping model introduces an adaptive grasping strategy, which includes three parts: initial grasp, grasping stability determination, and adaptive adjustment. During the initial grasping phase, the robot's dexterous hand performs a closing motion at a constant speed until it contacts the target object, thus completing the initial grasp. After the initial gripping is completed, the robot's robotic arm performs a preset lifting action and evaluates the stability of the gripping during the lifting process; When the grip is unstable, output dexterous hand movement instructions for the next few steps to fine-tune the gripping action.

[0011] According to the present invention, a robot adaptive grasping method based on a multimodal glove is provided, wherein the initial grasping phase includes: Continuously monitor the tactile signals output by the tactile sensor assembly at the fingertips; When the first time derivative of the tactile signal exceeds a given threshold, it is determined that the robot's dexterous hand is in full contact with the target object and completes the initial grasp.

[0012] According to the present invention, a robot adaptive grasping method based on a multimodal glove evaluates the stability of the grasp during the lifting process, including: The fused feature sequence is analyzed using a stability evaluator to obtain a likelihood value characterizing the stability of the current grasping state; The stability estimator is built based on a Gaussian mixture model.

[0013] According to the present invention, a robot adaptive grasping method based on a multimodal glove outputs dexterous hand action instructions for future multiple steps and fine-tunes the grasping action, including: The kinematic features of the hand posture sequence are used as queries, and the high-dimensional spatial features of the tactile sequence are used as keys and values. Information is fused through a multi-head attention mechanism to obtain a fused feature sequence. The fused feature sequence is decoded using a non-autoregressive method to generate a future action sequence; The future action sequence is mapped and normalized to obtain the final predicted action sequence, and dexterous hand action instructions are generated.

[0014] According to the present invention, a robot adaptive grasping method based on a multimodal glove is provided, wherein the tactile sensing component is a flexible piezoresistive tactile sensor array; The attitude sensing component is an optical marker point.

[0015] Secondly, the present invention also provides a robot adaptive grasping system, comprising: A multimodal glove integrating tactile sensing components and posture sensing components; A robot including a dexterous hand adapted to wear the multimodal glove; A processor; and Instructions stored in memory and executed by the processor, the instructions being used to perform the method as described in the first aspect.

[0016] Compared with the prior art, the present invention has the following beneficial effects: 1. Improved grasping success rate and adaptability: By mimicking the human grasping mechanism of "visual-guided approach and tactile closed-loop adjustment," this method increases the robot's grasping success rate on easily deformable / fragile objects (such as strawberries and eggs) from approximately 37% in the traditional open-loop mode to about 83%. Experimental results demonstrate that this method can effectively identify contact status and dynamically adjust grasping force, significantly reducing the risk of object breakage or slippage.

[0017] 2. Achieving Complete "Posture-Tactile" Multimodal Teaching: This invention integrates flexible tactile sensing and optical motion capture into a single glove, achieving for the first time synchronous and high-fidelity acquisition of hand movement posture and fingertip tactile pressure during human grasping. This provides robot learning with complete and coupled teaching data containing both "force" and "position," solving the fundamental problem of missing information in traditional teaching data.

[0018] 3. Solved the mapping challenge of heterogeneous human-robot kinematics: By employing a "data homology" strategy, that is, using the same pair of multimodal haptic gloves in both the teaching and execution phases, the robot's sensory interface is made consistent with that of the human instructor. This cleverly avoids the complex human-robot kinematic mapping problem, achieving "zero-sample" or "few-sample" transfer from human skills to robot execution, reducing the difficulty and cost of system deployment.

[0019] 4. Possesses closed-loop adaptive adjustment capability: The phased grasping strategy proposed in this invention, especially the "stability assessment-cross-modal fusion-motion prediction" closed-loop control mechanism in the adaptive grasping phase, enables the robot to dynamically adjust its grasping force and posture based on real-time tactile feedback. This endows the robot with dexterous manipulation capabilities when facing objects of unknown materials and shapes or dynamic disturbances, enhancing the system's robustness and generalization ability.

[0020] 5. Modular system, easy to expand: The multimodal tactile glove designed in this invention is an independent module. Its sensing functions (such as temperature, humidity, texture vibration, etc.) can be flexibly expanded according to task requirements without modifying the robot's dexterous hand structure. This provides a highly adaptable solution for the future application of robots in more complex scenarios (such as medical, service, and special operations). Attached Figure Description

[0021] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0022] Figure 1 This is a flowchart of the robot adaptive grasping method based on multimodal gloves provided by the present invention; Figure 2 This is a schematic diagram of the adaptive crawling model in an embodiment of the present invention; Figure 3 This is a schematic diagram of the robot adaptive grasping system provided by the present invention. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0024] Please see Figure 1 This application aims to provide a robot adaptive grasping method based on a multimodal glove. Its core objective is to address the problems of low grasping success rate and poor adaptability in existing robot grasping technologies, caused by a single perception dimension, difficulties in human-machine kinematic heterogeneous mapping, and a lack of closed-loop adaptive adjustment capabilities. This method mimics the natural human grasping behavior of "visually guided approach and tactile closed-loop adjustment," decomposing the grasping process into two stages: pre-grasping and adaptive grasping, to achieve dexterous and stable grasping of different objects.

[0025] The method provided in this application first requires wearing a multimodal glove integrating tactile and posture sensing components on the robot's dexterous hand. The core of this method lies in controlling the robot to execute a phased grasping process. The first phase is the pre-grasping phase, whose goal is to control the robot from a starting position, through a series of movements, to bring its dexterous hand close to the target object. This step is designed to complete the macroscopic localization of the grasping task, guiding the robot's end effector to a suitable area for contact operations, laying the foundation for subsequent fine manipulation. By decomposing the grasping task, the complexity of a single model can be effectively reduced, allowing the robot to first solve the "where to go" problem.

[0026] After the pre-grasping phase, the robot's dexterous hand makes or is about to make contact with the target object, at which point the process enters the adaptive grasping phase. In this phase, the core of the method shifts to real-time, closed-loop feedback control using a multimodal glove. Specifically, the system fuses tactile and posture data from the multimodal glove in real time. This fusion is not a simple data stitching but aims to construct a feature representation that comprehensively reflects the current grasping state. Based on this fused information, the system continuously evaluates the stability of the current grasping state. Stability assessment is crucial for adaptive grasping, providing the robot with a basis for deciding "whether adjustment is needed" and "how to adjust." When the system determines that the current grasping state is unstable, such as when the object tends to slide or experiences uneven force, the method immediately generates and executes fine-tuning instructions to adjust the robot's dexterous hand. These fine-tuning actions aim to proactively improve the grasping state until it returns to stability. Through this closed-loop "perception-evaluation-adjustment" cycle, this method enables the robot to dynamically adjust its grasping strategy based on "feel" after contact, much like a human, thereby significantly improving its adaptability to objects of different shapes, materials, and weights.

[0027] Furthermore, in a preferred embodiment, to make the pre-grasping phase more accurate and robust, this phase can be executed using a pre-grasping model. This model utilizes real-time RGB images captured by a vision camera mounted on a multimodal glove. Unlike a fixed external camera, this study mounts the camera on a haptic glove worn outside a rigid dexterous hand, creating a relatively fixed geometric relationship between the camera and the dexterous hand's end effector. This design provides the robot with a "hand-eye integrated" first-person perspective, allowing the camera's field of view to simultaneously cover both the target object and the end effector region, facilitating continuous observation of relative pose changes during approach. This pre-grasping model is pre-trained and incorporates a diffusion policy (DP). The purpose of introducing the diffusion policy is to address uncertainties present in the pre-grasping process, such as visual noise and target pose estimation jitter. The diffusion strategy learns to generate smooth action sequences from noise, which can handle multimodal distributions in teaching data (i.e., multiple reasonable approach paths in the same scene) better than traditional deterministic models. This results in more coherent and stable approach trajectories, avoiding motion jitter or hesitation that may occur in traditional methods, and improving the success rate and trajectory quality in the pre-grabbing stage.

[0028] In real-world systems, the approach process during the pre-grabbing phase is often simultaneously affected by visual measurement errors, execution errors, viewpoint changes, and teaching noise. For example, the relative pose of the target based on the camera may fluctuate, and the teaching trajectory itself may contain multiple equivalent approach paths. Therefore, pre-grabbing is not a single deterministic mapping problem, but rather a conditional generation problem that generates feasible and coherent action sequences under uncertain observations.

[0029] Compared to the deterministic regression strategy commonly used in traditional behavior cloning, the diffusion strategy has modeling characteristics that better suit the needs of this invention in action generation. Classical behavior cloning often learns a single-point mapping from observation to action by minimizing the mean squared error. When the same observation corresponds to multiple reasonable actions, the model tends to output a mean-based compromise action, exhibiting centralization, hesitation, or local jitter, which leads to increased alignment errors near the end stage. Furthermore, the stepwise prediction of traditional strategies is prone to error accumulation and distribution shift during long-term rolling execution, causing the trajectory to gradually deviate from the taught distribution and induce drift. Even with improved forms such as hybrid density networks with randomness and VAEs, problems such as insufficient multimodal coverage, unstable sampling, or difficulty in ensuring temporal consistency are still common. Especially in scenarios requiring continuous and smooth robot approach trajectory generation, these defects directly affect arrival accuracy and execution stability.

[0030] The diffusion strategy treats action sequences as conditional generation variables, obtaining finite-time-domain action fragments through iterative sampling that gradually denoises the noise. Therefore, it can more fully represent the multi-modal distribution in the teaching data while maintaining temporal consistency: on the one hand, the diffusion strategy does not output a single action point estimate but learns the action distribution, enabling the model to generate clear and executable trajectories even when multiple feasible proximity paths exist, rather than a mean-based compromise; on the other hand, the diffusion strategy often uses action fragments as generation objects, naturally possessing short-term planning characteristics. The generated trajectories are more continuous in terms of velocity and acceleration, and have a smoother spatial shape, which helps reduce the risk of oscillations and abrupt changes during robotic arm execution. Furthermore, the denoising generation mechanism is equivalent to recovering clean action sequences from noisy samples in terms of training objectives, making the model more robust to teaching jitter, visual pose estimation noise, and local anomalies, and less susceptible to the influence of a small number of inconsistent samples.

[0031] For the reasons mentioned above, this method uses a diffusion strategy to pre-capture the closed-loop approach process to the target object: in each control cycle, the system uses the latest visual feedback as a condition to generate and execute a range of approach trajectories. Simultaneously, in the next cycle, it resamples and corrects the subsequent trajectory based on updated observations, thus forming a continuously reprogramming visual closed-loop control. This strategy can promptly correct the approach direction and stride when there are slight perturbations in the target's relative pose, fluctuations in observation noise, or execution errors, stably guiding the end effector to the target's neighborhood and satisfying the arrival threshold condition.

[0032] In another preferred embodiment, the pre-fetching execution process based on the diffusion strategy described above is further specified. Within a control cycle... Inside, the system not only acquires RGB images It will also read the robot's physical state simultaneously. (e.g., low-dimensional information such as end-effector pose, joint angles, and joint velocities), these two parts of information together constitute a complete observation. This observation is input into the pre-fetching model, which then generates a sequence of actions of a preset length H based on this condition. The motion uses end-effector pose increment. However, to suppress error accumulation, this embodiment does not execute the entire sequence, but only executes a specific length portion (e.g., the beginning). step, Once execution is complete, the system will immediately update the current RGB image and robot state, forming new observations. (Before execution...) After each step, updated images are acquired and new observations are formed. Diffusion sampling is then performed again to generate subsequent trajectory segments to correct for parts not executed in the previous cycle. Because the camera is mounted on the haptic glove and rigidly bound to the end effector of the rigid dexterous hand, end-effector movement causes synchronous changes in the viewpoint and target scale, allowing visual feedback to sensitively reflect changes in approach direction and distance. Therefore, even with slight target perturbations, end-effector tracking errors, or visual estimation jitter, the approach direction and stride can be adjusted promptly in the next cycle through reconditional sampling, achieving rapid convergence against perturbations. Finally, when the target relative error or the distance from the end effector to the target meets a set threshold, the pre-grabbing approach phase ends, and the end effector state is used as the initial condition for subsequent grasping alignment and contact operations.

[0033] More specifically, to ensure expressive power while meeting the real-time requirements of closed-loop inference, this embodiment selects ResNet18 as the visual encoder and uniformly scales the input image to... The ResNet18 network, after removing the classification head, uses a backbone network to obtain a 512-dimensional visual feature vector through global average pooling. This vector is then mapped to the conditional feature dimension via a linear layer, and then compared with the low-dimensional state. Concatenate to form a global condition vector The main considerations for choosing ResNet18 are its moderate parameter size, low inference overhead, and stable feature representation ability, which can balance the need for visual detail recognition during the pre-grabbing approach process with the computational constraints of online closed-loop control.

[0034] The model training parameters are set as shown in Table 1. The length of the action sequence predicted each time is... The motion uses end-effector pose increment. The expression then describes the model's learning conditional distribution. ,in It is obtained by fusing visual features with the robot's body state. During the training phase, Gaussian noise is progressively injected into the action sequence using forward diffusion. The number of diffusion steps for training is set to 50, and linear noise scheduling is used. Its initial value is 0.0001 and its ending value is 0.02. The denoising network adopts a one-dimensional conditional U-Net structure, which performs convolutional modeling on the time-dimensional action sequence. The convolution kernel size is 5, and the number of downsampling channels is configured as follows: The embedding dimension of the diffusion time step is set to 128, and an input perturbation strength of 0.1 is introduced to improve robustness to observation noise and execution perturbations. The loss function uses the mean squared error of noise prediction, enabling the model to generate continuous, smooth, and executable action segments when multiple feasible approach trajectories exist. This embodiment uses the AdamW optimizer for end-to-end training, jointly optimizing the visual encoder and denoising network, with 125 training epochs. The diffusion sampling steps during the inference phase are set to 16 to achieve a balance between generation quality and online computational overhead. A rolling temporal control strategy is adopted during deployment: each control cycle generates an action segment of length 8 based on the latest visual feedback, and only executes the first few steps. Then, in the next cycle, the observation is re-observed and the subsequent actions are resampled to generate subsequent actions, thereby continuously correcting the trajectory and suppressing error accumulation, meeting the stability and real-time requirements of the pre-grabbing closed-loop approach process.

[0035] Table 1. Pre-fetching model training parameter settings

[0036] Furthermore, in a preferred embodiment, the adaptive grasping phase is also executed through a dedicated adaptive grasping model to achieve intelligent control of fine-tuning operations after contact. This model first collects two key data streams in real time during the grasping process: one is contact pressure distribution data obtained through tactile sensing components, forming a tactile sequence; the other is three-dimensional motion posture data of the robot's dexterous hand obtained through posture sensing components, forming a hand posture sequence. Next, the model extracts features from these two raw data streams. For example, it extracts high-dimensional spatial features reflecting pressure distribution and contact area from the tactile sequence, and kinematic features reflecting joint angles and speeds from the hand posture sequence. This step aims to transform the raw sensor signals into more informative representations. Subsequently, the model performs cross-modal information fusion on the extracted high-dimensional spatial features and kinematic features to generate a fused feature sequence. Finally, the model determines the stability of the current grasping state based on this fused feature sequence, and if it determines the state to be unstable, it generates and executes motion commands for fine-tuning the dexterous hand. This complete process, from multimodal data acquisition to action command generation, constitutes a highly efficient closed-loop adaptive control system.

[0037] like Figure 2 As shown, corresponding to the above process, the adaptive crawling model includes the following modules: Feature extraction module: High-dimensional spatial features of tactile sequences are extracted using a residual network (ResNet); kinematic features of hand pose sequences are extracted using a multilayer perceptron (MLP).

[0038] Cross-modal fusion module: Employs a multi-head attention mechanism, using posture features as queries and tactile features as keys and values ​​to fuse information and generate a fused feature sequence that integrates historical tactile and posture information.

[0039] Stability assessment module: Analyzes the fused feature sequence using a Gaussian Mixture Model (GMM) and outputs a grasping stability likelihood value to determine whether the current grasping state is stable.

[0040] Action prediction module: Based on a Transformer encoder, it processes the fused feature sequence in a non-autoregressive manner to output a sequence of dexterous hand joint movements for future multiple steps. The model is trained using the dexterous hand joint movement sequence of a human instructor during the adaptation phase as the ground truth.

[0041] In another preferred embodiment, a more specific adaptive grasping strategy is introduced into the above-mentioned adaptive grasping model. This strategy decomposes the complex adaptive process into three logically clear parts: initial grasp, grasping stability determination, and adaptive adjustment. First, in the initial grasping stage, the robot's dexterous hand performs a closing motion at a constant speed until it makes contact with the target object, completing the initial grasp. The goal of this stage is to establish an initial, evaluable contact state. After completing the initial grasp, the robot's robotic arm performs a preset lifting motion, and during the lifting process, the system continuously evaluates the stability of the grasp. The lifting motion is an active perturbation that can effectively expose unstable grasping states (e.g., the object begins to slide during the lifting process). If the evaluation finds that the grasp is unstable, the system will trigger an adaptive adjustment mechanism, outputting dexterous hand action instructions for the next few steps to fine-tune the grasping motion in order to achieve a more stable state. This three-stage strategy design clearly simulates the "trial-and-error - evaluation - adjustment" behavior pattern of humans when grasping uncertain objects.

[0042] Furthermore, to accurately determine the completion time of the initial grasp, this embodiment provides a method based on the rate of change of tactile signals. During the closing process of the dexterous hand, the system continuously monitors the tactile signals output by the fingertip tactile sensor components. When the dexterous hand has not yet contacted the object, the tactile signal typically remains at a low baseline level. Once the fingertip makes contact with the object, the pressure increases rapidly, causing a sudden change in the tactile signal. Therefore, this sudden change can be captured by calculating the first-order time derivative of the tactile signal. When this derivative value exceeds a preset threshold, it can be determined that the dexterous hand has made sufficient contact with the target object. At this point, the system immediately stops the closing action, completing the initial grasp. The advantage of this method is that it does not rely on prior knowledge of the object's size or position, but is entirely based on real-time tactile feedback, thus exhibiting strong adaptability to objects of different sizes and shapes.

[0043] In one optional implementation, a method for evaluating grasping stability is specifically described. During the lifting process, the system uses a stability estimator to analyze the aforementioned fused feature sequence to obtain a likelihood value that quantifies the stability of the current grasping state. The higher this likelihood value, the more similar the current grasping state is to the "stable" grasping state recorded in the teaching data. To construct this estimator, this embodiment employs a Gaussian Mixture Model (GMM). GMMs can learn the complex probability distributions of tactile and pose features in the joint feature space under a stable grasping state. By calculating the probability density (i.e., the likelihood value) of the current fused features under this GMM, the stability of the current grasping state can be effectively determined. Compared to simple threshold judgment, the GMM-based estimator can capture the correlation between features, providing a more reliable and statistically significant measure of stability.

[0044] Furthermore, when the grasping action is deemed unstable, precise fine-tuning instructions need to be generated. This embodiment provides an action generation method based on a multi-head attention mechanism. Specifically, the model uses the kinematic features of the hand posture sequence as a query and the high-dimensional spatial features of the tactile sequence as keys and values. The underlying logic of this setup is: using "how the hand moves" (posture features) to query "what the hand feels" (tactile features), thereby allowing the model to focus on the tactile feedback information most relevant to the current hand posture. After information fusion through the multi-head attention mechanism, a deeply fused feature sequence is obtained. Then, a non-autoregressive method is used to decode this fused feature sequence, generating action sequences for multiple future time steps at once. Compared to the autoregressive method of generating actions one by one, non-autoregressive decoding has a faster inference speed and is more suitable for real-time control scenarios. Finally, the generated future action sequences are mapped and normalized to obtain the final predicted action sequence, and based on this, specific action instructions to drive the dexterous hand are generated. This series of operations enables the robot to generate fine-tuning movements that are most likely to restore stability based on the specific "symptoms" of instability (i.e., specific tactile-gesture patterns).

[0045] For example, during the initial grasping phase, the dexterous hand moves at a constant low speed. The closed-loop motion is executed without a preset stopping time; instead, the output signal of the fingertip tactile sensor array is continuously monitored. This study defines contact as the instant when the first time derivative of the tactile signal exceeds a given threshold, using the abrupt change in tactile readings as a marker of the completion of the initial grasp. for t Tactile observation at any given time, defining the rate of change of tactile sensation:

[0046] in, Indicates the rate of change in tactile sensation. Indicates the sampling time interval. When detected... The sensitivity threshold was exceeded for the first time. When this happens, the dexterous hand movement immediately stops, and the current joint configuration is locked in the initial grasping posture. This process can be described as follows:

[0047] After the initial grasp is completed, the robotic arm performs a pre-set lifting motion. This stage requires evaluating the stability of the current dexterous hand grasp, which is modeled as a binary classification problem: stable or unstable. Considering the high degrees of freedom and complex contact topology of the dexterous hand, a Gaussian mixture model is used to construct the stability estimator. The current grasping state is defined as a vector. Among them, tactile information and dexterous hand posture By training with human teaching data, a Generative Model (GMM) is used to learn the probability density distribution of finger contact patterns and joint configurations in the joint space under stable grasping conditions. Let... For the initial grip configuration, in the initial grip and including m Within the framework of a Gaussian mixture model with Gaussian components, the grasping posture is determined. The likelihood function is defined as follows:

[0048] in, Represents the likelihood value. This represents the probability distribution of latent variables. This represents the mean. Describing covariance, Let represent a Gaussian distribution. Its mean and covariance are defined as follows:

[0049] in, Indicates the first i The mean of the tactile feature vectors in each Gaussian component. Indicates the first i The mean of the attitude eigenvectors in each Gaussian component. Indicates the first i The autocovariance matrix of tactile features in each Gaussian component (describing the degree of dispersion of the tactile features themselves). Indicates the first i The cross-covariance matrix of tactile and posture features in each Gaussian component (describing the correlation between the two types of features). for The transpose of the matrix, i.e., the first... i The cross-covariance matrix of pose features and tactile features in each Gaussian component. Indicates the first i The autocovariance matrix of the attitude features in each Gaussian component (describing the degree of dispersion of the attitude features themselves).

[0050] When the likelihood value exceeds a given threshold The current crawling is considered stable. (Discrimination threshold) The range of values ​​[ a , b ],in a and b The minimum and maximum likelihood values ​​for each Gaussian component at twice the standard deviation are calculated as follows:

[0051] in, This indicates that the i-th Gaussian component is located at twice the standard deviation. The likelihood value at a given point is used to define the confidence boundary for capturing the steady state. d This represents the dimension of the fused feature vector, which is the sum of the tactile feature dimension and the posture feature dimension. m This represents the total number of Gaussian components in the Gaussian mixture model.

[0052] The adaptive grasping process employs an encoder architecture, taking historical multimodal observation sequences as input and directly outputting dexterous hand movement instructions for future multiple steps. The overall network consists of three functional modules: a ResNet-18 module for extracting tactile spatial features, a multi-head attention fusion module for achieving modal interaction, and a Transformer encoder and action prediction head for capturing temporal dependencies. The following sections will elaborate on how the model outputs action sequences to drive dexterous hand grasping from tactile pressure and joint angle sequences.

[0053] During the deployment phase, the model's input consists of the current time and the multimodal observation sequence of the previous N frames (N=5), including the tactile observation sequence. With dexterity hand movement observation sequence The model aims to predict subsequent joint movement sequences based on previous sequence information, thereby driving the dexterous hand to perform fine-grasping posture adjustments. The specific process of the adaptive grasping strategy is as follows: The tactile sequence is first fed into a ResNet-18 network for processing. ResNet-18 acts as a feature extractor, capable of extracting the original tactile sequence... Mapping to high-dimensional feature representation ,in d For the embedding dimension. Data at each time step in the pose sequence. After being flattened, it is mapped to the same embedding dimension through a multilayer perceptron. d The pose feature sequence is obtained. This multilayer perceptron consists of two linear layers with ReLU activation and Dropout in between, used to extract kinematic features from pose. Pose features are used as Q (Query), and tactile features are used as K (Key) and V (Value) inputs, as shown in the following expressions:

[0054] in, The Query projection matrix represents the projection of pose features. F p The linear transformation into the query vector Q required for the attention mechanism is a learnable parameter of the model. This represents the Key projection matrix, used to project tactile features. F tThe linear transformation into the key vector K required for the attention mechanism is a learnable parameter of the model. This represents the Value projection matrix, used to project tactile features. F t The linear transformation yields the value vector V required for the attention mechanism, which represents the learnable parameters of the model. Information is then fused using a multi-head attention mechanism.

[0055] in, Represents the fused feature sequence. This represents the multi-head attention output projection matrix, used to linearly transform the output features of multiple attention heads after concatenation into the final cross-modal fusion feature sequence. , are the learnable parameters of the model.

[0056] Attention output yields a fused feature sequence The module consists of four Transformer encoder layers, each containing self-attention and a feedforward network, and employing residual connections and layer normalization. The encoder has four layers, eight attention heads, and a 512-dimensional feedforward network. The encoder uses self-attention to capture long-range dependencies between any two time steps in the sequence, outputting an encoded feature sequence. The decoder then uses a non-autoregressive approach to progressively generate future action sequences. The decoder output is an action sequence of length N. This sequence is then mapped and normalized using a multilayer perceptron to obtain the final predicted action sequence. .

[0057] The specific parameter settings for the adaptive grasping model during training are shown in Table 2.

[0058] Table 2 Adaptive grasping model training parameter settings

[0059] In a preferred embodiment, specific hardware selection was made for the sensing components on the multimodal glove. Specifically, the tactile sensing component can be a flexible piezoresistive tactile sensor array. The advantage of this sensor is that its flexible material can conform well to the curved surface of the glove without excessively hindering the movement of the dexterous hand; at the same time, the sensor array form can provide rich information on the distribution of contact pressure, not just single-point contact force. The posture sensing component can specifically be optical markers. These markers are attached to key joints of the glove and can be tracked with high precision by an external optical motion capture system, thereby calculating the three-dimensional motion posture of the entire hand in real time. This combination of "flexible tactile sensing + optical posture" provides a solid hardware foundation for acquiring high-quality, high-fidelity multimodal teaching and feedback data.

[0060] To verify the effectiveness of the above method, a systematic grasping experiment was conducted on a real robotic platform. The experimental platform included the Elite collaborative robotic arm, the WEM-G30 dexterous hand, a wrist RGB camera, the FZMotion optical motion capture system, and a self-developed multimodal haptic glove. Ten easily deformable or fragile objects, such as sponge, silicone, strawberry, tomato, and raw egg, were selected as test objects. The experiment consisted of three stages: the pre-grasping stage, where the diffusion strategy model generated the end trajectory based on visual feedback, successfully guiding the dexterous hand to a safe position with a probability of approximately 77%; and the initial grasping and adaptation stage, where, after the haptic threshold was triggered and stopped, the Gaussian mixture model evaluated the grasping stability in real time. When instability was detected, the cross-modal attention model fused the historical 5-frame haptic sequence and joint posture to predict future joint fine-tuning movements for stable grasping. Twenty repetitions were performed on each of the ten objects (a total of 200 trials), with 165 successful grasps, resulting in an overall success rate of 82.5%. In comparison, the success rate of traditional open-loop preset grasping mode is only 37% under the same platform and object conditions. This application improves the success rate by 45.5 percentage points, which fully verifies the effectiveness and significant technological progress of multimodal haptic gloves and phased adaptive grasping strategy.

[0061] The present invention also provides a robot adaptive grasping system, comprising: a multimodal glove integrating tactile sensing components and posture sensing components; a robot including a dexterous hand adapted to wear the multimodal glove; a processor; and instructions stored in a memory and executed by the processor, the instructions being used to perform the above-described method.

[0062] Specifically, such as Figure 3 As shown, the system includes a multimodal glove integrating tactile and gesture sensing components, and a robot including a dexterous hand adapted to wear the multimodal glove. Additionally, the system includes a processor, such as... Figure 3 The system includes a host computer and instructions stored in memory and executed by a processor. These instructions are configured to perform the aforementioned adaptive grasping method for the robot. In a practical deployment, the system may also include a switch for data communication and an optical motion capture system for posture capture. These hardware components work together to form a complete robotic system capable of performing highly intelligent, adaptive grasping tasks.

[0063] The data acquisition process is as follows: Each camera transmits raw image data to the host computer via a switch. FZMotion software then identifies, tracks, and calculates the 3D pose of the marker points, outputting a standard dexterous hand motion sequence, specifically the position of each finger joint and Euler angles at each time step. The haptic glove transmits the raw pressure signals collected by the fingertip sensor array to the host computer in real time via Bluetooth. The host computer uses a Python program to receive and synchronize these two heterogeneous data streams, ensuring precise alignment of haptic and motion information in the time dimension, and stores both in the same file.

[0064] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A robot adaptive grasping method based on a multimodal glove, wherein, A multimodal glove integrating tactile sensing components and posture sensing components is worn on the dexterous hand of a robot, characterized in that the method includes: Controlling the robot to perform phased grasping, the phased grasping includes: Pre-grabbing phase: Control the robot to approach a target object; Adaptive grasping phase: After contact with the target object, tactile data and posture data from the multimodal glove are fused in real time to assess the stability of the current grasping state. If it is determined to be unstable, fine-tuning instructions for adjusting the robot's dexterous hand are generated and executed.

2. The robot adaptive grasping method based on a multimodal glove according to claim 1, characterized in that, The pre-fetching phase is executed through a pre-fetching model and includes: Acquire the current RGB image; the RGB image is acquired by a vision camera configured on the multimodal glove; The pre-trained pre-grasping model is invoked, and the robot's dexterous hand is controlled to move closer to the target object in combination with the RGB image; the pre-grasping model incorporates a diffusion strategy.

3. The robot adaptive grasping method based on a multimodal glove according to claim 1, characterized in that, The process involves invoking a pre-trained pre-grasping model and using the RGB image to control the robot's dexterous hand to approach the target object, including: In each control cycle, the RGB image is acquired, and the robot's body state is read synchronously to form an observation. Using the pre-fetching model and the observations as conditions, a sequence of actions of a preset length is generated; Execute a specific length portion of the action sequence, and after execution, update the current RGB image and generate a new observation; the specific length does not exceed a preset length; In the next control cycle, subsequent trajectory segments are generated based on new observation diffusion sampling; these subsequent trajectory segments are used to correct the unexecuted parts of the previous control cycle. The pre-grabbing phase ends when the target relative error or the distance from the robot's dexterous end to the target object meets a set threshold.

4. The robot adaptive grasping method based on a multimodal glove according to claim 1, characterized in that, The adaptive crawling phase is executed through an adaptive crawling model and includes: Real-time acquisition of contact pressure distribution data and three-dimensional motion posture data of the robot's dexterous hand during the grasping process, and generation of tactile sequences and hand posture sequences; Extract the high-dimensional spatial features of the tactile sequence and the kinematic features of the hand posture sequence; Cross-modal information fusion is performed on the high-dimensional spatial features and the kinematic features to generate a fused feature sequence; The stability of the current grasping state is determined based on the fused feature sequence, and when it is determined to be unstable, fine-tuning instructions for adjusting the robot's dexterous hand are generated and executed.

5. The robot adaptive grasping method based on a multimodal glove according to claim 4, characterized in that, The adaptive crawling model introduces an adaptive crawling strategy, which includes three parts: initial gripping, crawling stability determination, and adaptive adjustment. During the initial grasping phase, the robot's dexterous hand performs a closing motion at a constant speed until it contacts the target object, thus completing the initial grasp. After the initial gripping is completed, the robot's robotic arm performs a preset lifting action and evaluates the stability of the gripping during the lifting process; When the grip is unstable, output dexterous hand movement instructions for the next few steps to fine-tune the gripping action.

6. The robot adaptive grasping method based on a multimodal glove according to claim 5, characterized in that, The initial grasping phase includes: Continuously monitor the tactile signals output by the tactile sensor assembly at the fingertips; When the first time derivative of the tactile signal exceeds a given threshold, it is determined that the robot's dexterous hand is in full contact with the target object and completes the initial grasp.

7. The robot adaptive grasping method based on a multimodal glove according to claim 5, characterized in that, Assess the stability of the grip during the lifting process, including: The fused feature sequence is analyzed using a stability evaluator to obtain a likelihood value characterizing the stability of the current grasping state; The stability estimator is built based on a Gaussian mixture model.

8. The robot adaptive grasping method based on a multimodal glove according to claim 5, characterized in that, Output dexterous hand movement instructions for future multiple steps, fine-tuning the grasping action, including: The kinematic features of the hand posture sequence are used as queries, and the high-dimensional spatial features of the tactile sequence are used as keys and values. Information is fused through a multi-head attention mechanism to obtain a fused feature sequence. The fused feature sequence is decoded using a non-autoregressive method to generate a future action sequence; The future action sequence is mapped and normalized to obtain the final predicted action sequence, and dexterous hand action instructions are generated.

9. The robot adaptive grasping method based on a multimodal glove according to claim 1, characterized in that, The tactile sensing component is a flexible piezoresistive tactile sensor array; The attitude sensing component is an optical marker point.

10. A robot adaptive grasping system, characterized in that, include: A multimodal glove integrating tactile sensing components and posture sensing components; A robot including a dexterous hand adapted to wear the multimodal glove; A processor; as well as Instructions stored in memory and executed by the processor, the instructions being used to perform the method as described in any one of claims 1-9.