A dexterous hand motion planning method and device based on visual perception and a medium

By generating executable master and shadow trajectories using multimodal visual data and a neural symbolic hybrid planner, the problem of difficulty in quantifying perceptual uncertainty in the motion planning of robot dexterous hands is solved, improving the reliability and stability of the trajectory, reducing the probability of replanning, and ensuring the safe execution of dexterous hands in complex environments.

CN122008246BActive Publication Date: 2026-06-23HANGZHOU HEIMAN TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU HEIMAN TECHNOLOGY CO LTD
Filing Date
2026-04-07
Publication Date
2026-06-23

Smart Images

  • Figure CN122008246B_ABST
    Figure CN122008246B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on visual perception nimble hand motion planning method, equipment and medium, it is related to robot control technical field, including, based on the multi-modal visual data of real-time acquisition execution object instance segmentation and object six degree of freedom pose estimation, generate the environment state representation including object category, segmentation mask and pose information;When arbitration decision instruction comprehensive risk difference index does not satisfy preset stable execution condition, start global re-planning and generate new planning instruction, and continuously update environment state representation in executable main trajectory execution process.The application is estimated to object six degree of freedom pose by covariance analysis to result and applies counterfactual disturbance, forms the multi-state environment representation around the perception uncertainty in trajectory planning phase, realizes multi-trajectory planning expression under the same task instruction constraint.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robot control technology, and in particular to a method, device and medium for dexterous hand motion planning based on visual perception. Background Technology

[0002] In assembly, sorting, and precision operation scenarios, robotic dexterous hands are gradually adopting a technical approach that combines visual perception and learning-based planning. Common processes include multimodal visual acquisition, instance segmentation and six-degree-of-freedom pose estimation, task instruction encoding, and joint space trajectory prediction. These processes are combined with kinematic and collision constraints to generate executable trajectories. These methods are continuously being applied and expanded in complex object manipulation and unstructured environments.

[0003] Existing technologies typically use single-shot or single-point pose estimation results as deterministic inputs in end-to-end planning links. The pose uncertainty caused by perceived noise, occlusion, and attitude jitter is difficult to express explicitly in the planning stage. The planning output lacks quantitative basis for sensitivity to perceived disturbances, which can easily lead to inconsistencies between trajectory evaluation and the perception of execution risks. The core contradiction lies in the fact that the impact of perceived uncertainty on planning reliability is difficult to be calculably characterized in the generation stage. Summary of the Invention

[0004] In view of the aforementioned existing problems, the present invention is proposed.

[0005] Therefore, this invention provides a visual perception-based dexterity hand motion planning method to solve the problem that perceptual uncertainty is difficult to explicitly quantify in planning, thus affecting the reliable determination of dexterity hand trajectories.

[0006] To solve the above-mentioned technical problems, the present invention provides the following technical solution:

[0007] In a first aspect, the present invention provides a dexterous hand motion planning method based on visual perception, comprising: performing object instance segmentation and six-degree-of-freedom pose estimation based on real-time acquired multimodal visual data to generate an environmental state representation containing object category, segmentation mask, and pose information; inputting the environmental state representation and task instructions into a neural symbolic hybrid planner for trajectory planning to generate an executable main trajectory that satisfies kinematic and collision constraints; applying a counterfactual perturbation to the six-degree-of-freedom pose of the object based on a perceptual uncertainty representation derived from the environmental state representation, and generating a corresponding shadow trajectory through a trajectory generator; and calculating the executable main trajectory. The comprehensive risk differential index is obtained by fusing the scores of the shadow trajectory in terms of feasibility margin, expected execution risk, and consistency with visual feature space. The comprehensive risk differential index is compared with a preset dynamic threshold to determine the arbitration decision. The trajectory processing path is determined based on the arbitration decision. When the arbitration decision indicates that the comprehensive risk differential index meets the preset stable execution conditions, the executable main trajectory is sent to the control layer to drive the dexterous hand to execute. When the arbitration decision indicates that the comprehensive risk differential index does not meet the preset stable execution conditions, global replanning is initiated to generate new planning instructions, and the environmental state representation is continuously updated during the execution of the executable main trajectory.

[0008] As a preferred embodiment of the dexterous hand motion planning method based on visual perception described in this invention, the multimodal visual data includes synchronously acquired two-dimensional color image data, depth image data spatially aligned with the two-dimensional color image data, and three-dimensional point cloud data reconstructed based on the depth image data.

[0009] As a preferred embodiment of the dexterous hand motion planning method based on visual perception described in this invention, the task instruction refers to a task-level instruction set that predefines and describes the dexterous hand operation target, operation method, and operation sequence.

[0010] The neural symbolic hybrid planner generates an initial trajectory through the neural network part and outputs an executable main trajectory after applying kinematic and collision constraints through the symbolic logic part.

[0011] The kinematic and collision constraints refer to the physical limits of the angles, angular velocities, and angular accelerations of each joint of the dexterous hand, as well as the minimum safe distance maintained between the dexterous hand / robotic arm and obstacles in the environment.

[0012] As a preferred embodiment of the dexterous hand motion planning method based on visual perception described in this invention, the neural network part includes a feedforward encoding structure that encodes environmental state representation and task instructions, a recurrent prediction network that performs temporal expansion, and a fully connected output layer that outputs a sequence of joint space path points.

[0013] The symbolic logic part includes a process for verifying the reachability of joint space path points, a process for performing collision detection based on obstacle information, and a process for performing trajectory optimization for joint space path points that do not meet kinematic and collision constraints.

[0014] As a preferred embodiment of the dexterous hand motion planning method based on visual perception described in this invention, wherein: the perceptual uncertainty characterization is obtained by performing covariance analysis on the six-degree-of-freedom pose estimation of the object;

[0015] The specific steps for generating the corresponding shadow trajectory using a trajectory generator are as follows.

[0016] Extract confidence parameters that characterize the reliability of current perception from multimodal visual data;

[0017] The direction and magnitude of the counterfactual perturbation are determined based on the confidence parameters. The six-degree-of-freedom pose of the object after applying the counterfactual perturbation is used as a new environmental state representation and input into the trajectory generator.

[0018] The trajectory generator reuses the neural network portion of the neural symbolic hybrid planner to generate shadow trajectories in a simplified computational manner.

[0019] As a preferred embodiment of the visual perception-based dexterity hand motion planning method of the present invention, the fusion score is used to obtain a comprehensive risk difference index, and the specific steps are as follows:

[0020] The feasibility margin score and the expected execution risk score are calculated based on the executable master trajectory.

[0021] Based on the changes of the executable main trajectory and the shadow trajectory in the visual feature space, the visual feature space consistency score is calculated.

[0022] The feasibility margin score, expected execution risk score, visual feature space consistency score, and the difference between the executable main trajectory and shadow trajectory and the corresponding score of the shadow trajectory are calculated. The differences are then weighted and fused to generate a comprehensive risk difference index.

[0023] As a preferred embodiment of the dexterous hand motion planning method based on visual perception described in this invention, the preset dynamic threshold is set based on the environmental state representation and historical operation data of the task instructions;

[0024] The specific steps for determining the arbitration decision are as follows:

[0025] When the comprehensive risk differential index is lower than the preset dynamic threshold, the arbitration decision is to output an executable main trajectory.

[0026] When the comprehensive risk differential index is higher than the preset dynamic threshold, the arbitration decision is to trigger a global replanning.

[0027] As a preferred embodiment of the dexterous hand motion planning method based on visual perception described in this invention, the preset stable execution condition refers to the state where the comprehensive risk difference index is lower than the preset dynamic threshold.

[0028] The steps for initiating global replanning to generate new planning instructions and continuously updating the environmental state representation during the execution of the executable main trajectory are as follows:

[0029] The arbitration decision triggers a global replanning signal, and the executable main trajectory is re-generated based on the environmental state representation and task instructions, and a new planning instruction is output.

[0030] The current execution is interrupted by the control layer, and the process is switched to drive the dexterous hand based on the new planning instructions to perform online updates of the environmental state representation and real-time adjustments to the motion plan.

[0031] In a second aspect, the present invention provides a computer device including a memory and a processor, wherein the memory stores a computer program, wherein when the computer program is executed by the processor, it implements any step of the visual perception-based dexterity hand motion planning method as described in the first aspect of the present invention.

[0032] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein: when the computer program is executed by a processor, it implements any step of the visual perception-based dexterity hand motion planning method as described in the first aspect of the present invention.

[0033] The beneficial effects of this invention are as follows: by performing covariance analysis on the six-degree-of-freedom pose estimation results of objects obtained from multimodal visual perception, and adaptively applying counterfactual perturbations based on the degree of perception stability, a multi-state environment representation centered on perception fluctuations can be formed in the trajectory planning stage, characterizing the propagation path of pose uncertainty in motion planning; shadow trajectories are generated based on perturbation poses, enabling the trajectory generation process to synchronously output multiple candidate trajectory expressions under the same task instruction constraints, providing a consistent planning input basis for trajectory evaluation and decision-making. Attached Figure Description

[0034] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0035] Figure 1 This is a flowchart of a visual perception-based dexterity hand motion planning method.

[0036] Figure 2 A flowchart generated for characterizing the environmental state.

[0037] Figure 3 This is a flowchart for generating counterfactual perturbations and shadow trajectories.

[0038] Figure 4 This is a flowchart for comprehensive risk differential assessment and arbitration decision-making.

[0039] Figure 5 This is a comparative data chart showing the magnitude of the counterfactual disturbance and the execution deviation at the end of the trajectory.

[0040] Figure 6 This is a comparative data graph showing the execution deviation between the executable main trajectory and the shadow trajectory. Detailed Implementation

[0041] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0042] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0043] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0044] Reference Figures 1-6 This is one embodiment of the present invention, which provides a dexterous hand motion planning method based on visual perception, including the following steps:

[0045] S1: Based on real-time acquired multimodal visual data, perform object instance segmentation and object six-degree-of-freedom pose estimation to generate an environmental state representation containing object category, segmentation mask and pose information.

[0046] The multimodal visual data includes synchronously acquired two-dimensional color image data, spatially aligned depth image data, and three-dimensional point cloud data reconstructed from the depth image data.

[0047] S1.1: Perform time alignment and spatial registration on the synchronously acquired two-dimensional color image data and depth image data to generate pixel-level one-to-one correspondence of two-dimensional color image data and depth image data joint data.

[0048] Specifically, two-dimensional color image data and depth image data at the same moment are acquired based on a synchronous acquisition mechanism (such as hardware-triggered synchronization or software synchronization mechanism based on a unified timestamp).

[0049] Distortion correction processing corresponding to camera intrinsic parameter calibration is performed on the two-dimensional color image data. Based on the extrinsic parameter transformation relationship between the color camera corresponding to the two-dimensional color image data and the depth camera corresponding to the depth image data, the depth image data is mapped to the image coordinate system where the two-dimensional color image data is located. This ensures that each pixel position in the two-dimensional color image data is associated with a unique corresponding depth value, forming joint data of two-dimensional color image data and depth image data that are synchronized in time and correspond to each pixel in space.

[0050] S1.2: Perform 3D point cloud reconstruction based on the joint data of 2D color image data and depth image data to generate 3D point cloud data that spatially corresponds one-to-one with the 2D color image data.

[0051] Specifically, based on the depth value corresponding to each pixel in the depth image data and combined with the camera intrinsic parameters, the pixel coordinates in the joint data of the two-dimensional color image data and the depth image data are back-projected into the camera coordinate system to generate three-dimensional point cloud data containing three-dimensional spatial coordinates and corresponding color information, so that the three-dimensional point cloud data and the two-dimensional color image data are consistent in spatial position and semantic information.

[0052] It should be noted that back-projecting the pixel coordinates in the joint data of two-dimensional color image data and depth image data into the camera coordinate system means using the pixel coordinates in the two-dimensional color image data as the back-projection reference, and using the depth values ​​in the depth image data that correspond one-to-one with the pixel coordinates as the distance constraint, and combining the camera intrinsic parameters to complete the back-projection calculation from pixel coordinates to three-dimensional spatial coordinates.

[0053] S1.3: Perform object instance segmentation based on two-dimensional color image data, and generate object category information and segmentation mask corresponding to each object instance.

[0054] Specifically, using two-dimensional color image data as input, instance segmentation processing is performed to identify multiple object instances in the scene instance by instance, and outputs the object category label corresponding to each object instance and the segmentation mask that identifies the object region in the two-dimensional color image coordinate system.

[0055] The segmentation mask identifies the pixel range belonging to the same object instance in the form of a pixel set, and serves as the spatial constraint for filtering the corresponding object instance point cloud in the 3D point cloud data.

[0056] S1.4: Extract object instance point clouds from 3D point cloud data based on segmentation mask, and perform six-degree-of-freedom pose estimation of the object based on the object instance point clouds to generate an environmental state representation.

[0057] Specifically, based on the pixel position of the segmentation mask in the two-dimensional color image coordinate system, three-dimensional points corresponding one-to-one with the pixels of the segmentation mask are selected from the three-dimensional point cloud data to form object instance point clouds.

[0058] Based on the point cloud of object instances, six-DOF pose estimation is performed to obtain the 3D position and pose parameters of the object instances in the camera coordinate system. The object category information, segmentation mask, and object six-DOF pose information are then associated and encapsulated to generate an environmental state representation that describes the semantic attributes and spatial geometric state of objects in the current scene. For example, the object six-DOF pose information includes translation parameters along the three coordinate axes and rotation parameters around the three coordinate axes.

[0059] S2: Input the environmental state representation and task instructions into the neural symbolic hybrid planner to perform trajectory planning and generate an executable master trajectory that satisfies kinematic and collision constraints.

[0060] S2.1: Perform joint encoding on the environmental state representation and task instructions to generate a state instruction encoding vector for trajectory prediction.

[0061] Task instructions refer to a set of task-level instructions that predefine the target, operation method, and operation sequence of the dexterous hand. The neural symbolic hybrid planner generates an initial trajectory through the neural network part and outputs an executable main trajectory after applying kinematic and collision constraints through the symbolic logic part. Kinematic and collision constraints refer to the physical limits of the angles, angular velocities, and angular accelerations of each joint of the dexterous hand, as well as the minimum safe distance between the dexterous hand, the robotic arm, and obstacles in the environment.

[0062] The neural network part includes a feedforward encoding structure that encodes environmental state representation and task instructions, a recurrent prediction network that performs temporal unfolding, and a fully connected output layer that outputs a sequence of joint space path points. The symbolic logic part includes a process for verifying the reachability of joint space path points, a process for performing collision detection based on obstacle information, and a process for optimizing the trajectory of joint space path points that do not meet kinematic and collision constraints.

[0063] Specifically, the object category information, segmentation mask, and object six-DOF pose information in the environmental state representation are structurally expanded, and the task-level instruction set regarding the dexterous hand's operation target, operation method, and operation sequence in the task instructions is vectorized. For the pixel set form of the segmentation mask, it is converted into a fixed-dimensional numerical representation through rasterization or feature extraction methods (such as resampling the pixel set into a fixed-size binary image and flattening it into a one-dimensional vector), and then structurally expanded together with the preset field information.

[0064] Structured unfolding refers to converting object category information, segmentation mask, and object six-DOF pose information into a fixed-dimensional data representation according to a preset field order. Vectorized representation refers to mapping the descriptions of dexterous hand operation targets, operation methods, and operation sequences in the task-level instruction set into numerical vectors consistent with the data representation dimensions. The preset field order refers to unfolding and concatenating the corresponding data in a fixed arrangement order of object category information, segmentation mask information, and object six-DOF pose information to form a structured data representation.

[0065] The feedforward encoding structure in the neural network part maps the environmental state representation and task instruction execution features respectively, and splices them in the feature dimension to form a unified state instruction encoding vector, which is used to represent the joint constraints of the current environmental geometric state and the operation intention.

[0066] S2.2: Based on the state instruction encoding vector, the initial trajectory in the form of a joint space path point sequence is generated by unfolding the execution time sequence of the cyclic prediction network.

[0067] Furthermore, using the state instruction encoding vector as the initial input, the state instruction encoding vector is fed into the recurrent prediction network in the neural network part. Recursive calculations are performed along the time step dimension, progressively outputting the joint space path point vectors of the dexterous hand at consecutive time steps, forming a sequence of joint space path points arranged in temporal order, serving as the initial trajectory without explicit physical constraints. Each joint space path point contains the joint angle values ​​of each joint of the dexterous hand at the corresponding time step.

[0068] The specific steps of the computation process for the neural network are as follows:

[0069] The feedforward coding structure performs an encoding mapping on the input, expressed as:

[0070] ;

[0071] In the formula, This represents the environmental state representation through a feedforward coding structure. With task instructions The initial state instruction encoding vector obtained after joint feature mapping. This represents the vectorized representation of the environmental state. This represents the vectorized representation of the task instruction. This indicates a feedforward coding structure.

[0072] Recurrent prediction network at time step The recursive calculation below is expressed as:

[0073] ;

[0074] In the formula, Indicates time step The corresponding hidden state vector of the recurrent prediction network, This represents the state update function of a recurrent prediction network. Indicates time step The corresponding hidden state vector of the recurrent prediction network.

[0075] The fully connected output layer generates joint space path points, expressed as:

[0076] ;

[0077] In the formula, Indicates time step The corresponding joint space path point vectors, for all time steps The initial trajectory is formed in sequence. This represents the mapping function of the fully connected output layer that maps the hidden state vectors of the recurrent prediction network to the path point vectors in the joint space.

[0078] It should be noted that the feedforward coding structure This refers to using a multi-layer feedforward neural network to perform linear transformations and non-linear activation operations on the numerical representation of the environmental state and task instructions to generate a fixed-dimensional state instruction encoding vector; the state update function of the recurrent prediction network. This refers to a mapping function based on a recurrent neural network structure, which computes the hidden state vector of the current time step using the hidden state vector and the state instruction encoding vector from the previous time step; and a fully connected output layer mapping function that maps the hidden state vector of the recurrent prediction network to the path point vector in the joint space. This refers to using a fully connected neural network structure to convert the hidden state vectors output by the recurrent prediction network at the corresponding time step into joint space path point vectors representing the angle values ​​of each joint of a dexterous hand. , and All of these are conventional components of neural network structures, and their specific implementation can adopt any known feedforward neural network, recurrent neural network, or fully connected network structure.

[0079] S2.3: Perform reachability verification and collision detection on the joint space path points in the initial trajectory, and identify path points that do not meet the kinematic and collision constraints.

[0080] Specifically, for each joint space path point in the initial trajectory, the angle value, angular velocity value, and angular acceleration value of each joint of the dexterous hand corresponding to the joint space path point are read one by one, and compared with the pre-set physical limits of joint angle, angular velocity, and angular acceleration. When any joint exceeds the corresponding physical limit in any physical quantity, the joint space path point is determined to not meet the joint space accessibility requirements.

[0081] After completing the joint space accessibility determination, based on the object's six-degree-of-freedom pose information and environmental obstacle spatial information contained in the environmental state representation, the minimum spatial distance between the dexterous hand and the robotic arm and each environmental obstacle in the spatial posture corresponding to the joint space path point is calculated, and the minimum spatial distance is compared with the preset minimum safe distance. When the minimum spatial distance is less than the minimum safe distance, the joint space path point is determined to have a collision risk.

[0082] Joint space path points that are deemed not to meet joint space accessibility requirements or are deemed to pose a collision risk are uniformly identified as joint space path points that do not meet kinematic constraints or the requirement to maintain a minimum safe distance from obstacles in the environment.

[0083] It should be noted that the pre-set physical limits for each joint angle, joint angular velocity, and joint angular acceleration are all set based on the corresponding mechanical structure parameters and rated performance parameters of the actuator. For example, the physical limits for each joint angle are set according to the maximum allowable rotation range of the joint structure, the physical limits for joint angular velocity are set according to the rated speed of the actuator, and the physical limits for joint angular acceleration are set according to the maximum output torque of the actuator and the joint inertia parameters.

[0084] S2.4: Perform trajectory optimization on joint space path points that are identified as not satisfying kinematic and collision constraints, and output an executable main trajectory that satisfies the constraints.

[0085] Specifically, for the identified joint space path points that do not meet the kinematic and collision constraints, trajectory optimization is performed on the corresponding path points while maintaining the overall temporal continuity of the trajectory. By adjusting the values ​​of the joint space path points, the optimized joint space path points simultaneously meet the physical limits of the angle, angular velocity, and angular acceleration of each joint of the dexterous hand, and also meet the minimum safe distance constraints between the dexterous hand, the robotic arm, and obstacles in the environment. This corrects the initial trajectory into an executable main trajectory that meets all kinematic and collision constraints.

[0086] S3: Based on the perceptual uncertainty representation derived from the environmental state representation, counterfactual perturbations are applied to the six-degree-of-freedom pose of the object, and the corresponding shadow trajectory is generated through the trajectory generator; the perceptual uncertainty representation is obtained by performing covariance analysis on the six-degree-of-freedom pose estimation of the object.

[0087] S3.1: Extract confidence parameters that characterize the current perception reliability from multimodal visual data.

[0088] Specifically, based on multimodal visual data, the six-degree-of-freedom pose estimation results of the same object instance at multiple adjacent acquisition times are statistically summarized.

[0089] The six-degree-of-freedom pose of the object at each moment is represented as a six-dimensional vector containing three-dimensional position parameters and three-dimensional attitude parameters. The mean vector and covariance matrix of the six-dimensional vector are calculated using a sample set of the six-dimensional vectors. The diagonal elements of the covariance matrix are used as variance measures in the six degrees of freedom directions. Furthermore, the variance measures in the six degrees of freedom directions are normalized and fused into a confidence parameter, so that the confidence parameter can characterize the stability of the current six-degree-of-freedom pose estimation of the object in a single scalar form. The higher the value of the confidence parameter, the lower the variance measure obtained by the covariance analysis and the more stable the six-degree-of-freedom pose estimation of the object.

[0090] The confidence parameter specifically includes: the normalized fusion result of three variance measures based on the three-dimensional position parameters and three variance measures based on the three-dimensional pose parameters.

[0091] S3.2: Determine the direction and magnitude of the counterfactual perturbation based on the confidence parameter, and use the six-degree-of-freedom pose of the object after applying the counterfactual perturbation as a new environmental state representation, and input it into the trajectory generator.

[0092] Specifically, based on the covariance matrix, the direction of the degree of freedom corresponding to the maximum variance measure of the diagonal elements of the covariance matrix is ​​selected as the direction of counterfactual perturbation. The magnitude of counterfactual perturbation is determined based on the confidence parameter. The magnitude of counterfactual perturbation and the confidence parameter have an inverse relationship, so that the magnitude of counterfactual perturbation is larger when the confidence parameter is lower and smaller when the confidence parameter is higher.

[0093] After determining the direction and magnitude of the counterfactual perturbation, the counterfactual perturbation is decomposed into position perturbation components and attitude perturbation components according to the arrangement order of the three-dimensional position parameters and three-dimensional attitude parameters in the object's six-degree-of-freedom pose information.

[0094] Specifically, for the three-dimensional position parameters, the displacement corresponding to the counterfactual perturbation amplitude is superimposed component by component to the original three-dimensional position parameters along the determined counterfactual perturbation direction; for the three-dimensional attitude parameters, the attitude offset corresponding to the counterfactual perturbation amplitude is superimposed component by component to the original three-dimensional attitude parameters; after completing the component-by-component superposition of the position parameters and attitude parameters, the six-degree-of-freedom pose information of the object after applying the counterfactual perturbation is obtained.

[0095] The object's six-DOF pose information after applying counterfactual perturbation is used to replace the object's six-DOF pose information of the corresponding object instance in the environmental state representation, while keeping the object category information and segmentation mask unchanged in the environmental state representation, forming a new environmental state representation and using it as input to the trajectory generator.

[0096] S3.3: The trajectory generator reuses the neural network portion of the neural symbolic hybrid planner to generate shadow trajectories in a simplified computational manner.

[0097] Specifically, the new environmental state representation and task instructions are input as pairs into the trajectory generator.

[0098] The trajectory generator calls the neural network part of the neural symbolic hybrid planner and reuses the feedforward encoding structure, recurrent prediction network and fully connected output layer contained in the neural network part to perform joint encoding and temporal unfolding of the new environmental state representation and task instructions, and outputs the initial trajectory in the form of a joint space path point sequence as the shadow trajectory.

[0099] It should be noted that the meaning of reusing the neural network part is limited to the trajectory generator only executing the encoding and prediction calculation process of the neural network part, without executing the verification process of kinematic and collision constraints and the trajectory optimization process of the symbolic logic part, so as to simplify the calculation method to generate the shadow trajectory and achieve parallel operation with the executable main trajectory generation process.

[0100] Preferably, by performing covariance analysis on the six-degree-of-freedom pose estimation results of objects based on multimodal visual data, a confidence parameter representing the degree of perceptual stability is formed. The direction and magnitude of the counterfactual perturbation are adaptively determined based on the confidence parameter. Under the condition of keeping the object category information and segmentation mask unchanged, the environmental state representation after perturbation is generated and the shadow trajectory is generated in parallel. This allows the trajectory planning stage to explicitly cover the propagation range of perceptual errors in the pose space. Compared with the conventional technique of generating trajectories based only on a single pose estimation result, this invention incorporates perceptual uncertainty factors in the planning stage, improving the stability of the trajectory against visual noise and posture fluctuations.

[0101] After performing covariance analysis on the six-DOF pose estimation results of the object and determining the degree of perception stability based on the covariance, counterfactual perturbations of different magnitudes are applied to the pose estimation results. Corresponding trajectory planning results are then generated under the same task instruction constraints. Simulation analysis is performed to analyze the relationship between the counterfactual perturbation magnitude and the trajectory end-point execution deviation. The relevant results are shown in the example. Figure 5 As shown.

[0102] Example Figure 5 In the upper part, the horizontal axis represents the counterfactual perturbation amplitude determined based on the covariance of the object's six-DOF pose estimation, and the vertical axis represents the end-effector bias of the corresponding trajectory planning result. The figure shows the distribution of planning results under low, medium, and high perceptual uncertainty conditions, and introduces the ideal propagation relationship between pose perturbation and trajectory bias as a comparison. With changes in the level of perceptual uncertainty, under the same counterfactual perturbation amplitude, the end-effector bias of the trajectory exhibits different growth trends, reflecting the cumulative impact of pose estimation uncertainty on the motion planning results.

[0103] Example Figure 5 The lower half shows a magnified view of the corresponding perturbation interval, illustrating the deviation distribution among multiple shadow trajectories generated by the counterfactual perturbation pose under conditions of small perturbation amplitude. The shadow trajectories under different perceptual uncertainties exhibit differentiated distribution characteristics in the planning output, reflecting a multi-state environmental representation centered around perceptual fluctuations.

[0104] S4: Calculate the scores of the executable main trajectory and shadow trajectory in terms of feasibility margin, expected execution risk and visual feature space consistency, and fuse the scores to obtain the comprehensive risk difference index.

[0105] S4.1: Calculate the feasibility margin score and the expected execution risk score based on the executable master trajectory.

[0106] Specifically, the sequence of joint space path points of the executable main trajectory is analyzed point by point in time step, and the remaining amount of each joint angle, joint angular velocity and joint angular acceleration of the dexterous hand relative to the corresponding physical limit is calculated for each joint space path point.

[0107] Based on the collision detection process, the minimum safe distance remaining relative to each obstacle in the environment is calculated for the dexterous hand and the robotic arm in the spatial posture corresponding to the joint space path point. The minimum value among the joint angle remaining amount, joint angular velocity remaining amount, joint angular acceleration remaining amount, and minimum safe distance remaining amount is determined as the feasibility margin value of the joint space path point.

[0108] It should be noted that the minimum safe distance margin refers to the distance margin calculated by taking the minimum Euclidean distance between the surfaces of the dexterous hand and the robotic arm and the surfaces of various obstacles in the environment, based on the geometric model of the dexterous hand and the robotic arm, under the spatial posture corresponding to the path point in the joint space, and then calculating the difference between the minimum Euclidean distance and the preset minimum safe distance.

[0109] After calculating the feasibility margin values ​​for all joint space path points, the minimum feasibility margin value is taken for the entire trajectory range and normalized mapping is performed to generate a feasibility margin score value.

[0110] Simultaneously, the proportions of time steps in the executable main trajectory that are close to the physical limits for joint angular velocity, joint angular acceleration, and minimum safe distance are calculated, and the proportions are summarized to generate the expected execution risk score.

[0111] S4.2: Calculate the visual feature space consistency score based on the changes of the executable main trajectory and the shadow trajectory in the visual feature space.

[0112] Specifically, within the same time step corresponding to the executable main trajectory and the shadow trajectory, based on the segmentation mask, the local region of the same object instance is determined in the multimodal visual data, and a visual feature vector of uniform dimension is extracted within the local region.

[0113] For each time step, the difference metric between the visual feature vector corresponding to the executable main trajectory and the visual feature vector corresponding to the shadow trajectory is calculated. The difference metric values ​​of all time steps are normalized and summarized to generate a visual feature space consistency score, so that the visual feature space consistency score can reflect the overall consistency between the executable main trajectory and the shadow trajectory in the visual feature space.

[0114] S4.3: Calculate the difference between the feasibility margin score, expected execution risk score, visual feature space consistency score and the corresponding score of the shadow trajectory for the executable main trajectory and shadow trajectory, and perform weighted fusion of the differences to generate a comprehensive risk difference index.

[0115] Specifically, the differences between the executable main trajectory and the shadow trajectory in terms of feasibility margin score, expected execution risk score, and visual feature space consistency score are calculated, and each difference is normalized to a uniform scale.

[0116] After the difference normalization is completed, the three types of differences are weighted and summarized according to the preset weight configuration. The weighted summary result is output as the comprehensive risk difference index, so that the comprehensive risk difference index can characterize the comprehensive difference level between the executable main trajectory and the shadow trajectory in terms of safety margin, execution risk and perceived consistency.

[0117] The preset weight configuration is a fixed weight ratio obtained by offline statistics on the impact of different scoring items on the success or failure of trajectory execution based on historical operation data. The weight ratio reflects the relative contribution of the feasibility margin score difference, the expected execution risk score difference, and the visual feature space consistency score difference to the execution stability in historical operations.

[0118] While generating an executable main trajectory based on the environmental state representation, a shadow trajectory is generated in parallel based on the counterfactual perturbation pose. The execution deviation between the main trajectory and the shadow trajectory during multiple rounds of planning is simulated and recorded. The relevant results are shown in the example. Figure 6 As shown.

[0119] Example Figure 6 In the upper part, the horizontal axis represents the planning time step number, and the vertical axis represents the trajectory execution deviation change at the corresponding time step. One curve represents the execution deviation of the executable main trajectory, and the other curve represents the statistical results of the shadow trajectory deviation generated under the same task instruction constraints. During the planning and execution process, the deviation fluctuation range of the shadow trajectory is relatively large, and this fluctuation is consistent with the introduction of counterfactual perturbation pose.

[0120] Example Figure 6 The lower half shows a magnified view of the corresponding time step interval, illustrating the deviation between the main trajectory and the shadow trajectory in areas with significant perceptual fluctuations. At certain time steps, the deviation between the shadow trajectory and the executable main trajectory shows a significant increase, reflecting the differentiation of planning results in the time dimension under different perceptual states.

[0121] S5: Compare the comprehensive risk differential index with the preset dynamic threshold to determine the arbitration decision; the preset dynamic threshold is set based on the environmental state characterization and historical operation data of the task instructions.

[0122] Specifically, during the historical execution process, for multiple historical operation records that are consistent with the current task instruction type and have similar environmental state representation, the comprehensive risk difference index generated in the corresponding operation and the corresponding execution result identifier are read. The historical operation data includes historical environmental state representation, historical task instructions, result identifiers of whether the historical executable main trajectory was successfully completed, and historical comprehensive risk difference index.

[0123] The historical operation data is filtered to extract the operation records that were successfully completed in the historical executable main trajectory, and the comprehensive risk difference index in the corresponding operation records is summarized.

[0124] Based on the comprehensive risk differential index corresponding to successful operations, the distribution range of the comprehensive risk differential index is statistically analyzed, and the upper quantile value in the distribution range is selected as a preset dynamic threshold to reflect the acceptable upper limit of risk under the current environmental state and task instructions. The "upper quantile value" refers to the critical value located at a high percentage position (such as the 75th or 90th percentile) after sorting the comprehensive risk differential index corresponding to historical successful operations by numerical value. This value is used to distinguish between the vast majority of low-risk successful cases and the few critical-risk cases.

[0125] S5.1: When the comprehensive risk difference index is lower than the preset dynamic threshold, the arbitration decision is to output an executable main trajectory; when the comprehensive risk difference index is higher than the preset dynamic threshold, the arbitration decision is to trigger global replanning.

[0126] Specifically, the comprehensive risk difference index is compared with the preset dynamic threshold. When the comprehensive risk difference index is less than the preset dynamic threshold, it is determined that the executable main trajectory remains stable within the difference range corresponding to the feasibility margin score, expected execution risk score, and visual feature space consistency score. Based on the determination result, an arbitration decision identifier is generated as the output executable main trajectory, and the executable main trajectory is determined as the trajectory result adopted in the current execution stage.

[0127] Furthermore, the comprehensive risk difference index is compared with the preset dynamic threshold. When the comprehensive risk difference index is greater than the preset dynamic threshold, it is determined that the difference between the executable main trajectory and the shadow trajectory in at least one of the feasibility margin score, expected execution risk score, or visual feature space consistency score exceeds the historical stable range. Based on the determination result, an arbitration decision identifier is generated to trigger global replanning, and the arbitration decision identifier is used as the trigger condition for trajectory regeneration.

[0128] Preferably, by fusing the differential results of the executable main trajectory and the shadow trajectory in terms of feasibility margin, execution risk and visual consistency, and combining them with dynamic thresholds for arbitration, the trajectory execution decision is based on the controllable impact of perceived disturbances. Compared with conventional judgment methods based on fixed thresholds or single collision detection, this invention reduces the probability of unnecessary replanning triggers and maintains execution continuity.

[0129] S6: Determine the trajectory processing path based on the arbitration decision. When the comprehensive risk difference index indicated by the arbitration decision meets the preset stable execution conditions, send the executable main trajectory to the control layer to drive the dexterous hand to execute. When the comprehensive risk difference index indicated by the arbitration decision does not meet the preset stable execution conditions, initiate global replanning to generate new planning instructions, and continuously update the environmental state representation during the execution of the executable main trajectory.

[0130] Among them, the preset stable execution condition refers to the state where the comprehensive risk difference index is lower than the preset dynamic threshold;

[0131] S6.1: Trigger a global replanning signal through arbitration decision, re-execute the executable main trajectory generation based on environmental state representation and task instructions, and output new planning instructions.

[0132] Specifically, when the arbitration decision indicator indicates that the comprehensive risk difference index meets the preset stable execution conditions, the joint space path points in the executable main trajectory are arranged in chronological order, and a corresponding time step number is assigned to each joint space path point to form a trajectory data structure. The trajectory data structure is sent to the control layer, which extracts the joint space path points in chronological order according to the time step number and outputs the corresponding joint angle control quantity to drive the dexterous hand to execute along the executable main trajectory.

[0133] When the arbitration decision indicator indicates that the comprehensive risk differential index does not meet the preset stable execution conditions, the trajectory planning process is re-executed based on the current environmental state characterization and the current task instructions to generate a new executable master trajectory. The joint space path points in the new executable master trajectory are arranged in chronological order, and a corresponding time step number is assigned to each joint space path point to form a new planning instruction. The new planning instruction serves as the input for the control layer to switch execution paths. For example, the new planning instruction may include an example version identifier to distinguish the planning results generated in different rounds.

[0134] S6.2: The current execution is interrupted by the control layer, and the process is switched to drive the dexterous hand based on the new planning instructions to perform online updates of the environmental state representation and real-time adjustments to the motion plan.

[0135] Specifically, when the control layer receives a new planning instruction, it stops the currently parsing joint space path point sequence and records the time step number that has been completed as the switching time step number.

[0136] Based on the switching time step number, select the joint space path point corresponding to the switching time step number from the joint space path point sequence corresponding to the new planning instruction, and use it as the first execution path point after the switch. Then, start from this path point to drive the dexterous hand to execute the motion trajectory corresponding to the new planning instruction.

[0137] During the execution of new planning instructions, multimodal visual data is repeatedly collected and new environmental state representations are generated according to the time step interval. The new environmental state representations and the current task instructions are then input into the trajectory planning process to generate an updated executable master trajectory. The time step interval is set to update the environmental state representation every several frames of visual data. During the setting process, the typical frame rate of the visual sensor (e.g., 30 frames / second) and the trajectory update delay are used to update at exemplary intervals (e.g., every 5-10 frames), and the shortest time window is used to avoid frequent jitter.

[0138] When the updated executable master trajectory deviates from the new planning instruction being executed at a joint space path point, and the arbitration decision flag indicates that a global replanning is triggered, the execution process corresponding to the current new planning instruction is stopped, a new planning instruction is regenerated, and the interruption and switching process is repeated to achieve continuous updating of the environmental state representation and dynamic adjustment of the motion planning results.

[0139] This embodiment also provides a computer device applicable to the dexterous hand motion planning method based on visual perception, including: a memory and a processor; the memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the dexterous hand motion planning method based on visual perception as proposed in the above embodiment.

[0140] The computer device can be a terminal, comprising a processor, memory, communication interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, carrier networks, NFC (Near Field Communication), or other technologies. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad on the computer device's casing, or an external keyboard, touchpad, or mouse.

[0141] This embodiment also provides a storage medium on which a computer program is stored. When executed by a processor, the program implements the dexterity hand motion planning method based on visual perception as proposed in the above embodiments. The storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Red-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0142] In summary, this invention performs covariance analysis on the six-DOF pose estimation results of objects obtained from multimodal visual perception and adaptively applies counterfactual perturbations based on the degree of perception stability. This enables the formation of a multi-state environment representation around perception fluctuations during the trajectory planning stage, characterizing the propagation path of pose uncertainty in motion planning. Based on the perturbation pose, a shadow trajectory is generated, allowing the trajectory generation process to synchronously output multiple candidate trajectory expressions under the same task instruction constraints, providing a consistent planning input basis for trajectory evaluation and decision-making.

[0143] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A dexterous hand motion planning method based on visual perception, characterized in that: include, Based on real-time acquired multimodal visual data, object instance segmentation and object six-degree-of-freedom pose estimation are performed to generate an environmental state representation that includes object category, segmentation mask and pose information. The environmental state representation and task instructions are input into the neural symbolic hybrid planner for trajectory planning, generating an executable master trajectory that satisfies kinematic and collision constraints; Based on the perception uncertainty representation derived from the environmental state representation, counterfactual perturbations are applied to the six-degree-of-freedom pose of the object, and the corresponding shadow trajectory is generated through the trajectory generator. Calculate the scores of the executable main trajectory and shadow trajectory in terms of feasibility margin, expected execution risk and visual feature space consistency, and obtain the comprehensive risk difference index by fusing the scores. The comprehensive risk differential index is compared with a preset dynamic threshold to determine the arbitration decision; Based on the arbitration decision, the trajectory processing path is determined. When the comprehensive risk differential index indicated by the arbitration decision meets the preset stable execution conditions, the executable main trajectory is sent to the control layer to drive the dexterous hand to execute. When the comprehensive risk differential index of the arbitration decision instruction does not meet the preset stable execution conditions, the global replanning is initiated to generate new planning instructions, and the environmental state characterization is continuously updated during the execution of the executable main trajectory.

2. The dexterous hand motion planning method based on visual perception as described in claim 1, characterized in that: The multimodal visual data includes synchronously acquired two-dimensional color image data, spatially aligned depth image data, and three-dimensional point cloud data reconstructed from the depth image data.

3. The dexterous hand motion planning method based on visual perception as described in claim 2, characterized in that: The task instructions refer to a set of task-level instructions that predefines and describes the target, method, and sequence of operations for dexterous hand operations. The neural symbolic hybrid planner generates an initial trajectory through the neural network part and outputs an executable main trajectory after applying kinematic and collision constraints through the symbolic logic part. The kinematic and collision constraints refer to the physical limits of the angles, angular velocities, and angular accelerations of each joint of the dexterous hand, as well as the minimum safe distance maintained between the dexterous hand / robotic arm and obstacles in the environment.

4. The dexterous hand motion planning method based on visual perception as described in claim 3, characterized in that: The neural network part includes a feedforward encoding structure that encodes environmental state representation and task instructions, a recurrent prediction network that performs temporal unfolding, and a fully connected output layer that outputs a sequence of joint space path points. The symbolic logic part includes a process for verifying the reachability of joint space path points, a process for performing collision detection based on obstacle information, and a process for performing trajectory optimization for joint space path points that do not meet kinematic and collision constraints.

5. The dexterous hand motion planning method based on visual perception as described in claim 4, characterized in that: The perceived uncertainty characterization is obtained by performing covariance analysis on the six-degree-of-freedom pose estimation of the object; The specific steps for generating the corresponding shadow trajectory using a trajectory generator are as follows. Extract confidence parameters that characterize the reliability of current perception from multimodal visual data; The direction and magnitude of the counterfactual perturbation are determined based on the confidence parameters. The six-degree-of-freedom pose of the object after applying the counterfactual perturbation is used as a new environmental state representation and input into the trajectory generator. The trajectory generator reuses the neural network portion of the neural symbolic hybrid planner to generate shadow trajectories in a simplified computational manner.

6. The dexterity hand motion planning method based on visual perception as described in claim 5, characterized in that: The fusion score is used to obtain the comprehensive risk difference index. The specific steps are as follows: The feasibility margin score and the expected execution risk score are calculated based on the executable master trajectory. Based on the changes of the executable main trajectory and the shadow trajectory in the visual feature space, the visual feature space consistency score is calculated. The feasibility margin score, expected execution risk score, visual feature space consistency score, and the difference between the executable main trajectory and shadow trajectory and the corresponding score of the shadow trajectory are calculated. The differences are then weighted and fused to generate a comprehensive risk difference index.

7. The dexterous hand motion planning method based on visual perception as described in claim 6, characterized in that: The preset dynamic threshold is set based on the environmental state characterization and historical operation data of the task instructions; The specific steps for determining the arbitration decision are as follows: When the comprehensive risk differential index is lower than the preset dynamic threshold, the arbitration decision is to output an executable main trajectory. When the comprehensive risk differential index is higher than the preset dynamic threshold, the arbitration decision is to trigger a global replanning.

8. The dexterous hand motion planning method based on visual perception as described in claim 7, characterized in that: The preset stable execution condition refers to the state where the comprehensive risk difference index is lower than the preset dynamic threshold; The steps for initiating global replanning to generate new planning instructions and continuously updating the environmental state representation during the execution of the executable main trajectory are as follows: The arbitration decision triggers a global replanning signal, and the executable main trajectory is re-generated based on the environmental state representation and task instructions, and a new planning instruction is output. The current execution is interrupted by the control layer, and the process is switched to drive the dexterous hand based on the new planning instructions to perform online updates of the environmental state representation and real-time adjustments to the motion plan.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that: When the processor executes the computer program, it implements the steps of the dexterous hand motion planning method based on visual perception as described in any one of claims 1 to 8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the computer program is executed by the processor, it implements the steps of the visual perception-based dexterity hand motion planning method according to any one of claims 1 to 8.