A two-handed articulated object pose estimation method and related devices
By using a transformer-based method, features of both hands and articulated objects are extracted and interacted with, relative position queries are generated, and pose parameters are regressed. This solves the problem of inconsistency in the reconstruction of articulated objects in existing methods, improves the accuracy and rationality of pose estimation, and is applicable to fields such as human-computer interaction and virtual reality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA UNIV OF TECH
- Filing Date
- 2025-11-17
- Publication Date
- 2026-06-23
Smart Images

Figure CN121564792B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of computer vision and artificial intelligence, and in particular to a method and related equipment for estimating the pose of a two-handed articulated object. Background Technology
[0002] The human hand is the primary medium for interacting with the physical world. Enabling machines to understand the interactive postures of the human hand and objects is a key technology for improving the naturalness of human-computer interaction, promoting the development of virtual reality / augmented reality (VR / AR) applications, and realizing imitation-based robotic skill learning. By accurately estimating the three-dimensional postures of the hand and objects, the system can generate more realistic virtual scenes or guide robotic arms to complete complex tasks.
[0003] Existing hand-object pose estimation methods primarily focus on rigid objects. These methods typically reconstruct a 3D mesh by estimating the parameters of a parametric model of the hand (such as the MANO model) and the object's 6-DOF pose (rotation and translation). To improve the plausibility of the reconstruction results, existing techniques aim to model the dependencies between the hand and the object, such as mutual occlusion and contact constraints.
[0004] However, the situation becomes more complex in scenarios involving two-handed manipulation of articulated objects (such as scissors, laptops, pliers, etc.). Articulated objects contain one or more kinematic pairs, and their state is determined by the local parameter of the hinge angle. This is fundamentally different in granularity from the global rotation and translation parameters of rigid objects. Existing methods for modeling hand-object dependencies on rigid objects cannot effectively capture the strong correlation between the relative spatial positions of the hands and the hinge state of the object. Directly applying existing methods leads to inconsistencies between the estimated hand pose, object pose, and hinge angle, resulting in significant deviations in the reconstructed mesh in terms of relative position and contact area, making it difficult to meet the high requirements for pose accuracy in practical applications. Summary of the Invention
[0005] The main objective of this application is to propose a transformer-based method and related device for estimating the pose of a two-handed articulated object. The aim is to significantly improve the accuracy and rationality of the three-dimensional pose estimation of the two hands and the articulated object by innovatively representing the relative positional relationship of the two hands and associating it with the articulation angle of the object.
[0006] To achieve the above objectives, one aspect of this application proposes a method for estimating the pose of a two-handed articulated object, the method comprising:
[0007] Feature extraction steps: Extract global features from the input RGB image, and extract the region features of the left hand, right hand and articulated object respectively. Concatenate the global features, left hand features, right hand features and object features along the sequence dimension and add position encoding to generate joint features;
[0008] Query extraction steps: Based on the global features, predict the probability that each pixel in the image belongs to the left hand and the right hand, and select the top N pixels with the highest probability of belonging to the left hand and the right hand respectively to form the left hand position query and the right hand position query.
[0009] The query interaction steps are as follows: The left-hand position query and the right-hand position query are concatenated and processed through a self-attention mechanism to extract the relative position information of the two hands and output the relative position query of the two hands; The left-hand position query, the right-hand position query, and the relative position query of the two hands are mapped to the left-hand parameter query, the right-hand parameter query, and the correction amount query, respectively; The left-hand parameter query, the right-hand parameter query, the correction amount query, and a learnable object parameter query are concatenated to generate a joint query;
[0010] Query-feature interaction step: Perform cross-attention matching between the joint query and the joint feature, wherein, through the attention masking mechanism, the left-hand parameter query, the right-hand parameter query, the relative position of the two hands query and the object parameter query are matched with their corresponding features respectively;
[0011] Parameter regression steps: Decode the left-hand parameter query, right-hand parameter query, object parameter query, and relative position query of both hands after matching features, and regress the pose, shape, and translation parameters of the left hand, the pose, shape, and translation parameters of the right hand, the rotation, initial value of the hinge angle, and translation parameters of the object, as well as the hinge angle correction amount; wherein, the initial value of the hinge angle and the hinge angle correction amount are added to obtain the final object hinge angle; and output the three-dimensional mesh of both hands and the hinged object in the camera coordinate system based on all parameters.
[0012] In some embodiments, in the feature extraction step, the convolutional neural networks used to extract features from the left-hand RGB image and the right-hand RGB image are the same network sharing weights.
[0013] In some embodiments, the position encoding is generated using sine and cosine functions to encode the position information of features in the horizontal and vertical directions.
[0014] In some embodiments, the dimensionality reduction process of the global features in the query extraction step specifically includes: sequentially passing through a convolutional layer, an activation function, and an upsampling operation to reduce the feature channel dimension and improve the spatial resolution of the feature map.
[0015] In some embodiments, in the query extraction step, the dimensionality-reduced feature map is convolved by the prediction head, and then the probability of each pixel belonging to the left or right hand is obtained by the softmax function.
[0016] In some embodiments, the self-attention mechanism in the query interaction step adopts a transformer encoder structure, which extracts the relative position information of the hands from the hand position query through query, key and value calculation and feedforward neural network.
[0017] In some embodiments, the attention masking mechanism in the query-feature interaction step is specifically as follows: when calculating the attention weight for each type of query and feature, a very large negative bias is applied to the feature positions that are not intended to be matched by that type of query, so that their attention weight after softmax approaches zero.
[0018] In some embodiments, in the parameter regression step, the regression-obtained hand pose and shape parameters are input into the MANO model to obtain a three-dimensional mesh of the hands, which is then transformed into the camera coordinate system by combining translation parameters; the three-dimensional mesh of the hinged object is obtained by rotating the standard model of the object according to the rotation parameters, rotating its upper half about the rotation axis according to the final hinge angle of the object, and then combining the translation parameters.
[0019] To achieve the above objectives, another aspect of this application provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the method described above.
[0020] To achieve the above objectives, another aspect of the embodiments of this application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described above.
[0021] To achieve the above objectives, another aspect of the embodiments of this application proposes a computer program product, including a computer program that, when executed by a processor, implements the method described above.
[0022] The embodiments of this application include at least the following beneficial effects: This application provides a method, electronic device, storage medium, and program product for estimating the pose of a two-handed articulated object. This scheme extracts features of the hands and the object from an input RGB image and encodes them into joint features; predicts and extracts left and right hand position queries representing the two-handed region from global features; interacts with the two-handed position queries through a self-attention mechanism to generate a two-handed relative position query representing the relative positional relationship between the two hands, and combines it with the mapped hand parameter query and object parameter query into a joint query; utilizes an attention occlusion mechanism to enable the joint query to interact with the joint features, matching the most relevant feature information for various queries; finally, it regresses the pose, shape, and translation parameters of the hands, the rotation, translation, and initial values of the articulation angle of the object, and the articulation angle correction amount predicted by the two-handed relative position query, adds them together to obtain the final articulation angle, and outputs a three-dimensional mesh of the hands and the articulated object in the camera coordinate system. This application significantly improves the reconstruction effect of hands and articulated objects by characterizing the relative positional relationship of the two hands and associating it with the articulated object. The improvement is most significant in terms of the relative position and contact area of the left hand, right hand and object. It has technical advantages in terms of posture rationality and can be applied to fields such as human-computer interaction, virtual reality, augmented reality and imitation-based machine skill learning to provide them with hand-articulated object posture information. Attached Figure Description
[0023] Figure 1 This application provides a flowchart of the steps for a two-handed articulated object attitude estimation method.
[0024] Figure 2 This is a qualitative comparison of the embodiments of this application with the existing best methods on the Arctic dataset.
[0025] Figure 3 This is an overall flowchart of the transformer-based pose estimation method for a two-handed articulated object provided in the embodiments of this application.
[0026] Figure 4 This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0027] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit it. In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this application; they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this application as detailed in the appended claims.
[0028] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.
[0029] Before providing a detailed description of the embodiments of this application, some of the nouns and terms involved in the embodiments of this application will be explained first. The nouns and terms involved in the embodiments of this application are subject to the following interpretations.
[0030] 1) The Transformer is a deep learning model based on the self-attention mechanism. It was proposed by the Google team in 2017 and was initially used for natural language processing (NLP) tasks, such as machine translation. It has now become a core architecture in the field of artificial intelligence and is widely used in scenarios such as text generation and sentiment analysis.
[0031] We interact with the world using our hands, and studying hand and object postures is a crucial step in helping machines understand human behavior. It has important applications in human-computer interaction, virtual reality, augmented reality, and imitation-based machine skill learning. By recognizing the postures of a user's hands and objects, the system can respond accordingly to the user's actions, improving the naturalness and fluency of the interaction, reducing limitations on input devices, and thus expanding the application scenarios of human-computer interaction. Hand and object posture estimation can also enhance the experience of virtual reality and augmented reality, increasing interactivity and having significant implications for games and medicine. When a robotic arm mimics human hand operations, hand posture provides motion information during interaction, while object posture provides grasping information, making the actions imitated by the robotic arm more physically plausible, thereby increasing the success rate of the robotic arm's object manipulation and enabling the development of robots that can be applied in various industries such as manufacturing, healthcare, and search and rescue operations. These applications all place demands on the plausibility of the estimated 3D meshes of hands and objects.
[0032] Traditionally, pose estimation methods primarily estimate the 3D coordinates of joints. With the advent of parametric hand models such as the MANO model, the estimated pose parameters can be input into the parametric model to obtain a 3D mesh of the hand. Additionally, objects are often given a standard template, and their 3D meshes are obtained by transforming the estimated object pose parameters.
[0033] Previous methods for bimanual object pose estimation primarily addressed information loss caused by mutual occlusion between the hand and the object, as well as the ill-conditioned problems associated with recovering 3D pose from 2D images. Furthermore, previous research has mainly focused on bimanual rigid object pose estimation, with less attention paid to bimanual articulated object pose estimation. However, bimanual articulated object pose estimation presents unique challenges, and simply applying methods from bimanual rigid object pose estimation is insufficient. Typically, to ensure more reasonable hand and object poses, pose estimation algorithms need to characterize the dependency between the hand and the object and construct constraints on the hand-object pose using this dependency. However, for bimanual articulated objects, in addition to considering the rotational and translational dependencies between the hands and the object, as in bimanual rigid object estimation, it is crucial to specifically consider the hinge angle relationship between the hands and the articulated object. Compared to the well-studied global target object rotation and translation in rigid object pose estimation, the unique target hinge angle in bimanual articulated object pose estimation is a local target, representing the relative relationship between two parts of the object. We observed that this target has a different granularity than other global targets, and its dependency on the hand is also different from other targets. Therefore, it is not suitable to simply use traditional methods for hand-object dependencies in rigid objects. Existing hand-jointed object algorithms are not designed for this one-to-one relationship.
[0034] In view of this, this application provides a transformer-based method for estimating the pose of a two-handed articulated object, an electronic device, a storage medium, and a program product, specifically including: 1) Feature extraction: Extracting features of the hands and the articulated object through three convolutional neural networks to provide semantic information for subsequent modules; 2) Query extraction: Extracting the regions belonging to the hands from the features by regressing the probability of each pixel belonging to the left and right hands, and using them as queries for the left and right hands to provide two-dimensional information for subsequent modules; 3) Query interaction: Extracting the relative position query of the hands containing the relative positional relationship information of the hands from the queries of the left and right hands through a self-attention module, which is used to represent the relative positional relationship of the hands; 4) Query-feature interaction: Matching the query output by the query interaction module with the features output by the feature extraction module to provide appearance information for the query; 5) Parameter regression: The query further regresses the pose parameters of the hands and the articulated object through a linear layer, wherein the relative position query of the hands is used to predict the articulation angle correction of the object, which is used to connect the hands and the articulated object.
[0035] The transformer-based pose estimation method for hand-held articulated objects provided in this application relates to the field of hand-held object pose estimation. This method can be applied to terminals, servers, or software running on either a terminal or a server. In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, or in-vehicle terminal, but is not limited to these. The server can be configured as an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The server can also be a node server in a blockchain network. The software can be an application implementing the hand-held articulated object pose estimation method, but is not limited to the above forms.
[0036] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0037] like Figure 1 As shown, this embodiment provides a method for estimating the pose of a two-handed articulated object, including the following steps:
[0038] S1, Feature Extraction.
[0039] Given an RGB image, a backbone network (such as ResNet50) is used to extract global features. Then, based on the provided left-hand, right-hand, and object bounding boxes, corresponding image regions are cropped, and their features are extracted separately using a convolutional neural network (CNN). The left-hand and right-hand feature extraction networks can share weights. Finally, the global features, left-hand features, right-hand features, and object features are concatenated along the sequence dimension, and sine and cosine positional encodings are added to generate joint features.
[0040] S2, Query and extract.
[0041] The global features output by the feature extraction module are dimensionality-reduced and upsampled to restore spatial resolution and refine the features. Then, two prediction heads are used to regress the probability of each pixel on the feature map belonging to the left and right hands, respectively. The top N pixels with the highest probabilities (e.g., N=40) are selected to form left-hand and right-hand position queries, respectively. This step utilizes 2D segmentation masks for supervised learning, injecting spatial location information into the queries.
[0042] S3, Query Interaction.
[0043] The left-hand and right-hand position queries are concatenated and input into a transformer module based on a self-attention mechanism to capture the spatial dependency between the two hands, outputting a relative hand position query containing information about their relative positions. Simultaneously, the left-hand position query, right-hand position query, and relative hand position query are mapped to a left-hand parameter query, a right-hand parameter query, and a correction value query, respectively, using a multilayer perceptron (MLP). Finally, the left-hand parameter query, right-hand parameter query, correction value query, and a learnable object parameter query are concatenated to form a joint query.
[0044] S4, Query-Feature Interaction.
[0045] The joint query and joint features are input into an interaction module based on the DETR framework. The joint query first passes through a self-attention module to enhance its internal representation, and then performs cross-attention calculation with the joint features. During this process, an attention masking mechanism is employed, forcing left-hand parameter queries to prioritize matching left-hand and global features, right-hand parameter queries to prioritize matching right-hand and global features, relative hand position queries to prioritize matching features related to both hands and the object, and object parameter queries to prioritize matching object features. This ensures that all types of queries can obtain appearance information from the most relevant features.
[0046] S5, Parametric Regression.
[0047] Decode the various queries after the interaction:
[0048] S51. Left / Right Hand Parameter Query: The hand's pose parameters (e.g., 16×6D), shape parameters (e.g., 10D), and translation parameters (e.g., 1D) are obtained through MLP regression.
[0049] S52. Object parameter query: The rotation, translation parameters and initial values of the hinge angle of the object are obtained by MLP regression.
[0050] S53. The relative position of both hands is queried and the hinge angle correction is obtained by MLP regression.
[0051] S54. The final hinge angle is obtained by adding the initial value and the correction amount.
[0052] S55. Input the hand parameters into the MANO model to obtain a 3D mesh of both hands, and transform it to the camera coordinate system using translation parameters. Transform the standard object model according to rotation parameters and the final hinge angle, and combine it with translation parameters to obtain a 3D mesh of the object.
[0053] The solutions of the embodiments of this application will be described in detail below with reference to the accompanying drawings and specific application examples.
[0054] like Figure 2 As shown, this embodiment provides a transformer-based method for estimating the pose of a two-handed articulated object, including the following steps:
[0055] Step 1, Feature Extraction. An RGB image is processed through a ResNet50 network to obtain global features. Then, the RGB image is cropped along the bounding boxes of the left hand, right hand, and object to obtain RGB images of the left hand, right hand, and object. The RGB images of the left hand and right hand are processed through a ResNet50 network to obtain left hand and right hand features, and the RGB image of the object is processed through a ResNet50 network to obtain object features. Global features, hand features, and object features are concatenated along the sequence dimension and positional encoding is added to obtain joint features. The output of this step is the global features and the joint features.
[0056] In some embodiments, the network used in step 1 to extract features from the left and right hand images is shared. This design is because both the left and right hands belong to the hand region and have similar appearance features, so the same network is used to extract features.
[0057] In some embodiments, the joint features in step 1 use the same manually designed sine and cosine position encoding as the transformer model, that is, the position of the feature in the horizontal and vertical directions is encoded with sine and cosine respectively.
[0058] Step 2, Query Extraction. The global features output by the object extraction module are processed through a two-layer convolutional layer and two upsampling operations to reduce the feature dimension from 2048 to 512. These features are then input into two prediction heads to predict the probability of each pixel in the feature map belonging to the left-hand and right-hand sides, respectively. The 40 pixels with the highest probability of belonging to the left-hand side are then selected to form the left-hand position query, and the 40 pixels with the highest probability of belonging to the right-hand side are selected to form the right-hand position query. The output of this step is the left-hand position query and the right-hand position query, both with a dimension of 40×512.
[0059] In some embodiments, the dimensionality reduction step described in step 2 first involves a convolutional layer reducing the feature channel dimension from 2048 to 512, followed by a ReLU activation function, then a nearest neighbor interpolation layer that quadruples both the height and width of the input feature map, followed by another convolutional layer with the feature channel dimension remaining at 512, another ReLU activation function, and finally, nearest neighbor interpolation that doubles the height and width of the feature map again. This dimensionality reduction operation refines the feature representation and restores the spatial resolution of the feature map, preparing for the subsequent extraction of the hand region from the feature map.
[0060] In some embodiments, the step 2 of predicting the left-hand position query using the prediction head is as follows: The feature map is passed through a convolutional layer with the channel dimension unchanged, and then through a softmax operation to obtain the probability that each pixel belongs to the left hand. The 40 pixels with the highest probability of belonging to the left hand are then selected, and the region formed by these 40 pixels constitutes the left-hand position query. Similarly, the right-hand position query is obtained. This operation is to obtain the two-dimensional position queries for both hands.
[0061] Step 3, Interactive Query. The left-hand position query and right-hand position query are concatenated along the sequence dimension and then fed into a transformer network for self-attention, outputting a relative hand position query with a dimension of 80×512. The left-hand and right-hand position queries are each fed into an MLP to obtain left-hand parameter queries and right-hand parameter queries, each with a dimension of 18×512. The first 16 queries along the sequence dimension are used to predict pose parameters, the 17th query is used to predict shape parameters, and the 18th query is used to predict translation parameters. The relative hand position query is then fed into an MLP to obtain a correction query with a dimension of 1×512, used to predict the correction amount for the hinge angle. The left-hand parameter query, right-hand parameter query, correction query, and object parameter query are concatenated along the sequence dimension to obtain a joint query, where the object parameter query is a learnable query with a dimension of 2×512. The output of this step is the joint query.
[0062] In some embodiments, the self-attention module structure for predicting the relative hand position query containing relative hand position information in step 3 is as follows: Similar to the self-attention module in a transformer encoder, the features are processed through three linear layers, and the query and key-value pairs are calculated using an attention mechanism. The calculation result is then passed through a feedforward layer to output the result, and finally mapped to the relative hand position query through another linear layer. This module can extract relative position information from the input relative hand position query.
[0063] Step 4, Query-Feature Interaction. The joint features obtained in the feature extraction step and the joint query obtained in the query interaction step are matched with each other using DETR. Specifically, the joint query first passes through the self-attention module to help the model better understand the joint query, especially the relationships between the various parameters. The output and the joint features are then input into the cross-attention module. Through the occlusion mechanism, the left-hand parameter query matches the left-hand features and global features, the right-hand parameter query matches the right-hand features and global features, the relative position query of the two hands matches the two hands-object and global features, and the object parameter query matches the object features.
[0064] In some embodiments, the query-feature matching operation in step 4 is implemented through an attention masking mechanism. That is, for each type of query, a very large negative number is added between this type of query and the features that this type of query is not expected to match, so that it tends to zero after softmax. This operation can match appropriate features for left-handed, right-handed, and articulated object parameters with different granularities.
[0065] Step 5, Parameter Regression. After feature matching, the left-hand and right-hand parameter queries output 16×6D pose parameters, 10D shape parameters, and 1D translation parameters for the left and right hands respectively through the MLP. The object parameter query outputs the initial hinge angle, rotation, and translation parameters of the object through the MLP. The relative position query of the two hands outputs the correction amount of the hinge angle through the MLP. The initial hinge angle value and the hinge angle correction amount are added together to obtain the hinge angle. The pose and shape parameters of the hands are input into the MANO model to obtain the 3D mesh of the hands, and then the translation parameters are added to transform it into the camera coordinate system. Simultaneously, the standard object model is first rotated according to the rotation parameters, and then the upper half is rotated along the rotation axis according to the predicted hinge angle, thus obtaining the 3D mesh of the object, which is then transformed into the camera coordinate system by adding the translation parameters. The 3D meshes of the hands and the object output in this step are the target output of this invention.
[0066] The training details of the algorithm will be discussed in the following parts:
[0067] (1) Label preparation
[0068] This algorithm is trained and inferred on the Arctic dataset, which contains over two million images depicting scenes of two hands manipulating articulated objects (such as scissors, pliers, laptops, etc.). The dataset includes one first-person view and eight third-person viewpoints, and provides complete labels, including MANO parameters (pose and shape parameters) and translation parameters in the world coordinate system for both hands, 2D and 3D keypoints for both hands and the object, and rotation, articulation angle, and translation parameters in the world coordinate system for the object. Hand-object vertices with a Euclidean distance within 3 mm are defined as contact pairs. In the data preprocessing stage, this embodiment uses a neural 3D mesh renderer to render a binary segmentation mask for both hands from the 3D mesh labels. This segmentation mask is used in the segmentation loss during training.
[0069] (2) Loss function
[0070] The training loss functions include: pose loss, shape loss, 2D reprojection loss, and 3D keypoint loss to supervise the reconstruction effects of the left and right hands, with initial weights of 10.0, 0.001, 5.0, and 5.0, respectively. Object 2D reprojection loss, 3D keypoint loss, rotation loss, hinge angle loss, and contact deviation loss to supervise the reconstruction effects, with initial weights of 1.0, 5.0, 1.0, 1.0, and 1.0, respectively. Hand relative translation loss and right-hand-object relative translation loss to measure the relative positional relationship between the left hand, right hand, and object, with initial weights of 1.0 and 1.0, respectively. World coordinate system translation loss to supervise the absolute translation of the hands and object in the world coordinate system, with initial weights of 1.0 and 1.0, respectively. Segmentation loss to supervise the pixel category prediction results in intermediate step two, with initial loss weights of 10.0 for both left-hand and right-hand categories. Except for contact deviation loss and segmentation loss, all other loss functions are the mean squared error loss between the predicted value and the corresponding label. The contact bias loss is defined as the Euclidean distance loss between the hand-object vertex of the contact pair and the label, while the segmentation loss uses cross-entropy loss to supervise the classification of pixels.
[0071] (3) Training settings
[0072] The Adam optimizer was used during training, with a learning rate of e-4 and a batch size of 32. Training was conducted for 20 epochs in third-person perspective and then for 50 epochs in first-person perspective.
[0073] (4) Adaptive training strategy
[0074] The difficulty of loss descent is measured by the rate at which the loss function decreases. The slower the gradient descent, the larger the weight of the loss function, in order to accelerate the rate at which the loss decreases.
[0075] Regarding the details of the algorithm's reasoning, specifically, given an RGB image containing a two-handed articulated object, the algorithm can output a 3D mesh of the hands and the articulated object in the camera coordinate system.
[0076] To assess the effectiveness of this algorithm, qualitative and quantitative experiments were conducted on the Arctic dataset in a first-person perspective, and comparisons were made with other methods in the field of bimanual-articulated object pose estimation. The qualitative experimental results are as follows: Figure 3 As shown in Table 1, the results of the quantitative experiment are as follows. The meanings of each indicator in the quantitative experiment are as follows:
[0077] 1) aae: Mean absolute error between the predicted hinge angle and the labeled value. It measures the accuracy of the prediction of the relative rotation between the two parts of the hinged object. The smaller the value, the better.
[0078] 2) Success rate: The percentage of vertices whose L2 error between the predicted and actual values is less than 5% of the object's diameter. It measures the object's reconstruction performance, and a higher percentage is better.
[0079] 3) mpjpe: The Euclidean distance between the 21 predicted joints of each hand and the labeled Euclidean distance (minus the 3D coordinates of the root joint of the label, in millimeters), which measures the hand reconstruction effect; the smaller the better.
[0080] 4) mrrpe / r / l: The error between the predicted distance between the root joints of the left and right hands and the marked distance between the root joints of both hands. It measures the relative position between the two hands, and the smaller the error, the better.
[0081] 5) mrrpe / r / o: The error between the predicted right-hand-object root joint distance and the labeled right-hand-object root joint distance. It measures the relative positional relationship between the hand and the object. The smaller the error, the better.
[0082] 6) cdev: In the prediction results, it is defined as the average distance between the hand and the object vertex of the contact pair, which measures the contact between the hand and the object. The smaller the value, the better.
[0083] Table 1 Qualitative comparison with other methods
[0084]
[0085] As shown in Table 1, the quantitative experimental results demonstrate that, compared to the SOTA method, the proposed method achieves a 6.6% improvement in MPJPE, a 5.2% improvement in MRRPE / R / L, a 5.3% improvement in MRRPE / R / O, a 7.3% improvement in success rate, and a 7.2% improvement in CDE. This indicates that our method more fully considers the dependencies between the two hands and the articulated object, thus resulting in significant improvements in two-hand reconstruction metrics, the positional relationship between the two hands, the positional relationship between the hand and the object, and the contact relationship between the hand and the object. Figure 1 The results of the qualitative experiments also demonstrate the advantages of this invention over previous best methods in terms of the relative positions of the hands and the object. Combining quantitative and qualitative experiments, it can be seen that this invention solves the problem of inaccurate prediction of the relative positions of the left hand, right hand, and object in previous methods for estimating the pose of hand-hinged objects. It also achieves better results in the contact relationship between the hands and the object, and can reconstruct a more reasonable 3D mesh of the hand-hinged object, providing the hand-hinged object pose required in fields such as human-computer interaction, virtual reality, augmented reality, and imitation-based machine skill learning.
[0086] In summary, compared with the prior art, this application has the following advantages and beneficial effects:
[0087] 1) For the task of pose estimation of two-handed articulated objects, a transformer-based pose estimation method that considers the relationship between the two hands and the articulated object is proposed. Experimental results show that the proposed method significantly improves the reconstruction results of the two hands and the articulated object compared with previous methods, especially in terms of the relative positions and contact relationships of the left hand, right hand, and the object. It significantly outperforms the state-of-the-art methods and provides a more effective solution for pose estimation in fields such as human-computer interaction, virtual reality, augmented reality, and imitation-based machine skill learning.
[0088] 2) The method proposed in this application also provides technical support for subsequent work on the pose estimation of two-handed articulated objects. This application constructs a left-hand-right-hand position query containing 2D information by designing a query extraction module and a query interaction module to predict pixels belonging to the left and right hands. It then uses an attention mechanism to capture the relative positional relationship between the left and right hands from the 2D information-containing left-hand-right-hand position query, providing a way to express the relative positional relationship between the two hands. This application also uses queries containing the relative positions of the two hands to predict the hinge angle of the object, thereby associating the relative positions of the two hands with the articulated object and providing a way to characterize the consistency of the two-handed articulated object. This application further demonstrates through qualitative and quantitative experiments that the two-handed articulated object association proposed in this invention is beneficial to the reconstruction effect and relative position of the two hands and the object. This proves that the relative positions of the two hands are consistent with the rotation angle of the articulated object, verifying the significance of characterizing the consistency of the two-handed articulated object and using it for pose estimation of the two-handed articulated object. This lays the groundwork for further exploration of ways to extract the relative positions of the two hands and associate them with the two-handed articulated object, and for developing a more effective pose estimation method for the two-handed articulated object.
[0089] 3) An innovative association model for hands-hinged objects was established: By introducing "relative position query of hands" and using it to predict "hinging angle correction", this application for the first time explicitly models the physical consistency between the relative spatial relationship of hands and the hinged state of the object within the attitude estimation framework, solving the core difficulties in this field.
[0090] 4) Significantly improves reconstruction accuracy and rationality: Experiments on public datasets show that the present invention significantly outperforms existing best methods in several key indicators, including hand reconstruction error (MPJPE), hand relative position error (MRRPE / r / l), hand-object relative position error (MRRPE / r / o), object reconstruction success rate, and hand-object contact accuracy (CDev), especially in terms of significant improvement in contact area and relative position.
[0091] 5) Provides a general technical framework: The query extraction, interaction and matching mechanism proposed in this application provides a new idea and technical foundation for subsequent research on the pose estimation problem of interaction between hands and complex objects.
[0092] 6) It has broad application prospects: The high-precision and reasonable three-dimensional mesh generated by this application can be directly applied to fields such as human-computer interaction, virtual reality, augmented reality, and robot imitation learning, thereby improving the realism and practicality of these applications.
[0093] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the above-described method. This electronic device can be any smart terminal, including tablet computers, in-vehicle computers, etc.
[0094] It is understood that the content of the above method embodiments is applicable to this device embodiment. The specific functions implemented by this device embodiment are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0095] Please see Figure 4 , Figure 4 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes:
[0096] The processor 401 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.
[0097] The memory 402 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 402 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 402 and is called and executed by the processor 401 using the methods described in the embodiments of this application.
[0098] Input / output interface 403 is used to implement information input and output;
[0099] The communication interface 404 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0100] Bus 405 transmits information between various components of the device (e.g., processor 401, memory 402, input / output interface 403, and communication interface 404);
[0101] The processor 401, memory 402, input / output interface 403 and communication interface 404 are connected to each other within the device via bus 405.
[0102] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method.
[0103] It is understood that the content of the above method embodiments is applicable to this storage medium embodiment. The specific functions implemented in this storage medium embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.
[0104] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0105] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0106] It is understood that the content of the above method embodiments is applicable to the embodiments of this program product. The specific functions implemented in the embodiments of this program product are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments. The executable computer program code or "code" used to perform the various embodiments can be written in high-level programming languages such as C, C++, Python, Smalltalk, Java, JavaScript, Visual Basic, Structured Query Language (e.g., Transact-SQL), Perl, or in various other programming languages.
[0107] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0108] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0109] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0110] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0111] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0112] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0113] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0114] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0115] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0116] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0117] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.
Claims
1. A method for estimating the pose of a two-handed articulated object, characterized in that, The method Includes the following steps: Feature extraction steps: Extract global features from the input RGB image, and extract the region features of the left hand, right hand and articulated object respectively. Concatenate the global features, left hand features, right hand features and object features along the sequence dimension and add position encoding to generate joint features; Query extraction steps: Based on the global features, predict the probability that each pixel in the image belongs to the left hand and the right hand, and select the top N pixels with the highest probability of belonging to the left hand and the right hand respectively to form the left hand position query and the right hand position query. The query interaction steps are as follows: The left-hand position query and the right-hand position query are concatenated and processed through a self-attention mechanism to extract the relative position information of the two hands and output the relative position query of the two hands; The left-hand position query, the right-hand position query, and the relative position query of the two hands are mapped to the left-hand parameter query, the right-hand parameter query, and the correction amount query, respectively; The left-hand parameter query, the right-hand parameter query, the correction amount query, and a learnable object parameter query are concatenated to generate a joint query; Query-feature interaction step: Perform cross-attention matching between the joint query and the joint feature, wherein, through the attention masking mechanism, the left-hand parameter query, the right-hand parameter query, the relative position of the two hands query and the object parameter query are matched with their corresponding features respectively; Parameter regression steps: Decode the left-hand parameter query, right-hand parameter query, object parameter query, and relative position query of both hands after matching features, and regress the pose, shape, and translation parameters of the left hand, the pose, shape, and translation parameters of the right hand, the rotation, initial value of the hinge angle, and translation parameters of the object, as well as the hinge angle correction amount; wherein, the initial value of the hinge angle and the hinge angle correction amount are added to obtain the final object hinge angle; and output the 3D mesh of both hands and the hinged object in the camera coordinate system based on all parameters; The self-attention mechanism in the query interaction step adopts a transformer encoder structure, which extracts the relative position information of the hands from the hand position query through query, key and value calculation and feedforward neural network. The hand pose and shape parameters obtained from the regression are input into the MANO model to obtain the three-dimensional mesh of the hands, and then transformed into the camera coordinate system by combining the translation parameters; the three-dimensional mesh of the hinged object is obtained by rotating the standard model of the object according to the rotation parameters, rotating its upper half about the rotation axis according to the final hinge angle of the object, and then combining the translation parameters.
2. The method according to claim 1, characterized in that, In the feature extraction step, the convolutional neural networks used to extract features from the left-hand RGB image and the right-hand RGB image are the same network with shared weights.
3. The method according to claim 1, characterized in that, The position encoding is generated using sine and cosine functions and is used to encode the position information of features in the horizontal and vertical directions.
4. The method according to claim 1, characterized in that, The query extraction step specifically includes dimensionality reduction of the global features by sequentially passing through convolutional layers, activation functions, and upsampling operations to reduce the feature channel dimension and improve the spatial resolution of the feature map.
5. The method according to claim 1, characterized in that, In the query extraction step, the feature map after dimensionality reduction is convolved by the prediction head, and then the probability of each pixel belonging to the left or right hand is obtained by the softmax function.
6. The method according to claim 1, characterized in that, The attention masking mechanism in the query-feature interaction step is as follows: when calculating the attention weight for each type of query and feature, a large negative bias is applied to the feature position that is not desired to be matched by this type of query, so that its attention weight after softmax approaches zero.
7. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the method according to any one of claims 1 to 6.
8. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 6.