A class-level 6D pose estimation method based on geometric perception key points
By fusing cross-modal features and dynamic keypoint proposals through an end-to-end structured perception network and optimizing keypoint distribution, this approach addresses the shortcomings of existing category-level 6D pose estimation methods in terms of accuracy and generalization under complex conditions, achieving higher pose estimation accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- LIAO NING GONG CHENG JI SHU DA XUE E ER DUO SI YAN JIU YUAN
- Filing Date
- 2025-09-30
- Publication Date
- 2026-06-26
Smart Images

Figure CN121304789B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of attitude estimation technology, specifically a category-level 6D attitude estimation method based on geometrically perceived keypoints. Background Technology
[0002] 6D pose estimation aims to simultaneously predict the position, pose, and scale information of a target object in three-dimensional space. It is a crucial technology for robots to achieve accurate perception and stable control in tasks such as grasping, assembly, navigation, and human-robot interaction. Compared with traditional instance-level methods, category-level 6D pose estimation does not rely on specific CAD models and can adapt to differences in appearance and shape among different instances of the same type, thus offering greater flexibility and wider applicability in practical applications. However, existing methods still have shortcomings in cross-instance generalization ability and pose prediction accuracy under conditions such as the lack of explicit geometric priors, non-rigid deformation of the target, missing texture, or scale ambiguity.
[0003] For the problem of category-level 6D pose estimation, existing research has mainly focused on implicit regression methods and keypoint detection methods based on unified canonical space modeling. Early representative methods, such as the NOCS framework, use RGB-D input to predict pixel mapping to canonical space and combine it with depth information for pose calculation. Subsequently, the SPD network was proposed, introducing an explicit deformation module to improve adaptability to differences in intra-category structural features. Currently, DualPoseNet has been designed, introducing dual decoders and adding consistency loss to further improve accuracy in the absence of a CAD model. Further, cascaded fusion and reconstruction mechanisms have been added to improve the spatial alignment consistency between RGB and point cloud features. 6D-ViT has also been proposed, using Transformer to fuse image, point cloud, and prior shape information, making pose representation more cross-modal. Query6DoF has also been proposed, introducing a sparse query mechanism as a shape prior or embedding representation for the first time, and further improving rotation accuracy through point voting and confidence-driven strategies. These methods have gradually improved the model's performance in deformation and occlusion scenarios. However, these methods still have limitations when facing complex structures, low texture, and large viewpoint changes.
[0004] To enhance the understanding of geometric structure and scale, several methods have recently incorporated keypoint mechanisms and attention modeling for optimization. For example, SAR-Net enhances structure recovery through symmetric completion. VI-Net introduces a rotation branch decoupling strategy, improving the accuracy and stability of rotation estimation. HS-Pose applies 3D graph convolution to keypoint structures, performing local / global fusion of structural representations. AG-Pose combines instance-adaptive keypoint detection with a geometrically aware alignment mechanism for the first time, effectively improving pose accuracy. A unified template and template-free paradigm has been proposed, enhancing image reasoning through neural implicit representations. GeoRef further strengthens structure-scale co-learning. GCE-Pose combines semantic reconstruction and global context aggregation, achieving explicit structure-scale recovery for the first time. Furthermore, Transformer keypoint adaptive methods enhance representation consistency. However, current methods still face challenges: keypoint distribution relies on fixed strategies, lacking structural adaptation capabilities; the fusion modeling of point cloud geometry and image semantics is insufficient; and scale estimation accuracy is limited by weak structure awareness. Summary of the Invention
[0005] To address the shortcomings of existing technologies, the present invention aims to propose a category-level 6D pose estimation method based on geometrically perceived keypoints, comprising:
[0006] Obtain a target image containing a single target object. and target point cloud in the target object ;
[0007] The target image and target point cloud are input into an end-to-end structured perception network to obtain a 6D pose estimate, which includes a rotation matrix R, a translation vector t, and a scale vector s. The end-to-end structured perception network includes a cross-modal feature fusion module, a dynamic keypoint proposal network, a spatial geometric attention module, and a geometric perception reconstruction and scale estimation module.
[0008] Calculate the total loss, and based on the total loss, perform multiple iterations to update the parameters in the end-to-end structured perception network through end-to-end backpropagation to obtain the trained end-to-end structured perception network.
[0009] The image and point cloud of the object to be detected are acquired. The image and point cloud of the object to be detected are then input into the trained end-to-end structured perception network to obtain the 6D pose estimate of the object to be detected.
[0010] Optionally, the target image and target point cloud are input into an end-to-end structured perception network to obtain 6D pose estimation, including:
[0011] The target image and target point cloud are input into the cross-modal feature fusion module to obtain a multimodal joint representation. Among them, multimodal joint representation It contains multiple multimodal features, and each multimodal feature corresponds to the three-dimensional coordinates of a point in the target point cloud;
[0012] Multimodal joint representation Input the dynamic keypoint proposal network to obtain the set of keypoint coordinates. and key feature set ;
[0013] Set of key point coordinates and key feature set and multimodal joint representation and target point cloud Input the spatial geometry attention module to obtain the updated keypoint feature set. ;
[0014] target point cloud and the updated keypoint feature set Input the data into the geometric perception reconstruction and scale estimation module to obtain 6D pose estimation.
[0015] Optionally, the target image and target point cloud can be input into the cross-modal feature fusion module to obtain a multimodal joint representation. ,include:
[0016] The target image is input into the image encoder to obtain the image patch feature sequence, i.e., the patch token sequence. ,in, Indicates the number of patches. express The dimension is ,right Upsampling interpolation and 1×1 convolution are performed to obtain the image feature tensor. Where N represents the number of spatial dimensions of the image features, and C1 represents the number of channels of the image features;
[0017] The target point cloud is input into a PointNet++ architecture to obtain point cloud geometric features at multiple scales. These geometric features are then upsampled and interpolated to obtain multiple sampled features. All sampled features are concatenated and then passed through two 1×1 convolutional layers to obtain the final point cloud structural features. Where C2 represents the number of channels for the final point cloud structural features;
[0018] Final point cloud structure features and image feature tensor By splicing the data, a multimodal joint representation is obtained. Specifically, it is expressed by the following formula:
[0019] ;
[0020] Where C represents the number of spatial dimensions in the multimodal joint representation.
[0021] Optionally, multimodal joint representation can be used. Input the dynamic keypoint proposal network to obtain the set of keypoint coordinates. and key feature set ,include:
[0022] Suppose there are K keypoints in the dynamic keypoint proposal network. Initialize the query vector for each keypoint, and form a query matrix from all the query vectors of the keypoints. ,in, For each dimension of the query vector, the multimodal joint representation will be obtained through linear projection. Mapped to key-value pairs ,in, V represents the key vector, and V represents the value vector, which will be used to query the matrix. Key vector The sum vector V is input into the multi-layer self-attention module and the self-attention module, and the query matrix is iteratively updated multiple times to obtain the key point descriptor. Among them, key point descriptors Each row vector corresponds to a semantic and structural fusion representation of a key point;
[0023] Based on key point descriptors and multimodal joint representation Construct attention heatmap Specifically, this is achieved through the following formula:
[0024] ;
[0025] Based on the attention heatmap, obtain the set of keypoint coordinates from the target point cloud. and key feature set Specifically, this is achieved through the following formula:
[0026] ;
[0027] ;
[0028] Among them, the set of key point coordinates , Represents the three-dimensional coordinates of the i-th key point. It includes the keypoint features corresponding to each keypoint.
[0029] Optionally, set the keypoint coordinates and key feature set and multimodal joint representation and target point cloud Input the spatial geometry attention module to obtain the updated keypoint feature set. ,include:
[0030] For the set of key point coordinates Each key point in the target point cloud The algorithm uses KNN to search for the k neighboring points of a keypoint, and obtains the set of coordinates of these neighboring points. In multimodal joint representation Obtaining the feature set of neighborhood points ;
[0031] The relative positional difference between keypoints and neighboring points is calculated. Based on this relative positional difference, geometric coding is calculated using a small multilayer perceptron (MLP), specifically implemented through the following formula:
[0032] ;
[0033] ;
[0034] in, Represents the set of coordinates of neighboring points The three-dimensional coordinates of the j-th neighboring point are This represents the relative positional difference between the i-th keypoint and the j-th neighboring point. The geometric code representing the i-th key point and the j-th neighboring point;
[0035] The geometric code and its corresponding neighborhood point feature set are concatenated and fused to obtain the neighborhood enhanced feature, which is achieved through the following formula:
[0036] ;
[0037] in, It consists of multiple linear layers, ReLU activation, and normalization modules. This represents the features of the neighboring points corresponding to the j-th neighboring point in the feature set of neighboring points. This represents the neighborhood enhancement feature between the i-th keypoint and the j-th neighboring point;
[0038] The cosine similarity between keypoint features and neighborhood enhancement features is calculated, and the attention weights are obtained through Softmax normalization, specifically using the following formula:
[0039] ;
[0040] in, This represents the attention weight between the i-th keypoint and the j-th neighboring point. The key feature represents the i-th key point. Temperature coefficient;
[0041] The neighborhood enhancement features and keypoint features are weighted and aggregated to obtain the geometrically enhanced keypoint features, which is achieved through the following formula:
[0042] ;
[0043] in, This represents the geometrically enhanced keypoint feature of the i-th keypoint. The geometrically enhanced keypoint features of all keypoints together constitute the local enhancement features. ;
[0044] Global average pooling is performed on all geometrically enhanced keypoint features to obtain the global mean feature, specifically achieved through the following formula:
[0045] ;
[0046] in, Represents the global mean characteristic;
[0047] The absolute geometric encoding of the 3D coordinates of each key point is performed to obtain the absolute positional encoding of the key point, which is achieved through the following formula:
[0048] ;
[0049] in, This represents the absolute position encoding of the i-th key point;
[0050] The relative position between two key points is calculated, and then the structural relationship code of the key points is calculated. This is achieved through the following formula:
[0051] ;
[0052] in, This represents the structural relationship encoding of the i-th key point. Represents the three-dimensional coordinates of the m-th key point;
[0053] The absolute position code and the structural relationship code of the key points are added together to obtain the global structure-aware geometric code, which is achieved through the following formula:
[0054] ;
[0055] in, This represents the global structure-aware geometric encoding of the i-th keypoint. The global structure-aware geometric encodings of all keypoints constitute the global structure-geometric features. ;
[0056] Local enhancement features Global mean characteristics and global structural geometric features The data is concatenated along the channel dimension, fused using an MLP, and then processed through residual connections and nonlinear activation to obtain globally enhanced features. Specifically, this is achieved through the following formula:
[0057] ;
[0058] Local enhancement features The KNN search is used to obtain a small-range neighborhood feature set. For a small neighborhood feature set The mean of all features is taken to obtain the local aggregated features. Specifically, this is achieved through the following formula:
[0059] ;
[0060] Mean represents the mean operation;
[0061] Local aggregation features With local enhancement features The fused features are obtained by fusing the data and enhancing their representation using a multilayer perceptron (MLP). Specifically, this is achieved through the following formula:
[0062] ;
[0063] Fusion features Max pooling and residual connections are performed to obtain the updated keypoint feature set. Specifically, this is achieved through the following formula:
[0064] ;
[0065] Pooling represents max pooling.
[0066] Optionally, the target point cloud and the updated keypoint feature set Inputting the geometric perception reconstruction and scale estimation module yields 6D pose estimation, including:
[0067] target point cloud The input is fed into a multi-layer convolutional encoder to extract geometric embedding features. Specifically, this is achieved through the following formula:
[0068] ;
[0069] in, This indicates a multilayer convolutional encoder;
[0070] By employing a multi-scale convolutional cross-attention mechanism, the updated keypoint feature set is processed. Fine-grained enhancement is performed to obtain the enhanced keypoint feature set. ;
[0071] Enhanced keypoint feature set With geometric embedding features The features are fused and averaged to obtain global semantic features. The global semantic features and the enhanced keypoint features are then combined. The features are concatenated to obtain the reconstructed input features. Specifically, this is achieved through the following formula:
[0072] ;
[0073] Mean represents the mean operation;
[0074] Reconstruct input features Input to shape residual decoder The process generates dense point cloud residuals for key points and then overlays these residuals with the target point cloud to obtain the reconstructed point cloud. Specifically, this is achieved through the following formula:
[0075] ;
[0076] ;
[0077] in, This represents the dense point cloud residuals at key points. This indicates an operation that involves translating and expanding point by point at key points;
[0078] Update the key point feature set Input MLP to obtain the set of normalized predicted coordinates under the NOCS system. Specifically, this is achieved through the following formula:
[0079] ;
[0080] Normalized prediction coordinate set Key point coordinate set and the updated keypoint feature set The vectors are concatenated, and a global pose semantic vector, i.e., 6D pose estimation, is extracted through multi-layer convolution and global pooling operations. The 6D pose estimation includes a rotation matrix R, a translation vector t, and a scale vector s, which are specifically implemented through the following formula:
[0081] ;
[0082] ;
[0083] Here, Concat means concatenation. To unify feature representation, MLP R MLP t and MLP s Each represents a set of three multilayer perceptrons.
[0084] Optionally, calculate the total loss, including:
[0085] Calculate attitude regression loss Specifically, this is achieved through the following formula:
[0086] ;
[0087] in, The true value of the rotation matrix. The true value of the translation vector. The true value of the scale vector;
[0088] Calculate NOCS coordinate loss Specifically, this is achieved through the following formula:
[0089] ;
[0090] in, Indicates smoothness loss, This is the set of coordinates of the actual key points;
[0091] Calculate structural cover loss Specifically, this is achieved through the following formula:
[0092] ;
[0093] in, Represents the target point cloud The three-dimensional coordinates of the nth point;
[0094] Calculate keypoint diversity loss Specifically, this is achieved through the following formula:
[0095] ;
[0096] in, This represents the three-dimensional coordinates of the m-th key point. Represents the three-dimensional coordinates of the i-th key point. Distance threshold;
[0097] Calculate the surface constraint loss at key points Specifically, this is achieved through the following formula:
[0098] ;
[0099] in, For target point cloud The foreground point cloud set after removing outliers, where outliers are points in the target point cloud that belong to the background region. Let be the three-dimensional coordinates of the a-th point in the foreground point cloud set;
[0100] attitude regression loss NOCS coordinate loss Structural cover loss Key point diversity loss and key point surface constraint loss We perform a weighted summation to obtain the total loss. Specifically, this is achieved through the following formula:
[0101] ;
[0102] in, The weighting hyperparameters are used to balance the weights of each loss term.
[0103] The beneficial effects of adopting the above technical solution are as follows:
[0104] This invention, based on the fusion of RGB image and point cloud features, designs a dynamic keypoint proposal module to generate class-consistent and geometrically adaptive keypoints, enhancing adaptability to intra-class deformation and weakly textured targets. Secondly, a spatial geometric attention module is introduced to model the spatial structural relationships between keypoints to optimize keypoint distribution. Finally, a geometrically perceptual reconstruction and scale estimation module simultaneously achieves point cloud reconstruction and scale regression without prior CAD model knowledge, utilizing multi-task loss for end-to-end optimization. Experimental results on the REAL275, CAMERA25, and HouseCat6D datasets demonstrate that the proposed method outperforms existing methods in multiple metrics, including IoU and pose accuracy, and exhibits stronger generalization and robustness in real-world complex scenes. Attached Figure Description
[0105] Figure 1 This is a schematic diagram of a category-level 6D pose estimation method based on geometrically perceived keypoints in an embodiment of the present invention.
[0106] Figure 2 This is a schematic diagram of the cross-modal feature fusion module in an embodiment of the present invention;
[0107] Figure 3 This is a schematic diagram of the structure for generating keypoint descriptors in an embodiment of the present invention;
[0108] Figure 4 This is a schematic diagram of the spatial geometry attention module in an embodiment of the present invention, wherein (a) is a local geometry modeling path diagram and (b) is a global structure modeling path diagram;
[0109] Figure 5 The following are visualization results in the embodiments of the present invention, wherein (a1) to (a18) are pose prediction results of AG-Pose on the REAL275 dataset, and (b1) to (b18) are pose prediction results of the present invention on the REAL275 dataset. Detailed Implementation
[0110] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.
[0111] Current class-based 6D pose estimation methods often suffer from insufficient modeling of geometric structure and scale under conditions such as lack of explicit geometric priors, large object deformation, and scale ambiguity. This limits the improvement of cross-instance generalization ability and estimation accuracy. To enhance the robustness and accuracy of class-based 6D pose estimation, this invention provides a class-based 6D pose estimation method based on geometrically perceptual keypoints. The core idea is to construct a structure-aware framework with robust geometric representation capabilities through cross-modal feature fusion, spatial geometric attention mechanisms, and a structure reconstruction module. The main contributions of this invention are as follows:
[0112] 1) This invention proposes a geometric perception reconstruction key point network that integrates multimodal information. This method breaks through the bottlenecks of existing methods in intra-class deformation, structural reasoning and scale adaptation by introducing multi-source feature extraction, dynamic key point proposal and spatial geometric modeling mechanism.
[0113] 2) A dynamic keypoint proposal module was designed, which no longer relies on fixed templates or manually defined keypoint distributions. Instead, it automatically generates a set of keypoints with consistent structure and robust categories from images and point clouds through end-to-end training, thereby improving the consistency and geometric representation of keypoints. It is particularly suitable for object instances with large deformation and weak texture.
[0114] 3) A spatial geometric attention mechanism is proposed, which guides attention to semantically related and geometrically close regions by calculating the geometric affinity between key points in Euclidean space, thereby improving the model's robustness in recognition under occlusion and structural confusion.
[0115] 4) A geometric perception reconstruction module is introduced, which jointly performs local point cloud reconstruction and scale estimation. It constructs the structural representation of the target under the condition of no shape prior and explicitly models the scale by using volume regularization and reconstruction loss to assist the pose estimation network, thus filling the gap in volume estimation in existing methods.
[0116] 5) System evaluations were conducted on multiple datasets, including REAL275, CAMERA25, and HouseCat6D, to verify the accuracy, stability, and generalization ability of the method in this invention under no prior conditions.
[0117] This invention provides a category-level 6D pose estimation method based on geometrically perceived keypoints, combined with... Figure 1 This may include the following steps:
[0118] Obtain a target image containing a single target object. and target point cloud in the target object ;
[0119] Specifically, Mask R-CNN is used to segment the input RGB-D image into target instances, obtaining the mask and class label for each target instance. Then, the segmentation mask is used to crop the RGB image of each individual target. and the corresponding point cloud The point cloud is obtained by backprojecting the depth map and then passing through the point count... Limit downsampling. Image And point cloud As input to the network.
[0120] The target image and target point cloud are input into an end-to-end structured perception network to obtain a 6D pose estimate, which includes a rotation matrix R, a translation vector t, and a scale vector s. The end-to-end structured perception network includes a cross-modal feature fusion module (CMFF), a dynamic keypoint proposal network (DKPN), a spatio-geometric attention module (SGA), and a geometric perception reconstruction and scale estimation module (i.e., Figure 1The module includes a geometry-aware reconstruction module (GARM) and a pose and size estimator.
[0121] The end-to-end structure perception network extracts complementary features from RGB images and point clouds to guide key point generation and structure modeling, and finally regresses the rotation, translation and scale parameters of the target.
[0122] Fusion features The query vector is first fed into the dynamic keypoint proposal module. This module uses multiple attention modules to initialize the query vector. Update the dataset to generate a set of key points that are both class-consistent and instance-adaptive. By matching the similarity between the points and the fused features, a keypoint heatmap is generated, and the corresponding geometric point set is selected from the point cloud based on this heatmap. Obtain a set of key points with geometric perception capabilities. Building upon this foundation, to further enhance the structural stability among key points, the network introduces a spatial geometric attention module, constructing a multi-scale adjacency graph (neighborhood size) based on Euclidean distance for each key point. and By integrating local structure and global context, geometric consistency is enhanced, and structure-aware features are output.
[0123] Subsequently, the network models and reconstructs the geometric structure of the target object through a geometry-aware reconstruction module. This module utilizes a multi-scale convolutional cross-attention mechanism to enhance structural residuals at the keypoint level. Finally, a reconstructed point cloud is output through a shape decoder, and the scale vector of the target is regressed based on the reconstructed structure, achieving self-supervised modeling of the object's volume information. In the output stage, the network integrates feature information from the keypoint branch, the structural modeling branch, and the reconstruction branch, and inputs it into the geometry-aware reconstruction and scale estimation module to regress the target's rotation matrix, translation vector, and scale vector, thereby achieving complete category-level 6D pose estimation.
[0124] Specifically, the target image and target point cloud are input into the cross-modal feature fusion module to obtain a multimodal joint representation. Among them, multimodal joint representation It contains multiple multimodal features, and each multimodal feature corresponds to the three-dimensional coordinates of a point in the target point cloud;
[0125] In the category-level 6D pose estimation task, RGB images and point clouds contain semantic information about the object's appearance and geometric features, respectively. While their information is complementary, their modal differences are significant. Therefore, this invention designs a cross-modal feature fusion module, the overall architecture of which is as follows: Figure 2As shown, this module adopts a dual-branch structure, extracting semantic features from the image and geometric features from the point cloud separately, and then fusing them in a shared space.
[0126] Combination Figure 2 The target image is input into the image encoder to obtain the image patch feature sequence, i.e., the patch token sequence. ,in, Indicates the number of patches. express The dimension is ;
[0127] The image encoder can be the visual self-supervised pre-trained model DINOv2-ViT-S14, which has good general feature extraction capabilities and can extract semantically clear and locally sensitive patch token representations from the input image.
[0128] right Upsampling interpolation and 1×1 convolution are performed to obtain the image feature tensor. Where N represents the number of spatial dimensions of the image features, and C1 represents the number of channels of the image features.
[0129] Wherein, the upsampling difference corresponds to Figure 2 In R&I, 1×1 convolution corresponds to MLP.
[0130] The process of processing the target image to obtain the image feature tensor corresponds to... Figure 1 The Multi-ScaleVision Encoder part.
[0131] Inputting the target point cloud into a PointNet++ structure yields point cloud geometric features at multiple scales (number of points × number of channels), i.e. Figure 2 The resolutions are 512×64, 256×128, 128×256, and 64×512. Specifically, local geometric features are extracted through multi-scale region sampling and radius neighborhood aggregation. In particular, each module achieves local geometric context modeling through dual-radius neighborhood construction, region sampling, and multi-channel MLP, and increases the receptive field layer by layer to extract multi-scale structural features from fine-grained to global shape.
[0132] To fully integrate geometric information at different scales, point cloud geometric features at multiple scales are upsampled and interpolated to achieve a uniform spatial resolution, resulting in multiple sampled features. All sampled features are then concatenated and passed through a two-layer 1×1 convolutional perceptron module to obtain the final point cloud structural features. Where C2 represents the number of channels for the final point cloud structural features;
[0133] The process of processing the target point cloud to obtain the final point cloud structural features corresponds to Figure 1 The PointCloud Hierarchical Encoder part.
[0134] Final point cloud structure features and image feature tensor By splicing the data, a multimodal joint representation is obtained. Specifically, it is expressed by the following formula:
[0135] ;
[0136] Where C represents the number of spatial dimensions in the multimodal joint representation.
[0137] The quality of keypoint representation directly affects the accuracy of rotation, translation, and scale estimation. Compared to instance-level tasks, which can rely on CAD templates or manually define keypoint positions, category-level tasks face multi-instance scenarios within a category with structural differences, texture variations, and scale inconsistencies. Therefore, fixed keypoint strategies struggle to maintain robustness and generalization ability. This invention proposes a Dynamic Keypoint Proposal Module (DKPN), which generates a set of keypoints with semantic consistency, structural stability, and instance adaptability from cross-modal fusion features through guided querying and a fusion feature awareness mechanism. This enhances the spatial representation capability during pose estimation. Its architecture is as follows: Figure 3 As shown.
[0138] Multimodal joint representation Input the dynamic keypoint proposal network to obtain the set of keypoint coordinates. and key feature set ;
[0139] Suppose there are K keypoints in the dynamic keypoint proposal network. Initialize the query vector for each keypoint, and form a query matrix from all the query vectors of the keypoints. ,in, For each dimension of the query vector, the multimodal joint representation will be obtained through linear projection. Mapped to key-value pairs ,in, V represents the key vector, and V represents the value vector. Figure 3 Query matrix Key vector The sum vector V is input into the multi-layer self-attention module and the self-attention module, and the query matrix is iteratively updated multiple times to obtain the key point descriptor. Among them, keypoint descriptors Each row vector corresponds to a semantic and structural fusion representation of a key point;
[0140] Figure 3 In this context, AttnBlock represents a multi-layered self-attention module and a self-attention module. Expanding AttnBlock represents the process of processing by the multi-layered self-attention module and the self-attention module.
[0141] To associate the generated keypoints with their actual 3D location positions, based on the keypoint descriptor... and multimodal joint representation Constructing an attention heatmap HeatMap, also known as HeatMap, is implemented using the following formula:
[0142] ;
[0143] Based on the attention heatmap, obtain the set of keypoint coordinates from the target point cloud. and key feature set Specifically, this is achieved through the following formula:
[0144] ;
[0145] ;
[0146] Among them, the set of key point coordinates , Represents the three-dimensional coordinates of the i-th key point. It includes the keypoint features corresponding to each keypoint.
[0147] Among them, the dynamic keypoint proposal network can automatically learn the most discriminative structural keypoints, improving its adaptability to occlusion and intra-class deformation.
[0148] Although the dynamic keypoint proposal module can generate a set of keypoints with semantic and geometric representation capabilities, the spatial relationships between keypoints are not explicitly modeled due to the feature smoothing and attention normalization operations introduced during modality fusion, thus affecting the spatial discriminability in the pose decoding stage. Therefore, this invention introduces a Spatial Geometric Attention (SGA) module, such as... Figure 4 As shown. This module is based on a keypoint-anchored perception mechanism. By introducing local geometric relationship modeling, keypoint structure perception, and multi-scale fusion strategies, it effectively captures the spatial context of keypoints in the point cloud and achieves enhanced global consistency. The SGA module consists of two sub-paths, namely... Figure 4 The local geometric modeling path shown in (a) is... Figure 4 (b) shows the global structure modeling path, and finally introduces a local fine aggregation mechanism to further improve the fine-grained structure expression capability.
[0149] Set of key point coordinates and key feature set and multimodal joint representation and target point cloud Input the spatial geometry attention module to obtain the updated keypoint feature set. ;
[0150] Specifically, in combination Figure 4 (a) For the set of key point coordinates Each key point in the target point cloud The KNN search is used to obtain the k neighborhood points corresponding to the key points. and the set of coordinates of neighboring points In multimodal joint representation Obtaining the feature set of neighborhood points ;
[0151] This invention can calculate the relative distance between neighboring points and key points. Specifically, this is achieved through the following formula:
[0152] ;
[0153] The relative positional difference between keypoints and neighboring points is calculated. Based on this relative positional difference, geometric coding is calculated using a small multilayer perceptron (MLP) for nonlinear coding. This is specifically achieved through the following formula:
[0154] ;
[0155] ;
[0156] in, Represents the set of coordinates of neighboring points The three-dimensional coordinates of the j-th neighboring point are This represents the relative positional difference between the i-th keypoint and the j-th neighboring point. The geometric code representing the i-th key point and the j-th neighboring point;
[0157] The geometric code and its corresponding neighborhood point feature set are concatenated and fused to obtain the neighborhood enhanced feature, which is achieved through the following formula:
[0158] ;
[0159] in, Composed of multiple linear layers, ReLU activation, and normalization modules, it can capture the joint geometric-semantic distribution, effectively improving the discriminative ability of local point features. This represents the features of the neighboring points corresponding to the j-th neighboring point in the feature set of neighboring points. This represents the neighborhood enhancement feature between the i-th keypoint and the j-th neighboring point;
[0160] To adaptively weight neighbor features based on semantic similarity, the cosine similarity between keypoint features and neighborhood enhancement features is calculated, and attention weights are obtained through Softmax normalization. This is achieved using the following formula:
[0161] ;
[0162] in, This represents the attention weight between the i-th keypoint and the j-th neighboring point. The key feature represents the i-th key point. Temperature coefficient;
[0163] The neighborhood enhancement features and keypoint features are weighted and aggregated to obtain the geometrically enhanced keypoint features, which is achieved through the following formula:
[0164] ;
[0165] in, This represents the geometrically enhanced keypoint feature of the i-th keypoint. The geometrically enhanced keypoint features of all keypoints together constitute the local enhancement features. ;
[0166] Based on this, to enhance global contextual information, global average pooling is performed on all geometrically enhanced keypoint features to obtain the global mean feature, specifically achieved through the following formula:
[0167] ;
[0168] in, Represents the global mean characteristic;
[0169] Building upon local enhancement, SGA introduces a structure-aware mechanism to fuse the geometrical position information of keypoints themselves, the structural relationships between keypoints, and the overall contextual description, thereby improving the spatial consistency and discriminativeness of features. Figure 4 As shown in (b), the absolute geometric encoding of the three-dimensional coordinates of each key point is performed to obtain the absolute position encoding of the key point, which is specifically achieved through the following formula:
[0170] ;
[0171] in, This represents the absolute position encoding of the i-th key point;
[0172] The relative position between two key points is calculated, and then the structural relationship code of the key points is calculated. This is achieved through the following formula:
[0173] ;
[0174] in, This represents the structural relationship encoding of the i-th key point. Represents the three-dimensional coordinates of the m-th key point;
[0175] The absolute position code and the structural relationship code of the key points are added together to obtain the global structure-aware geometric code, which is achieved through the following formula:
[0176] ;
[0177] in, This represents the global structure-aware geometric encoding of the i-th keypoint. The global structure-aware geometric encodings of all keypoints constitute the global structure-geometric features. ;
[0178] Local enhancement features Global mean characteristics and global structural geometric features The data is concatenated along the channel dimension, fused using an MLP, and then processed through residual connections and nonlinear activation to obtain globally enhanced features. Specifically, this is achieved through the following formula:
[0179] ;
[0180] To further enhance the stability and fine-grained representation of keypoint features at the microscale, the SGA module introduces a local fine-grained aggregation mechanism in its final stage. This mechanism enhances the features locally based on the updated keypoint feature set. In China, KNN ( A search is performed to obtain a small-scale neighborhood feature set. For a small neighborhood feature set The mean of all features is taken to obtain the local aggregated features. Specifically, this is achieved through the following formula:
[0181] ;
[0182] Mean represents the mean operation;
[0183] Local aggregation features With local enhancement features The fused features are obtained by fusing the data and enhancing their representation using a multilayer perceptron (MLP). Specifically, this is achieved through the following formula:
[0184] ;
[0185] Fusion features Max pooling and residual connections are performed to obtain the updated keypoint feature set. Specifically, this is achieved through the following formula:
[0186] ;
[0187] Pooling represents max pooling.
[0188] To further enhance the ability of keypoint features to model the geometric structure and scale information of the target, this invention designs a geometry-aware reconstruction module (GARM) based on the proposed DPDN framework, which aims to recover the local geometric structure using keypoints as anchors; and uses a pose and size estimator module to regress the target's 6D pose and scale vectors based on the correlation between reconstruction and keypoints.
[0189] The core idea of the GARM module is to utilize a spatial structure perception mechanism to generate a dense point cloud from sparse keypoints to capture the complete geometric shape of an object. Unlike DPDN, which directly uses MLP to reconstruct keypoints, this invention introduces a multi-scale convolutional cross attention (MCA) mechanism to enhance the structural residual representation capability in the keypoint dimension.
[0190] target point cloud and the updated keypoint feature set Input the data into the geometric perception reconstruction and scale estimation module to obtain 6D pose estimation.
[0191] target point cloud The input is fed into a multi-layer convolutional encoder to extract geometric embedding features. Specifically, this is achieved through the following formula:
[0192] ;
[0193] in, This indicates a multilayer convolutional encoder;
[0194] To enhance spatial structure perception capabilities, this invention employs a multi-scale convolutional cross-attention mechanism (MCA) to update the keypoint feature set. Fine-grained enhancement is performed to obtain the enhanced keypoint feature set. Specifically, MCA extracts spatial information using depth-separable convolutional kernels with different receptive fields (1×7, 7×1; 1×11, 11×1, etc.). The convolution results establish context awareness at different resolutions along two orthogonal directions, resulting in an enhanced set of keypoint features. .
[0195] Enhanced keypoint feature set With geometric embedding features The features are fused and averaged to obtain global semantic features. The global semantic features and the enhanced keypoint features are then combined. The features are concatenated to obtain the reconstructed input features. Specifically, this is achieved through the following formula:
[0196] ;
[0197] Mean represents the mean operation;
[0198] Reconstruct input features Input to shape residual decoder The process generates dense point cloud residuals for key points and then overlays these residuals with the target point cloud to obtain the reconstructed point cloud. Specifically, this is achieved through the following formula:
[0199] ;
[0200] ;
[0201] in, This represents the dense point cloud residuals at key points. This indicates an operation that involves translating and expanding point by point at key points;
[0202] Update the key point feature set Input MLP to obtain the set of normalized predicted coordinates under the NOCS system. Specifically, this is achieved through the following formula:
[0203] ;
[0204] To achieve the final 6D pose estimation and scale regression, after obtaining the keypoint features of the structure-aware system, a pose and scale estimator was employed. Its input included three types of geometric perception information: keypoint coordinates. Updated keypoint feature set and the predicted normalized coordinates under the NOCS system .
[0205] Normalized prediction coordinate set Key point coordinate set and the updated keypoint feature set The vectors are concatenated, and a global pose semantic vector, i.e., 6D pose estimation, is extracted through multi-layer convolution and global pooling operations. The 6D pose estimation includes a rotation matrix R, a translation vector t, and a scale vector s, which are specifically implemented through the following formula:
[0206] ;
[0207] ;
[0208] Here, Concat means concatenation. To unify feature representation, MLP R MLP t and MLP s Each represents a set of three multilayer perceptrons.
[0209] Ultimately, this invention achieves integrated modeling from point cloud reconstruction to pose size estimation based on sparse keypoints.
[0210] Calculate the total loss; specifically, calculate the pose regression loss. Posture regression loss Rotation matrix used for supervised prediction Translation vector and scale vector Compared with the true value Consistency is achieved through the following formula:
[0211] ;
[0212] in, The true value of the rotation matrix. The true value of the translation vector. The true value of the scale vector;
[0213] Calculate NOCS coordinate loss NOCS coordinate loss This ensures that the predicted keypoints are precisely aligned with the ground truth keypoints in the Normalized Coordinate Space (NOCS), specifically achieved through the following formula:
[0214] ;
[0215] in, Indicates smoothness loss, This is the set of coordinates of the actual key points;
[0216] Calculate structural cover loss Structural cover loss This is used to measure whether the predicted keypoints adequately cover the structural region of the input point cloud. Essentially, it is the mean of the minimum distances from each point in the point cloud to the nearest keypoint, specifically implemented using the following formula:
[0217] ;
[0218] in, Represents the target point cloud The three-dimensional coordinates of the nth point;
[0219] Calculate keypoint diversity loss To prevent key points from collapsing in local areas, a uniformity constraint is introduced, which is implemented through the following formula:
[0220] ;
[0221] in, This represents the three-dimensional coordinates of the m-th key point. Represents the three-dimensional coordinates of the i-th key point. Distance threshold;
[0222] Calculate the surface constraint loss at key points To constrain the predicted keypoints to be distributed on the object surface while simultaneously suppressing keypoints falling into the background or outlier regions, an object-aware Chamfer Distance Loss is employed, specifically implemented using the following formula:
[0223] ;
[0224] in, For target point cloud The foreground point cloud set after removing outliers is calculated. Outliers are points in the target point cloud that belong to the background region. The determination of outliers is based on semantic masking and geometric consistency constraints. That is, if a point belongs to the background region or its distance from the object surface exceeds a threshold, it is removed. Let be the three-dimensional coordinates of the a-th point in the foreground point cloud set;
[0225] attitude regression loss NOCS coordinate loss Structural cover loss Key point diversity loss and key point surface constraint loss We perform a weighted summation to obtain the total loss. Specifically, this is achieved through the following formula:
[0226] ;
[0227] in, To balance the weight hyperparameters of each loss term, all loss terms work together through end-to-end backpropagation to update the parameters of the entire network.
[0228] Based on the total loss, the parameters in the end-to-end structured perception network are updated iteratively multiple times through end-to-end backpropagation to obtain the trained end-to-end structured perception network.
[0229] The image and point cloud of the object to be detected are acquired. The image and point cloud of the object to be detected are then input into the trained end-to-end structured perception network to obtain the 6D pose estimate of the object to be detected.
[0230] To fully verify the effectiveness of the network proposed in this invention in the category-level 6D pose estimation task, this invention conducted a systematic empirical analysis on three representative public datasets (CAMERA25, REAL275, and HouseCat6D).
[0231] The CAMERA25 dataset is a synthetic dataset generated by Blender rendering. It contains approximately 300,000 RGB-D images, comprising 1,085 distinct instances of six everyday objects (bowls, bottles, cameras, cans, mugs, and laptops). Of these, 25,000 images were used for evaluation, and the remainder for training. This dataset is commonly used for model pre-training due to its high-quality labels and large-scale instance diversity.
[0232] The REAL275 dataset is a real-world dataset containing 7,000 RGB-D images taken from 13 real-world indoor scenes. The test set consists of 2,750 real-world images from 6 scenes, covering 3 brand-new instances of each object class that have never appeared in the training set. This dataset is used to test the model's ability to generalize across instances in real-world environments.
[0233] The HouseCat6D dataset is a multimodal, category-level 6D object perception dataset containing 10 family categories and a total of 194 objects. The training set consists of 20,000 frames of images from 34 scenes, and the test set consists of 3,000 frames of images spanning 5 scenes. This highly challenging test set is primarily used to validate robustness under conditions of severe occlusion, close arrangement, and missing textures.
[0234] To comprehensively evaluate the performance of the proposed method in the category-level 6D pose estimation task, based on previous work, this invention uses the following two metrics to evaluate performance:
[0235] (1) 3D IoU mean accuracy (mAP)
[0236] To measure the spatial overlap between the predicted 3D bounding boxes and the ground truth bounding boxes, this invention uses the mean average precision (mAP) based on the 3D intersection-over-union (IoU) ratio as the object detection metric. IoU is defined as the ratio of the intersection to the union of the predicted and ground truth bounding boxes. If the IoU value exceeds a set threshold... When the prediction result is correct, it is considered a correct match. This metric reflects both the accuracy of the object's position and indirectly assesses the overall alignment between the pose and scale estimates. The formula for calculating the average accuracy is as follows:
[0237] ;
[0238] in, The integral represents the precision of the i-th class at the recall rate R, where n is the total number of target classes. The integral result reflects the precision of the class. The average accuracy (AP).
[0239] (2) index
[0240] These metrics are used to evaluate the joint prediction accuracy of a model in both rotation and translation subspaces. Let the predicted pose be... With true attitude The rotational error is less than Translation error is less than If the value is in centimeters, then the sample is considered a correct prediction. This invention evaluates the mean accuracy (mAP) under the following four criteria: High-precision attitude estimation, suitable for precision tasks; Stability assessment under moderate error; , This indicates the overall adaptability under relaxed precision requirements.
[0241] The network of this invention was developed and deployed in the Ubuntu 18.04 operating system environment, based on the PyTorch 1.12.1 deep learning framework and CUDA 11.3 acceleration library, and all experiments were conducted on a single NVIDIA RTX 3090 GPU.
[0242] During training, the network accepts two modalities of input: RGB images and their corresponding point cloud data. Images are uniformly cropped and scaled to 224×224, and the number of point cloud sampling points is set to 1024. To extract image features, this invention uses a pre-trained DINOv2 ViT-S / 14 model as a frozen image encoder, and the point cloud feature extraction module uses a PointNet++ architecture. To maintain consistency and comparability of the experiments, this invention uniformly adopts the same segmentation mask as SPD and DPDN, and performs initial segmentation on each instance based on Mask R-CNN. The number of keypoints K is set to 96, and the neighborhood range of each keypoint in the spatial geometric attention module is set to... This is used to encode short-range and long-range geometric dependencies. Feature dimensions are set to C1=128, C2=128, and C=256, respectively, with a query dimension D=256. The weighting coefficients for the loss function are: , , , and .
[0243] To improve the model's generalization ability in real-world scenarios, a 3D data augmentation strategy based on ternary perturbation was introduced during training. Specifically, this included random rotation, rotating each sample around... Three-axis rotation, angle range is Random translation, sampling from a uniform distribution Simulates object position offset; random scaling, scaling factor. This simulates changes in target size. The perturbation is achieved by constructing a rotation matrix. Translation vector and scaling factor The combined application to the original point cloud forms a standard affine transformation to enhance the model's robustness to target geometric deformation.
[0244] In terms of training configuration, this invention uses the Adam optimizer, with an initial learning rate set to... The maximum value is The training process employs a CyclicLR (triangular cyclic scheduling) strategy for dynamic adjustment to improve training stability. The maximum training epoch is 100, and the batch sizes for synthetic and real data are 30 and 10, respectively.
[0245] To comprehensively evaluate the robustness and accuracy of the proposed method in category-level 6D pose estimation tasks, rigorous comparative experiments were conducted with several existing methods on three mainstream public datasets: REAL275, CAMERA25, and HouseCat6D. The experimental results are shown in Tables 1-3, with evaluation metrics covering 3DIoU (…). , ) and 6D attitude evaluation standard ( , , , The performance metrics of the proposed method are denoted as "Ours".
[0246] Table 1 shows a performance comparison between the method of this invention and current mainstream 6D pose estimation methods on the real-world scene dataset REAL275. As shown in the table, the method of this invention exhibits superior performance on most metrics, demonstrating robustness under complex occlusion, viewpoint changes, and instance diversity conditions. Compared to the current representative prior method PENet, this invention achieves significant improvements in key aspects... and The metrics show improvements of 5.9% and 5.5% respectively, fully demonstrating the significant enhancement of pose estimation accuracy by the proposed cross-modal fusion mechanism and geometry perception module under no-priority conditions. This performance improvement is attributed to the dynamic keypoint generation mechanism proposed in this invention, which can significantly alleviate the problems of structural occlusion and incomplete geometric information. It is particularly noteworthy that the method of this invention completely eliminates the dependence on offline CAD models, possessing stronger generalization ability for practical applications. Compared to the baseline method AG-Pose, which is closest to this invention, this invention achieves... and The improvements were 5% and 2.9% respectively. While AG-Pose also employs adaptive keypoint modeling, it lacks effective feature extraction and hierarchical geometry perception mechanisms. This invention, by constructing a cross-modal feature fusion module, enhances the semantic integrity and spatial expressiveness of keypoint representations. Furthermore, the proposed geometry perception reconstruction module is more robust in capturing the relationship between global and local geometric structures, making it particularly suitable for pose regression in complex scenarios such as occlusion and deformation. Overall, this invention's method, without requiring shape priors and relying on a precisely designed modular structure, surpasses almost all existing state-of-the-art methods in both pose accuracy and scale estimation, especially in… Its outstanding performance under this stringent standard verifies the practicality and robustness of the proposed method in real-world scenarios.
[0247] Table 1 Performance comparison of various methods on the REAL275 dataset
[0248]
[0249] In this text, bold indicates the best result, underscore indicates the second-best result, and minus sign indicates that the corresponding experiment was not conducted in the relevant literature.
[0250] Table 2 shows the comparison results between the method of this invention and several representative methods on the synthetic dataset CAMERA25. and In terms of metrics, the proposed method achieves 94.5% and 92.7% accuracy respectively, ranking among the top performers in all comparison methods. Compared to mainstream prior methods such as Query6DoF and CatFormer, the proposed method improves the IoU75 metric by 4.6% and 2.8% respectively, validating the effectiveness of the proposed geometry perception module in object scale estimation. , , , In terms of the four indicators, this invention achieves 79.6%, 83.5%, 87.4%, and 92.7% respectively, with overall accuracy superior to most existing methods. Especially... Under the specified criteria, it improved by 2.6% compared to SpherePose. The proposed method slightly outperforms all prior and non-prior methods, demonstrating the strong advantage of its proposed dynamic keypoint and reconstruction mechanism in modeling spatial structural details. The proposed method achieves slight improvements in all metrics compared to AG-Pose, reflecting that its proposed spatial geometric attention mechanism better integrates RGB-D features while maintaining semantic consistency. Although SpherePose has a slight advantage in some metrics, its overall performance fluctuates significantly. The index dropped to 84.3%. In contrast, the method of this invention performs more evenly across all indices, especially in attitude estimation. and ) and structural perception ( The model achieved strong performance, reflecting its synergistic advantages in key point extraction, geometric perception, and multimodal fusion.
[0251] Table 2 Performance comparison of various methods on the CAMERA25 dataset
[0252]
[0253] In this text, bold indicates the best result, underscore indicates the second-best result, and minus sign indicates that the corresponding experiment was not conducted in the relevant literature.
[0254] Table 3 shows a comparison of the quantitative performance of the proposed method with several prior-free methods on the recently proposed real-world large-scale class-based dataset HouseCat6D. It reached 88.0%, slightly lower than SpherePose's 88.8%, but... This invention achieved an accuracy of 79.3% on this more challenging metric, surpassing SpherePose by 7.1%, making it the highest among currently known methods. Compared to AG-Pose and SecondPose, this invention... The improvements were 13.3% and 13.2% respectively, significantly demonstrating the scale-aware advantage of the proposed geometry aggregation strategy in real-world complex scenes. In terms of accuracy, the method of this invention achieved 49.2%, outperforming SpherePose (40.9%) and AG-Pose (37.4%), making it the best-performing model under this metric. , and In terms of metrics, they are 0.3%, 5%, and 0.8% lower than SpherePose, respectively.
[0255] Table 3 Performance comparison of various methods on the HouseCat6D dataset
[0256]
[0257] In this text, bold indicates the best result, underscore indicates the second-best result, and minus sign indicates that the corresponding experiment was not conducted in the relevant literature.
[0258] To further verify the accuracy and robustness of the proposed method in 6D pose estimation in real-world scenarios, this invention visualizes the pose prediction results of AG-Pose and the proposed method on the REAL275 dataset. Specifically, this invention performs visualization analysis on six scenes in the dataset, randomly selecting three representative images from each scene for display, comparing a total of 18 sets of samples, such as... Figure 5 As shown in the figure, (a1) to (a18) are the pose prediction results of AG-Pose on the REAL275 dataset, and (b1) to (b18) are the pose prediction results of the present invention on the REAL275 dataset. Red represents the model prediction results, and green represents the corresponding ground truth. It can be clearly observed from the figure that, compared with the existing prior-free method AG-Pose, the present invention's method exhibits superior bounding box alignment accuracy in multiple challenging scenarios. Specifically, when objects are occluded or stacked, AG-Pose often suffers from pose shifts or inaccurate scale estimations, while the present invention's method still accurately matches the pose and size of the real objects. Especially in low-texture regions or environments with complex lighting (e.g., column 6), the present invention's method maintains stable prediction results, demonstrating strong generalization ability. The visualization results fully demonstrate the robustness of the present invention's pose estimation and geometric alignment accuracy in real-world complex environments, further confirming its high performance in the prior-free paradigm and achieving relatively accurate pose estimation.
[0259] To verify the effectiveness of the method proposed in this invention, a series of ablation experiments were designed on the REAL275 dataset. Using AG-Pose as the baseline model, the three core components proposed in this invention were gradually introduced: Cross-Modal Feature Fusion (CMFF), Spatio-Geometric Attention (SGA), and Geometry-Aware Reconstruction Module (GARM). The ablation experiment results are shown in Table 4.
[0260] Table 4 Ablation Experiment
[0261]
[0262] In this text, bold indicates the best result, underscore indicates the second-best result, and minus sign indicates that the corresponding experiment was not conducted in the relevant literature.
[0263] This invention addresses the challenges of rigid keypoint distribution, insufficient geometric modeling capabilities, and inaccurate scale prediction in category-level 6D pose estimation. It proposes a geometrically perceptive keypoint estimation framework that integrates multimodal information, achieving high-precision and robust joint pose and scale regression without requiring prior CAD models. Specifically, this invention designs an end-to-end trainable structure-aware network, including a cross-modal feature fusion module to align semantic and geometric information between images and point clouds; a dynamic keypoint proposal module to generate category-robust and structurally consistent keypoints; a spatial geometric attention mechanism to enhance structural consistency and perceptual robustness among keypoints; and a geometrically perceptive reconstruction module to jointly reconstruct local point clouds and estimate scale, assisting in the construction of explicit structural representations. Extensive experiments demonstrate the excellent generalization ability and practical potential of this method, especially for objects with large intra-class deformations, weak textures, or significant scale distribution differences. Future work will focus on improving robustness in occluded scenarios, efficient inference capabilities in multi-object scenarios, and lightweight deployment and accelerated inference strategies for industrial applications, further promoting the widespread application of 6D pose estimation in practical scenarios such as robot manipulation and augmented reality.
[0264] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this disclosure.
Claims
1. A category-level 6D pose estimation method based on geometrically perceived keypoints, characterized in that, include: Obtain a target image containing a single target object. and target point cloud in the target object ; The target image and target point cloud are input into an end-to-end structured perception network to obtain a 6D pose estimate, which includes a rotation matrix R, a translation vector t, and a scale vector s. The end-to-end structured perception network includes a cross-modal feature fusion module, a dynamic keypoint proposal network, a spatial geometric attention module, and a geometric perception reconstruction and scale estimation module. The target image and target point cloud are input into an end-to-end structured perception network to obtain 6D pose estimation, including: The target image and target point cloud are input into the cross-modal feature fusion module to obtain a multimodal joint representation. Among them, multimodal joint representation It contains multiple multimodal features, and each multimodal feature corresponds to the three-dimensional coordinates of a point in the target point cloud; Multimodal joint representation Input the dynamic keypoint proposal network to obtain the set of keypoint coordinates. and key feature set ; Set of key point coordinates and key feature set and multimodal joint representation and target point cloud Input the spatial geometry attention module to obtain the updated keypoint feature set. ; target point cloud and the updated keypoint feature set Inputting the geometric perception reconstruction and scale estimation module yields 6D pose estimation, including: target point cloud The input is fed into a multi-layer convolutional encoder to extract geometric embedding features. Specifically, this is achieved through the following formula: ; in, This indicates a multilayer convolutional encoder; By employing a multi-scale convolutional cross-attention mechanism, the updated keypoint feature set is processed. Fine-grained enhancement is performed to obtain the enhanced keypoint feature set. ; Enhanced keypoint feature set With geometric embedding features The features are fused and averaged to obtain global semantic features. The global semantic features and the enhanced keypoint features are then combined. The features are concatenated to obtain the reconstructed input features. Specifically, this is achieved through the following formula: ; Mean represents the mean operation; Reconstruct input features Input to shape residual decoder The process generates dense point cloud residuals for key points and then overlays these residuals with the target point cloud to obtain the reconstructed point cloud. Specifically, this is achieved through the following formula: ; ; in, This represents the dense point cloud residual at key points. This indicates an operation that involves translating and expanding point by point at key points; Update the key point feature set Input MLP to obtain the set of normalized predicted coordinates under the NOCS system. Specifically, this is achieved through the following formula: ; Normalized prediction coordinate set Key point coordinate set and the updated keypoint feature set The vectors are concatenated, and a global pose semantic vector, i.e., 6D pose estimation, is extracted through multi-layer convolution and global pooling operations. The 6D pose estimation includes a rotation matrix R, a translation vector t, and a scale vector s, which are specifically implemented through the following formula: ; ; Here, Concat means concatenation. To unify feature representation, MLP R MLP t and MLP s Each represents a three-layer perceptron; Calculate the total loss, and based on the total loss, perform multiple iterations to update the parameters in the end-to-end structured perception network through end-to-end backpropagation to obtain the trained end-to-end structured perception network. The image and point cloud of the object to be detected are acquired. The image and point cloud of the object to be detected are then input into the trained end-to-end structured perception network to obtain the 6D pose estimate of the object to be detected.
2. The category-level 6D pose estimation method based on geometrically perceptual keypoints according to claim 1, characterized in that, The target image and target point cloud are input into the cross-modal feature fusion module to obtain a multimodal joint representation. ,include: The target image is input into the image encoder to obtain the image patch feature sequence, i.e., the patch token sequence. ,in, Indicates the number of patches. express The dimension is ,right Upsampling interpolation and 1×1 convolution are performed to obtain the image feature tensor. Where N represents the number of spatial dimensions of the image features, and C1 represents the number of channels of the image features; The target point cloud is input into a PointNet++ architecture to obtain point cloud geometric features at multiple scales. These geometric features are then upsampled and interpolated to obtain multiple sampled features. All sampled features are concatenated and then passed through two 1×1 convolutional layers to obtain the final point cloud structural features. Where C2 represents the number of channels for the final point cloud structural features; Final point cloud structure features and image feature tensor By splicing the data, a multimodal joint representation is obtained. Specifically, it is expressed by the following formula: ; Where C represents the number of spatial dimensions in the multimodal joint representation.
3. The category-level 6D pose estimation method based on geometrically perceived keypoints according to claim 1, characterized in that, Multimodal joint representation Input the dynamic keypoint proposal network to obtain the set of keypoint coordinates. and key feature set ,include: Suppose there are K keypoints in the dynamic keypoint proposal network. Initialize the query vector for each keypoint, and form a query matrix from all the query vectors of the keypoints. ,in, For each dimension of the query vector, the multimodal joint representation will be obtained through linear projection. Mapped to key-value pairs ,in, V represents the key vector, and V represents the value vector, which will be used to query the matrix. Key vector The sum vector V is input into the multi-layer self-attention module and the self-attention module, and the query matrix is iteratively updated multiple times to obtain the key point descriptor. Among them, key point descriptors Each row vector corresponds to a semantic and structural fusion representation of a key point; Based on key point descriptors and multimodal joint representation Constructing an attention heatmap Specifically, this is achieved through the following formula: ; Based on the attention heatmap, obtain the set of keypoint coordinates from the target point cloud. and key feature set Specifically, this is achieved through the following formula: ; ; Among them, the set of key point coordinates , Represents the three-dimensional coordinates of the i-th key point. It includes the keypoint features corresponding to each keypoint.
4. The category-level 6D pose estimation method based on geometrically perceptual keypoints according to claim 1, characterized in that, Set of key point coordinates and key feature set and multimodal joint representation and target point cloud Input the spatial geometry attention module to obtain the updated keypoint feature set. ,include: For the set of key point coordinates Each key point in the target point cloud The algorithm uses KNN to search for the k neighboring points of a keypoint, and obtains the set of coordinates of these neighboring points. In multimodal joint representation Obtaining the feature set of neighborhood points ; The relative positional difference between keypoints and neighboring points is calculated. Based on this relative positional difference, geometric coding is calculated using a small multilayer perceptron (MLP), specifically implemented through the following formula: ; ; in, Represents the set of coordinates of neighboring points The three-dimensional coordinates of the j-th neighboring point are This represents the relative positional difference between the i-th keypoint and the j-th neighboring point. The geometric code representing the i-th key point and the j-th neighboring point; The geometric code and its corresponding neighborhood point feature set are concatenated and fused to obtain the neighborhood enhanced feature, which is achieved through the following formula: ; in, It consists of multiple linear layers, ReLU activation, and normalization modules. This represents the features of the neighboring points corresponding to the j-th neighboring point in the feature set of neighboring points. This represents the neighborhood enhancement feature between the i-th keypoint and the j-th neighboring point; The cosine similarity between keypoint features and neighborhood enhancement features is calculated, and the attention weights are obtained through Softmax normalization, specifically using the following formula: ; in, This represents the attention weight between the i-th keypoint and the j-th neighboring point. The key feature represents the i-th key point. Temperature coefficient; The neighborhood enhancement features and keypoint features are weighted and aggregated to obtain the geometrically enhanced keypoint features, which is achieved through the following formula: ; in, This represents the geometrically enhanced keypoint feature of the i-th keypoint. The geometrically enhanced keypoint features of all keypoints together constitute the local enhancement features. ; Global average pooling is performed on all geometrically enhanced keypoint features to obtain the global mean feature, specifically achieved through the following formula: ; in, Represents the global mean characteristic; The absolute geometric encoding of the 3D coordinates of each key point is performed to obtain the absolute positional encoding of the key point, which is achieved through the following formula: ; in, This represents the absolute position encoding of the i-th key point; The relative position between two key points is calculated, and then the structural relationship code of the key points is calculated. This is achieved through the following formula: ; in, This represents the structural relationship encoding of the i-th key point. Represents the three-dimensional coordinates of the m-th key point; The absolute position code and the structural relationship code of the key points are added together to obtain the global structure-aware geometric code, which is achieved through the following formula: ; in, This represents the global structure-aware geometric encoding of the i-th keypoint. The global structure-aware geometric encodings of all keypoints constitute the global structure-geometric features. ; Local enhancement features Global mean characteristics and global structural geometric features The data is concatenated along the channel dimension, fused using an MLP, and then processed through residual connections and nonlinear activation to obtain globally enhanced features. Specifically, this is achieved through the following formula: ; Local enhancement features The KNN search is used to obtain a small-range neighborhood feature set. For a small neighborhood feature set The mean of all features is taken to obtain the local aggregated features. Specifically, this is achieved through the following formula: ; Mean represents the mean operation; Local aggregation features With local enhancement features The fused features are obtained by fusing the data and enhancing their representation using a multilayer perceptron (MLP). Specifically, this is achieved through the following formula: ; Fusion features Max pooling and residual connections are performed to obtain the updated keypoint feature set. Specifically, this is achieved through the following formula: ; Pooling represents max pooling.
5. The category-level 6D pose estimation method based on geometrically perceived keypoints according to claim 1, characterized in that, Calculate the total loss, including: Calculate attitude regression loss Specifically, this is achieved through the following formula: ; in, The true value of the rotation matrix. The true value of the translation vector. The true value of the scale vector; Calculate NOCS coordinate loss Specifically, this is achieved through the following formula: ; in, Indicates smoothness loss, This is the set of coordinates of the actual key points; Calculate structural cover loss Specifically, this is achieved through the following formula: ; in, Represents the target point cloud The three-dimensional coordinates of the nth point; Calculate keypoint diversity loss Specifically, this is achieved through the following formula: ; in, This represents the three-dimensional coordinates of the m-th key point. Represents the three-dimensional coordinates of the i-th key point. Distance threshold; Calculate the surface constraint loss at key points Specifically, this is achieved through the following formula: ; in, For target point cloud The foreground point cloud set after removing outliers, where outliers are points in the target point cloud that belong to the background region. Let be the three-dimensional coordinates of the a-th point in the foreground point cloud set; attitude regression loss NOCS coordinate loss Structural cover loss Key point diversity loss and key point surface constraint loss We perform a weighted summation to obtain the total loss. Specifically, this is achieved through the following formula: ; in, The weighting hyperparameters are used to balance the weights of each loss term.