A 6D pose estimation method based on cross-modal reconstruction self-supervised training

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a cross-modal reconstruction self-supervised training method that combines image and point cloud features, the problems of resolution inconsistency and insufficient network interaction in the fusion of depth and image information are solved, achieving high-precision and robust 6D pose estimation.

CN120510218BActive Publication Date: 2026-06-19NANTONG MARINE ADVANCED RESEARCH INSTITUTE SOUTHEAST UNIVERSITY +1

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: NANTONG MARINE ADVANCED RESEARCH INSTITUTE SOUTHEAST UNIVERSITY
Filing Date: 2025-05-23
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

In existing 6D pose estimation methods, the fusion of depth information and image information suffers from spatial resolution loss due to resolution inconsistencies. The network interaction behavior is limited, lacking flexibility and adaptability, and it is difficult to maintain high-precision estimation under complex conditions such as occlusion and lighting changes.

Method used

A cross-modal reconstruction self-supervised training method is adopted. By actively occluding image information and reconstructing occlusion information through a self-supervised algorithm, combined with image and point cloud features, a channel selection fusion strategy is used to remove redundant information, thereby improving the accuracy and robustness of pose estimation.

Benefits of technology

By effectively fusing modal information from different sensors, the accuracy and robustness of pose estimation are improved, noise interference is reduced, the model's adaptability to complex environments is enhanced, computational load is reduced, and training and inference speeds are increased.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120510218B_ABST

Patent Text Reader

Abstract

This invention discloses a 6D pose estimation method based on self-supervised training using cross-modal reconstruction. The method involves acquiring a scene map and a depth map, cropping them to obtain a target map and a depth map containing the target object. After processing the target map into a mask image, image features and point cloud features are extracted. A cross-modal cross-attention mechanism is used to train the image and point cloud fusion features. The image is then reconstructed, and a pose estimation feature pair is output for self-supervision. After training, the model weights are combined with the pose estimation feature pair and output as a 6D estimated pose through a pose decoder. This method effectively fuses information from different sensor modalities and removes redundant information using a selective channel fusion strategy, resulting in a more accurate 6D pose. This provides precise positional information for subsequent control and recognition. It can understand the characteristics of objects from different angles and dimensions, and combines information from multiple sensors in situations such as occlusion and insufficient lighting, exhibiting better robustness compared to single-modal input.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a 6D pose estimation method based on cross-modal reconstruction self-supervised training, belonging to the fields of computer vision and multimodal information fusion. Background Technology

[0002] Multimodal instance-level 6D pose estimation is a key technology that achieves high-precision estimation of the position and orientation of a known object instance in 3D space by fusing data from multiple sensors (such as RGB images, depth information, point clouds, etc.). It determines the object's position and orientation in 3D space by analyzing depth maps and RGB images captured by a depth camera, totaling six degrees of freedom. Therefore, mastering these six degrees of freedom is crucial for computers to accurately understand and manipulate target objects and to deeply analyze complex scenes. Although multimodal object pose estimation is very important and has received some research, there is still room for improvement in the fusion of RGB and depth information in pose estimation algorithms.

[0003] Currently, there are two main methods for fusing depth and image information. One method involves processing relevant information within the transformed depth map and then stitching it together with the RGB image as an additional supplementary channel. However, the resolutions of the depth map and the RGB image may differ, leading to a loss of spatial resolution during stitching. This is detrimental to detail restoration and edge preservation, resulting in reduced algorithm accuracy. The other method converts the depth map into a point cloud image, extracts spatial features using a point cloud processing network, and then stitches it together with image features. Point cloud data exhibits good robustness to illumination changes and occlusion, and point cloud processing networks can be flexibly combined with other types of network structures; therefore, this fusion method is widely used.

[0004] However, most applications of fusing depth and image information involve extracting features using corresponding networks from two different modalities, then performing global or hierarchical concatenation, and finally supervising the network's learning through task loss. This training method results in limited interaction between the two networks, lacking sufficient flexibility and adaptability during feature extraction. The network may not be able to adequately learn how to adjust the processing of original modal information based on heterogeneous input. When faced with occlusion, lighting variations, and lack of texture, RGB information cannot fully describe object features. If the network cannot adequately adjust the extraction of image features based on point cloud features, it will limit the model's performance in these complex situations. Therefore, the interaction between corresponding networks from different modalities needs to be further considered to enable better fusion of point cloud and image features for pose estimation. In addition, redundant information generated during the fusion process needs to be removed to avoid the introduction of noise. Therefore, to address these two technical requirements, a 6D pose estimation method based on cross-modal reconstruction and self-supervised training is urgently needed. Summary of the Invention

[0005] The summary section of this application is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description section below. This summary section is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.

[0006] To address the problems and shortcomings of existing technologies, this invention aims to provide a 6D pose estimation method based on self-supervised training using cross-modal reconstruction. This method addresses the issues of background interference, object occlusion, and poor fusion performance between different modalities in current 6D pose estimation methods. By introducing a self-supervised algorithm that combines active occlusion from image information with occlusion information from cross-modal reconstruction, information from different sensor modalities is effectively fused to improve the accuracy and robustness of pose estimation. Furthermore, a channel selection fusion strategy is employed to remove redundant information and reduce noise interference, thereby solving the problems mentioned in the background technology.

[0007] To achieve the above objectives, the present invention provides the following technical solution:

[0008] This invention discloses a 6D pose estimation method based on cross-modal reconstruction self-supervised training, comprising the following steps:

[0009] Step 1: Obtain the scene map and depth map, and use the detection anchor box to select and crop to obtain the target map and depth map containing the target object;

[0010] Step 2: Perform piecewise masking on the target image to obtain a mask image, and input the mask image and depth map into the encoder to extract image features and point cloud features;

[0011] Step 3: The image features and point cloud features are trained using a cross-modal cross-attention mechanism to obtain image-point cloud fusion features;

[0012] Step 4: Output pose estimation feature pairs based on the image point cloud fusion features, and reconstruct the image through a self-supervised training strategy;

[0013] Step 5: Introduce reconstruction loss and feature pair prediction loss to optimize neural network parameters and then save the trained model weights;

[0014] Step 6: Load the trained model weights and combine them with the pose estimation features to output the 6D estimated pose through the pose decoder.

[0015] Preferably, step 3 includes the following steps:

[0016] Step 3.1: Extract the correlation features between the point cloud features and the image features based on the cross-attention mechanism;

[0017] Step 3.2: Select the relevant feature with the highest similarity to the point cloud feature and the image feature from the relevant features;

[0018] Step 3.3: Concatenate and fuse the relevant features with the highest similarity into cross-modal features;

[0019] Step 3.4: The cross-modal features are fed into the point cloud and image decoder to decode and recover the point cloud data and image data;

[0020] Step 3.5: Integrate the point cloud data and image data to generate image point cloud fusion features.

[0021] Preferably, step 3.1 further includes the following steps:

[0022] Step 3.1.1: Calculate the point cloud to image and image to point cloud attention heatmaps based on the point cloud features and image features;

[0023] Step 3.1.2: Multiply the point cloud features and the point cloud-to-image attention heatmap to obtain the point cloud-to-image features;

[0024] Step 3.1.3: Multiply the image features and the image-to-point-cloud attention heatmap to obtain the image-to-point-cloud features.

[0025] Preferably, step 3.2 further includes the following steps:

[0026] Step 3.2.1: Calculate the cosine similarity between the point cloud to the image, the image to the point cloud features and the image and point cloud features, and obtain the depth similarity matrix respectively;

[0027] Step 3.2.2: Calculate the mean similarity score for each row of the deep similarity matrix using channelized mean;

[0028] Step 3.2.3: Extract the top K most relevant features with the highest similarity scores from the mean.

[0029] Preferably, in step 3.3, the most similar related features are concatenated and fused into cross-modal features, which manifests as follows:

[0030] Extract the point cloud features from the image and then concatenate and fuse the K most similar related features with the point cloud features;

[0031] Extract the point cloud to image features and the K most similar related features to the image features, then stitch and fuse them with the image features.

[0032] Preferably, the reconstruction loss L in step 5 Rec Represented as,

[0033]

[0034] in, It is a fully reconstructed image, X t This is the original image before any masking is applied.

[0035] Preferably, the features affect the prediction loss function L. Code Represented as,

[0036]

[0037] Among them, w j Weights are assigned to the importance of each binary code element. This represents the binary vertex code predicted at training step t. This represents the actual binary vertex code at training step t.

[0038] Preferably, after obtaining the 6D estimated pose through step 6, the 6D estimated pose needs to be compared with the true pose and the error between them needs to be calculated to optimize the accuracy of the estimation.

[0039] As a second aspect of this application, the present invention also discloses an electronic device, comprising:

[0040] At least one processor, and a memory communicatively connected to said at least one processor;

[0041] The memory stores instructions that can be executed by the at least one processor, which are executed by the at least one processor to enable the at least one processor to perform the steps of the above-described 6D pose estimation method based on cross-modal reconstruction self-supervised training.

[0042] As a third aspect of this application, the present invention also discloses a computer storage medium storing a computer program thereon, characterized in that the computer program, when executed by a processor, implements the steps of the above-described 6D pose estimation method based on cross-modal reconstruction self-supervised training.

[0043] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0044] This invention provides a 6D pose estimation method based on cross-modal reconstruction and self-supervised training. This method effectively handles occluded environments, deeply fuses point cloud and image features, and improves the robustness and accuracy of 6D pose estimation. By combining image and point cloud data, it can understand the features of objects from different angles and dimensions. It combines information from multiple sensors in situations such as occlusion and insufficient lighting, exhibiting better robustness compared to single-modal input. Utilizing self-supervised learning, the model can be trained without a large amount of labeled data, enhancing its adaptability to different environments and conditions and reducing the cost of data collection and labeling. After partially occluding the input image, the original image is reconstructed using the remaining image and point cloud. The fusion of the reconstructed images at the occluded areas achieves self-supervision and promotes bidirectional communication between different modal networks. A hierarchical cross-modal channel selection and cross-attention mechanism effectively fuses image and point cloud features, improving feature representation capabilities. Simultaneously, a similar feature selection strategy is used, and by optimizing the algorithm and model structure, the computational load is reduced, improving the speed of training and inference. Attached Figure Description

[0045] The accompanying drawings, which form part of this application, are used to provide a further understanding of the application and to make other features, objects, and advantages of the application more apparent. The illustrative embodiments and descriptions of this application are used to explain the application and do not constitute an undue limitation of the application. In the drawings:

[0046] Figure 1 This is a block diagram showing the steps of the 6D pose estimation method in an embodiment of the present invention.

[0047] Figure 2 This is a flowchart illustrating the steps of the 6D pose estimation method in an embodiment of the present invention.

[0048] Figure 3 This is a flowchart illustrating the steps of fusing feature output in an embodiment of the present invention;

[0049] Figure 4 This is a flowchart illustrating the steps of similarity feature fusion in an embodiment of the present invention;

[0050] Figure 5 This is a visualization of the cropped scene image and depth image in an embodiment of the present invention.

[0051] Figure 6 These are visualizations of the scene diagram, mask diagram, and reconstruction diagram in embodiments of the present invention.

[0052] Figure 7 This is a visualization of the 6D pose estimation results in an embodiment of the present invention;

[0053] Figure 8 This is a schematic diagram of the structure of an electronic device in an embodiment of the present invention. Detailed Implementation

[0054] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0055] It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. Unless otherwise specified, the embodiments and features described herein can be combined with each other. This invention discloses a 6D pose estimation method based on cross-modal reconstruction self-supervised training, which will be described in detail below with reference to the accompanying drawings and embodiments.

[0056] Reference Figure 1 and Figure 2 As shown, it includes the following steps:

[0057] Step 1: Obtain the scene map and depth map, and use the detection anchor box to select and crop to obtain the target map and depth map containing the target object;

[0058] Step 2: Perform piecewise masking on the target image to obtain a mask image. Input the mask image and depth map into the encoder to extract image features and point cloud features.

[0059] Step 3: Image features and point cloud features are trained using a cross-modal cross-attention mechanism to obtain image-point cloud fusion features;

[0060] Step 4: Output pose estimation feature pairs based on the image point cloud fusion features, and reconstruct the image through a self-supervised training strategy;

[0061] Step 5: Introduce reconstruction loss and feature pair prediction loss to optimize neural network parameters and then save the trained model weights;

[0062] Step 6: Load the trained model weights and combine them with the pose estimation features to output the 6D estimated pose through the pose decoder.

[0063] For step 1, scene and depth maps are acquired using a depth camera. The depth camera can simultaneously capture color images (RGB) and depth maps. The scene map is a standard RGB image, while the depth map contains distance information from each pixel in the target scene to the camera. An analogy to the scene map is also required to ensure accurate identification of target objects for subsequent processing. Next, a target detection algorithm is used to crop the target objects from the image. Training weights are obtained by training a target detection algorithm with good performance on the target dataset. Using the training weights and the target detection algorithm, the target objects are identified and located in the scene map. Based on the target object detection results, a small area containing the target object and its depth data is selected using detection anchor boxes, and the corresponding anchor box coordinates are output. Cropping the scene and depth maps acquired by the depth camera based on the anchor box coordinates yields the cropped target map and depth map containing the target object. The cropping result is shown below. Figure 5 As shown in the diagram. This operation helps reduce the impact of background noise on subsequent processing, allowing the model to focus more on the features of the target object. Simultaneously, the cropped depth map is converted into point cloud data, which can more intuitively represent the 3D structure of the scene.

[0064] Next, the target image after selection and cropping by the anchor frame is processed into segments, that is, the target image is divided into segments using the image's height (H) and width (W) data. The image is divided into n×n patches. These patches are then randomly discarded to create a mask. The image data from the discarded patches is then reassembled into an image with height (H) and width (W). The lost information is replaced with black RGB values, thus generating the masked target image, which provides content for subsequent self-supervised learning and reconstruction of lost information. Specifically, the target image is divided into 10×10 patches, and these patches are randomly discarded with a 60% probability. The remaining patches are then reassembled to the size of the original target image as input. Randomly discarding image information allows the feature extraction network to ignore background noise and focus more on target-related information.

[0065] The cropped depth map is then converted into point cloud data. Point cloud data is a high-precision digital representation of an object's surface geometry and attributes in three-dimensional space, recorded as discrete points. It is represented as a set of points in three-dimensional space, each point containing position information (x, y, z). RandLA-Net is used as the point cloud encoder to obtain point cloud features f. pRandLA-Net is an efficient and lightweight neural network architecture that can directly infer the semantics of each point in a large-scale point cloud. RandLA-Net significantly improves model efficiency and reduces computational resource consumption by replacing time-consuming sampling methods such as FPS with random sampling. The masked target image is processed using a ConvNext model as an image encoder to obtain image features f. r The ConvNeXt model is a pure convolutional neural network that can process occluded information in the input image. It outperforms the Transformer model on various tasks while retaining its simplicity and efficiency.

[0066] Reference Figure 3 and Figure 4 As described in step 3, the image features and point cloud features are input into the model, which is then trained using a cross-modal cross-attention mechanism to obtain fused features. This further includes the following steps:

[0067] Step 3.1: Extract the correlation features between point cloud features and image features based on the cross-attention mechanism;

[0068] Step 3.2: Select the relevant features that have the highest similarity to the point cloud features and image features;

[0069] Step 3.3: Concatenate and fuse the most similar related features into cross-modal features;

[0070] Step 3.4: Input the cross-modal features into the point cloud and image decoders to decode and recover the point cloud data and image data;

[0071] Step 3.5: Integrate point cloud data and image data to generate image point cloud fusion features.

[0072] Specifically, key information characterizing each modality is extracted from the encoded point cloud features and image features, and relevant features between the two are identified based on this information. Image feature descriptors include information such as color, texture, edges, shape, position, and relative position; point cloud feature descriptors include information such as point position, normals, curvature, spatial relationships between points, surface shape, and the three-dimensional structure of objects. The process includes the following steps:

[0073] Step 3.1.1: Calculate the attention heatmaps from point cloud to image and from image to point cloud based on point cloud features and image features;

[0074] Step 3.1.2: Multiply the point cloud features and the point cloud-to-image attention heatmap to obtain the point cloud-to-image features;

[0075] Step 3.1.3: Multiply the image features and the image-to-point-cloud attention heatmap to obtain the image-to-point-cloud features.

[0076] This invention employs a cross-modal attention mechanism to find relevant features between two modalities (point cloud and image). Instead of directly using the common features of the two modalities as the fusion output, it uses a novel hierarchical similarity feature search method based on these common features to select more representative and highly similar heterogeneous information for fusion. First, attention heatmaps between the two modalities are calculated, resulting in the point cloud to image attention heatmap H. p→r Image to point cloud attention heatmap H r→p , is represented as:

[0077]

[0078] Among them, W Q Used to map point cloud or image features to a query space for similarity calculation. k Used to map point cloud or image features to a key space and match them with a query vector. V This is used to map point cloud or image features to a value space, where these values are weighted according to attention scores. k It is a scaling factor used to stabilize the gradient during training, d k With W k The dimensions are the same. Softmax is used to convert similarity scores into a probability distribution, representing the relative importance between different features. Through the attention mechanism, the model can learn the relationship between point cloud and image features, understanding which point cloud features are related to which regions in the image. Attention heatmaps can be used to enhance or suppress certain features, making the model focus more on information useful for the task.

[0079] In multimodal learning, this attention mechanism helps to effectively fuse information from different modalities. After obtaining the attention heatmap, the image features f are... r Point cloud features f p Multiplying this by the attention heatmap yields the transformed relevant features, represented as:

[0080] f p→r =f p *H p→r ,

[0081] f r→p =f r *H r→p ;

[0082] Among them, f p→r This represents the transformation from point cloud to image features. This transformation can make the point cloud features f pBased on image features f r The importance of each factor is weighted to obtain richer point cloud features that contain more image features. r→p This represents the transformation from image to point cloud features. This transformation can make image features f r Based on point cloud features f p The importance of each feature is weighted to obtain richer image features that include point cloud characteristics. Since the attention mechanism can identify the most task-relevant parts in different modalities, the transformed features f... p→r and f r→p Cross-modal feature fusion and enhancement are achieved. This transformation combines the attention heatmap with the original features through multiplication, enabling the feature representation to focus more on task-useful information while suppressing unimportant information.

[0083] After obtaining the relevant features f p→r and f r→p Subsequently, hierarchical similarity feature selection is used to search for and select the features with the highest similarity to the original modality among the relevant features, thereby enriching the point cloud and image data. Unlike existing methods, which directly use the relevant features f obtained through attention similarity weights... p→r and f r→p With the original feature f r and f p Instead of further detailed differentiation and evaluation, features are concatenated. Features with large similarity differences between the transformed and original features are also concatenated into the modal features and then input into the next feature extractor, introducing noise from heterogeneous modalities, impairing the purity of the original features, making the network-processed features noisier, and reducing the effectiveness of feature extraction. Therefore, the hierarchical similarity feature selection proposed in this invention is crucial for multimodal feature fusion in 6D pose estimation. Specifically, it includes the following steps:

[0084] Step 3.2.1: Calculate the cosine similarity between point cloud to image, image to point cloud features and image and point cloud features, and obtain the depth similarity matrix respectively;

[0085] Step 3.2.2: Calculate the mean similarity score for each row in the deep similarity matrix using channelized mean;

[0086] Step 3.2.3: Extract the top K most relevant features with the highest similarity scores from the mean.

[0087] Specifically, in this embodiment of the invention, the selection of hierarchical similarity features is achieved by calculating the transformed feature f. p→r and f r→p With the original feature f r and f pThe similarity between features is used to generate a deeper similarity consideration matrix. Then, by comparing the elements in this matrix, the features with the highest similarity to the original modality features are selected. These selected features are then further fused with the original modality features to generate the final cross-modality feature representation. This method effectively improves the expressive power of features because it not only considers the similarity between features through the attention score matrix but also optimizes feature selection through a hierarchical cosine similarity matrix. This method can also be used for multimodal data fusion to improve the expressive power and application effectiveness of data.

[0088] First, based on the point cloud to image features f p→r Image to point cloud features f r→p Calculate image features f r Point cloud features f p Point cloud to image features f p→r Image to point cloud features f r→p The cosine similarity between the features is used to obtain the depth similarity matrix. This depth similarity matrix does not simply link the point cloud features f. p and image features f r Instead, it links the point cloud to image features f p→r With image features f r Image to point cloud features f r→p Point cloud features f p This involves a deeper level of similarity consideration. Once the similarity matrix is obtained, the similarity from the point cloud to the image features f can be measured. p→r Image to point cloud features f r→p With the features f of the fused image r Point cloud features f p The correlation between them. This correlation is represented by the point cloud to image correlation matrix S. p→r Image-to-point cloud correlation matrix S r→p The specific process is as follows:

[0089]

[0090] Each element of the correlation matrix represents the point cloud to image feature f. p→r Image to point cloud features f r→p and image features f r Point cloud features f p The similarity between features is calculated. For each row in the similarity matrix, the mean of all similarity scores in that row is calculated using a channelized mean method. This mean reflects the average similarity between the related features and the original features. Specifically, if S is a similarity matrix, then the mean M of the i-th channel is... i It can be represented as Among them, Sij This represents the similarity between related feature i and original feature j, where n is the dimension of the feature. The top K most similar related features from the mean are selected as the transformed features, i.e., the channels with the cosine similarity closest to 1, denoted as TopK(S). r→p TopK(S) and TopK(S) p→r If the similarity of the channelized features is not in TopK(S), then... r→p ,K) or TopK(S) p→r If the feature channelized mean similarity is within TopK(S), then discard it. r→p ,K) or TopK(S) p→r Within a range of K, extract the K most similar features and image features f respectively. r and point cloud features f p Stitching and fusion are used to enhance the expressive power of image data.

[0091] Furthermore, extracting image features into point cloud f r→p In the point cloud feature f p The K most relevant features. The channel numbers of these K features are related to the image-to-point cloud correlation matrix S. r→p The former

[0092] TopK(S) features with K similarities r→p (K) have the same number. Extract image to point cloud features f r→p In the point cloud feature f p The K most similar features and point cloud features f p The splicing and fusion process is used to enhance the expressive power of point cloud data, resulting in the final output feature representation.

[0093] f poutput =Concat(f p ,f r→p (n)), n=TopK(S r→p ,K),

[0094] The concatenated feature vector contains point cloud features that enhance the image's expressive power, improving the representational ability of two-dimensional image data and enabling it to contain more geometric information. Similarly, extracting point cloud features f transforms the image. p→r In the image f r The K most relevant features, the channel numbers of these K features and the correlation matrix S p→r The former

[0095] TopK(S) features with K similarities p→r (K) have the same number. Extract f p→r In and f r The K most similar features and image features fr The image data is stitched and fused to enhance its expressive power, resulting in a final output feature representation.

[0096] f routput =Concat(f r ,f p→r (n)), n=TopK(S p→r ,K);

[0097] The concatenated feature vector contains geometric information of the point cloud features and image-related texture information, which can improve the expressive power of point cloud data and enable it to contain more visual information. This cross-modal feature f... poutput and f routput The data is then fed into the image decoder and point cloud decoder respectively to decode the fused cross-modal features and recover the point cloud data and image data. The decoded point cloud data and image data are then integrated to generate an image-point cloud fusion feature containing information from both data modalities. This improves the expressive power of the data, enabling it to more comprehensively reflect the features of the original scene. This image-point cloud fusion feature will be used in subsequent image reconstruction and pose estimation matching pair calculations. This is a crucial step in the point cloud and image fusion process, ensuring that the fused features can be effectively converted back into data from both modalities, thus guaranteeing the information validity of the image-point cloud fusion feature.

[0098] The previously processed data, including fused features, point cloud features, and image features, are used to calculate output pose estimation feature pairs. These output pose estimation feature pairs are crucial for subsequent calculations of the target's 6D pose, as they involve extracting features from the model that represent the target's pose. A self-supervised training strategy is used to reconstruct the image, reducing the model's reliance on labeled data. Simultaneously, cross-modal point cloud image reconstruction enhances the network model's understanding of both models. The image probabilistic partial occlusion and reconstruction visualization results are shown below. Figure 6 As shown.

[0099] Reconstruction loss and feature-pair prediction loss are introduced to optimize the neural network parameters, and the trained model weights are preserved. Reconstruction loss L Rec This refers to the difference between the model's predicted output and the actual value, expressed as:

[0100]

[0101] The mean squared error (MSE) loss is used to evaluate the quality of the reconstruction, where, It is a fully reconstructed image, X t This is the original image before any masking is applied. The features affect the prediction loss L. CodeThe focus is on the model's prediction accuracy for feature pairs. Based on the pairing results, a hierarchical binary prediction loss is calculated using a Hamming distance-based loss function, expressed as follows:

[0102]

[0103] Among them, w j Weights are assigned to the importance of each binary code element. This represents the binary vertex code predicted at training step t; This represents the true binary vertex code at training step t. λ is a constant used to balance the difference between the current step and the previous step. avg represents the average operation across all pixels within the predicted object mask. Through these two loss functions, the model can learn more accurate feature representations, thereby improving its ability to estimate the 6D pose of the target and ensuring that the model can learn from the data and continuously improve its predictive performance. Preserving the model weights ensures that the model can maintain its predictive ability and accuracy when faced with new data, while also facilitating model deployment and application.

[0104] Finally, the trained model weights are combined with pose estimation feature pairs to output the 6D estimated pose through the pose prediction decoder. Pose estimation feature pairs typically include keypoints, edges, and corners of objects in the image. This information helps the system understand and identify the object's position and orientation in space. By analyzing these feature pairs, the system can more accurately calculate the 6D pose of the target object, i.e., the object's position and orientation in three-dimensional space. This typically includes three position coordinates (x, y, z) and three rotation angles (usually represented as Euler angles or quaternions). The output 6D pose estimation result can be represented by the projection of the point cloud onto the input 2D image after transformation by the pose matrix. The red projection represents the projection of the point cloud onto the 2D image after transformation by the pose estimation matrix. A specific visualization is shown below. Figure 7 As shown. Finally, the system compares the estimated pose with the true pose to determine the accuracy of the estimation. Calculating the error helps the system understand the performance of the current model and provides a basis for further optimization.

[0105] To implement the above embodiments, this application also discloses an electronic device. (Refer to...) Figure 8As shown, the electronic device 800 may include a processing unit (e.g., a central processing unit, a graphics processor, etc.) 801, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 802 or a program loaded from a storage device 808 into a random access memory (RAM) 803. The RAM 803 also stores various programs and data required for the operation of the electronic device 800. The processing unit 801, ROM 802, and RAM 803 are interconnected via a bus 804. An input / output (I / O) interface 805 is also connected to the bus 804.

[0106] Typically, the following devices can be connected to I / O interface 805: input devices 806 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 807 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 808 including, for example, magnetic tapes, hard disks, etc.; and communication devices 809. Communication device 809 allows electronic device 800 to communicate wirelessly or wiredly with other devices to exchange data. Although... Figure 8 An electronic device 800 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively. Figure 8 Each box shown can represent a device or multiple devices as needed.

[0107] In particular, according to some embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, some embodiments of this disclosure include a computer program product comprising a computer program carried on a computer storage medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 809, or installed from a storage device 808, or installed from a ROM 802. When the computer program is executed by the processing device 801, it performs the functions defined above in the methods of some embodiments of this disclosure.

[0108] It should be noted that, in some embodiments of this disclosure, the computer storage medium described above can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.

[0109] In some embodiments of this disclosure, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in connection with an instruction execution system, apparatus, or device. In some embodiments of this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can also be any computer storage medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer storage medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0110] The aforementioned computer storage medium may be included within the aforementioned electronic device, or it may exist independently and not assembled into the electronic device. The aforementioned computer storage medium carries one or more programs that, when executed by the electronic device, enable the electronic device to implement a 6D pose estimation method based on cross-modal reconstruction self-supervised training.

[0111] Computer program code for performing operations of some embodiments of this disclosure can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0112] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings.

[0113] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0114] The above description is merely a selection of preferred embodiments of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this disclosure.

Claims

1. A 6D pose estimation method based on cross-modal reconstruction self-supervised training, characterized in that, Includes the following steps: Step 1: Obtain the scene map and depth map, and use the detection anchor box to select and crop to obtain the target map and depth map containing the target object; Step 2: Perform piecewise masking on the target image to obtain a mask image, and input the mask image and depth map into the encoder to extract image features and point cloud features; Step 3: The image features and point cloud features are trained using a cross-modal cross-attention mechanism to obtain image-point cloud fusion features; Step 4: Output pose estimation feature pairs based on the image point cloud fusion features, and reconstruct the image through a self-supervised training strategy; Step 5: Introduce reconstruction loss and feature pair prediction loss to optimize neural network parameters and then save the trained model weights; Step 6: Load the trained model weights and combine them with the pose estimation features to output the 6D estimated pose through the pose decoder; Step 3 includes the following steps: Step 3.1: Extract the correlation features between the point cloud features and the image features based on the cross-attention mechanism; Step 3.2: Select the relevant features that have the highest similarity to the point cloud features and image features from the relevant features; Step 3.3: Concatenate and fuse the relevant features with the highest similarity into cross-modal features; Step 3.4: The cross-modal features are fed into the point cloud and image decoder to decode and recover the point cloud data and image data; Step 3.5: Integrate the point cloud data and image data to generate image point cloud fusion features.

2. The 6D pose estimation method based on cross-modal reconstruction self-supervised training as described in claim 1, characterized in that, Step 3.1 also includes the following steps: Step 3.1.1: Calculate the point cloud to image and image to point cloud attention heatmaps based on the point cloud features and image features; Step 3.1.2: Multiply the point cloud features and the point cloud-to-image attention heatmap to obtain the point cloud-to-image features; Step 3.1.3: Multiply the image features and the image-to-point-cloud attention heatmap to obtain the image-to-point-cloud features.

3. The 6D pose estimation method based on cross-modal reconstruction self-supervised training as described in claim 2, characterized in that, Step 3.2 also includes the following steps: Step 3.2.1: Calculate the cosine similarity between the point cloud to the image, the image to the point cloud features and the image and point cloud features, and obtain the depth similarity matrix respectively; Step 3.2.2: Calculate the mean similarity score for each row of the deep similarity matrix using channelized mean; Step 3.2.3, extract the first [value] from the mean. The relevant feature with the highest similarity score.

4. The 6D pose estimation method based on cross-modal reconstruction self-supervised training as described in claim 3, characterized in that, In step 3.3, the most similar related features are concatenated and fused into cross-modal features, which are manifested as follows: Extract the point cloud features from the image and the features most similar to the point cloud features. The relevant features are spliced and fused with the point cloud features; Extract the point cloud to image features and the most similar image features The relevant features are spliced and fused with image features.

5. The 6D pose estimation method based on cross-modal reconstruction self-supervised training as described in claim 1, characterized in that: The reconstruction loss described in step 5 Represented as, ； in, It is a fully reconstructed image. This is the original image before any masking is applied.

6. The 6D pose estimation method based on cross-modal reconstruction self-supervised training as described in claim 5, characterized in that: The feature pairs prediction loss function Represented as, ； in, Weights are assigned to the importance of each binary code element. Indicating in the training steps Predicted binary vertex code Indicating in the training steps The actual binary vertex code at that time.

7. The 6D pose estimation method based on cross-modal reconstruction self-supervised training as described in claim 4, characterized in that: After obtaining the 6D estimated pose through step 6, it is necessary to compare the 6D estimated pose with the true pose and calculate the error between them in order to optimize the accuracy of the estimation.

8. An electronic device, characterized in that, include: At least one processor, and a memory communicatively connected to said at least one processor; The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the steps of the method according to any one of claims 1 to 7.

9. A computer storage medium storing a computer program thereon, characterized in that: When the computer program is executed by the processor, it performs the steps as described in any one of claims 1 to 7.