A 6D pose estimation method based on cross-modal information fusion

By employing a cross-modal information fusion method, utilizing the encoding and decoding stages of RGB networks and point cloud networks, and combining geometric context feature aggregation and cross-modal attention fusion modules, the accuracy and computational cost issues in RGB-D pose estimation are addressed, thereby improving pose estimation performance in occluded scenarios.

CN118135553BActive Publication Date: 2026-06-26XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2024-03-20
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing RGB-D based 6D pose estimation methods suffer from low accuracy and high computational cost when dealing with weak textures, occlusion, and lighting problems, and fail to effectively integrate the global semantic relevance of RGB and depth information.

Method used

A cross-modal information fusion-based approach is adopted, which integrates RGB and point cloud features through RGB network branches and point cloud network branches in the encoding and decoding stages, and utilizes a geometric context feature aggregation module and a cross-modal attention fusion module to perform 6D pose estimation.

Benefits of technology

It improves the accuracy of pose estimation in occluded scenarios, reduces computational costs, and achieves high-performance end-to-end pose estimation, making it suitable for fields such as robot manipulation and autonomous driving.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118135553B_ABST
    Figure CN118135553B_ABST
Patent Text Reader

Abstract

The application discloses a 6D pose estimation method based on cross-modal information fusion, in the encoding stage, the RGB network branch and the point cloud network branch respectively use the encoder to extract the RGB feature of the RGB image and the point cloud feature of the depth image layer by layer, when the point cloud feature is extracted, the geometric context feature aggregation module is used in each layer of encoding. In the decoding stage, the RGB network branch and the point cloud network branch respectively use a multilayer decoder to decode the features. Between the corresponding encoding layers and decoding layers of the two branches, the cross-modal attention fusion module is used to fuse the RGB feature and the point cloud feature, and the fused features are re-split into the RGB feature and the point cloud feature according to the arrangement order of the RGB feature and the point cloud feature. In the pose calculation stage, the RGB feature output by the first multilayer decoder and the point cloud feature output by the second multilayer decoder are spliced, and 6D pose estimation is carried out according to the spliced features. The application improves the feature representation capability in the occlusion scene and improves the performance of pose estimation in the occlusion scene.
Need to check novelty before this filing date? Find Prior Art