Position recognition model construction method and system based on multi-view cross-modal matching
By employing a multi-view cross-modal matching method, combined with panoramic images and natural language descriptions, the accuracy and computational complexity issues of existing visual position recognition technologies in complex environments are addressed. This results in high-precision and robust position recognition, applicable to fields such as autonomous driving and robot navigation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QILU UNIVERSITY OF TECHNOLOGY (SHANDONG ACADEMY OF SCIENCES)
- Filing Date
- 2024-12-25
- Publication Date
- 2026-06-12
AI Technical Summary
Existing visual position recognition technologies suffer from low accuracy, loss of feature details, and high computational complexity in complex environments. Some methods rely on point cloud data acquisition, which is costly. Furthermore, the language descriptions of cross-modal position recognition methods are too simplified and cannot meet the needs of complex scenarios.
By segmenting 360° images acquired by a panoramic camera into multiple viewpoints, natural language descriptions are generated. Text features are extracted using GPT-4 and frozen T5 models, and image features are extracted by combining ViT and Sinkhorn algorithms. Cross-modal matching is achieved through contrastive learning and multi-view feature stitching, optimizing feature distance and similarity.
It achieves high-precision and highly adaptable location recognition in complex scenarios, improves the robustness and computational efficiency of the system, supports matching in cases where some text or images are missing, significantly improves positioning accuracy, and is applicable to fields such as autonomous driving, robot navigation, and logistics delivery.
Smart Images

Figure CN119887911B_ABST