Position recognition model construction method and system based on multi-view cross-modal matching

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a multi-view cross-modal matching method, combined with panoramic images and natural language descriptions, the accuracy and computational complexity issues of existing visual position recognition technologies in complex environments are addressed. This results in high-precision and robust position recognition, applicable to fields such as autonomous driving and robot navigation.

CN119887911BActive Publication Date: 2026-06-12QILU UNIVERSITY OF TECHNOLOGY (SHANDONG ACADEMY OF SCIENCES)

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: QILU UNIVERSITY OF TECHNOLOGY (SHANDONG ACADEMY OF SCIENCES)
Filing Date: 2024-12-25
Publication Date: 2026-06-12

Application Information

Patent Timeline

25 Dec 2024

Application

12 Jun 2026

Publication

CN119887911B

IPC: G06T7/73; G06V10/82; G06V10/74; G06V10/774; G06V10/80; G06V10/42; G06V10/44; G06V20/56; G06N3/0455; G06N3/0475; G06N3/084; G06N3/0895; G06N3/0985

CPC: G06T7/74; G06V10/82; G06V10/761; G06V10/7753; G06V10/806; G06V10/42; G06V10/454; G06V20/56

AI Tagging

Application Domain

Image enhancement Image analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing visual position recognition technologies suffer from low accuracy, loss of feature details, and high computational complexity in complex environments. Some methods rely on point cloud data acquisition, which is costly. Furthermore, the language descriptions of cross-modal position recognition methods are too simplified and cannot meet the needs of complex scenarios.

Method used

By segmenting 360° images acquired by a panoramic camera into multiple viewpoints, natural language descriptions are generated. Text features are extracted using GPT-4 and frozen T5 models, and image features are extracted by combining ViT and Sinkhorn algorithms. Cross-modal matching is achieved through contrastive learning and multi-view feature stitching, optimizing feature distance and similarity.

Benefits of technology

It achieves high-precision and highly adaptable location recognition in complex scenarios, improves the robustness and computational efficiency of the system, supports matching in cases where some text or images are missing, significantly improves positioning accuracy, and is applicable to fields such as autonomous driving, robot navigation, and logistics delivery.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN119887911B_ABST

Patent Text Reader

Abstract

The application relates to a position recognition model construction method and system based on multi-view cross-modal matching, relates to the technical field of computer vision and natural language processing, and aims at the problem that a traditional visual position recognition method is difficult to maintain high precision in a complex environment and a multi-view scene and cannot effectively process natural language description. In order to solve the problem, the application combines multi-view images and natural language text description, adopts text coding and visual coding to respectively extract features of the text and the images, then uses a clustering algorithm to cluster the image features, and splices the multi-view image features of each position into global image features, and finally, position matching is performed by calculating the similarity between the text features and the image features. By combining visual and text information, the application solves the problem of poor robustness and accuracy of the traditional method in a complex scene and view change, and can be widely applied to the navigation field of unmanned systems.

Need to check novelty before this filing date? Find Prior Art