A visual dialogue method based on multimodal semantic alignment of optimal transmission

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By explicitly training a visual dialogue model using a multimodal semantic alignment method based on optimal transmission, the fine-grained alignment problem is solved, and the model's alignment and question-answering accuracy are improved.

CN116303965BActive Publication Date: 2026-06-19EAST CHINA NORMAL UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: EAST CHINA NORMAL UNIV
Filing Date: 2023-03-23
Publication Date: 2026-06-19

Smart Images

Figure CN116303965B_ABST

Patent Text Reader

Abstract

This invention discloses a visual dialogue method based on optimal transmission multimodal semantic alignment. Its key feature is the construction of a model comprising multimodal feature extraction, optimal transmission-based textual semantic alignment, and optimal transmission-based cross-modal semantic alignment. Given an image, a description of that image, and the dialogue history of the past t-1 rounds surrounding that image, the model can select the correct answer from the candidate answer set for the current question in the t-th round. Compared with existing technologies, this invention introduces optimal transmission-based semantic alignment, explicitly providing training signals for intra-modal and inter-modal alignment. This improves the model's ability to align different text entities pointing to the same entity and different modal entities, helping the model better understand textual information and answer questions, thus improving the accuracy of predicted answers. It has high practical value and promising development prospects for visual dialogue in various real-world application scenarios.

Need to check novelty before this filing date? Find Prior Art