A visual dialogue method based on multimodal semantic alignment of optimal transmission

By explicitly training a visual dialogue model using a multimodal semantic alignment method based on optimal transmission, the fine-grained alignment problem is solved, and the model's alignment and question-answering accuracy are improved.

CN116303965BActive Publication Date: 2026-06-19EAST CHINA NORMAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
EAST CHINA NORMAL UNIV
Filing Date
2023-03-23
Publication Date
2026-06-19

Smart Images

  • Figure CN116303965B_ABST
    Figure CN116303965B_ABST
Patent Text Reader

Abstract

This invention discloses a visual dialogue method based on optimal transmission multimodal semantic alignment. Its key feature is the construction of a model comprising multimodal feature extraction, optimal transmission-based textual semantic alignment, and optimal transmission-based cross-modal semantic alignment. Given an image, a description of that image, and the dialogue history of the past t-1 rounds surrounding that image, the model can select the correct answer from the candidate answer set for the current question in the t-th round. Compared with existing technologies, this invention introduces optimal transmission-based semantic alignment, explicitly providing training signals for intra-modal and inter-modal alignment. This improves the model's ability to align different text entities pointing to the same entity and different modal entities, helping the model better understand textual information and answer questions, thus improving the accuracy of predicted answers. It has high practical value and promising development prospects for visual dialogue in various real-world application scenarios.
Need to check novelty before this filing date? Find Prior Art