A visual dialogue method based on multimodal semantic alignment of optimal transmission
By explicitly training a visual dialogue model using a multimodal semantic alignment method based on optimal transmission, the fine-grained alignment problem is solved, and the model's alignment and question-answering accuracy are improved.
CN116303965BActive Publication Date: 2026-06-19EAST CHINA NORMAL UNIV
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- EAST CHINA NORMAL UNIV
- Filing Date
- 2023-03-23
- Publication Date
- 2026-06-19
Smart Images

Figure CN116303965B_ABST
Abstract
This invention discloses a visual dialogue method based on optimal transmission multimodal semantic alignment. Its key feature is the construction of a model comprising multimodal feature extraction, optimal transmission-based textual semantic alignment, and optimal transmission-based cross-modal semantic alignment. Given an image, a description of that image, and the dialogue history of the past t-1 rounds surrounding that image, the model can select the correct answer from the candidate answer set for the current question in the t-th round. Compared with existing technologies, this invention introduces optimal transmission-based semantic alignment, explicitly providing training signals for intra-modal and inter-modal alignment. This improves the model's ability to align different text entities pointing to the same entity and different modal entities, helping the model better understand textual information and answer questions, thus improving the accuracy of predicted answers. It has high practical value and promising development prospects for visual dialogue in various real-world application scenarios.
Need to check novelty before this filing date? Find Prior Art