Video question answering method based on self-driven twinborn sampling and reasoning

A self-driven, twinning technology, applied in inference methods, video data retrieval, video data indexing, etc., can solve problems such as failure to make good use of context information, and achieve the effect of enhanced learning.

Pending Publication Date: 2022-03-22
SUN YAT SEN UNIV
View PDF1 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Aiming at the deficiencies of the prior art, the present invention aims to provide a video question answering method based on self-driven twin sampling and reasoning, which solves the problem that the context information in the same video cannot be well uti

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Video question answering method based on self-driven twinborn sampling and reasoning
  • Video question answering method based on self-driven twinborn sampling and reasoning
  • Video question answering method based on self-driven twinborn sampling and reasoning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0062] figure 1 The overall framework of the invention is shown. For a video F, the present invention constructs a reference video segment c by sampling anchor and the corresponding twin video segments where N represents the number of all video segments. The length of each video segment c is B frames, and B feature maps are obtained through the video encoder. The encoded features of the reference segment and the twin segment are respectively denoted as v anchor as well as

[0063] The text representation encoded by the text encoder and the video segment representation of the video encoder are stitched together as input, and the multimodal Transformer is used to generate video segment-text features. Benchmark video segment-text features denoted as f anchor , and the corresponding twin video segment-text features are denoted as

[0064] Similarly, the soft label predictions inferred using these features are denoted as p {anchor} with where p is a vector of dimensi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a video question and answer method based on self-driven twin sampling and reasoning, the method comprises video segment sampling, feature extraction and reasoning strategies, the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-mode; according to the inference strategy, a twinborn knowledge generation module is used for generating refined knowledge labels for video segments, and a twinborn knowledge inference module is used for spreading the labels to all twinborn samples and fusing the labels. The method has the beneficial effects that the framework based on self-driven twin sampling and reasoning is provided and is used for extracting context semantic information in different video segments of the same video so as to enhance the learning effect of the network.

Description

technical field [0001] The invention relates to the technical fields of computer vision and pattern recognition in computers, in particular to a video question answering method based on self-driven twin sampling and reasoning. Background technique [0002] Video question answering is a visual-language reasoning task with great application prospects, thus attracting more and more researchers' attention. The video question answering task needs to acquire and use the temporal and spatial features of the visual signal in the video according to the combined semantics of linguistic cues to generate answers. [0003] Some existing works extract general visual information as well as motion features from videos to represent video content, and design different attention mechanisms to integrate these features. These methods focus on how to better understand the overall content of the video, but it is easy to ignore the details in the video segment. Some researchers have also explored...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/783G06F16/78G06F16/75G06F16/71G06F16/332G06V20/40G06V10/764G06K9/62G06N5/04
CPCG06F16/783G06F16/7867G06F16/75G06F16/71G06F16/3329G06N5/04G06F18/2415
Inventor 余伟江卢宇彤李孟非陈志广
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products