Video time sequence positioning method based on intra-modal collaborative multilinear pooling

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A multi-linear, video technology, applied in character and pattern recognition, biological neural network models, special data processing applications, etc., can solve problems such as high dependence on timing modeling and sensitive timing information

Pending Publication Date: 2020-07-03

HANGZHOU DIANZI UNIV

View PDF5 Cites 8 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Since the output of video timing positioning is a timing interval, this task is more sensitive to the timing information of the video and more dependent on timing modeling

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0085] The detailed parameters of the present invention will be further specifically described below.

[0086] Such as figure 1 , as shown in 2, the present invention provides a video temporal grounding method (Video Temporal Grounding) based on Intra-and Inter-modal Multilinear Pooling.

[0087] The core method of the present invention is to propose a multi-linear pooling model (IIM) of intra-modal coordination, which is used to solve the effective fusion of multimedia representations, and to verify the superiority of the model in the cross-modal deep learning task of video timing positioning . For the first time, this method proposes to model the features in each mode while interacting between modes of video and natural language. The resulting fusion features not only obtain the correlation between modes, but also establish the in-depth understanding and interaction. On the premise of the excellent performance of the IIM model, the present invention further proposes a gen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a video time sequence positioning method based on intra-modal collaborative multilinear pooling. The method comprises the following steps of 1, performing data preprocessing onvideo and text data, and extracting features; and 2, performing feature fusion on the video and the text through an intra-modal collaborative multi-linear pooling module or a generalized intra-modalcollaborative multi-linear pooling module; 3, a neural network structure based on a video time sequence positioning task; and 4, model training: putting the multi-task loss function into an optimizer,and carrying out gradient backhaul and updating on network parameters through a back propagation algorithm. The invention provides a deep neural network for video timing sequence positioning. The invention relates to the technical field of video time sequence positioning, in particular to a module for performing cross-modal fusion on video-text data, deep features of all modals are fully utilized, an interaction method for video time sequence information at the same time extends out of the module, the expression capacity of mode expansion features is improved, and a good effect is achieved inthe field of video time sequence positioning.

Description

technical field [0001] The present invention proposes a video temporal grounding method (Video Temporal Grounding) based on Intra-and Inter-modal Multilinear Pooling. Background technique [0002] Video Temporal Grounding (Video Temporal Grounding) is an emerging task in the multimedia field, which aims to perform temporal grounding on a given video based on the provided text description. Specifically, a sentence and a video file are input, and the model is used to locate the timing position (start frame and end frame) where the sentence appears in the video. For example, a video of a person taking out an onion in the kitchen and shredding it may contain text such as "take out the chopping board", "take out the onion", "rinse the onion", "cut the onion", "rinse the chopping board", "put back in the chopping board" etc. Description. When a specific text is given, such as "take out the onion", the video timing positioning model needs to output the time when the text occurs i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06K9/62G06F16/783G06N3/04

CPCG06F16/7844G06N3/045G06F18/253G06F18/214Y02T10/40

Inventor 余宙俞俊宋怡君

Owner HANGZHOU DIANZI UNIV

Video time sequence positioning method based on intra-modal collaborative multilinear pooling

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology