Cross-modal time sequence behavior positioning method and device for multi-granularity cascade interaction network

A behavior positioning and multi-granularity technology, applied in the field of visual-language cross-modal learning, can solve the problems of not making full use of multi-granularity text query information, not fully modeling the timing dependence characteristics of video local contexts, etc., to improve accuracy, The effect of improving positioning accuracy

Active Publication Date: 2022-02-18
ZHEJIANG LAB
View PDF11 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the existing methods do not make full use of multi-granularity text query information in the visual-language cross-mod...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cross-modal time sequence behavior positioning method and device for multi-granularity cascade interaction network
  • Cross-modal time sequence behavior positioning method and device for multi-granularity cascade interaction network
  • Cross-modal time sequence behavior positioning method and device for multi-granularity cascade interaction network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0075] Specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

[0076] The invention discloses a multi-granularity cascaded interactive network cross-modal temporal behavior positioning method and device, based on the multi-granularity cascaded interactive network visual-language cross-modal temporal behavior positioning, which is used to solve untrimmed video based on given The timing behavior positioning problem of a given text query. This method proposes a simple and effective multi-granularity cascaded cross-modal interaction network to improve the cross-modal alignment ability of the model. In addition, the present invention introduces a local-global context-aware video encoder, which is used to improve the context timin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a cross-modal time sequence behavior positioning method and device for a multi-granularity cascade interactive network, and aims at solving the problem of time sequence behavior positioning based on given text query in an unpruned video. According to the invention, a new multi-granularity cascade cross-modal interaction network is implemented, cascade cross-modal interaction is carried out in a coarse-to-fine mode, and the cross-modal alignment capability of the model is improved. In addition, the invention introduces a local-global context-aware video encoder, and the local-global context-aware video encoder is used for improving the context time sequence dependence modeling capability of the video encoder. The visual-language cross-modal alignment method is simple in implementation method and flexible in means, and has the advantage of improving visual-language cross-modal alignment precision, and the model obtained through training can remarkably improve time sequence positioning accuracy on paired video-query test data.

Description

technical field [0001] The invention relates to the field of visual-language cross-modal learning, in particular to a cross-modal temporal sequence behavior positioning method and device. Background technique [0002] With the rapid development of multimedia and network technology, as well as the increasing popularity of large-scale video surveillance in transportation, campuses, shopping malls and other places, massive video data shows rapid geometric growth, and video understanding has become an important and urgent problem to be solved. Among them, temporal behavior localization is the basis and important part of video understanding. The research on temporal behavior localization based on visual unimodality limits the behavior to be localized to a predefined behavior set. However, the behaviors in the real world are complex and diverse, and the predefined behavior set is difficult to meet the needs of the real world. like figure 1 As shown, the visual-language cross-mod...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/735G06F16/78G06F16/783G06N3/04G06N3/08H04N19/149H04N19/21
CPCG06F16/735G06F16/7844G06F16/7867H04N19/21H04N19/149G06N3/08G06N3/044
Inventor 王聪鲍虎军宋明黎
Owner ZHEJIANG LAB
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products