A method for generating natural language descriptions of open-domain videos based on multimodal feature fusion

A feature fusion, natural language technology, applied in the field of video analysis, can solve the problem of only using RGB image features, without much research on other information, without considering other features, etc., to increase robustness and speed, improve accuracy, Highly robust effect

Active Publication Date: 2020-11-20
NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The S2VT model achieved a METEOR value of 29.8% on a standard video description dataset, which is higher than all previous model results, but S2VT only considers the image features and optical flow features of the video, and other information of the video has not been passed. more research
[0004] Later, some models were proposed, such as the bidirectional LSTM model (Yi B, Yang Y, Shen F, et al. Bidirectional Long-Short Term Memory for Video Description[C] / / ACM onMultimedia Conference.ACM,2016:436-440.) , multi-scale multi-instance model (Xu H, Venugopalan S, Ramanishka V, et al. A Multi-scale Multiple Instance Video Description Network [J]. Computer Science, 2015, 6738: 272-279.) but did not consider the image and other features besides optical flow
In 2017, Pasunuru et al. proposed a multi-task model (Pasunuru R, Bansal M. Multi-Task Video Captioning with Video and Entailment Generation [J]. 2017.), between unsupervised video prediction tasks (encoding) and language generation tasks (decoding) shared parameters among them, achieved the best result so far, with a METEOR value of 36%, but the model only used RGB image features

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for generating natural language descriptions of open-domain videos based on multimodal feature fusion
  • A method for generating natural language descriptions of open-domain videos based on multimodal feature fusion
  • A method for generating natural language descriptions of open-domain videos based on multimodal feature fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] Such as figure 1 The shown open-domain video natural language description model based on multimodal feature fusion is mainly divided into two major models, one is the feature extraction model, and the other is the natural language model. The present invention mainly studies the feature extraction model, which will be divided into four major models: Partial introduction.

[0034] The first part: ResNet152 extracts RGB image features and optical flow features,

[0035] (1) Extraction of RGB image features,

[0036] Use the ImageNet image database to pre-train the ResNet model. ImageNet contains 12,000,000 images divided into 1,000 categories, which can make the model more accurate in identifying objects in open-domain videos. The batch size of the neural network model is set to 50, and the learning rate at the beginning Set to 0.0001, the MSVD (Microsoft Research Video DescriptionCorpus) dataset contains 1970 video clips, with a duration between 8 and 25 seconds, corres...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

An open-domain video natural language description method based on multi-modal feature fusion, using a deep convolutional neural network model to extract RGB image features and grayscale optical flow image features, adding video spatiotemporal information and audio information to form a multi-modal feature system , when extracting C3D features, dynamically adjust the coverage of continuous frame blocks input into the 3D convolutional neural network, solve the limitation of training data size, and have robustness to the length of the video that can be processed, and the audio information makes up for the visual Insufficient, and finally for multi-modal feature fusion. The invention uses the data standardization method to standardize the eigenvalues ​​of each mode within a certain range, which solves the problem of eigenvalue differences; uses the PCA method to reduce the dimension of individual modal features, and effectively retains 99% of important information, which solves the problem of excessive dimensionality. It effectively improves the accuracy of the generated open-domain video description sentences, and has high robustness to scenes, characters, and events.

Description

technical field [0001] The invention belongs to video analysis technology, in particular to an open-domain video natural language description generation method based on multimodal feature fusion. Background technique [0002] With the popularity of smart mobile devices in recent years, a large amount of video data on network platforms needs to be analyzed and managed urgently. Therefore, it is of great practical value to study the natural language description technology of videos. Illegal videos on social platforms such as Weibo and WeChat emerge in endlessly, but now we mainly rely on manual methods such as reporting by the masses to report and control the spread of such videos, which is not effective. In addition to controlling the dissemination of pornographic, violent and reactionary and other illegal videos and maintaining network security, the language description of videos can also provide intelligent technology for the blind and other people with visual impairments t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G10L15/00G10L15/02G10L15/06G10L15/18G10L15/26G10L17/26G06K9/46G06K9/62G06N3/04
CPCG10L15/005G10L15/02G10L15/063G10L15/18G10L15/26G10L17/26G06V10/56G06N3/045G06F18/214
Inventor 袁家斌杜晓童
Owner NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products