A method for generating natural language descriptions of open-domain videos based on multimodal feature fusion

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A feature fusion, natural language technology, applied in the field of video analysis, can solve the problem of only using RGB image features, without much research on other information, without considering other features, etc., to increase robustness and speed, improve accuracy, Highly robust effect

Active Publication Date: 2020-11-20

NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

View PDF7 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The S2VT model achieved a METEOR value of 29.8% on a standard video description dataset, which is higher than all previous model results, but S2VT only considers the image features and optical flow features of the video, and other information of the video has not been passed. more research

[0004] Later, some models were proposed, such as the bidirectional LSTM model (Yi B, Yang Y, Shen F, et al. Bidirectional Long-Short Term Memory for Video Description[C] / / ACM onMultimedia Conference.ACM,2016:436-440.) , multi-scale multi-instance model (Xu H, Venugopalan S, Ramanishka V, et al. A Multi-scale Multiple Instance Video Description Network [J]. Computer Science, 2015, 6738: 272-279.) but did not consider the image and other features besides optical flow

In 2017, Pasunuru et al. proposed a multi-task model (Pasunuru R, Bansal M. Multi-Task Video Captioning with Video and Entailment Generation [J]. 2017.), between unsupervised video prediction tasks (encoding) and language generation tasks (decoding) shared parameters among them, achieved the best result so far, with a METEOR value of 36%, but the model only used RGB image features

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0033] Such as figure 1 The shown open-domain video natural language description model based on multimodal feature fusion is mainly divided into two major models, one is the feature extraction model, and the other is the natural language model. The present invention mainly studies the feature extraction model, which will be divided into four major models: Partial introduction.

[0034] The first part: ResNet152 extracts RGB image features and optical flow features,

[0035] (1) Extraction of RGB image features,

[0036] Use the ImageNet image database to pre-train the ResNet model. ImageNet contains 12,000,000 images divided into 1,000 categories, which can make the model more accurate in identifying objects in open-domain videos. The batch size of the neural network model is set to 50, and the learning rate at the beginning Set to 0.0001, the MSVD (Microsoft Research Video DescriptionCorpus) dataset contains 1970 video clips, with a duration between 8 and 25 seconds, corres...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

An open-domain video natural language description method based on multi-modal feature fusion, using a deep convolutional neural network model to extract RGB image features and grayscale optical flow image features, adding video spatiotemporal information and audio information to form a multi-modal feature system , when extracting C3D features, dynamically adjust the coverage of continuous frame blocks input into the 3D convolutional neural network, solve the limitation of training data size, and have robustness to the length of the video that can be processed, and the audio information makes up for the visual Insufficient, and finally for multi-modal feature fusion. The invention uses the data standardization method to standardize the eigenvalues of each mode within a certain range, which solves the problem of eigenvalue differences; uses the PCA method to reduce the dimension of individual modal features, and effectively retains 99% of important information, which solves the problem of excessive dimensionality. It effectively improves the accuracy of the generated open-domain video description sentences, and has high robustness to scenes, characters, and events.

Description

technical field [0001] The invention belongs to video analysis technology, in particular to an open-domain video natural language description generation method based on multimodal feature fusion. Background technique [0002] With the popularity of smart mobile devices in recent years, a large amount of video data on network platforms needs to be analyzed and managed urgently. Therefore, it is of great practical value to study the natural language description technology of videos. Illegal videos on social platforms such as Weibo and WeChat emerge in endlessly, but now we mainly rely on manual methods such as reporting by the masses to report and control the spread of such videos, which is not effective. In addition to controlling the dissemination of pornographic, violent and reactionary and other illegal videos and maintaining network security, the language description of videos can also provide intelligent technology for the blind and other people with visual impairments t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G10L15/00G10L15/02G10L15/06G10L15/18G10L15/26G10L17/26G06K9/46G06K9/62G06N3/04

CPCG10L15/005G10L15/02G10L15/063G10L15/18G10L15/26G10L17/26G06V10/56G06N3/045G06F18/214

Inventor 袁家斌杜晓童

Owner NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

A method for generating natural language descriptions of open-domain videos based on multimodal feature fusion

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology