Video description method based on deep learning and text summarization

A text summary and video description technology, applied in the field of video description, can solve the problems of lack of human natural language expression color, single sentence structure, difficulty in implementation, etc., and achieve good video description effect and accuracy

Active Publication Date: 2016-01-27
广州葳尔思克自动化科技有限公司
View PDF2 Cites 102 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, there are certain limitations in this method. For example, the use of language templates to generate sentences tends to result in a relatively fixed sentence pattern, which is too single and lacks the color of human natural language expression.
At the same time, different features are required to identi...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Video description method based on deep learning and text summarization
  • Video description method based on deep learning and text summarization
  • Video description method based on deep learning and text summarization

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] A video description method based on deep learning and text summarization, see figure 1 , the method includes the following steps:

[0038] 101: Download videos from the Internet, and describe each video (English description), form a pair, and form a text description training set, wherein each video corresponds to multiple sentence descriptions, thereby forming a text description sequence;

[0039] 102: Utilize the existing image data set to train a convolutional neural network (CNN) model according to the image classification task;

[0040] For example: ImageNet.

[0041] 103: Extract the video frame sequence from the video, and use the convolutional neural network (CNN) model to extract CNN features, form a pair as the input of the recurrent neural network (RNN) model, and train the recurrent neural network (RNN) model;

[0042] 104: Using the trained RNN model to describe the video frame sequence of the video to be described to obtain a description sequence;

[...

Embodiment 2

[0046] 201: Download images from the Internet, and describe each video to form a pair to form a text description training set;

[0047] This step specifically includes:

[0048] (1) Download the Microsoft Research Video Description Dataset (MicrosoftResearchVideoDescriptionCorpus) from the Internet. This data set includes 1970 video segments collected from YouTube. The data set can be expressed as V I D = { Video 1 , ... , Video N d } , where N d is a video in the collection VID total.

[0049] (2) Each video will have multiple corresponding descriptions, and the sentence description of each video is Sentences={Sentence 1 ,...,Sentence N}, where N represents the sentence corresponding to each video (Sentence 1 ,...,Sentence N ) description number.

[0050] (3) The ...

Embodiment 3

[0124] Here, two videos are selected as the videos to be described, such as Figure 5 As shown, use the method based on deep learning and text summary in the present invention to predict and output the corresponding video description:

[0125] (1) Use ImageNet as the training set, sample each picture in the data set to a picture of 256*256 size, I M A G E = { Image 1 , ... , Image N m } As input, N m is the number of pictures.

[0126](2) Build the first convolutional layer, set the convolution kernel cov1 size to 11, stride to 4, select ReLU as max(0,x), perform pooling operation on the convolutional featuremap, and the kernel size is 3. The stride is 2, and the convolutional data is normalized using local corresponding normalization. In AlexNet, k=2, n=5, α=10...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a video description method based on deep learning and text summarization. The video description method comprises the following steps: through a traditional image data set, training a convolutional neural network model according to an image classification task; extracting a video frame sequence of a video, utilizing the convolutional neural network model to extract convolutional neural network characteristics to form a <video frame sequence, text description sequence> pair which is used as the input of a recurrent neural network model, and training to obtain the recurrent neural network model; describing the video frame sequence of the video to be described through the recurrent neural network model obtained by training to obtain description sequences; and through a method that graph-based vocabulary centrality is used as the significance of the text summarization, sorting the description sequences, and outputting a final description result of the video. An event which happens in one video and object attributes associated with the event are described through a natural language so as to achieve a purpose that video contents are described and summarized.

Description

technical field [0001] The invention relates to the field of video description, in particular to a video description method based on deep learning and text summarization. Background technique [0002] Using natural language to describe a video is extremely important for both understanding the video and retrieving the video on the Web. At the same time, the language description of video is also a key research topic in the field of multimedia and computer vision. The so-called video description means that for a given video, by observing the content contained in it, the video features are obtained, and corresponding sentences are generated according to these contents. When people see a video, especially some action videos, they will have a certain degree of understanding of the video after watching the video, and can use language to tell what happened in the video. Example: Describe a video with a sentence like "A man is riding a motorcycle." However, in the face of a large ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06K9/00G06N3/08
CPCG06N3/08G06V20/41
Inventor 李广马书博韩亚洪
Owner 广州葳尔思克自动化科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products