Unlock instant, AI-driven research and patent intelligence for your innovation.

Video description text generation method based on multi-modal fusion

A video description and text technology, applied in the field of image processing, can solve problems such as unstable semantic direction, inability to reflect video dynamic content and time domain information, and large divergence of description text content, so as to improve accuracy and robustness Effect

Pending Publication Date: 2020-12-11
新华智云科技有限公司
View PDF6 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Then the above-mentioned existing technology will frame the video, and use the image after the frame extraction as an independent feature to output the description text, but the independent image after the frame extraction cannot reflect the dynamic content and time domain information of the video; and naturally The output of language description text requires the support of text-level information. However, the above-mentioned existing technologies do not integrate the characteristics of text-level information, which leads to large divergence of the output description text content and unstable semantic direction.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Video description text generation method based on multi-modal fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0046] It should be noted that, in the case of no conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other.

[0047] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments, but not as a limitation of the present invention.

[0048] The present invention includes a method for generating video description text based on mult...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a video description text generation method based on multi-modal fusion, and the method comprises the steps: obtaining a to-be-described video which comprises a video frame, wherein the to-be-described video is provided with a corresponding video description statement; obtaining text theme information of the video description statement, and setting a text theme information code for each piece of text theme information; respectively obtaining dynamic time domain information codes, static information codes and audio feature vector codes of the to-be-described video; performing fusion processing on the dynamic time domain information code, the static information code and the audio feature vector code of the to-be-described video to obtain a fusion result; inputting the fusion result and the text theme information code into a first recurrent neural network for iterative processing, and determining a video content description text of the to-be-described video. The beneficial effects of the invention are that the method achieves the generation of the natural language description of the video on the basis of the fusion of various modes of the video, audio and text, and improves the generation accuracy and robustness.

Description

technical field [0001] The invention relates to the technical field of image processing, in particular to a method for generating video description text based on multimodal fusion. Background technique [0002] Video resources have become the most popular and favorite way for people to obtain information, especially after the emergence of some video APPs, watching videos every day has become an indispensable way of leisure and entertainment for many people. In order to better serve users, it is necessary to express the core information in the video in text form for recommendation display. Therefore, there must be a method capable of outputting the core content information of a given video. [0003] At present, video content description (video captioning) is usually performed on a video. The video content description is to generate a piece of text describing the video content by giving a piece of video. The video content description needs to describe the video content in a ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/78
CPCG06F16/7867
Inventor 刘辉
Owner 新华智云科技有限公司