Video natural language text retrieval method based on space time sequence characteristics

A natural language and text technology, applied in the physical field, can solve the problems of difficulty in accurately modeling the spatiotemporal semantic features of videos, affecting the retrieval accuracy of natural language texts in videos, etc. Effect

Pending Publication Date: 2021-11-26
XIDIAN UNIV
View PDF4 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The purpose of the present invention is to address the deficiencies in the above-mentioned prior art, and propose a video natural language text retrieval method based on spatial and temporal features, aiming at solving the difficulty of accurately modeling the complex temporal and spatial semantic features of video, and the semantic features of different modal data There is a problem that the heterogeneous underlying manifold structure distribution and different semantic gaps affect the accuracy of video natural language text retrieval

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Video natural language text retrieval method based on space time sequence characteristics
  • Video natural language text retrieval method based on space time sequence characteristics
  • Video natural language text retrieval method based on space time sequence characteristics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] Attached below figure 1 And embodiment the present invention is described in further detail.

[0044] Step 1, generate a sample set.

[0045] Select at least 6,000 multi-category dynamic behavior videos to be retrieved and their corresponding natural language text annotations to form a sample set. Each video contains at least 20 human-labeled natural language text annotations, and the number of natural language texts does not exceed 30 characters. At least 120,000 video natural language text pairs.

[0046] Step 2, using three neural networks to extract three-level spatial and temporal features of video samples.

[0047] Input the videos in the sample set into the trained deep residual neural network ResNet-152, extract the features of each frame image in each video, average the image features of all frames in each video, and output the video The 2048-dimensional frame-level features are used as the first-level features of the video.

[0048] Use the trained 3D conv...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a video text retrieval method based on spatial time sequence characteristics, which comprises the following steps of: performing hierarchical fine-grained comprehensive video unified representation on spatial time sequence semantic information of a video by utilizing three different types of neural networks; and constructing a video text public semantic embedding network to fit a semantic gap of cross-modal data, and training the network by using a comparison sorting loss function. The method can be used for mutual retrieval of video natural language texts, the layered feature extraction method fully excavates more discriminative complex space-time semantic information of video modal data, and the video text public semantic embedding network effectively learns public space feature representation with different modal heterogeneous data semantic feature identical distribution. The semantic association between the high-order features of the video and the natural language text is accurately measured through public space feature representation, and the retrieval precision of the video natural language text is improved.

Description

technical field [0001] The invention belongs to the technical field of physics, and further relates to a video natural language text retrieval method based on spatial and temporal features in the technical field of image and data processing. The invention can be used for semantic information mutual retrieval of large-scale video modal and natural language text modal data emerging from the Internet and social media, video theme detection and content recommendation of video applications. Background technique [0002] The emergence of a large number of user-generated videos on the Internet has increased the demand for video retrieval systems based on natural language text descriptions, and users' requirements for retrieval accuracy have also brought unprecedented challenges to the precise retrieval of video content. Traditional methods mainly support concept-based retrieval for simple natural language text queries, which are ineffective for complex long natural language text qu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/78G06F16/783G06F16/33G06F40/30G06K9/46G06K9/62G06N3/04G06N3/08
CPCG06F16/783G06F16/7867G06F16/3344G06F40/30G06N3/08G06N3/044G06N3/045G06F18/241
Inventor 王笛田玉敏罗雪梅丁子芮万波王义峰赵辉
Owner XIDIAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products