A method of locating sound source from video

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A video and sound source technology, applied in the field of cross-modal learning, can solve the problems of blurred edges of positioning and low positioning accuracy, and achieve high-precision effects

Active Publication Date: 2020-12-11

TSINGHUA UNIV

View PDF5 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] At present, most of the existing work on video sound source positioning is at the pixel level, using convolutional neural networks to learn the relationship between sound and different positions in the picture, and using heat maps to mark possible sounds in the original image Partly, the edge of this method is blurred, the positioning accuracy is not high, and there is still positioning information in video frames where the sound and picture are not synchronized

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0039] The present invention proposes a method for locating a sound source from a video, which will be further described in detail below in conjunction with specific embodiments.

[0040] The present invention proposes a method for locating a sound source from a video, comprising the following steps:

[0041] (1) Training stage;

[0042] (1-1) Obtain training samples; obtain J segments of video from any channel as training samples, each training sample length is 10 seconds, there is no special requirement for the content of the training sample video, the video needs to contain a variety of different object categories, each The object categories in the training sample videos are manually labeled;

[0043] The video source of training sample in the present embodiment is the video of 10 categories in the Audioset data set, (comprising automobile, motorcycle, helicopter, yacht, speech, dog, cat, pig, alarm clock, guitar), present embodiment selects altogether J = 32469 video clips...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention proposes a method for locating a sound source from a video, which belongs to the field of cross-modal learning. In the training phase, the method obtains the training sample video and performs preprocessing, constructs a neural network composed of a fully connected layer and a localization network, and uses the preprocessed training samples to train the neural network for localization of the sound source , to obtain the trained sound source localization neural network. In the test phase, the test video is obtained and preprocessed, and then the trained sound source localization neural network is input to calculate the similarity, and the similarity is used to further synchronize the sound and video images and the synchronized sound source positioning, so as to solve the out-of-sync video The sound source localization problem. The invention can automatically discover the corresponding relationship between each object and sound in the video picture, has high positioning accuracy, high position accuracy, and has high application value.

Description

technical field [0001] The invention proposes a method for locating a sound source from a video, which belongs to the field of cross-modal learning. Background technique [0002] In recent years, with the popularity of the Internet and television, people are faced with more and more video clips. Videos contain rich sounds and images, and finding connections among them is meaningful in many ways, such as making human-machine interactions more friendly. It is becoming more and more important to automatically discover the corresponding relationship between various objects and sounds in the video screen, so as to help people quickly understand the part of the pronunciation in the video. The robot can also determine the location of the target in many scenarios such as rescue by locating the sound source in the video. [0003] At present, most of the existing work on video sound source positioning is at the pixel level, using convolutional neural networks to learn the relationsh...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06K9/00G06N3/04G06N3/08G10L25/30

CPCG06N3/08G10L25/30G06V20/41G06V2201/07G06N3/045

Inventor 刘华平王峰郭迪周峻峰孙富春

Owner TSINGHUA UNIV

A method of locating sound source from video

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology