An adversarial double-contrast self-supervised learning method for cross-modal lip reading

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A supervised learning and adversarial technology, applied in the field of image processing, can solve problems such as dependence, neglect, and the validity of negative samples, and achieve the effect of optimizing representation learning

Active Publication Date: 2021-10-01

NAT UNIV OF DEFENSE TECH

View PDF8 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Previous studies in this area have attempted pairwise comparison strategies to bring visual embeddings closer to corresponding audio embeddings and further away from non-corresponding audio embeddings. Two-contrast learning requires manual selection of negative samples, and its effect largely depends on the effectiveness of negative samples; second, representation learning only relies on synchronized audio-video data pairs, and other self-supervised signals, such as speaker-related information and modality information, can also be used to optimize the quality of learned representations, but these self-monitoring signals are usually ignored in previous work

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0041] Such as figure 1 As shown, given a video of the mouth talking and the corresponding audio , first introduces a visual encoder and an audio encoder to extract the A-V embedding. To ensure consistent A-V embedding, both the audio encoder network and the visual encoder network ingest clips with the same duration, typically 0.2 seconds. Specifically, the input to the audio encoder is 13-dimensional Mel-frequency cepstral coefficients (MFCCs), which are extracted every 10ms with a frame length of 25ms. The input to the vision encoder is 5 consecutive mouth-centered cropped video (= 25) frames.

[0042] To learn effective visual representations for lip reading, three pre-tasks are introduced. Double Contrast Learning Objectives and The goal is to make the visual embeddings more closely resemble the corresponding audio embeddings on short and long time scales. adversarial learning objectives and Make the learned embedding independent of schema information a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The present invention proposes an adversarial double-contrast self-supervised learning method for cross-modal lip reading, which includes a visual encoder, an audio encoder, two multi-scale temporal convolutional networks with average pooling, an identity discriminator, and a model state classifier. The method learns effective visual representations by combining audio-visual synchrony-based dual contrastive learning, identity adversarial training, and modality adversarial training. In double-contrastive learning, the noise-contrastive estimate is used as the training objective to distinguish real samples from noise samples. In the adversarial training, an identity discriminator and a modality classifier are proposed for audio-visual representation, the identity discriminator is used to distinguish whether the input visual features have a common identity, and the modality classifier is to predict whether the input features belong to the visual modality or not. The audio modality is then utilized for adversarial training using a momentum gradient reversal layer.

Description

technical field [0001] The invention belongs to the field of image processing, and in particular relates to an adversarial double-contrast self-supervised learning method for cross-modal lip reading. Background technique [0002] Supervised deep learning has made revolutionary progress in many fields such as image classification, object detection and segmentation, speech recognition, machine translation, etc. Although supervised learning has made remarkable progress in the past few years, its success largely relies on large amounts of human-annotated training data. However, for some specific tasks, such as lip reading, the cost of annotation can be prohibitively expensive. In recent years, self-supervised learning has attracted increasing attention due to its high labeling efficiency and good generalization ability. Self-supervised learning methods have shown great potential in natural language processing, computer vision, and cross-modal representation learning. [0003]...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06K9/00G06K9/62G06N3/08G10L15/06G10L15/16G10L15/25

CPCG06N3/084G10L15/063G10L15/16G10L15/25G06V40/20G06F18/22G06F18/214G06F18/24

Inventor 张雪毅刘丽常冲刘忠龙云利

Owner NAT UNIV OF DEFENSE TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

An adversarial double-contrast self-supervised learning method for cross-modal lip reading

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology