An adversarial double-contrast self-supervised learning method for cross-modal lip reading

A supervised learning and adversarial technology, applied in the field of image processing, can solve problems such as dependence, neglect, and the validity of negative samples, and achieve the effect of optimizing representation learning

Active Publication Date: 2021-10-01
NAT UNIV OF DEFENSE TECH
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Previous studies in this area have attempted pairwise comparison strategies to bring visual embeddings closer to corresponding audio embeddings and further away from non-corresponding audio embeddings. Two-contrast learning requires manual selection of negative samples, and its effect largely depends on the effectiveness of negative samples; second, representation learning only relies on synchronized audio-video data pairs, and other self-supervised signals, such as speaker-related information and modality information, can also be used to optimize the quality of learned representations, but these self-monitoring signals are usually ignored in previous work

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An adversarial double-contrast self-supervised learning method for cross-modal lip reading
  • An adversarial double-contrast self-supervised learning method for cross-modal lip reading
  • An adversarial double-contrast self-supervised learning method for cross-modal lip reading

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] Such as figure 1 As shown, given a video of the mouth talking and the corresponding audio , first introduces a visual encoder and an audio encoder to extract the A-V embedding. To ensure consistent A-V embedding, both the audio encoder network and the visual encoder network ingest clips with the same duration, typically 0.2 seconds. Specifically, the input to the audio encoder is 13-dimensional Mel-frequency cepstral coefficients (MFCCs), which are extracted every 10ms with a frame length of 25ms. The input to the vision encoder is 5 consecutive mouth-centered cropped video (= 25) frames.

[0042] To learn effective visual representations for lip reading, three pre-tasks are introduced. Double Contrast Learning Objectives and The goal is to make the visual embeddings more closely resemble the corresponding audio embeddings on short and long time scales. adversarial learning objectives and Make the learned embedding independent of schema information a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention proposes an adversarial double-contrast self-supervised learning method for cross-modal lip reading, which includes a visual encoder, an audio encoder, two multi-scale temporal convolutional networks with average pooling, an identity discriminator, and a model state classifier. The method learns effective visual representations by combining audio-visual synchrony-based dual contrastive learning, identity adversarial training, and modality adversarial training. In double-contrastive learning, the noise-contrastive estimate is used as the training objective to distinguish real samples from noise samples. In the adversarial training, an identity discriminator and a modality classifier are proposed for audio-visual representation, the identity discriminator is used to distinguish whether the input visual features have a common identity, and the modality classifier is to predict whether the input features belong to the visual modality or not. The audio modality is then utilized for adversarial training using a momentum gradient reversal layer.

Description

technical field [0001] The invention belongs to the field of image processing, and in particular relates to an adversarial double-contrast self-supervised learning method for cross-modal lip reading. Background technique [0002] Supervised deep learning has made revolutionary progress in many fields such as image classification, object detection and segmentation, speech recognition, machine translation, etc. Although supervised learning has made remarkable progress in the past few years, its success largely relies on large amounts of human-annotated training data. However, for some specific tasks, such as lip reading, the cost of annotation can be prohibitively expensive. In recent years, self-supervised learning has attracted increasing attention due to its high labeling efficiency and good generalization ability. Self-supervised learning methods have shown great potential in natural language processing, computer vision, and cross-modal representation learning. [0003]...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06K9/00G06K9/62G06N3/08G10L15/06G10L15/16G10L15/25
CPCG06N3/084G10L15/063G10L15/16G10L15/25G06V40/20G06F18/22G06F18/214G06F18/24
Inventor 张雪毅刘丽常冲刘忠龙云利
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products