Voice-driven 3D virtual human expression voice and picture synchronization method and system based on deep learning

A technology of audio and video synchronization and deep learning, which is applied in the fields of speech recognition, computer graphics, computer vision, and speech synthesis to achieve good scalability.

Pending Publication Date: 2020-11-27
超维视界(北京)传媒科技有限公司
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] In order to overcome the problems that existing avatars do not have high naturalness of synchronizing facial expressions, audio and video, real-time interaction ability and learning ability to improve synchronous effects of facial expressions, audio

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Voice-driven 3D virtual human expression voice and picture synchronization method and system based on deep learning
  • Voice-driven 3D virtual human expression voice and picture synchronization method and system based on deep learning
  • Voice-driven 3D virtual human expression voice and picture synchronization method and system based on deep learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0047] In order to make the above objects, features and advantages of the present invention more comprehensible, the present invention will be further described in detail below through specific embodiments and accompanying drawings.

[0048] The deep learning-based voice-driven 3D virtual human facial expression audio-video synchronization system of the present invention includes a video analysis module, a parameter extraction module, a speech synthesis module, a speech signal processing module, a parameter prediction module, a parameter filtering module and a rendering module. All modules are divided into two parts, respectively in training mode and working mode. The modules used in the training mode include: video analysis module, parameter extraction module, voice signal processing module, parameter prediction module. The modules used in the working mode include: speech synthesis module, speech signal processing module, parameter prediction module, parameter filtering modul...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a voice-driven 3D virtual human expression voice and picture synchronization method and system based on deep learning. The method comprises the following steps: extracting a logarithm amplitude spectrum in a voice signal as a voice signal characteristic; inputting the voice signal characteristic into a trained parameter prediction model which outputs an expression parameter value, wherein the parameter prediction model is a neural network model obtained by training a natural label pair relationship between a voice signal and an image signal in the video data; filteringthe expression parameter value output by the parameter prediction model; and performing image rendering of a 3D figure model by using the filtered expression parameter value to realize 3D virtual figure expression sound and picture synchronization. The system comprises a video analysis module, a parameter extraction module, a voice synthesis module, a voice signal processing module, a parameter prediction module, a parameter filtering module and a rendering module. According to the invention, the mouth lip effect of a virtual person is improved by learning a large amount of face video data such that the mouth lip effect is more natural and human-like.

Description

technical field [0001] The invention relates to the fields of computer graphics, computer vision, speech recognition, speech synthesis, etc., and specifically relates to a method of using a deep neural network to fit the relationship between speech and 3D model Blend Shape values, and to realize the synchronization of speech-driven 3D virtual human expression, sound and picture methods and systems. Background technique [0002] At present, there are several types of voice-driven methods for generating virtual human facial animations: [0003] (1) Speech generates the vertex coordinates of a 3D model with a fixed topology through the neural network, and these vertex coordinates can show facial animation on the DI4D PRO system. [0004] (2) Speech drives the avatar through the confrontation network to generate different 2D images, which are reflections of different angles of a 3D model. [0005] (3) Speech is split by phonemes, and each phoneme corresponds to an animation cl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06T13/40G06T15/00G06N3/04G10L13/02G11B27/10
CPCG06T13/40G06T15/005G10L13/02G11B27/10G06N3/045
Inventor 梁宏华彭超
Owner 超维视界(北京)传媒科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products