Voice-driven 3D virtual human expression voice and picture synchronization method and system based on deep learning

A technology of audio and video synchronization and deep learning, which is applied in the fields of speech recognition, computer graphics, computer vision, and speech synthesis to achieve good scalability.
CN112001992APending Publication Date: 2020-11-27超维视界(北京)传媒科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Current Assignee / Owner
超维视界(北京)传媒科技有限公司
Publication Date
2020-11-27

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention relates to a voice-driven 3D virtual human expression voice and picture synchronization method and system based on deep learning. The method comprises the following steps: extracting a logarithm amplitude spectrum in a voice signal as a voice signal characteristic; inputting the voice signal characteristic into a trained parameter prediction model which outputs an expression parameter value, wherein the parameter prediction model is a neural network model obtained by training a natural label pair relationship between a voice signal and an image signal in the video data; filteringthe expression parameter value output by the parameter prediction model; and performing image rendering of a 3D figure model by using the filtered expression parameter value to realize 3D virtual figure expression sound and picture synchronization. The system comprises a video analysis module, a parameter extraction module, a voice synthesis module, a voice signal processing module, a parameter prediction module, a parameter filtering module and a rendering module. According to the invention, the mouth lip effect of a virtual person is improved by learning a large amount of face video data such that the mouth lip effect is more natural and human-like.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the fields of computer graphics, computer vision, speech recognition, speech synthesis, etc., and specifically relates to a method of using a deep neural network to fit the relationship between speech and 3D model Blend Shape values, and to realize the synchronization of speech-driven 3D virtual human expression, sound and picture methods and systems. Background technique

[0002] At present, there are several types of voice-driven methods for generating virtual human facial animations:

[0003] (1) Speech generates the vertex coordinates of a 3D model with a fixed topology through the neural network, and these vertex coordinates can show facial animation on the DI4D PRO system.

[0004] (2) Speech drives the avatar through the confrontation network to generate different 2D images, which are reflections of different angles of a 3D model.

[0005] (3) Speech is split by phonemes, and each phoneme corresponds to an animation cl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More