Semantic-based audio-driven digital human generation method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An audio-driven and semantic technology, applied in the field of machine learning, can solve the problems of large limitations, errors in the positioning of facial feature points, and the lack of consideration of individual speech differences, etc., to achieve the effect of improving viewing experience

Pending Publication Date: 2021-03-26

新华智云科技有限公司

View PDF5 Cites 20 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] Since face movement and speech are a very delicate and complex process, the feature point coordinates can only be used to initially represent the face movement, and there are errors in the positioning of face feature points, and there are individual differences in face movement and speech. The motion parameters are obtained by the difference Vel between the feature point coordinates and the standard frame coordinates and the corresponding scale reference P on the face, without considering the differences in individual speech; The tone, language, and speed are all related to facial movements, and this method has great limitations

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0050] Embodiment 1. A semantic-based audio-driven digital human generation method, such as figure 1 shown, including the following steps:

[0051] S100. Obtain a target audio and target face image sequence, and perform mask processing on the mouth area of each target face image in the target face image sequence to obtain a corresponding first face image sequence;

[0052] After the mouth area of the target face image is masked, the corresponding face images to be rendered are obtained, and the face images to be rendered corresponding to each target face image one-to-one constitute a first face image sequence.

[0053] S200. Perform feature extraction on the target audio to obtain corresponding audio features;

[0054] S300. Input the audio features into a pre-trained semantic conversion network, and perform semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, the semantic motion sequence includi...

Embodiment 2

[0116] Embodiment 2, a kind of audio-driven digital human generation system based on semantics, such as Figure 4 shown, including:

[0117] The data acquisition module 100 is used to obtain the target audio and the target face image sequence, and after masking the mouth area of each target face image in the target face image sequence, obtain the corresponding first face image sequence ;

[0118] The feature extraction module 200 is used to perform feature extraction on the target audio to obtain corresponding audio features;

[0119] The semantic conversion module 300 is used to input the audio features into a pre-trained semantic conversion network, and the semantic conversion network performs semantic conversion on the audio features to obtain a corresponding semantic motion sequence, the semantic motion sequence includes several a mouth semantic map;

[0120] The composite rendering module 400 is configured to construct a second human face image sequence based on the ...

Embodiment 3

[0128] Embodiment 3. A computer-readable storage medium stores a computer program, and when the program is executed by a processor, the steps of the method described in Embodiment 1 are implemented.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a semantic-based audio-driven digital human generation method and system. The generation method comprises the following steps of obtaining a target audio and a first human faceimage sequence; performing feature extraction on the target audio to obtain corresponding audio features; inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence which comprises a plurality of mouth semantic graphs; and acquiring to-be-rendered face images with the same number as the mouth semantic graphs based on a first face image sequence, shielding mouth areas of the to-be-rendered face images, performing face synthesis based on the mouth semantic graphs and the to-be-rendered face images, and generating a synthesized face sequence. According to the invention, conversion between audio and facial semantics is realized through the semantic conversion network, and accurate expression of mouth shapes is realized by utilizing the facial semantics.

Description

technical field [0001] The invention relates to the field of machine learning, in particular to a semantic-based audio-driven digital human generation method and system. Background technique [0002] Videos of digital human speaking movements generated by audio drivers are widely used in various video sharing scenarios, such as news broadcasting, training sharing, advertising and other scenarios; [0003] With reference to the publication number CN1032188842, a method for voice synchronously driving three-dimensional human face mouth shape and facial gesture animation, by extracting the mouth shape feature parameters and facial gesture features based on MPEG-4 definitions corresponding to each consonant in the video frame parameters, and then calculate the difference Vel between the coordinates of each feature point and the coordinates of the standard frame, and then calculate the corresponding scale reference P on the face defined by MPEG-4, and calculate the face motion pa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G10L21/10G10L15/18G10L15/16G10L15/02G10L25/57G06K9/00G06T15/00G06N3/04G06N3/08

CPCG10L21/10G10L15/1822G10L15/16G10L15/02G10L25/57G06T15/005G06N3/08G10L2021/105G06V40/171G06V40/161G06N3/045

Inventor 王涛徐常亮

Owner 新华智云科技有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Semantic-based audio-driven digital human generation method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology