Semantic-based audio-driven digital human generation method and system

An audio-driven and semantic technology, applied in the field of machine learning, can solve the problems of large limitations, errors in the positioning of facial feature points, and the lack of consideration of individual speech differences, etc., to achieve the effect of improving viewing experience

Pending Publication Date: 2021-03-26
新华智云科技有限公司
View PDF5 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Since face movement and speech are a very delicate and complex process, the feature point coordinates can only be used to initially represent the face movement, and there are errors in the positioning of face feature points, and there are individual differences in face movement and speech. The motion parameters are obtained by the difference Vel between the feature point coordinates and the standard frame coordinates and the corresponding scale reference P on the face, without considering the differences in individual speech; The tone, language, and speed are all related to facial movements, and this method has great limitations

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Semantic-based audio-driven digital human generation method and system
  • Semantic-based audio-driven digital human generation method and system
  • Semantic-based audio-driven digital human generation method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0050] Embodiment 1. A semantic-based audio-driven digital human generation method, such as figure 1 shown, including the following steps:

[0051] S100. Obtain a target audio and target face image sequence, and perform mask processing on the mouth area of ​​each target face image in the target face image sequence to obtain a corresponding first face image sequence;

[0052] After the mouth area of ​​the target face image is masked, the corresponding face images to be rendered are obtained, and the face images to be rendered corresponding to each target face image one-to-one constitute a first face image sequence.

[0053] S200. Perform feature extraction on the target audio to obtain corresponding audio features;

[0054] S300. Input the audio features into a pre-trained semantic conversion network, and perform semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, the semantic motion sequence includi...

Embodiment 2

[0116] Embodiment 2, a kind of audio-driven digital human generation system based on semantics, such as Figure 4 shown, including:

[0117] The data acquisition module 100 is used to obtain the target audio and the target face image sequence, and after masking the mouth area of ​​each target face image in the target face image sequence, obtain the corresponding first face image sequence ;

[0118] The feature extraction module 200 is used to perform feature extraction on the target audio to obtain corresponding audio features;

[0119] The semantic conversion module 300 is used to input the audio features into a pre-trained semantic conversion network, and the semantic conversion network performs semantic conversion on the audio features to obtain a corresponding semantic motion sequence, the semantic motion sequence includes several a mouth semantic map;

[0120] The composite rendering module 400 is configured to construct a second human face image sequence based on the ...

Embodiment 3

[0128] Embodiment 3. A computer-readable storage medium stores a computer program, and when the program is executed by a processor, the steps of the method described in Embodiment 1 are implemented.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a semantic-based audio-driven digital human generation method and system. The generation method comprises the following steps of obtaining a target audio and a first human faceimage sequence; performing feature extraction on the target audio to obtain corresponding audio features; inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence which comprises a plurality of mouth semantic graphs; and acquiring to-be-rendered face images with the same number as the mouth semantic graphs based on a first face image sequence, shielding mouth areas of the to-be-rendered face images, performing face synthesis based on the mouth semantic graphs and the to-be-rendered face images, and generating a synthesized face sequence. According to the invention, conversion between audio and facial semantics is realized through the semantic conversion network, and accurate expression of mouth shapes is realized by utilizing the facial semantics.

Description

technical field [0001] The invention relates to the field of machine learning, in particular to a semantic-based audio-driven digital human generation method and system. Background technique [0002] Videos of digital human speaking movements generated by audio drivers are widely used in various video sharing scenarios, such as news broadcasting, training sharing, advertising and other scenarios; [0003] With reference to the publication number CN1032188842, a method for voice synchronously driving three-dimensional human face mouth shape and facial gesture animation, by extracting the mouth shape feature parameters and facial gesture features based on MPEG-4 definitions corresponding to each consonant in the video frame parameters, and then calculate the difference Vel between the coordinates of each feature point and the coordinates of the standard frame, and then calculate the corresponding scale reference P on the face defined by MPEG-4, and calculate the face motion pa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G10L21/10G10L15/18G10L15/16G10L15/02G10L25/57G06K9/00G06T15/00G06N3/04G06N3/08
CPCG10L21/10G10L15/1822G10L15/16G10L15/02G10L25/57G06T15/005G06N3/08G10L2021/105G06V40/171G06V40/161G06N3/045
Inventor 王涛徐常亮
Owner 新华智云科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products