Cross-modal generation method based on voice and face images

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A face image and speech synthesis technology, applied in the field of deep learning, to achieve the effects of accelerated convergence, strong robustness, and scientific and reasonable design

Active Publication Date: 2021-02-19

TIANJIN UNIV

View PDF4 Cites 2 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

At present, the most common way is to use the GAN network to generate face images, which can generate face images that are very close to the original photos, and the quality is amazing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0026] The present invention will be further described in detail below through the specific examples, the following examples are only descriptive, not restrictive, and cannot limit the protection scope of the present invention with this.

[0027] A transmembrane state generation method based on voice and face image, characterized in that: the method includes face reconstruction based on residual prior voice and personalized speech synthesis of residual prior human face image.

[0028] For Speech Reconstruction Face Model with Residual Prior, in order to alleviate the mismatch between speech and face in speech-based face generation, an end-to-end encoder-decoder structure based speech reconstruction face model is proposed , this structure complements the speech features in the speech extraction network with additional prior facial features. Two prior facial features (i.e. neutral and gender prior facial features) were explored according to gender. Furthermore, the encoder and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a cross-modal generation method based on voice and a face image. The method comprises the steps of voice reconstruction of a face and personalized voice synthesis of the faceimage. A voice reconstruction face model based on residual priori is provided for voice reconstruction of a face, and the face of the person is generated according to an input section of unknown voice. According to personalized voice synthesis of the face image, a face image personalized voice synthesis model based on residual priori is provided, and the voice of the person is synthesized according to the given face image and a section of text. The invention is scientific and reasonable in design, the effect of the voice reconstruction face model can generate the face image very similar to theoriginal face, the robustness is very high, the number of the generated faces is not a fixed number, the voice of any speaker is input, and the face similar to the speaker can be reconstructed. And the residual priori face image personalized speech synthesis model is also used for synthesizing the speech of the person according to any face image. In addition, the proposed residual priori knowledge method can accelerate convergence of the model and achieve a better effect.

Description

technical field [0001] The invention belongs to the technical field of deep learning, and relates to a method for generating a transmembrane state based on voice and face images. Background technique [0002] The deep learning of transmembrane states has always been a hot topic in academia and industry. One of its focuses is to study the mapping relationship between knowledge and information in different modalities. The mapping between modalities is to map an entity in a The process of transitioning from one modality to another. The technology of reconstructing face images from speech and the technology of synthesizing personalized speech from face images are also a cross-modal learning method, which can reconstruct the information of face modality and face image modality from speech modality. state to synthesize speech modality information. [0003] In the research of image generation, the most commonly used method is the transposed convolutional network, which passes the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06K9/00G06K9/62G10L13/027G10L13/08G06N3/04

CPCG10L13/027G10L13/08G06V40/168G06N3/045G06F18/214

Inventor喻梅胡晓晟王建荣徐天一赵满坤

OwnerTIANJIN UNIV

Cross-modal generation method based on voice and face images

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology