Face image generation method and device, equipment, medium and product
By employing cross-layer connectivity and feature fusion techniques in the facial expression generation model, the generalization and accuracy issues of speech-driven lip-sync synthesis in existing technologies have been resolved, achieving high-quality facial image generation suitable for various virtual human application scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BIGO TECH PTE LTD
- Filing Date
- 2022-11-21
- Publication Date
- 2026-06-23
AI Technical Summary
Existing speech-driven lip-syncing methods have poor generalization ability and low accuracy in terms of facial expressions, languages, intonations and facial features, and cannot adapt to the facial expression styles of multiple speakers.
An expression generation model is adopted, which integrates semantic feature information through cross-layer connections between feature extraction network and feature decoding network to generate facial images synchronized with audio features. Multimodal features are generated using digital human facial data templates and audio feature information to achieve end-to-end expression and lip-syncing.
The generated facial images have more delicate, accurate, natural and smooth expressions. The facial expressions and mouth movements are synchronized with the audio sound. It is suitable for multiple digital humans and can be used in scenarios such as virtual assistants, virtual tour guides, virtual customers and virtual anchors.
Smart Images

Figure CN115761075B_ABST
Abstract
Description
Technical Field
[0001] This application relates to digital human virtual technology, and more particularly to a method, apparatus, device, medium, and product for generating facial images. Background Technology
[0002] Lip-syncing based speech technology is widely used in virtual digital humans and animation production, and is one of the key technologies for realizing real-time, realistic communication and interaction between virtual and real spaces.
[0003] Speech-driven lip-syncing refers to using speech audio signals to drive the lip movements of a digital human, generating the correct lip shape corresponding to the speech audio information, thus achieving synchronization between speech and lip movement. Currently, speech-driven lip-syncing methods fall into three main categories: the first is traditional language-based model methods; the second is audio-driven speaker head synthesis; and the third is machine learning-based 3D face model methods.
[0004] Although current research shows that various traditional models can be trained to create facial synthesis for specific speakers, there is still no universal method that can capture facial expression styles corresponding to various speech patterns without being limited to a specific speaker. Specifically, the shortcomings of traditional techniques are that specific models have poor generalization and low accuracy in many aspects such as expression, language, tone, and facial features. Summary of the Invention
[0005] The purpose of this application is to solve the above-mentioned problems by providing a method for generating facial images and corresponding apparatus, devices, non-volatile readable storage media, and computer program products.
[0006] According to one aspect of this application, a method for generating a face image is provided, comprising the following steps:
[0007] The method involves acquiring a facial data template of a digital human and audio feature information of an audio segment of spoken audio. The facial data template contains grid vertex data of the facial region of the digital human, and the audio feature information contains audio features of the audio segment obtained in the frequency domain.
[0008] The feature extraction network in the facial expression generation model is used to extract the semantic feature information corresponding to the facial data template and the audio feature information respectively, and then fuse them into multimodal feature information;
[0009] The feature decoding network in the expression generation model is used to generate a face data frame synchronized with the audio segment based on the multimodal feature information and the semantic feature information of the face data template. The face data frame contains mesh vertex data obtained by transforming the face data template.
[0010] The digital human's 3D model is rendered based on the mesh vertex data of the facial data frame to obtain a facial image synchronized with the audio clip.
[0011] According to another aspect of this application, a face image generation apparatus is provided, comprising:
[0012] The data acquisition module is configured to acquire a facial data template of a digital human and audio feature information of an audio segment of a spoken audio. The facial data template contains grid vertex data of the facial region of the digital human, and the audio feature information contains audio features of the audio segment obtained in the frequency domain.
[0013] The feature extraction module is configured to use the feature extraction network in the expression generation model to extract the semantic feature information corresponding to the facial data template and the audio feature information respectively, and then fuse them into multimodal feature information;
[0014] The feature decoding module is configured to use the feature decoding network in the expression generation model to generate a face data frame synchronized with the audio segment based on the multimodal feature information and the semantic feature information of the face data template. The face data frame contains mesh vertex data obtained according to the transformation of the face data template.
[0015] The image rendering module is configured to render the three-dimensional model of the digital human based on the mesh vertex data of the face data frame, thereby obtaining a face image synchronized with the audio clip.
[0016] According to another aspect of this application, a face image generation device is provided, including a central processing unit and a memory, wherein the central processing unit is configured to invoke and run a computer program stored in the memory to perform the steps of the face image generation method described in this application.
[0017] According to another aspect of this application, a non-volatile readable storage medium is provided, which stores a computer program implemented according to the face image generation method in the form of computer-readable instructions, wherein the computer program, when invoked by a computer, performs the steps included in the method.
[0018] According to another aspect of this application, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the method described in any embodiment of this application.
[0019] Compared with existing technologies, this application has several technological advantages, including but not limited to:
[0020] First, in the process of generating facial images, the expression generation model of this application has a cross-layer connection link between its feature extraction network and feature decoding network, which transmits the semantic feature information obtained in the middle to the feature decoding network. This allows the feature decoding network to perform layer-by-layer comprehensive upsampling decoding of the multimodal feature information generated by the feature extraction network by integrating the semantic feature information of the digital human's facial data template and the semantic feature information of the audio feature information during the generation of the corresponding facial data frame. This utilizes both the global comprehensive semantic information corresponding to the entire face and the local information of each detail in the face, thereby comprehensively and finely adjusting the three-dimensional spatial position data of each vertex of the digital human's facial region. This results in more accurate generation of facial data frames corresponding to the audio feature information, making the facial expressions of the final generated facial image more delicate, accurate, natural, and smooth, with facial expressions and mouth movements synchronized with the sound of the audio segment.
[0021] Secondly, this application provides an expression generation model that generates facial data frames for each audio segment of the digital human based on the audio feature information of the audio segments of the audio and the facial data template of the digital human. This model can obtain corresponding facial images, form facial animations, and achieve end-to-end expression and mouth shape driving effects in one step. Furthermore, it can use a single model to handle the voice-driven business of multiple digital humans, achieving economies of scale.
[0022] Furthermore, the facial images generated by this application based on audio and digital human facial data templates have broad adaptability and can be applied to various business scenarios such as virtual assistants, virtual tour guides, virtual customers, and virtual anchors. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 This is a schematic diagram of the network architecture for an exemplary application scenario of this application;
[0025] Figure 2 This is a schematic diagram of an exemplary network architecture for the facial expression generation model of this application;
[0026] Figure 3 This is a schematic flowchart of one embodiment of the face image generation method of this application;
[0027] Figure 4 This is a schematic diagram illustrating the process of feature representation of audio feature information and facial data templates in an embodiment of this application;
[0028] Figure 5 This is a schematic diagram illustrating the process of decoding multimodal feature information in an embodiment of this application;
[0029] Figure 6 This is a schematic diagram illustrating the process of injecting semantic feature information of the corresponding scale into the corresponding scale and extracting convolutional feature information by upsampling in an embodiment of this application;
[0030] Figure 7 This is a schematic diagram illustrating the process of training the facial expression generation model in an embodiment of this application;
[0031] Figure 8 This is a schematic diagram illustrating the data correspondence of a set of training samples in an embodiment of this application;
[0032] Figure 9 This is a schematic diagram illustrating the process of constructing a training dataset in an embodiment of this application;
[0033] Figure 10 This is a schematic diagram illustrating the process of obtaining input data for the facial expression generation model in an embodiment of this application;
[0034] Figure 11 This is a schematic block diagram of the face image generation device of this application;
[0035] Figure 12 This is a schematic diagram of the structure of a face image generation device used in this application. Detailed Implementation
[0036] Please see Figure 1 The network architecture used in an exemplary application scenario of this application includes a terminal device 80, a media server 81, and an application server 82.
[0037] The application server 82 can be used to set up application layer services based on digital humans, such as virtual voice assistant services, virtual tour guide services, virtual customer service, virtual anchor services, etc.
[0038] The virtual pronunciation assistant service can provide a video stream of specific text to a user on a terminal device 80. The video stream can include a digital facial animation presented according to the pronunciation of the specific text, and can also generate machine-synthesized speech of the specific text on demand. In the facial animation, each image frame is synchronized with the sound of the machine-synthesized speech in terms of digital facial expressions and mouth movements, so as to realize pronunciation correction teaching.
[0039] The virtual tour guide service can provide users on terminal device 80 with a video stream corresponding to specific narration text. The video stream typically includes digital human facial animation and machine-synthesized speech of the specific text. Similarly, in the facial animation, each image frame is synchronized with the sound of the machine-synthesized speech in terms of the digital human's facial expressions and mouth movements, and is usually also matched with image information corresponding to the content being narrated.
[0040] The virtual customer service is mainly applied in intelligent customer service scenarios. Similarly, based on the reply text generated by the customer service robot in response to user questions, a corresponding digital human face animation and machine-synthesized speech are generated. This ensures that each image frame in the face animation is synchronized with the digital human face expression and mouth movements in terms of sound from the machine-synthesized speech, forming a video stream that is pushed to the user's terminal device 80, giving the user the feeling of chatting face-to-face with a digital human.
[0041] The virtual anchor service is mainly used in online live streaming scenarios. It can generate digital facial animation based on the audio data of the anchor users in the online live streaming room. The animation frames are synchronized with the audio data of the spoken words in terms of facial expressions and mouth movements, forming a live video stream that is pushed to the terminal devices of various viewers in the online live streaming room for playback.
[0042] The media server 81 is used to implement a digital human face animation generation service. Its open interface is available for application layer services of the application server to call. An expression generation model is deployed to generate synchronized facial data frames for each audio segment based on the digital human's facial data template. Upon receiving the call, the digital human face animation generation service obtains the digital human's feature identifier and the audio. Based on the digital human's feature identifier, it determines the corresponding digital human's facial data template, divides the audio into multiple audio segments, and then inputs each audio segment along with the digital human's facial data template into the expression generation model to generate a facial image synchronized with that audio segment. Corresponding facial images are generated for each audio segment, and these facial images are synthesized in chronological order to form a facial animation, which is returned to the application layer service as a video stream.
[0043] The facial data template is essentially a facial data frame. This facial data frame contains mesh vertex data describing the three-dimensional spatial position information of each vertex in the facial region of the digital human's 3D model. The mesh vertex data is described by the motion vectors or static position data of corresponding preset facial key points, also known as vertices. The specific data format can be pre-standardized or arbitrarily set to adapt to the output format of the expression generation model. For example, in one embodiment, when the mesh vertex data represents each facial image, it can provide the offset corresponding to the rotation values of each preset vertex on the three axes of the 3D image space coordinate system. In another embodiment, when the mesh vertex data represents each facial image, it can provide the position coordinates of each preset vertex in the 3D image space coordinate system. Therefore, the data structure and data representation of the facial data frame can be flexibly defined and used.
[0044] The vertices in the face region of the digital human are mainly distributed at various key positions in the face region. The displacement of these vertices in the three-dimensional image space coordinate system can drive the corresponding facial details of the digital human to produce corresponding motion effects. By progressively representing the changes in the positional information of the same set of vertices through multiple sequential facial data frames, the three-dimensional model of the digital human can be controlled to produce corresponding facial expression changes. By controlling and rendering the three-dimensional model of the digital human using these facial data frames, the corresponding facial images of each facial data frame can be obtained. By organizing these facial images together in time sequence, the facial animation of the digital human is formed, which can visually present the image motion effect of the changes in the facial expression details of the digital human corresponding to the corresponding vertices.
[0045] The 3D model of the digital human can be pre-modeled to adjust the position and viewpoint of each vertex according to the position information of each vertex in the facial data frame, so that a corresponding digital human facial animation can be generated through rendering. Furthermore, depending on actual needs, background images or other foreground images can be added when generating the digital human's facial image; this can be flexibly implemented by those skilled in the art.
[0046] Please see Figure 2 The exemplary network architecture of the facial expression generation model adopted in this application includes a feature extraction network and a feature decoding network. The feature extraction network includes an audio encoder, an facial expression encoder, and a feature fusion network.
[0047] The audio encoder is used to extract semantic features from the input audio features, so as to extract deep semantic features related to the mouth shape of the face image and obtain audio semantic features. It can be implemented based on a network architecture with serialization information processing capabilities. The network architecture can selectively include convolutional neural networks (CNN), recurrent neural networks (RNN), gated recurrent units (GRU), long short-term memory networks (LSTM), etc. By using one or more such basic neural network models, the ability to represent audio features in a serialized manner and obtain the corresponding audio semantic features can be built. For example, the DeepSpeech model is a known model that can be used to extract audio semantic features from audio features.
[0048] The facial expression encoder can also be constructed based on any one or more basic neural network models such as recurrent neural networks, long short-term memory networks, and residual networks. In one embodiment, the facial expression encoder uses multiple fully connected layers to perform feature transformations at multiple feature scales on the input facial data frames to obtain semantic feature information corresponding to each feature scale. Then, the semantic feature information at the smallest scale is used to extract its deep semantics through a long short-term memory network to obtain the facial expression semantic feature information of the facial data frame. The semantic feature information at different feature scales obtained by the facial expression encoder has rich shallow semantics, and therefore can be transmitted to the feature decoding network through cross-layer connections as reference information required for decoding at each scale.
[0049] The feature fusion network is used to fuse the facial expression semantic feature information of the face data frame obtained by the expression encoder with the audio semantic feature information of the audio feature information obtained by the audio encoder to form multimodal feature information, which is then input into the feature decoder for further processing.
[0050] The feature decoder comprises multiple cascaded subnetworks, each containing a residual module (ResNet). These subnetworks are cascaded sequentially. The first cascaded subnetwork receives the multimodal feature information as input through its residual module, performs equal-scale convolution operations on it, and obtains the corresponding convolutional feature information. Subsequent cascaded subnetworks correspond to a specific scale setting. Besides including a residual module for further upsampling the convolutional feature information output from the previous stage, each subnetwork uses a multilayer perceptron before the residual module to extract perceptual feature information of the semantic features of the facial data frames obtained by the expression encoder at the corresponding scale. This perceptual feature information is then standardized by a normalization layer and provided to the residual module as reference information for upsampling the convolutional feature information at the previous scale, allowing for the reconstruction of the convolutional feature information at the corresponding scale by referencing the shallow semantics of that scale. Therefore, based on the lower-scale convolutional feature information obtained by the previous residual module, upsampling is performed to obtain the convolutional feature information corresponding to the current scale. This information is then passed to the subsequent cascaded sub-networks for further upsampling. The convolutional feature information obtained by the residual module of the last cascaded sub-network constitutes the face data frame obtained through inference.
[0051] To enable the expression generation model to generate facial data frames that are synchronized with the sound of the audio segment in terms of facial expressions and lip movements, based on the given digital human facial data frames and the audio feature information of the given audio segment, a corresponding training dataset can be used to pre-train the expression generation model to a convergent state before putting it into online inference.
[0052] Based on the above explanation of principles, please refer to Figure 3 According to a face image generation method provided in this application, in one embodiment, it includes the following steps:
[0053] Step S1100: Obtain the facial data template of the digital human and the audio feature information of the audio segment of the audio. The facial data template contains the grid vertex data of the facial region of the digital human, and the audio feature information contains the audio features of the audio segment obtained in the frequency domain.
[0054] When it is necessary to generate facial animation of a digital human from audio, the audio feature information of the audio segment can be obtained by taking the standardized duration audio segment in the audio as a unit, and at the same time, the facial data template of the target digital human can be obtained. The two are then used as the input data for the expression generation model of this application.
[0055] The audio feature information of the audio segment can be information composed of audio features extracted from the spectrogram of the audio segment. In one embodiment, the audio segment is first preprocessed, that is, windowed and framed, to obtain multiple speech frames. Short-time Fourier transform is performed on the multiple speech frames to transform them from the time domain to the frequency domain. Then, a Mel filter is applied to the speech frames in the frequency domain to obtain the Mel spectrum. The corresponding audio features are then extracted from the Mel spectrum. The audio features of the multiple speech frames are organized in an orderly manner to constitute the audio feature information of the audio segment.
[0056] In one embodiment, the operation network for extracting the corresponding audio feature information of the audio segment can be integrated into the feature extraction network to achieve standardized operation.
[0057] The facial data template is a facial data frame corresponding to a 3D model of a digital human. The facial data frame is formed by representing the 3D spatial position information of each vertex in the mesh of the facial region of the 3D model of the digital human, constituting the mesh vertex data corresponding to the facial region of the digital human. The mesh vertex data included in the facial data template can be data corresponding to the facial region of the 3D model of the digital human in any static state, that is, data corresponding to a static expression state.
[0058] Step S1200: Using the feature extraction network in the expression generation model, the semantic feature information corresponding to the facial data template and the audio feature information is extracted respectively, and then fused into multimodal feature information;
[0059] Please continue to refer to this. Figure 2The previously obtained digital human facial data template and the audio feature information of the audio segment are respectively input into the expression encoder and audio encoder in the feature extraction network of the expression generation model. Then, the expression encoder performs feature representation on the facial data template, transforming it through multiple fully connected layers to extract semantic feature information of the facial data template at multiple feature scales. At different feature scales, the shallow original semantics of the facial data template are preserved, and based on the semantic feature information at the smallest scale, its deep semantic features are further extracted to obtain the corresponding expression semantic feature information. Then, all these semantic features are forward transmitted to the feature decoding network. The information includes semantic feature information representing the shallow original semantics at various scales and facial expression semantic feature information extracted based on the minimum scale. Simultaneously, the facial expression encoder also uses its audio encoder to represent the audio feature information, extracting deep semantic feature information as the audio semantic feature information corresponding to the audio segment. Finally, under the action of the feature fusion network of the facial expression generation model, the facial expression semantic feature information of the face data template output by the facial expression encoder and the audio semantic feature information of the audio feature information output by the audio encoder are fused to obtain multimodal feature information. In this multimodal feature information, based on the deep semantic features of the face data template (i.e., facial expression semantic feature information), the deep semantic features of the audio feature information of the audio segment are introduced, thereby modulating the deep semantic features of the original mesh vertex data in the face data template by the deep semantic features of the audio feature information. This allows for the decoding and transformation of the mesh vertex data.
[0060] Step S1300: Using the feature decoding network in the expression generation model, generate a face data frame synchronized with the audio segment based on the multimodal feature information and the semantic feature information of the face data template. The face data frame contains mesh vertex data obtained by transforming the face data template.
[0061] Please continue reading. Figure 2 The multimodal feature information and semantic feature information corresponding to each intermediate feature scale output by the feature extraction network are all passed to the feature decoding network for further decoding processing. In the feature decoding network, the multimodal feature information is used as input by the residual module in the first cascaded sub-network, i.e., the first residual module, to perform an equal-scale convolution operation on the multimodal feature information to obtain the corresponding convolutional feature information.
[0062] The second cascaded subnetwork, which is connected in series with the first cascaded subnetwork, and the subsequent third, fourth, and so on cascaded subnetworks, are all cascaded. The total number of cascaded subnetworks depends on the total amount of semantic feature information at the intermediate feature scale transmitted from the cross-layer connections of the feature extraction network. In fact, it is also set to adapt to the amount of output data in the feature extraction network. Each subsequent cascaded subnetwork is implemented based on the same network architecture and thus works according to the same business logic.
[0063] Each subsequent cascaded subnetwork receives two inputs. The first input is convolutional feature information corresponding to a lower scale obtained from the residual module of its predecessor cascaded subnetwork. The second input is semantic feature information of the face data template at its corresponding scale obtained from the feature extraction network through cross-layer connections. Each subsequent cascaded subnetwork uses the semantic feature information from its second input as a reference for the shallow original semantics, and upsamples the convolutional feature information from the first input to accurately reconstruct the convolutional feature information at its current scale. The scale at this level is higher than the scale at the previous cascaded subnetwork but lower than the scale at the next cascaded subnetwork.
[0064] Each subsequent cascaded subnetwork, after upsampling sequentially from low to high scale, results in a face data frame whose scale corresponds to the original scale of the face data template. It's easy to understand that the obtained face data frame similarly describes the grid vertex data of the digital human's face region. These grid vertex data, modulated by the deep semantic features of the audio segment's audio features, maintain synchronization with the audio segment's facial expressions and lip movements. This transforms the face data template into a face data frame with synchronized facial expressions and lip movements based on the audio segment's audio features.
[0065] Step S1400: Render the three-dimensional model of the digital human based on the mesh vertex data of the face data frame to obtain a face image synchronized with the audio clip.
[0066] After obtaining the facial data frame corresponding to the audio segment, the facial data frame can be applied to the three-dimensional model of the digital human, and the three-dimensional model can be rendered based on the facial data frame to generate the facial image of the digital human corresponding to the facial data frame.
[0067] In one embodiment, by dividing the same audio into multiple audio segments, and using each audio segment in conjunction with the facial data template of the digital human as input data for the expression generation model, a corresponding facial image can be generated by the expression generation model. Finally, these facial images are combined in an orderly manner according to the order of the audio segments and converted into a video format to generate a corresponding facial animation. The facial animation can be stored or transmitted in a streaming media format.
[0068] As can be seen from the above embodiments, this application has a variety of technical advantages, including but not limited to:
[0069] First, in the process of generating facial images, the expression generation model of this application has a cross-layer connection link between its feature extraction network and feature decoding network, which transmits the semantic feature information obtained in the middle to the feature decoding network. This allows the feature decoding network to perform layer-by-layer comprehensive upsampling decoding of the multimodal feature information generated by the feature extraction network by integrating the semantic feature information of the digital human's facial data template and the semantic feature information of the audio feature information during the generation of the corresponding facial data frame. This utilizes both the global comprehensive semantic information corresponding to the entire face and the local information of each detail in the face, thereby comprehensively and finely adjusting the three-dimensional spatial position data of each vertex of the digital human's facial region. This results in more accurate generation of facial data frames corresponding to the audio feature information, making the facial expressions of the final generated facial image more delicate, accurate, natural, and smooth, with facial expressions and mouth movements synchronized with the sound of the audio segment.
[0070] Secondly, this application provides an expression generation model that generates facial data frames for each audio segment of the digital human based on the audio feature information of the audio segments of the audio and the facial data template of the digital human. This model can obtain corresponding facial images, form facial animations, and achieve end-to-end expression and mouth shape driving effects in one step. Furthermore, it can use a single model to handle the voice-driven business of multiple digital humans, achieving economies of scale.
[0071] Furthermore, the facial images generated by this application based on audio and digital human facial data templates have broad adaptability and can be applied to various business scenarios such as virtual assistants, virtual tour guides, virtual customers, and virtual anchors.
[0072] Based on any embodiment of this application, please refer to Figure 4 The facial expression generation model employs a feature extraction network to extract semantic features corresponding to the facial data template and the audio feature information, respectively, and then fuses them into multimodal feature information, including:
[0073] Step S1210: Using the audio encoder in the feature extraction network, extract the semantic feature information of the audio feature information to obtain the audio semantic feature information;
[0074] In an exemplary network architecture, the audio encoder is constructed using Deepspeech. It first performs convolution operations on the audio feature information of the audio segment through multiple convolutional layers to obtain intermediate feature information. Then, it extracts deep semantics based on the intermediate feature information through a serialized feature representation network composed of multiple gated recurrent units, and finally obtains the corresponding audio semantic feature information.
[0075] Step S1220: Using the facial expression encoder in the feature extraction network, extract the semantic feature information corresponding to the facial data template at multiple preset scales, and extract its deep semantics as facial expression semantic feature information based on the semantic feature information at the smallest scale.
[0076] In the facial expression encoder, for example, three fully connected layers can be set up, each performing feature transformation from a higher scale to a lower scale. When the facial data template is input into the first fully connected layer at its original scale, the first fully connected layer transforms it to a lower feature scale to obtain semantic feature information at the corresponding scale. Then, it is transmitted to the second fully connected layer for transformation at the next scale, and so on. Multiple semantic feature information at multiple corresponding scales can be obtained through multiple fully connected layers. Since the fully connected layers mainly perform linear mapping based on the original semantics, the semantic feature information at each scale can preserve the shallow original semantics of the facial data template. Furthermore, the semantic feature information at different scales can highlight the semantic features of the facial data template at different granularities from comprehensive to local, thus possessing rich original semantic reference value.
[0077] After the last fully connected layer outputs the semantic feature information corresponding to the minimum scale, an exemplary feature representation model, such as a long real-time memory network, set at the end of the expression encoder further extracts the deep semantics of the semantic feature information of the minimum scale, thereby obtaining the corresponding deep semantic information based on context association, which can be used as the expression semantic feature information of the face data template.
[0078] Step S1230: Using the feature fusion network in the feature extraction network, the audio semantic feature information and the facial expression semantic feature information are synthesized into multimodal feature information.
[0079] The audio semantic feature information and facial expression semantic feature information obtained by the audio encoder and facial expression encoder are normalized to data of the corresponding scale during their processing, and then passed to the feature fusion network in the feature extraction network.
[0080] The feature fusion network performs fusion processing based on the audio semantic feature information and the facial expression semantic feature information. Specifically, it can use any method such as element-wise product or channel concatenation to achieve fusion, so that the two are fused into the same multimodal feature information. It is easy to understand that the multimodal feature information is based on the deep semantics of the facial data template, and the deep semantics of the audio segment is introduced as a modulating factor, which guides the transformation of the facial data template into facial data frames with different expressions.
[0081] As can be seen from the above embodiments, under the action of the expression encoder, by processing the audio feature information and facial data template of the audio segment respectively, audio semantic feature information and facial semantic feature information are obtained accordingly. Then, the two are fused into multimodal feature information, realizing effective feature fusion. The multimodal feature information, with audio features as a reference, forms the ability to represent the facial template data and its expression change trend. Moreover, the facial semantic feature information in the multimodal feature information comes from the smallest scale semantic feature information of the facial data template, which has the ability to represent the significant features of detailed parts. Therefore, the obtained multimodal feature information is more accurate in terms of feature representation for obtaining facial data frames with facial expressions and mouth movements in relation to the audio segment.
[0082] Based on any embodiment of this application, please refer to Figure 5 The facial expression generation model employs a feature decoding network to generate facial data frames synchronized with the audio segment based on the multimodal feature information and the semantic feature information of the facial data template, including:
[0083] Step S1310: Using the first residual module in the feature decoding network, perform equal-scale convolution operation on the multimodal feature information to obtain the corresponding convolutional feature information;
[0084] The number of residual modules in the feature decoding network is mainly determined by the amount of output data in the feature extraction network. For example, if the feature extraction network has four outputs (three fully connected layers and one feature fusion network), then four residual modules can be set accordingly. The first residual module corresponds to the multimodal feature information output by the feature fusion network. It performs equal-scale feature extraction on the multimodal feature information and obtains the corresponding convolutional feature information through residual convolution operations.
[0085] Step S1320: Multiple residual modules cascaded with the first residual module are adopted according to each scale. The residual module of each scale refers to the semantic feature information of the face data template of its corresponding scale, and upsamples the convolutional feature information output by the residual module of the previous stage to obtain the convolutional feature information of the corresponding scale.
[0086] The second residual module cascaded to the first residual module, and the third, fourth, and so on, are all part of their respective cascaded sub-networks, working in conjunction with other components within their cascaded sub-networks. Each cascaded residual module following the first corresponds to a scale of semantic feature information output by the feature extraction network. It receives the semantic feature information of the face data template at the corresponding scale, and also receives the smaller-scale convolutional feature information obtained from the previous residual module as its input data. Then, it upsamples using the deep semantics carried by the smaller-scale convolutional feature information. During upsampling, the shallow original semantics carried by the corresponding scale of the semantic feature information are used as reference information to obtain the convolutional feature information of the corresponding scale at this level. This information is then output to the next-level residual module for the same processing, and so on, until the last residual module outputs convolutional feature information at the same scale as the face data template.
[0087] Step S1330: Use the convolutional feature information obtained by the residual module corresponding to the highest scale as the face data frame.
[0088] The last residual module, which is also the residual module corresponding to the highest scale, obtains convolutional feature information that corresponds to the audio segment. It is the facial data frame obtained by transforming facial expressions and mouth movements based on the facial data template.
[0089] As can be seen from the above embodiments, during the decoding process based on multimodal feature information, the semantic feature information corresponding to each scale obtained by the feature extraction network is used as reference information. During the decoding process, original semantic information related to facial expressions from the facial data template is injected at multiple corresponding scales, thereby guiding the feature decoding network to accurately decode the facial data frames. The expression injection operation at multiple scales ensures that the generation of facial data frames progresses from shallow to deep, from local to global, and obtains effective reference information at all times, ensuring that the final generated facial data frames have high-quality results and can maintain audio synchronization with the corresponding audio segments in terms of facial expression features, such as mouth shape.
[0090] Based on any embodiment of this application, please refer to Figure 6Each scale's residual module references the semantic feature information of the facial data template at its corresponding scale, and upsamples the convolutional feature information output by the previous residual module to obtain the corresponding scale's convolutional feature information, including:
[0091] Step S1321: The residual module at each scale extracts the perceptual feature information of the semantic feature information of the face data template obtained at the corresponding scale through the multilayer perceptron at its corresponding scale.
[0092] As mentioned above, for each feature scale, a corresponding cascaded subnetwork is set in the feature decoding network. In each cascaded subnetwork, based on the residual module, a multilayer perceptron (MLP) and a normalization layer are also set to process the corresponding scale semantic feature information of the face data template at the corresponding scale.
[0093] When the semantic feature information of the face data template obtained by the feature extraction network at one scale needs to be passed to the residual module at the corresponding scale, the semantic feature information first reaches the multilayer perceptron of the cascaded sub-network at the corresponding scale. The multilayer perceptron performs feature mapping and extracts the corresponding perceptual feature information based on the semantic feature information at the corresponding scale. The perceptual feature information then enters the standardization layer for standardization processing.
[0094] Step S1322: The residual module of each scale performs channel normalization preprocessing on the sensing feature information obtained at the corresponding scale through the normalization layer of its corresponding scale to obtain the corresponding normalized sensing feature information.
[0095] The perceptual feature information obtained in the same cascaded sub-network is standardized by the subsequent standardization layer. This standardization layer performs channel normalization (AdaIN) processing. Its principle is to use the preceding multivariate perceptron to generate its weight parameters, which then standardize the perceptual feature information at the corresponding scale. An example formula is shown below:
[0096]
[0097] Where z is the perceptual feature information obtained by the multivariate perceptron, μ and σ are the feature mean and variance in the channel direction, and γ and β are the parameters obtained by the MLP output of the semantic feature information at the corresponding scale, carrying facial expression feature information.
[0098] It is evident that by using channel standardization to standardize semantic feature information at the corresponding scale, rather than using simple feature splicing, a more accurate expression injection effect can be achieved, obtaining standardized perceptual feature information.
[0099] Step S1323: The residual module at each scale uses the standardized perceptual feature information obtained at its corresponding scale as reference information, and upsamples the convolutional feature information output by the residual module of the previous stage to obtain the convolutional feature information at the corresponding scale.
[0100] When the residual module in each cascaded sub-network needs to perform an upsampling operation, it can use the standardized perceptual feature information as the reference information for upsampling, based on the standardized perceptual feature information obtained. Since the standardized perceptual feature information has achieved a good representation of facial features, when the residual module performs upsampling based on the convolutional feature information at a smaller scale, it can obtain a more accurate semantic restoration result when referring to the standardized perceptual feature information for original semantic restoration, and obtain the convolutional feature information at the corresponding scale.
[0101] As can be seen from the above embodiments, during the decoding process of semantic feature information and multimodal feature information at each scale based on the facial data template, when injecting expression features corresponding to the original semantics at each scale, the combination of a multivariate perceptron and a normalization layer based on channel standardization can more effectively realize the injection of expression-corresponding features, making the obtained facial data frames more accurate and effective.
[0102] Based on any embodiment of this application, please refer to Figure 7 Before acquiring the digital human's facial data template and audio feature information from audio segments of spoken audio, the process includes:
[0103] Step S2100: From the training dataset, call the audio feature information of any two audio segments with asynchronous temporal relationships and the face data frames of the digital human to form the same set of two training samples, and associate each training sample with face data frames of arbitrary temporal order as its label sample.
[0104] A training dataset can be prepared, providing a set of audio feature information and a corresponding set of facial data frames to form a mapping relationship data. This data serves as the training samples and labeled samples for the expression generation model of this application. The training dataset can contain multiple such mapping relationship data sets, and the digital humans and sample audio recordings used to generate these mapping relationship data sets can be different to facilitate feature generalization.
[0105] The same set of facial data frames may include various facial data frames corresponding to the facial animation of the same digital human. The corresponding set of audio feature information consists of audio data obtained by dividing the audio data of the facial expressions and lip movements presented in the digital human's facial animation into multiple audio segments, and then performing audio preprocessing and audio feature extraction on each audio segment. Therefore, in the training dataset, each audio feature information has a corresponding facial data frame, and they correspond one-to-one in temporal order, forming a synchronous data unit. Specifically, the sound of the audio segment corresponding to a certain temporal feature information is synchronously corresponding to the facial expressions and lip movements of the digital human presented in the facial data frame at that time.
[0106] Based on the synchronized audio feature information and facial data frames in the training dataset, training samples and labeled samples that serve as training supervision can be constructed by flexibly combining the audio feature information and facial data frames.
[0107] To construct training samples, for example, each time any two different time-series synchronized data units are selected. Each synchronized data unit contains synchronized audio feature information and facial data frames. Then, the audio feature information from the first time-series synchronized data unit and the facial data frames from the second time-series synchronized data unit are used to form a first training sample. The facial data frames from the first time-series synchronized data unit and the audio feature information from the second time-series synchronized data unit are used to form a second training sample. This cross-construction of samples yields two training samples in the same set. Further, corresponding label samples are assigned to these two training samples.
[0108] In one embodiment, such as Figure 8As shown, when setting corresponding label samples for two training samples in the same group, any one face data frame from the two synchronized data units can be fixed as the label sample for the two training samples. Subsequently, different forced constraint methods can be determined when calculating the loss value based on whether the face data frames of the label samples and the face data frames in the training samples have a temporal synchronization relationship. According to this embodiment, it is easy to understand that for two synchronized data units, four training sample and label sample mapping data can be constructed by combining different label samples. The four mapping data can be divided into two groups of training samples, each consisting of two pairs. To implement subsequent mandatory constraints, when grouping the four mapping data obtained from the two synchronized data units, it can be determined that the audio feature information of one mapping data is temporally synchronized with the facial data frames in the label samples. In this case, the mouth region of the facial data frames in the label samples can provide effective supervision information. Conversely, the audio feature information of the other training sample is temporally asynchronous with the facial data frames in the label samples. In this case, the mouth region of the facial data frames in the label samples usually does not correspond to the sound corresponding to the audio feature information, but other facial regions outside the mouth region can still provide effective supervision information for facial expressions. Each training sample group constructed according to this principle includes two training samples: one training sample can be used to train the ability to generate mouth movements, and the other training sample can be used to train the ability to generate facial expressions in other facial expression regions besides mouth movements.
[0109] In one embodiment, for the same set of training samples, sampling can be performed on the same mapping data in the training dataset to ensure that the face data frames on which they are relied upon correspond to the same digital human, which is more helpful in scientifically and reasonably determining the corresponding loss value.
[0110] Step S2200: Input the two training samples in the same group into the expression generation model to perform inference and predict the predicted face data frame that is synchronized with the audio feature information in each training sample.
[0111] When training the expression generation model, training samples from the same set can be successively input into the model for inference. The expression generation model then uses the corresponding training samples—namely, the audio feature information and facial data frames in the training samples—to perform feature representation and feature decoding, thereby obtaining the corresponding predicted facial data frames. It is easy to understand that for the same set of training samples, two predicted facial data frames can be obtained.
[0112] Step S2300: Calculate the single-frame loss value of each predicted face data frame in the same training sample group relative to its corresponding label sample, summarize the single-frame loss value based on the single-frame loss value of each training sample group, and summarize the single-frame loss values of multiple training samples into the total loss value.
[0113] To supervise the training of the facial expression generation model, it is necessary to calculate the single-frame loss value corresponding to each training sample. The single-frame loss value of a single predicted face data frame can be determined by summing the L2 paradigm loss calculated for each vertex in the predicted face data frame and the corresponding vertex in the corresponding training sample.
[0114] Considering that the facial data frames in the label samples of the same set of two training samples have different supervisory effects on facial regions in different training samples, in one embodiment, the single-frame loss value is determined as follows to adapt to different training samples:
[0115] When the facial data frames in the labeled samples are synchronized with the audio feature information in the training samples, the single-frame loss value of the predicted facial data frame relative to the facial data frames in the labeled samples is calculated only based on the grid vertex data of the mouth region of the digital human.
[0116] When the face data frames in the labeled samples are the same as the face data frames in the training samples, the single-frame loss value of the predicted face data frame relative to the face data frames in the labeled samples is calculated only based on the grid vertex data of other expression regions outside the mouth region of the digital human.
[0117] After determining the single-frame loss value corresponding to the predicted face data frame of each training sample in the same training sample group, the single-frame loss values of the two training samples can be summed to obtain the single-group loss value.
[0118] During model training, batch training is typically used to improve efficiency. In this case, the individual loss values corresponding to multiple training samples in the same batch are further averaged or summed to form a total loss value, which is used to decide whether to continue iterating to the next batch of training. In one embodiment, each batch of training may also contain only one set of training samples; this can be determined as needed.
[0119] Step S2400: Determine whether the expression generation model has converged based on the total loss value. If it has not converged, perform gradient updates and iterative training on the expression generation model until the expression generation model reaches a convergent state.
[0120] To determine whether the facial expression generation model has reached convergence and whether further iterative training is needed, a target threshold corresponding to whether the model has reached convergence can be preset. After determining the total loss value corresponding to a batch, it is compared with the target threshold. When the total loss value reaches the target threshold, it indicates that the facial expression generation model has reached convergence, and the training task is terminated. Otherwise, if the total loss value does not reach the target threshold, it indicates that the facial expression generation model has not reached convergence. Gradient updates are performed on the facial expression generation model based on the total loss value, and the weight parameters of each component are corrected through backpropagation to make it further approach convergence. Then, the next batch of training begins, and new training samples are called from the training dataset for iterative training until the facial expression generation model reaches convergence.
[0121] In one embodiment, the facial expression generation model can be trained in two stages. First, in the pre-training stage, the model is trained to convergence using training samples corresponding to the pronunciation of a first language. Then, Chinese language data is used to resample the training samples in the training dataset to achieve data augmentation. Finally, the augmented training samples are used to fine-tune the model until convergence. This approach leverages the advantage of easily acquiring training samples for foreign languages while rapidly developing a fine-tuned model adapted to a specific language.
[0122] As can be seen from the above embodiments, by using audio feature information and facial data frames corresponding to different time sequences to construct the same set of training samples, and using corresponding facial data frames as label samples for two training samples in the same set of training samples, it is determined that the label samples in some training samples can be used to supervise the generation of the mouth region, and the label samples in other training samples can be used to supervise the generation of other facial regions besides the mouth region. The loss values corresponding to the same label sample in these two cases are summarized to determine the total loss value, and then supervised updates are performed based on the total loss, thereby achieving self-supervised training without relying on manual labels, which can save training costs. In addition, since the loss calculation is subject to mandatory constraints when calculating the loss value, the calculation of the loss value between the mouth region and other facial regions is decoupled, making the model weight update process more accurate and efficient, improving the training efficiency of the model, and enabling the model to converge faster. The resulting expression generation model can delicately and accurately control the generation of mouth movements when put into online inference, obtain accurate facial data frames, and make the facial images obtained from these facial data frames more delicate, accurate, natural and smooth.
[0123] Based on any embodiment of this application, please refer to Figure 9 Before calling the training samples and label samples from the training dataset, the following steps are included:
[0124] Step S3100: Obtain the basic dataset, which includes sample audio and sample facial animation. The sample facial animation is described by three-dimensional model data and is synchronized with the sound in the sample audio in terms of facial expressions and mouth movements.
[0125] Open-source data can be used as the base dataset to create the training dataset required for this application, thereby improving the efficiency of sample production and reducing costs.
[0126] The aforementioned basic dataset includes sample audio recordings and sample facial animations. The sample audio recordings are audio data recorded from natural human speech, or machine-synthesized human voice audio data. The sample facial animations are digital facial animation data pre-created based on the facial expressions and lip movements corresponding to the sounds in the sample audio recordings, and can be described in the form of corresponding 3D model data of a digital human. For each sample facial animation, the facial expressions and lip movements are synchronized with the sounds in the corresponding sample audio recordings, thus making it one of the ideal materials for creating the training dataset of this application. Of course, in other embodiments, the sample audio recordings can also be manually recorded, and the corresponding sample facial animations can be manually created; these can also be included in the scope of the aforementioned basic dataset.
[0127] Step S3200: Convert each image frame in the sample facial animation into a digital human face data frame to form a face data frame set. The face data frame contains the mesh vertex data of the face region of the digital human, which is used to describe the three-dimensional spatial position data of each vertex of the face region of the digital human.
[0128] The sample facial animation includes multiple image frames. These image frames describe the corresponding 3D model data of the digital human in a certain format. To avoid format incompatibility and adapt to the format specifications of the digital human's mesh vertex data in this application, the image frames in the sample facial animation can be converted into facial data frames represented in the mesh vertex data format of the digital human in this application, so that the representation of each vertex data is compatible with the 3D model of the digital human in this application. The 3D spatial position data of each vertex in the facial region of the converted digital human is described through the mesh vertex data of the facial data frames. In this way, a sample facial animation can obtain a set of facial data frames, which contains multiple facial data frames.
[0129] Step S3300: Corresponding to the time period occupied by each image frame of the sample facial animation, the sample audio is divided into multiple audio segments, so that each image frame corresponds synchronously with each audio segment. Audio preprocessing is performed on each audio segment to obtain its frequency domain audio feature information, thus forming an audio feature information set.
[0130] For the sample audio corresponding to the sample facial animation, the sample audio can be segmented and sampled according to the time period occupied by each image frame in the sample facial animation, and then divided into multiple audio segments. Thus, each image frame has an audio segment with a corresponding time sequence, so that each image frame in the sample facial animation corresponds to each audio segment in the sample audio.
[0131] For each audio segment in the sample audio, the audio preprocessing and feature extraction methods described above can be applied, including windowing, framing, time-frequency transformation, and filtering, to obtain the frequency domain features of each audio segment, thus forming the corresponding audio feature information. In this way, a sample audio can obtain an audio feature information set composed of the audio feature information of its multiple audio segments.
[0132] Step S3400: Construct a mapping relationship data between the audio feature information set and the face data frame set, and store it in the training dataset.
[0133] Finally, a mapping relationship is established between the audio feature information set corresponding to the audio of each sample and the facial data frame set of the facial animation of the sample corresponding to the audio of that sample, forming mapping relationship data, which is stored in the training dataset of this application. Subsequently, training samples and label samples required for training the expression generation model of this application can be constructed based on these mapping relationship data.
[0134] As can be seen from the above embodiments, the training dataset of this application can be created using sample audio and corresponding sample facial animation. These basic data are easy to obtain. By applying technical means to preprocess these data, the training samples and label samples required for training this application can be obtained. No manual intervention is required during the process, and it can be automated, which is very economical and efficient.
[0135] Based on any embodiment of this application, please refer to Figure 10 Acquire facial data templates and audio feature information of audio segments from digital humans, including:
[0136] Step S1110: Obtain the audio recording and digital human feature identifier;
[0137] When it is necessary to use the facial expression generation model of this application to generate facial animation, a digital human can be specified for it, and corresponding audio can be provided as input data.
[0138] The input data can be adapted to different business scenarios and can come from various sources. For example, the audio data can be real-time recording data submitted by the anchor user in the live broadcast room, machine-synthesized speech data obtained by machine speech synthesis based on preset text in the virtual tour guide scenario, or other forms of audio data.
[0139] Step S1120: Divide the audio into multiple audio segments according to a preset duration, perform audio preprocessing on each audio segment, and obtain its frequency domain audio feature information;
[0140] When processing audio data, the aforementioned facial expression generation model typically segments audio data into segments according to a certain duration standard. This establishes a fixed correspondence between the duration of the facial data frame and the duration of the audio segment, achieving standardization. Therefore, the audio can be segmented according to the preset duration determined after standardization to obtain multiple audio segments. Then, each audio segment is processed accordingly using the audio processing method described above to obtain the corresponding audio feature information in the frequency domain for each audio segment.
[0141] Step S1130: Determine the corresponding digital human's facial data template based on the digital human feature identifier;
[0142] Since the facial expression generation model of this application has been trained on a large number of samples and has the ability to generate facial data frames for different digital humans, a digital human facial data template library can be provided in advance, and a mapping relationship between the facial data templates and digital human feature identifiers can be established. When it is necessary to create a facial animation of a corresponding digital human, the corresponding digital human feature identifier can be provided, and then the corresponding digital human facial data template can be called according to the digital human feature identifier.
[0143] Step S1140: Call the audio feature information of each audio segment of the audio one by one, and construct the input data of the expression generation model with the facial data template of the digital human, so as to generate each facial image through the expression generation model and obtain the facial animation corresponding to the audio.
[0144] The audio is segmented into multiple audio segments, while the digital human's facial data template is essentially a single facial data frame. In this case, according to the business processing logic of the expression generation model, the audio feature information of each audio segment and the digital human's facial data frame constitute the same set of input data for the expression generation model. These input data are then called one by one and provided to the expression generation model to generate the corresponding facial data frame. That is, the process of iteratively executing steps S1100 to S1400 of this application is performed according to each set of input data. Each iteration obtains a facial data frame corresponding to an audio segment, and its facial image is rendered accordingly. Finally, multiple facial images are obtained, which are then synthesized together according to the temporal correspondence to form a facial animation corresponding to the audio. Based on the capabilities learned by the expression generation model of this application during the training phase, the facial animation generated in this way has a precise correspondence and synchronization relationship between its facial expressions and lip movements and the sound of the audio, resulting in a natural and smooth overall picture.
[0145] As can be seen from the above embodiments, the facial expression generation model of this application can serve the production of digital facial animation, thereby providing efficient basic services for various downstream application layer services.
[0146] Please see Figure 11 According to one aspect of this application, a facial image generation apparatus is provided. In one embodiment, it includes a data acquisition module 1100, a feature extraction module 1200, a feature decoding module 1300, and an image rendering module 1400. The data acquisition module 1100 is configured to acquire a facial data template of a digital human and audio feature information of an audio segment of spoken audio. The facial data template includes grid vertex data of the facial region of the digital human, and the audio feature information includes audio features obtained in the frequency domain of the audio segment. The feature extraction module 1200 is configured to use a feature extraction network in an expression generation model to... After extracting the semantic feature information corresponding to the facial data template and the audio feature information, they are fused into multimodal feature information; the feature decoding module 1300 is configured to use the feature decoding network in the expression generation model to generate a facial data frame synchronized with the audio segment based on the multimodal feature information and the semantic feature information of the facial data template, the facial data frame containing mesh vertex data obtained by transforming the facial data template; the image rendering module 1400 is configured to render the three-dimensional model of the digital human based on the mesh vertex data of the facial data frame to obtain a facial image synchronized with the audio segment.
[0147] Based on any embodiment of this application, the feature extraction module 1200 includes: an audio encoding unit configured to use an audio encoder in the feature extraction network to extract semantic feature information of the audio feature information to obtain audio semantic feature information; an image encoding unit configured to use an expression encoder in the feature extraction network to extract semantic feature information corresponding to the face data template at multiple preset scales, and extract its deep semantics as expression semantic feature information based on the semantic feature information at the smallest scale; and a feature fusion unit configured to use a feature fusion network in the feature extraction network to synthesize the audio semantic feature information and the expression semantic feature information into multimodal feature information.
[0148] Based on any embodiment of this application, the feature decoding module 1300 includes: a primary upsampling unit, configured to use the first residual module in the feature decoding network to perform equal-scale convolution operations on the multimodal feature information to obtain corresponding convolutional feature information; a multi-scale upsampling unit, configured to use multiple residual modules cascaded with the first residual module corresponding to each scale, wherein each scale residual module refers to the semantic feature information of the face data template at its corresponding scale, and upsamples the convolutional feature information output by the residual module of its predecessor to obtain convolutional feature information at the corresponding scale; and a data frame output unit, configured to use the convolutional feature information obtained by the residual module corresponding to the highest scale as the face data frame.
[0149] Based on any embodiment of this application, the multi-scale upsampling unit includes: a feature perception subunit, configured to extract the semantic feature information of the face data template obtained at the corresponding scale through the multilayer perceptron of each scale's residual module; a normalization processing subunit, configured to perform channel normalization preprocessing on the perceptual feature information obtained at the corresponding scale through the normalization layer of each scale's residual module to obtain the corresponding normalized perceptual feature information; and an upsampling processing subunit, configured to use the normalized perceptual feature information obtained at the corresponding scale as reference information, and upsample the convolutional feature information output by the residual module of the preceding stage to obtain the convolutional feature information at the corresponding scale.
[0150] Based on any embodiment of this application, prior to the data acquisition module 1100, the module includes: a data retrieval module, configured to retrieve audio feature information of any two audio segments with asynchronous temporal relationships and digital human facial data frames from the training dataset to form two training samples in the same group, and associate each training sample with facial data frames of arbitrary temporal order as its label sample; a training prediction module, configured to input the two training samples in the same group into the expression generation model to perform inference and predict the predicted facial data frames that are synchronized with the audio feature information in each training sample; a loss determination module, configured to calculate the single-frame loss value of each predicted facial data frame in the same group of training samples relative to its corresponding label sample, summarize the single-frame loss values of each training sample in the same group of training samples to obtain a single-group loss value, and summarize the single-group loss values of multiple groups of training samples to obtain a total loss value; and an iterative decision module, configured to decide whether the expression generation model has converged based on the total loss value, and if it has not converged, to perform gradient updates and iterative training on the expression generation model until the expression generation model reaches a convergent state.
[0151] Based on any embodiment of this application, the loss determination module is further configured to: when the face data frame in the label sample is synchronized with the audio feature information in the training sample, calculate the single-frame loss value of the predicted face data frame relative to the face data frame in the label sample based only on the grid vertex data of the mouth region of the digital human; when the face data frame in the label sample is the same as the face data frame in the training sample, calculate the single-frame loss value of the predicted face data frame relative to the face data frame in the label sample based only on the grid vertex data of other expression regions outside the mouth region of the digital human.
[0152] Based on any embodiment of this application, prior to the data acquisition module 1100, the module includes: a material acquisition module, configured to acquire a basic dataset, the basic dataset including sample audio and sample facial animation, the sample facial animation being described by three-dimensional model data and maintaining synchronization with the sound in the sample audio in terms of facial expressions and mouth movements; an image processing module, configured to convert each image frame in the sample facial animation into a digital human face data frame, constituting a face data frame set, the face data frame containing grid vertex data of the digital human's face region, used to describe the three-dimensional spatial position data of each vertex of the digital human's face region; an audio processing module, configured to divide the sample audio into multiple audio segments corresponding to the time period occupied by each image frame of the sample facial animation, so that each image frame corresponds synchronously with each audio segment, perform audio preprocessing on each audio segment to obtain its frequency domain audio feature information, constituting an audio feature information set; and a construction and storage module, configured to construct a mapping relationship data between the audio feature information set and the face data frame set, and store it in the training dataset.
[0153] Based on any embodiment of this application, the data acquisition module 1100 includes: an input acquisition unit configured to acquire the audio and digital human feature identifier; an audio segmentation unit configured to segment the audio into multiple audio segments according to a preset duration, perform audio preprocessing on each audio segment, and obtain its frequency domain audio feature information; a template calling unit configured to determine the corresponding digital human's facial data template based on the digital human feature identifier; and an iteration preparation unit configured to sequentially call the audio feature information of each audio segment of the audio, and construct the input data of the expression generation model with the digital human's facial data template, so as to generate each facial image corresponding to the audio through the expression generation model and obtain the facial animation corresponding to the audio.
[0154] Another embodiment of this application also provides a face image generation device. For example... Figure 12 The diagram shows the internal structure of a face image generation device. This device includes a processor, a computer-readable storage medium, a memory, and a network interface connected via a system bus. The computer-readable, non-volatile storage medium stores an operating system, a database, and computer-readable instructions. The database stores information sequences, and when executed by the processor, these computer-readable instructions enable the processor to implement a face image generation method.
[0155] The processor of the face image generation device provides computing and control capabilities to support the operation of the entire device. The memory of the face image generation device can store computer-readable instructions, which, when executed by the processor, cause the processor to perform the face image generation method of this application. The network interface of the face image generation device is used for communication with a terminal.
[0156] Those skilled in the art will understand that Figure 12 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the face image generation device to which the present application is applied. A specific face image generation device may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0157] In this embodiment, the processor is used to execute... Figure 11The specific functions of each module are described, and the memory stores the program code and various data required to execute the above modules or sub-modules. The network interface is used to realize data transmission between user terminals or servers. The non-volatile readable storage medium in this embodiment stores the program code and data required to execute all modules in the face image generation device of this application, and the server can call the server's program code and data to execute the functions of all modules.
[0158] This application also provides a non-volatile readable storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the face image generation method of any embodiment of this application.
[0159] This application also provides a computer program product, including a computer program / instructions that, when executed by one or more processors, implement the steps of the method described in any embodiment of this application.
[0160] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. This computer program can be stored in a non-volatile readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The aforementioned storage medium can be a computer-readable storage medium such as a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM).
[0161] In summary, this application can generate facial images corresponding to audio segments of the audio in an end-to-end manner based on the audio and the facial data frame template of the digital human. It can be used to construct facial animations, and the generated facial animations have delicate and smooth expressions, with facial movements and mouth movements synchronized with the sound of the audio segments. It is suitable for various business scenarios and has economic potential.
Claims
1. A face image generation method characterized by comprising: The method comprises the following steps: calling audio feature information and digital human face data frames of any two audio clips with different time sequences from a training data set to form two training samples in a same group, and associating any time sequence face data frame with each training sample as a label sample; inputting the two training samples in the same group into an expression generation model to perform inference, and predicting a predicted face data frame synchronized with the audio feature information in each training sample; calculating a single-frame loss value of each predicted face data frame relative to the corresponding label sample in the same group of training samples, and summarizing single-group loss values from the single-frame loss values of the training samples in the same group to obtain a total loss value; determining whether the expression generation model converges according to the total loss value, and performing gradient update and iterative training on the expression generation model until the expression generation model converges; obtaining a digital human face data template and audio feature information of an audio clip with sound, wherein the face data template comprises mesh vertex data of a face region of the digital human, and the audio feature information comprises audio features obtained in a frequency domain; extracting semantic feature information corresponding to the face data template and the audio feature information respectively by using a feature extraction network in the expression generation model, and fusing the semantic feature information into multi-modal feature information; generating a face data frame synchronized with the audio clip according to the multi-modal feature information and the semantic feature information of the face data template by using a feature decoding network in the expression generation model, wherein the face data frame comprises mesh vertex data obtained by transforming the face data template; rendering a three-dimensional model of the digital human according to the mesh vertex data of the face data frame to obtain a face image synchronized with the audio clip.
2. The face image generation method according to claim 1, characterized by, extracting semantic feature information corresponding to the face data template and the audio feature information respectively by using a feature extraction network in the expression generation model, and fusing the semantic feature information into multi-modal feature information, which comprises: extracting semantic feature information of the audio feature information by using an audio encoder in the feature extraction network to obtain audio semantic feature information; extracting semantic feature information corresponding to the face data template at multiple preset scales by using an expression encoder in the feature extraction network, and extracting deep semantic information of the face data template based on the semantic feature information at the smallest scale as expression semantic feature information; fusing the audio semantic feature information and the expression semantic feature information into multi-modal feature information by using a feature fusion network in the feature extraction network.
3. The face image generation method according to claim 2, characterized by, generating a face data frame synchronized with the audio clip according to the multi-modal feature information and the semantic feature information of the face data template by using a feature decoding network in the expression generation model, which comprises: performing equal-scale convolution operation on the multi-modal feature information by using a first residual module in the feature decoding network to obtain corresponding convolution feature information; Multiple residual modules corresponding to each scale are cascaded with the first residual module. Each residual module at each scale refers to the semantic feature information of the face data template at its corresponding scale, and upsamples the convolutional feature information output by the residual module in front of it to obtain the convolutional feature information of the corresponding scale. The convolutional feature information obtained by the residual module corresponding to the highest scale is used as the face data frame.
4. The face image generation method according to claim 3, characterized by, Each scale's residual module references the semantic feature information of the facial data template at its corresponding scale, upsamples the convolutional feature information output by the previous residual module, and obtains the corresponding scale's convolutional feature information, including: The residual module at each scale extracts the perceptual feature information of the semantic feature information of the face data template obtained at the corresponding scale through the multilayer perceptron at its corresponding scale. The residual module at each scale performs channel normalization preprocessing on the sensory feature information obtained at the corresponding scale through the normalization layer at its corresponding scale to obtain the corresponding normalized sensory feature information. Each scale's residual module uses the standardized perceptual feature information obtained at its corresponding scale as reference information, and upsamples the convolutional feature information output by the residual module in front of it to obtain the convolutional feature information at the corresponding scale.
5. The face image generation method according to any one of claims 1 to 4, characterized in that, Calculate the single-frame loss value of each predicted face data frame relative to its corresponding label sample in the same training sample group, including: When the facial data frames in the labeled samples are synchronized with the audio feature information in the training samples, the single-frame loss value of the predicted facial data frame relative to the facial data frames in the labeled samples is calculated only based on the grid vertex data of the mouth region of the digital human. When the face data frames in the labeled samples are the same as the face data frames in the training samples, the single-frame loss value of the predicted face data frame relative to the face data frames in the labeled samples is calculated only based on the grid vertex data of other expression regions outside the mouth region of the digital human.
6. The face image generation method according to any one of claims 1 to 4, characterized by, Before retrieving training and label samples from the training dataset, the following steps are included: Obtain a basic dataset, which includes sample audio recordings and sample facial animations. The sample facial animations are described using 3D model data and are synchronized with the sounds in the sample audio recordings in terms of facial expressions and lip movements. Each image frame in the sample facial animation is converted into a digital human facial data frame to form a facial data frame set. The facial data frame contains the mesh vertex data of the digital human's facial region, which is used to describe the three-dimensional spatial position data of each vertex of the digital human's facial region. Corresponding to the time period occupied by each image frame of the sample facial animation, the sample audio is divided into multiple audio segments, so that each image frame corresponds synchronously with each audio segment. Audio preprocessing is performed on each audio segment to obtain its frequency domain audio feature information, thus forming an audio feature information set. The audio feature information set and the facial data frame set are constructed into a mapping relationship data and stored in the training dataset.
7. The face image generation method according to any one of claims 1 to 4, characterized by, Acquire audio feature information from digital human facial data templates and audio segments of spoken audio, including: Obtain the audio recordings and digital human feature identifiers; The audio is divided into multiple audio segments according to a preset duration, and each audio segment is preprocessed to obtain its frequency domain audio feature information. The facial data template of the corresponding digital human is determined based on the digital human feature identifier; The audio feature information of each audio segment of the audio is called one by one and constructed with the facial data template of the digital human as the input data of the expression generation model, so as to generate each facial image through the expression generation model and obtain the facial animation corresponding to the audio.
8. A face image generation apparatus characterized by comprising: include: The data retrieval module is configured to retrieve audio feature information of any two audio segments with asynchronous temporal relationships from the training dataset and digital human face data frames to form the same set of two training samples, and associate each training sample with face data frames of arbitrary temporal order as its label sample. The training prediction module is set to input two training samples from the same group into the facial expression generation model to perform inference and predict the predicted facial data frame that is synchronized with the audio feature information in each training sample. The loss determination module is set to calculate the single-frame loss value of each predicted face data frame in the same training sample relative to its corresponding label sample, summarize the single-frame loss value based on the single-frame loss value of each training sample in the same group, and summarize the single-frame loss values of multiple training samples into the total loss value. The iterative decision module is configured to determine whether the expression generation model has converged based on the total loss value. If the model has not converged, gradient updates and iterative training are performed on the expression generation model until the expression generation model reaches a convergent state. The data acquisition module is configured to acquire a facial data template of a digital human and audio feature information of an audio segment of a spoken audio. The facial data template contains grid vertex data of the facial region of the digital human, and the audio feature information contains audio features of the audio segment obtained in the frequency domain. The feature extraction module is configured to use the feature extraction network in the expression generation model to extract the semantic feature information corresponding to the facial data template and the audio feature information respectively, and then fuse them into multimodal feature information; The feature decoding module is configured to use the feature decoding network in the expression generation model to generate a face data frame synchronized with the audio segment based on the multimodal feature information and the semantic feature information of the face data template. The face data frame contains mesh vertex data obtained according to the transformation of the face data template. The image rendering module is configured to render the three-dimensional model of the digital human based on the mesh vertex data of the face data frame, thereby obtaining a face image synchronized with the audio clip.
9. A face image generation device comprising a central processing unit and a memory, characterized in that, The central processing unit is configured to invoke and run a computer program stored in the memory to perform the steps of the method as described in any one of claims 1 to 7.
10. A non-volatile readable storage medium, characterized by It stores a computer program in the form of computer-readable instructions, which, when invoked by a computer, performs the steps included in the method as described in any one of claims 1 to 7.
11. A computer program product, characterised in that, Includes a computer program / instruction, which, when executed by a processor, performs the steps of the method as described in any one of claims 1 to 7.