Digital human multi-language content generation system based on cross-language timbre cloning
The digital human multilingual content generation system, which uses cross-language timbre cloning, solves the problems of cross-language timbre transfer and facial expression synchronization, and achieves high-quality multilingual digital human video generation, while reducing the training data requirements and costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 北京泽桥传媒科技股份有限公司
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to achieve high-quality voice transfer across languages, resulting in speech with noticeable accents, unnatural rhythms, stiff facial expressions, and difficulty in precise synchronization. Furthermore, the process of building digital humans is complex and costly, making it difficult to meet the needs for rapid deployment and personalization.
The digital human multilingual content generation system, which employs cross-language voice cloning, decouples voice and language through a multilingual voice cloning module, a multimodal identity preservation module, and a multilingual lip-sync and facial expression driving module, generating digital human videos with consistent voice, synchronized lip movements, and natural facial expressions.
It improves the naturalness and realism of cross-language speech synthesis, expands the range of languages supported by digital human systems, reduces training data requirements and costs, and ensures the synchronization and consistency of facial expressions and speech.
Smart Images

Figure CN122245284A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of artificial intelligence technology, specifically relating to a digital human multilingual content generation system based on cross-language timbre cloning. Background Technology
[0002] With the rapid development of artificial intelligence and digital human technology, the demand for multilingual digital content generation in fields such as news broadcasting, education and entertainment, virtual customer service, and cross-cultural communication is increasing. Current technical solutions typically rely on high-quality speech synthesis in a single language and pre-recorded facial motion libraries, which present the following technical bottlenecks when facing multilingual and personalized needs:
[0003] First, existing speech cloning technologies are mostly trained on specific languages, and their timbre features are strongly bound to the training language. When applied to new languages, the synthesized speech often exhibits obvious accents and unnatural rhythms, making it difficult to achieve high-quality timbre transfer across languages. In particular, for low-resource languages with scarce training data (such as some minority languages and dialects), the synthesis quality drops sharply, severely limiting the widespread adoption of digital human technology.
[0004] In multimodal content generation, existing systems typically handle speech synthesis and facial animation generation independently. This results in the difficulty of accurately synchronizing the generated digital human's facial movements (especially lip movements) with the content and rhythm of the synthesized speech, leading to stiff and unnatural expressions. Although existing technologies have attempted to use audio to directly drive general facial models, it is difficult to maintain the stability of specific speaker identity features, which easily leads to the phenomenon of "identity drift," where the digital human's facial appearance becomes inconsistent in different language or expression scenarios.
[0005] To achieve the rapid construction of new digital identities, existing solutions often require the collection of massive amounts of multi-angle audio and video data of the target person, as well as long-term, high-computing-power model training. This process is complex and costly, making it difficult to meet the commercial needs for rapid deployment and personalized customization.
[0006] Therefore, the industry urgently needs to develop a multilingual digital human content generation system that can effectively decouple timbre from language, adapt to low-resource languages, and efficiently generate lip-synced, vivid, natural, and highly consistent digital human content. Summary of the Invention
[0007] This application provides a digital human multilingual content generation system based on cross-language timbre cloning, aiming to solve the problems of existing technologies that make it difficult to achieve high-quality cross-language timbre transfer, content and rhythm that are difficult to synchronize accurately, and facial expressions that are stiff and unnatural.
[0008] A digital human multilingual content generation system based on cross-language timbre cloning includes:
[0009] The input interface module receives multimodal input containing target language content and a specified target speaker identifier;
[0010] A multilingual timbre cloning module is connected to the input interface module. Based on the target language content, it generates or extracts the corresponding target language speech data, phoneme sequences and prosodic features, and integrates the timbre of the source speaker.
[0011] A multimodal identity preservation module is used to store and load a driveable three-dimensional facial neural representation based on neural radiation field or three-dimensional Gaussian scattering technology, corresponding to the target speaker identifier;
[0012] The multilingual lip-sync and facial expression driving module is connected to the multilingual timbre cloning module and the multimodal identity preservation module, respectively. It uses the target language speech data, phoneme sequence and prosodic features as driving sources to generate a facial driving parameter sequence that is highly synchronized with the speech content and decoupled from the identity.
[0013] An end-to-end multilingual content generation engine integrates the multilingual voice cloning module, the multimodal identity preservation module, and the multilingual lip-sync and facial expression driving module to generate digital human multilingual videos with consistent voice, synchronized lip movements, and natural facial expressions based on multimodal input.
[0014] Optionally, the multilingual timbre cloning module includes: a timbre coding network for extracting speaker timbre embedding vectors from the speech samples of the source speaker;
[0015] A multilingual speech synthesis network, connected to the timbre encoding network, is used to synthesize target language speech with source speaker timbre features based on target language text, using the timbre embedding vector as a condition;
[0016] A low-resource language adaptation unit is used to convert text in a low-resource language into a sequence of phonetic units that can be recognized by the multilingual speech synthesis network.
[0017] Optionally, the multilingual speech synthesis network includes: a multilingual phoneme alignment unit, which has a built-in phoneme set covering the pronunciation units of multiple languages, and introduces an attention mechanism to learn the association mapping between phonemes of different languages and shared timbre features;
[0018] A cross-linguistic duration-prosodic predictor is used to generate duration, fundamental frequency, and energy parameters that conform to the prosodic pattern of the target language and the rhythmic characteristics of a specific speaker, under the joint guidance of the timbre embedding vector and the language type encoding.
[0019] Optionally, the low-resource language adaptation unit adopts a two-level adaptation strategy, including:
[0020] A phoneme similarity mapping strategy is used to map the articulation units of low-resource languages to phoneme sequences with similar acoustic features in high-resource languages;
[0021] The parameter fine-tuning strategy is used to fine-tune some layers of the multilingual speech synthesis network using a small amount of speech data from a low-resource language after initialization with a high-resource language model.
[0022] Optionally, the multimodal identity preservation module includes:
[0023] An identity-specific 3D neural field modeling unit is used to construct a 3D neural representation encoding the facial features of the target speaker based on video input.
[0024] An audio-driven facial expression parameter generation unit is used to generate an identity-independent facial expression parameter sequence based on a speech feature sequence.
[0025] The identity and dynamic fusion rendering unit is used to fuse the three-dimensional neural representation with the expression parameter sequence to render a sequence of facial images with consistent identity and corresponding expressions.
[0026] Optionally, the multimodal identity preservation module further includes:
[0027] The Fast Identity Modeling and Adaptation Unit is used to quickly reconstruct and drive the three-dimensional neural representation of the target speaker based on a pre-trained general face base model through a lightweight parameter adaptation mechanism, given a short video of the target speaker.
[0028] Optionally, the multilingual lip-sync and facial expression driving module includes: a front-end speech-phoneme analysis unit, used to receive the input audio stream, identify and output the phoneme sequences and their boundaries of multiple languages;
[0029] The core cross-language driving signal generation network adopts an encoder-decoder architecture to predict the weight coefficients and expression style offsets for mixing a set of predefined basic visual pixel actions based on the phoneme sequence and prosodic features, so as to generate a facial driving parameter sequence.
[0030] The post-processing and fusion unit is used to smooth and optimize the driving parameter sequence and adapt it to the downstream rendering model.
[0031] Optionally, the multilingual lip-sync and facial expression driving module further includes:
[0032] The low-resource language phoneme-visual mapping enhancement strategy module is used to enhance the ability to generate driving signals for low-resource languages through linguistic rule-based phoneme mapping, knowledge distillation-based teacher-student collaborative training, and adaptive mapping and fine-tuning.
[0033] Optionally, the end-to-end multilingual content generation engine includes: a unified input interface and a preprocessing scheduling unit, used to parse multimodal input types and generate or extract time-aligned audio-phoneme streams;
[0034] The identity model loading and conditional injection unit is used to dynamically load the identity parameter model of the target speaker and inject the timbre embedding vector extracted from the audio.
[0035] A parallelized driving signal generation and optimization unit is used to schedule the multilingual lip-sync and facial expression driving module to generate a high-precision facial driving parameter sequence and perform online fine-tuning.
[0036] The multimodal fusion and high-fidelity rendering unit is used to fuse the driving parameter sequence with the three-dimensional neural representation, synthesize a face image sequence and encapsulate it in sync with the audio, and output a digital human video.
[0037] Optionally, the multilingual lip-sync and facial expression driving module further includes a real-time adaptive prosodic feedback and lip-sync correction module, which is used to analyze the phoneme-visual alignment delay and prosodic intensity mismatch in real time during the generation of driving parameters, and to dynamically correct the subsequent driving parameters through a feedback loop.
[0038] Compared with the prior art, this application has at least the following beneficial effects:
[0039] This application employs a decoupled timbre coding network and a conditional multilingual speech synthesis network to extract pure timbre features from a small number of source language speech samples and transfer them to speech synthesis of multiple target languages. The generated speech retains the unique timbre of the source speaker while conforming to the pronunciation habits and prosodic patterns of the target language, thus improving the naturalness and realism of cross-language speech synthesis.
[0040] This application, through an integrated low-resource language adaptation unit, utilizes a two-level strategy of phoneme similarity mapping and parameter fine-tuning to effectively leverage knowledge from high-resource languages to assist in the modeling of low-resource languages. This solves the problems of poor synthesized speech quality and low robustness caused by a lack of training data, and expands the range of languages supported by digital human systems. Attached Figure Description
[0041] Figure 1 This is a schematic diagram of the module connections of a digital human multilingual content generation system based on cross-language timbre cloning, provided as an embodiment of this application. Detailed Implementation
[0042] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments.
[0043] The digital human multilingual content generation system based on cross-language timbre cloning provided in this application includes:
[0044] The input interface module receives multimodal input containing target language content and a specified target speaker identifier;
[0045] A multilingual timbre cloning module is connected to the input interface module. Based on the target language content, it generates or extracts the corresponding target language speech data, phoneme sequences and prosodic features, and integrates the timbre of the source speaker.
[0046] The multilingual timbre cloning module consists of a timbre coding network, a multilingual speech synthesis network, and a low-resource language adaptation unit. It is used to extract and transfer the speaker's timbre from a small number of source language speech samples to speech synthesis of multiple target languages.
[0047] Specifically, the timbre encoding network is constructed based on convolutional neural networks and recurrent neural networks. Its input is the speech waveform or Mel spectrum of the source speaker, and its output is a fixed-dimensional speaker timbre embedding vector. During the training phase, the network uses a large-scale multi-speaker dataset and is pre-trained through an auxiliary speaker classification task to ensure that the extracted features are speaker-discriminative and decoupled from the speech content.
[0048] The multilingual speech synthesis network, conditioned on the aforementioned timbre embedding vectors, employs an improved non-autoregressive end-to-end structure. Its core improvement lies in the following: the network front end incorporates a multilingual phoneme alignment unit, which includes a scalable phoneme set covering articulation units of multiple languages and introduces an attention mechanism to learn the correlation mapping between phonemes of different languages and shared timbre features. The network back end is a cross-lingual duration-prosodic predictor, which is jointly guided by timbre embedding and language type encoding to generate duration, fundamental frequency, and energy parameters that conform to the prosodic pattern of the target language and the rhythmic characteristics of a specific speaker.
[0049] The low-resource language adaptation unit addresses the problem of scarce training data for the target language. It employs a two-level adaptation strategy: The first level is a phoneme similarity mapping strategy. For low-resource languages (such as Hmong and Thai), it first utilizes their linguistic pronunciation rules (e.g., designing a dedicated G2P model for Thai to handle inconsistencies between written and spoken order; designing subsyllable representations for Hmong to distinguish tones) to convert the text into International Phonetic Alphabet symbols or designed basic phoneme units. Then, this phoneme unit sequence is mapped to the phoneme sequence with the most similar acoustic features in high-resource languages (such as English and Chinese) as the initial input to the synthesis network. The second level is a parameter fine-tuning strategy. After initialization using a high-resource language model, a small amount of speech data from the low-resource language is used to fine-tune some layers of the synthesis network (especially layers related to duration and prosody prediction). Simultaneously, timbre embedding decoupling regularization technology constrains the impact of the fine-tuning process on the output vector of the pre-trained timbre encoding network, preventing timbre features from drifting during adaptation to the new language.
[0050] The workflow of the multilingual timbre cloning module is as follows: For a given source speaker's speech and target language text, the speaker's timbre embedding vector is first extracted through a timbre encoding network. Simultaneously, the target language text is processed by a low-resource language adaptation unit, converting it into a sequence of articulatory units recognizable by the synthesis network. Finally, the multilingual speech synthesis network, using the timbre embedding vector as a condition, drives this sequence of articulatory units to synthesize target language speech with the timbre characteristics of the source speaker. This design achieves the technical effect of transferring specific timbres across languages while maintaining the naturalness of the synthesized speech under limited data, solving the problems of strong timbre-language binding and poor low-resource language synthesis quality in existing technologies.
[0051] A multimodal identity preservation module is used to store and load a driveable three-dimensional facial neural representation based on neural radiation field or three-dimensional Gaussian scattering technology, corresponding to the target speaker identifier;
[0052] The multimodal identity preservation module specifically includes the following sub-units:
[0053] An identity-specific 3D neural field modeling unit is used, taking a multi-view or single-view high-definition video of the target speaker as input. If NeRF technology is used, a conditional neural radiation field is constructed, with network weights encoding the speaker's specific facial geometry, skin material, and texture details. If 3D Gaussian scattering technology is used, a set of 3D Gaussian ellipsoids associated with facial feature points are initialized, and their attributes (position, covariance, color, opacity) are optimized to accurately reflect the speaker's facial appearance. This process produces a facial identity code or a set of static 3D Gaussian units as the basis for identity rendering.
[0054] An audio-driven facial expression parameter generation unit is a lightweight neural network that takes a sequence of speech features in the target language (such as Mel spectrograms or Wav2Vec features) as input and outputs a continuous, identity-independent sequence of facial expression parameters. These parameters can be based on standardized facial action coding systems (such as pose and expression coefficients from the FLAME model) or latent deformable codes driving neural radiation fields / 3D Gaussians. The unit is trained on a large-scale dataset containing multi-speaker, multilingual audio-expression pairs, learning a mapping from general speech features to general facial actions, thus enabling it to generalize across speakers and languages.
[0055] Identity and Dynamic Fusion Rendering Unit: This unit is crucial to this module, responsible for fusing static identity representations with dynamic facial expression parameters to generate a final sequence of facial images with consistent identity. For the NeRF-based scheme, this unit inputs the facial expression parameters as conditions into the neural radiation field network, dynamically modulating its radiation field to render facial images with corresponding expressions and lip movements while maintaining the identity network weights unchanged. For the 3D Gaussian-based scheme, this unit dynamically adjusts the position, rotation, and other transformation properties of the 3D Gaussian ellipsoid based on the facial expression parameters to simulate facial muscle movements, and then generates images through an efficient raster renderer.
[0056] Furthermore, to facilitate rapid identity modeling for new individuals, a rapid identity modeling and adaptation unit is provided. This unit is based on a general neural radiation field or three-dimensional Gaussian scattering model pre-trained on a large-scale multi-speaker facial video dataset. By introducing a lightweight parameter adaptation mechanism, it enables rapid and efficient reconstruction and driving of a highly accurate three-dimensional neural representation of the new speaker when only a single short video (usually 1-5 minutes) of the target speaker is acquired.
[0057] The general underlying model consists of a shared, parameter-frozen geometry and appearance decoder, which has the general ability to generate 3D geometry and textures from implicitly encoded or explicit parameters. To enable rapid adaptation to new speakers, this unit adds a learnable speaker-specific adapter, which includes two optional implementation methods:
[0058] Adaptive weights generated by the hypernetwork: A lightweight hypernetwork is set up, whose input is the identity feature vector obtained by encoding several keyframes extracted from the target speaker's short video. This hypernetwork outputs a set of sparse, low-rank adaptive weight matrices or bias vectors. During the inference phase, these adaptive weights are dynamically injected into specific layers of the general base decoder (e.g., weights of certain feedforward network layers or biases after activation functions), thereby fine-tuning the general model with almost no increase in computational burden, so that its radiation field or Gaussian properties are mapped onto the facial features of the new speaker;
[0059] Conditional implicit encoding: A fixed, low-dimensional implicit identity encoding vector is learned for each new speaker. During the model's forward propagation, this encoding vector, along with general spatial coordinates and viewpoint orientation, is input into a general decoder with frozen parameters. Internally, the decoder uses conditional layer normalization or feature modulation techniques to dynamically adjust the statistical properties of intermediate feature maps using this identity encoding, thereby controlling the generated facial geometry and appearance to be consistent with the target speaker.
[0060] The rapid adaptive process is performed within a unified optimization framework: all weights of the general base decoder are fixed, and optimization is performed only on the speaker-specific adapter (parameters of the hypernetwork or implicit coding vectors). The loss function combines photometric reconstruction loss (comparing the differences between the rendered image and the input video frame), facial keypoint alignment loss, and optional identity preservation loss (using a pre-trained face recognition network to ensure the rendered image matches the target speaker's identity). This process typically converges within a finite number of iterations (e.g., thousands of steps), minimizing data and computational costs.
[0061] As a further optimization, the large-scale pre-trained dataset not only includes diverse identities but also covers different expressions, poses, and lighting conditions. The general base model learns decoupled representations of face shape, texture priors, and expression variations on this data. Therefore, when combined with the newly learned lightweight adapter, it can accurately capture the unique identity features of new speakers (such as face shape, skin color, moles, etc.) while naturally inheriting the base model's responsiveness to speech-driven expressions and lip movements, ensuring the robustness and naturalness of the adapted model under dynamic driving conditions.
[0062] The multilingual lip-sync and facial expression driving module is connected to the multilingual timbre cloning module and the multimodal identity preservation module, respectively. It uses the target language speech data, phoneme sequence and prosodic features as driving sources to generate a facial driving parameter sequence that is highly synchronized with the speech content and decoupled from the identity.
[0063] The multilingual lip-sync and facial expression driving module consists of a three-tiered structure: a front-end speech-phoneme analysis unit, a core cross-language driving signal generation network, and a post-processing and fusion unit.
[0064] Front-end Speech-Phonetic Analysis Unit: This unit receives the input audio stream and first extracts speech features through a pre-trained general speech recognition front-end. Then, it connects to a multilingual phoneme classifier. This classifier is based on an extended International Phonetic Alphabet (IPA) set and can recognize and output phoneme sequences and their boundaries for multiple languages, including low-resource languages. For languages with unique writing and pronunciation rules (such as Thai and Hmong), this unit integrates corresponding rule-based preprocessing submodules (e.g., a Thai G2P module and a Hmong subsyllable segmenter) to convert the text or audio into a representation of this extended phoneme set, ensuring that language inputs from different sources are uniformly and standardizedly represented at the phoneme level.
[0065] Core cross-language driving signal generation network: This network is the main body of this module and adopts a non-autoregressive encoder-decoder architecture to achieve high-efficiency inference;
[0066] Encoder: Receives the normalized phoneme sequence and prosodic features (such as fundamental frequency and energy envelope) of the original audio from the aforementioned unit, and outputs a context-dependent speech code rich in speech content and prosodic information;
[0067] Decoder: This decoder is trained with audiovisual data from multiple languages and multiple speakers. Its key design feature is the introduction of a language-independent hybrid spectrophotome base and a parameter decoupling mechanism.
[0068] Hybrid visual base: A predefined set of basic visual pixel movements that transcends a single language and covers common mouth shapes in human pronunciation. Each basic visual pixel corresponds to a set of standardized parameters for facial movement units (such as lip opening and closing, corner of the mouth extension, jaw rotation, etc.).
[0069] Parameter decoupling: The core function of the decoder is to predict a set of weighting coefficients and an expression style offset that vary over time based on the input speech code. The weighting coefficients are used to linearly or nonlinearly mix the base pixel matrix to generate core lip movements that precisely match the phoneme sequence; the expression style offset is an auxiliary parameter that is weakly correlated with the speech content but strongly correlated with speaking habits and emotional state. It is used to modulate facial expressions in non-lip areas (such as subtle movements of eyebrows, cheeks, and periorbital muscles), thereby giving the face a natural and vivid overall expression while ensuring accurate lip movements.
[0070] Post-processing and fusion unit: This unit is responsible for smoothing and temporally consistent optimizing the driving signals generated by the core network, and adapting them to different downstream rendering models. Specifically, it includes:
[0071] Timing filtering and interpolation: A physiologically constrained smoothing filter is applied to eliminate high-frequency jitter in the driving signal, ensuring the continuity of the action. Context-based interpolation is performed at phoneme boundaries to simulate realistic coarticulation transitions.
[0072] Identity-based adaptation: The generated general facial motion parameters (lip shape and expression) are combined with the speaker-specific identity code provided by the "multimodal identity preservation module". Through a lightweight adaptation layer, the general motion parameters are fine-tuned according to individual speaking characteristics (such as mouth opening and closing habits, facial expression activity) to ensure that the final driving signal not only conforms to the language content, but also matches the unique expression of the target speaker;
[0073] Furthermore, for low-resource languages lacking sufficient audiovisual training data, a transfer learning and data augmentation strategy is adopted when training the core driving signal generation network. This further provides a phoneme-visual mapping enhancement strategy module for low-resource languages. This module integrates rule-based mapping based on linguistic priors with transfer learning techniques based on large-scale pre-trained models to construct a hierarchical and scalable cross-linguistic articulation unit alignment and driving signal generation framework, as detailed below:
[0074] The specific implementation of this strategy module includes the following three collaborative levels:
[0075] The first layer is a phoneme mapping and data construction layer based on linguistic rules. This layer aims to establish a structured bridge between the pronunciation systems of low-resource languages and high-resource languages. First, linguistic experts or authoritative pronunciation dictionaries are invited to label one or more best-matching phonemes or phoneme combinations for each phoneme in the low-resource language within the extended International Phonetic Alphabet (IPA) set of high-resource language phonemes, specifying their mapping confidence levels, thus forming a priori cross-linguistic phoneme mapping matrix. For example, a mapping is established between specific vowels in Thai and specific vowels or diphthongs in English. This matrix serves as a non-trainable system parameter. During the training data preparation phase, for each audio clip and its phoneme sequence in the low-resource language, a corresponding pseudo-high-resource language phoneme sequence and soft labels are automatically generated based on this mapping matrix. Simultaneously, the acoustic feature extractor of the high-resource language pre-trained model is used to extract cross-linguistic robust acoustic features for the low-resource language audio, thereby constructing enhanced training sample pairs that can be used for subsequent training.
[0076] The second layer is a knowledge distillation-based teacher-student collaborative training layer. This layer utilizes a mature driving signal generation network trained on abundant high-resource language data as a fixed-parameter teacher model. A two-stage distillation method is used to train the student model in low-resource language scenarios.
[0077] Output layer distillation: Low-resource language audio is input into the teacher model, which generates pseudo-visual-driven signals as "soft targets" based on its own high-resource language phoneme understanding. During training, the student model's loss function includes not only the difference from the real low-resource language video frames but also the distillation loss between its predictions and the teacher model's "soft targets." This enables the student model to learn the general physical mappings from acoustic features to facial movements inherent in the teacher model.
[0078] Feature layer guidance: This method mandates that the feature representations of the intermediate layers in the student model maintain similarity to the feature representations of the corresponding layers in the teacher model on the mapped phoneme-aligned frames (through feature imitation loss). This aligns the feature space structure within the student model with that of the teacher model, accelerating convergence and improving generalization ability.
[0079] The third layer, the adaptive mapping and fine-tuning layer, introduces a learnable adaptive mapping network after the student model's initial training. This network takes as input the prior mapping matrix generated in the first layer and statistical information learned from real audiovisual data of the low-resource language, dynamically optimizing and adjusting the mapping relationships between phonemes. Specifically, it outputs a refined phoneme-visual association weight matrix, used for context-based post-processing calibration of the initial driving signals generated by the student model during the inference phase, making them more consistent with the unique co-pronunciation habits of the low-resource language. Finally, using limited real low-resource language data, the entire student model (or key layers) is lightweightly fine-tuned to fully adapt to the target language.
[0080] To improve the real-time performance of interactive applications, a real-time adaptive prosodic feedback and lip-sync correction module is provided. This module introduces a lightweight online prosodic analysis and feedforward correction loop into the generation flow of the multilingual lip-sync and facial expression driving module, forming a closed-loop control system with real-time self-adjustment capability.
[0081] Specifically, the real-time adaptive prosodic feedback and lip-sync correction module consists of three parts: a forward-driven parameter prediction unit, a parallel prosodic feature extraction and analysis unit, and a feedback-driven parameter correction unit.
[0082] Forward driving parameter prediction unit: As the main pathway, it takes the features of the current and historical speech frames (such as Mel spectrum and phoneme labels) as input, and predicts and outputs the preliminary facial driving parameter sequence (including pixel weights, expression offsets, etc.) in real time frame by frame through a hybrid structure of causal convolution and recurrent neural network.
[0083] Parallel prosodic feature extraction and analysis unit: This unit operates in parallel with the main path. It receives the same raw input audio stream and calculates features at two levels in real time:
[0084] Phoneme-level synchronicity measurement: A pre-trained, lightweight phoneme boundary detector is used to estimate the timing of phoneme occurrences in the input audio in real time. Simultaneously, the current dominant visual category is decoded from the real-time driving parameters output by the feedforward unit. The phoneme-visual alignment delay error within a short time window is calculated by comparing the phoneme occurrence time with the corresponding dominant visual occurrence time.
[0085] Prosodic Energy and Fundamental Envelope: The short-time energy and fundamental frequency profile of the input audio are calculated in real time and correlated with the expected lip opening and jaw movement trajectory estimated from the current driving parameters. When a sharp increase in energy (emphasis on stressed lip) is detected in the actual audio and the expected movement trajectory fails to match, a prosodic intensity mismatch signal is generated.
[0086] Feedback-driven parameter correction unit: This unit is the core controller of this module. It receives real-time error signals (alignment delay error, prosody intensity mismatch) from the analysis unit and calculates the dynamic adjustment amount for the driving parameters of subsequent frames based on a digital corrector designed based on the proportional-integral-differential principle or a lightweight recurrent neural network corrector.
[0087] The adjustment is injected into the forward prediction unit in real time. Specifically, it can be implemented as an additional condition, concatenated with the original speech features and input into the next time step of the prediction network; or directly as an additive residual, superimposed on the driving parameters of the prediction unit for the next few frames to be output.
[0088] For example, when a positive delay in phoneme-visual alignment (lip lag) is detected, the corrector will generate an adjustment that slightly advances the subsequent visual switching; when an emphasis mismatch is detected, an adjustment command is generated to temporarily increase the amplitude and speed of lip movements.
[0089] The system completes a "prediction-analysis-correction" cycle within an extremely short sliding time window (typically 100-500 milliseconds), thereby achieving millisecond-level tracking and adaptive compensation for dynamic changes in speech rhythm;
[0090] The end-to-end multilingual content generation engine integrates the above modules. The input is text or speech in the target language, and the output is a digital human multilingual video with consistent timbre, synchronized lip movements, and natural expressions. The end-to-end multilingual content generation engine organically integrates the aforementioned functional modules (including multilingual timbre cloning module, multimodal identity preservation module, multilingual lip movement and expression driving module and its enhancement sub-modules) through a hierarchical and configurable pipeline architecture and a unified data bus protocol, forming an automated and integrated generation system from multimodal input to high-quality video stream output.
[0091] The workflow and key integration mechanisms of the end-to-end multilingual content generation engine are as follows:
[0092] A unified input interface and preprocessing scheduling are implemented. The engine provides a multimodal input interface that accepts target language text, target language audio, or a combination thereof as input sources. The system has a built-in intelligent input parser that automatically determines the input type. If it is plain text input, the multilingual timbre cloning module (or its integrated TTS subunit) is first invoked to generate a target language speech waveform with the target timbre and the corresponding phoneme sequence and prosodic features by combining the specified source speaker's timbre features. If it is audio input, the audio is directly extracted as the driving source, and the phoneme and prosodic features are obtained using the speech-phoneme parsing unit. This step ensures that all subsequent module processing is based on the same time-aligned audio-phoneme stream.
[0093] Identity model loading and condition injection: Based on the target speaker identifier specified by the user, the engine dynamically loads the lightweight identity adaptation parameters (such as adapter weights or implicit encoding) generated by the multimodal identity preservation module for that speaker, as well as static neural field / 3D Gaussian model data. Simultaneously, the speaker timbre embedding vector extracted from the input audio or cloned speech is injected into the subsequent driving module as an important identity-related condition for personalized fine-tuning of lip movements and facial expressions, ensuring consistency in audiovisual identity perception.
[0094] Parallelization drives signal generation and optimization. The engine's core scheduler distributes time-aligned audio feature streams, phoneme sequences, and loaded identity conditions in parallel to the multilingual lip-sync and facial expression driving modules and their integrated sub-modules (such as the low-resource language enhancement strategy module and the real-time adaptive prosodic feedback module). These modules collaboratively generate high-precision, identity-decoupled facial driving parameter sequences (including basic pixel weights, expression offsets, head pose parameters, etc.) in a pipelined or parallel computing manner. During this process, the prosodic feedback module runs in real time, fine-tuning the driving parameters online to ensure real-time synchronization accuracy.
[0095] In the final stage of multimodal fusion and high-fidelity rendering, the engine performs crucial fusion and rendering tasks. It fuses the temporal driving parameter sequence from the driving module with the static 3D neural representation from the identity preservation module in the rendering space. Specifically, in each frame, the renderer (based on NeRF or Gaussian scattering rendering technology) synthesizes the corresponding 2D face image by querying or deforming the 3D neural representation, based on the driving parameters (controlling facial expressions, lip movements, and head pose) and identity parameters of the current frame. Simultaneously, the engine performs strict time alignment and encapsulation with the synthesized face image sequence, using synchronously generated or original target language audio waveforms, ultimately outputting a digital human video stream with synchronized lip movements, natural facial expressions, and consistent audio and video.
[0096] In a specific embodiment, the specific application process of this solution includes:
[0097] For example, an expert records a 15-minute Chinese speech video as source material. The system quickly constructs a 3D neural face model of the expert through a multimodal identity preservation module, preserving their facial features, expressions, and habits. The system extracts the expert's timbre embedding vector through a timbre encoding network. The speech transcript is translated into English, Japanese, and Spanish by a professional medical translation team. A multilingual timbre cloning module transfers the expert's timbre into the speech synthesis of each language, generating natural multilingual speech with the expert's timbre. A low-resource language adaptation unit ensures natural pronunciation in less common languages (such as Thai and Arabic). A multilingual lip-sync and facial expression driving module adjusts the lip shape and facial expression based on the lip movements of each language. By combining elemental sequences and prosodic features, the system generates precisely synchronized facial motion parameters. Through a real-time adaptive prosodic feedback module, the system dynamically adjusts lip movements to ensure that the facial expressions and speech intensity match when emphasizing accents. The end-to-end multilingual content generation engine integrates speech, facial movements, and 3D face models to generate multilingual versions of digital human speech videos. In the output videos, the expert "himself" delivers fluent speeches in English, Japanese, and Spanish with natural facial expressions and synchronized lip movements. This allows the expert to deliver engaging speeches to a global audience without needing to learn foreign languages, while maintaining a professional image and a high degree of consistency between tone, expression, and lip movements, enhancing credibility and persuasiveness.
[0098] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
Claims
1. A multilingual content generation system for digital humans based on cross-language timbre cloning, characterized in that, include: The input interface module receives multimodal input containing target language content and a specified target speaker identifier; A multilingual timbre cloning module is connected to the input interface module. Based on the target language content, it generates or extracts the corresponding target language speech data, phoneme sequences and prosodic features, and integrates the timbre of the source speaker. A multimodal identity preservation module is used to store and load a driveable three-dimensional facial neural representation based on neural radiation field or three-dimensional Gaussian scattering technology, corresponding to the target speaker identifier; The multilingual lip-sync and facial expression driving module is connected to the multilingual timbre cloning module and the multimodal identity preservation module, respectively. It uses the target language speech data, phoneme sequence and prosodic features as driving sources to generate a facial driving parameter sequence that is highly synchronized with the speech content and decoupled from the identity. An end-to-end multilingual content generation engine integrates the multilingual voice cloning module, the multimodal identity preservation module, and the multilingual lip-sync and facial expression driving module to generate digital human multilingual videos with consistent voice, synchronized lip movements, and natural facial expressions based on multimodal input.
2. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 1, characterized in that, The multilingual timbre cloning module includes: a timbre coding network for extracting speaker timbre embedding vectors from the speech samples of the source speaker; A multilingual speech synthesis network, connected to the timbre encoding network, is used to synthesize target language speech with source speaker timbre features based on target language text, using the timbre embedding vector as a condition; A low-resource language adaptation unit is used to convert text in a low-resource language into a sequence of phonetic units that can be recognized by the multilingual speech synthesis network.
3. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 2, characterized in that, The multilingual speech synthesis network includes: a multilingual phoneme alignment unit, which has a built-in phoneme set covering the pronunciation units of multiple languages, and introduces an attention mechanism to learn the association mapping between phonemes of different languages and shared timbre features; A cross-linguistic duration-prosodic predictor is used to generate duration, fundamental frequency, and energy parameters that conform to the prosodic pattern of the target language and the rhythmic characteristics of a specific speaker, under the joint guidance of the timbre embedding vector and the language type encoding.
4. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 2, characterized in that, The low-resource language adaptation unit adopts a two-level adaptation strategy, including: A phoneme similarity mapping strategy is used to map the articulation units of low-resource languages to phoneme sequences with similar acoustic features in high-resource languages; The parameter fine-tuning strategy is used to fine-tune some layers of the multilingual speech synthesis network using a small amount of speech data from a low-resource language after initialization with a high-resource language model.
5. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 1, characterized in that, The multimodal identity preservation module includes: An identity-specific 3D neural field modeling unit is used to construct a 3D neural representation encoding the facial features of the target speaker based on video input. An audio-driven facial expression parameter generation unit is used to generate an identity-independent facial expression parameter sequence based on a speech feature sequence. The identity and dynamic fusion rendering unit is used to fuse the three-dimensional neural representation with the expression parameter sequence to render a sequence of facial images with consistent identity and corresponding expressions.
6. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 5, characterized in that, The multimodal identity preservation module also includes: The Fast Identity Modeling and Adaptation Unit is used to quickly reconstruct and drive the three-dimensional neural representation of the target speaker based on a pre-trained general face base model through a lightweight parameter adaptation mechanism, given a short video of the target speaker.
7. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 1, characterized in that, The multilingual lip-sync and facial expression driving module includes: a front-end speech-phoneme parsing unit, used to receive input audio streams, recognize and output phoneme sequences and their boundaries in multiple languages; The core cross-language driving signal generation network adopts an encoder-decoder architecture to predict the weight coefficients and expression style offsets for mixing a set of predefined basic visual pixel actions based on the phoneme sequence and prosodic features, so as to generate a facial driving parameter sequence. The post-processing and fusion unit is used to smooth and optimize the driving parameter sequence and adapt it to the downstream rendering model.
8. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 7, characterized in that, The multilingual lip-sync and facial expression driving module also includes: The low-resource language phoneme-visual mapping enhancement strategy module is used to enhance the ability to generate driving signals for low-resource languages through linguistic rule-based phoneme mapping, knowledge distillation-based teacher-student collaborative training, and adaptive mapping and fine-tuning.
9. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 1, characterized in that, The end-to-end multilingual content generation engine includes: a unified input interface and a preprocessing scheduling unit, used to parse multimodal input types and generate or extract time-aligned audio-phoneme streams; The identity model loading and conditional injection unit is used to dynamically load the identity parameter model of the target speaker and inject the timbre embedding vector extracted from the audio. A parallelized driving signal generation and optimization unit is used to schedule the multilingual lip-sync and facial expression driving module to generate a high-precision facial driving parameter sequence and perform online fine-tuning. The multimodal fusion and high-fidelity rendering unit is used to fuse the driving parameter sequence with the three-dimensional neural representation, synthesize a face image sequence and encapsulate it in sync with the audio, and output a digital human video.
10. The digital human multilingual content generation system based on cross-language timbre cloning according to claim 9, characterized in that, The multilingual lip-sync and facial expression driving module also includes a real-time adaptive prosodic feedback and lip-sync correction module, which is used to analyze phoneme-visual alignment delay and prosodic intensity mismatch in real time during the generation of driving parameters, and to dynamically correct subsequent driving parameters through a feedback loop.