Digital human real-time generation method, computer device, and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining a cross-modal spatiotemporal average flow field network and a flow field residual network, the multimodal coordination problem and the real-time performance deficiency in digital human generation are solved. This enables real-time, highly natural digital human image generation on lightweight terminal devices, improving the user interaction experience.

CN122244243APending Publication Date: 2026-06-19SHENZHEN QIANHAI HAND-PAINTED TECH & CULTURE CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHENZHEN QIANHAI HAND-PAINTED TECH & CULTURE CO LTD
Filing Date: 2026-01-27
Publication Date: 2026-06-19

Application Information

Patent Timeline

27 Jan 2026

Application

19 Jun 2026

Publication

CN122244243A

IPC: G06T13/40; G06N3/045; G06N3/0464; G06N3/0442; G06N3/08

AI Tagging

Application Domain

Animation Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Multi-motion generation
US20260162226A1Image enhancement Image analysis
Behavior control system, control device, electronic device, and avatar display device
US20260162345A1Image enhancement Input/output for user-computer interaction
A method and apparatus for generating a sequence of lip images, and an electronic device
CN122199755ACharacter and pattern recognition Animation
Hybrid mode 3D gaussian splat signaling
WO2026125092A1Animation Digital video signal modification
Image animation adapters for diffusion models
WO2026123183A1Animation

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122244243A_ABST

Patent Text Reader

Abstract

This application relates to the field of digital human technology and discloses a method, computer device, and storage medium for real-time digital human generation, comprising: acquiring user multimodal data at the current moment; converting the user multimodal data into a multimodal joint state vector; inputting the multimodal joint state vector into a pre-trained cross-modal spatiotemporal average flow field network, and predicting the cross-modal spatiotemporal average flow field through single-step inference; wherein the cross-modal spatiotemporal average flow field is used to describe the joint evolution of the digital human multimodal state from the current moment to the target moment; rendering the cross-modal spatiotemporal average flow field to generate a digital human image frame at the target moment; transmitting the digital human image frame to a display interface for display, wherein the cross-modal spatiotemporal average flow field network obtains a dynamic digital human image frame through one-step inference.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method for real-time generation of digital humans, a computer device, and a storage medium. Background Technology

[0002] In the field of digital human generation technology, existing methods are mainly divided into two categories: traditional modeling-driven methods and deep learning-based generation methods. Traditional techniques rely on 3D modeling software for manual modeling and skeleton binding, or on building facial expression capture systems using optical / inertial sensors and combining phoneme-lip mapping to achieve voice-driven operation. However, these methods have inherent drawbacks such as cumbersome processes, high costs, poor real-time performance, and insufficient flexibility.

[0003] In recent years, while deep learning-based methods have made significant progress, they still face many challenges: Generative Adversarial Networks (GANs) can generate high-quality images, but their training process is unstable and it is difficult to coordinate relationships between multiple modalities; diffusion models generate good diversity, but their slow speed due to the need for 20-50 iterative inference steps makes them unable to meet real-time requirements; stream matching methods, while theoretically elegant and training stable, mostly focus on single-modal generation and do not fully consider the fusion of multimodal characteristics of digital humans. These technologies share five core defects: First, the multimodal coordination problem, where simple splicing after independent generation leads to mismatches between lip movements and speech, and between facial expressions and speech emotions; second, temporal inconsistency, where frame-by-frame generation easily produces image jitter and flicker; third, insufficient real-time performance, especially the high latency of diffusion models, which makes it difficult to support interactive scenarios; fourth, lack of interactivity, as the generation process cannot be dynamically adjusted based on user feedback; and fifth, waste of computational resources, as a uniform resolution is used to process all regions without optimizing resource allocation based on visual importance.

[0004] The aforementioned shortcomings are particularly pronounced in lightweight terminal applications such as AI photo frames: traditional methods struggle to achieve low-latency generation with limited computing power; multimodal asynchrony leads to a disconnect between digital lip movements and voice, severely impacting the realism of the interaction; static generation modes cannot respond to real-time voice commands; and high-power rendering mechanisms restrict the battery life of terminal devices. Therefore, there is an urgent need for a lightweight digital human generation solution that balances real-time performance, multimodal coordination, and interactivity to support a highly natural human-computer interaction experience for consumer-grade terminals such as AI photo frames. Summary of the Invention

[0005] Based on this, it is necessary to address the technical problems of existing technologies such as "difficulty in multimodal coordination, inconsistent timing, and insufficient real-time performance" by proposing a method, computer equipment, and storage medium for real-time generation of digital humans.

[0006] Firstly, a method for real-time generation of digital humans is provided, the method comprising: Obtain the user's multimodal data at the current moment; The user's multimodal data is converted into a multimodal joint state vector; The multimodal joint state vector is input into a pre-trained cross-modal spatiotemporal average flow field network, and the cross-modal spatiotemporal average flow field is obtained through single-step inference prediction; wherein, the cross-modal spatiotemporal average flow field is used to describe the joint evolution of the digital human's multimodal state from the current time to the target time; The cross-modal spatiotemporal average flow field is rendered to generate a digital human image frame at the target time. The digital human image frame is transmitted to the display interface for display.

[0007] Preferably, the cross-modal spatiotemporal average flow field network adopts a multi-branch Transformer coding structure; The multi-branch Transformer encoder encodes the multimodal joint state vector and models the dynamic coupling relationship between different modal features through a cross-modal attention mechanism.

[0008] Preferably, the method further includes: acquiring real-time user feedback information and encoding the feedback information into a correction vector; The flow field residual network corrects the cross-modal spatiotemporal average flow field based on the correction vector; Based on the corrected cross-modal spatiotemporal averaged flow field, a corrected digital human image frame is generated; The corrected digital human image frame is transmitted to the display interface for display.

[0009] Preferably, the user feedback information includes at least one of voice, gesture, and text.

[0010] Preferably, the method further includes: The time offset between two modes in the multimodal joint state vector is calculated using a time offset estimation network. Based on the time offset, time compensation is performed on modes with time delays.

[0011] Preferably, the method further includes: acquiring a training dataset, the training dataset containing multiple sets of temporally consecutive digital human multimodal state sequences; The temporally continuous multimodal state pairs in the training dataset are input into the cross-modal spatiotemporal average flow field network to obtain the cross-modal spatiotemporal average flow field predicted by the cross-modal spatiotemporal average flow field network. Based on the difference between the predicted cross-modal spatiotemporal average flow field and the actual state changes in the multimodal state pair, a flow field matching loss is constructed; wherein, the flow field matching loss is the degree of multimodal synchronization. Backpropagation is performed based on the flow field matching loss to update the parameters of the cross-modal spatiotemporal average flow field network and the parameters of the time offset estimation network.

[0012] Preferably, the step of rendering the cross-modal spatiotemporal averaged flow field to generate a digital human image frame at the target time includes: The target multimodal joint state vector is obtained based on the cross-modal spatiotemporal average flow field; The target multimodal joint state vector is rendered into a digital human image frame by using a dynamic resolution flow field. The digital human image frame has different resolutions in different regions.

[0013] Preferably, the method for generating a digital human image frame by rendering the image corresponding to the target multimodal joint state vector through a dynamic resolution flow field includes: Visual importance scores are assigned to each position in the image corresponding to the multimodal joint state vector of the target based on the visual attention mechanism. The resolution at each location is assigned based on the visual importance score; By using Gaussian weighting, positions at different resolutions are fused to generate digital human image frames.

[0014] In a second aspect, a computer device is provided, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the real-time digital human generation method as described in any of the preceding claims.

[0015] Thirdly, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps of the above-described intelligent question-answering processing method.

[0016] Beneficial effects: This application achieves single-step deterministic inference, abandoning the generation path that relies on multi-step iterative denoising. Instead, it constructs and predicts a physical quantity called the "cross-modal spatiotemporal average flow field." This flow field defines the overall evolutionary trend from the current state to the target state. Therefore, the model can directly output the final generated result in a single forward propagation, eliminating the inherent delay caused by multiple iterations. Especially in lightweight interactive experience terminals such as AI photo frames, this can greatly enhance the user's interactive experience. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] in: Figure 1 This is an application environment diagram of a real-time digital human generation method in one embodiment; Figure 2 This is a flowchart of a real-time digital human generation method in one embodiment; Figure 3 This is a structural block diagram of a computer device in one embodiment. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] The real-time digital human generation method provided in this invention can be applied to, for example... Figure 1 In this application environment, the client communicates with the server via a network. The server can obtain the user's multimodal data at the current moment through the client; convert the user's multimodal data into a multimodal joint state vector; input the multimodal joint state vector into a pre-trained cross-modal spatiotemporal average flow field network, and predict the cross-modal spatiotemporal average flow field through single-step inference; wherein, the cross-modal spatiotemporal average flow field is used to describe the joint evolution of the digital human's multimodal state from the current moment to the target moment; render the cross-modal spatiotemporal average flow field to generate a digital human image frame at the target moment; and transmit the digital human image frame to the display interface for display. In this invention, by implementing single-step deterministic inference, the generation path relying on multi-step iterative denoising is abandoned, and instead a physical quantity called "cross-modal spatiotemporal average flow field" is constructed and predicted. This flow field defines the overall evolution trend from the current state to the target state. Therefore, the model can directly output the final generation result in one forward propagation, eliminating the inherent delay caused by multiple iterations. The client can be, but is not limited to, various personal computers, laptops, smartphones, tablets, AI photo frames, and portable wearable devices. The server can be implemented using a standalone server or a server cluster consisting of multiple servers. The invention will now be described in detail through specific embodiments.

[0021] Please see Figure 2 As shown, Figure 2 A flowchart illustrating the real-time digital human generation method provided in this embodiment of the invention includes the following steps: S1: Obtain the user's multimodal data at the current moment.

[0022] Specifically, user multimodal data refers to raw signals containing multiple modalities collected in real time from the user's end. It includes at least audio data acquired through an audio receiving port and video data acquired through a video receiving port. Audio data includes speech segments, and video data includes facial expressions, head or upper body posture, lip movements, etc. This data serves as the raw input driving digital human generation, representing multiple states of the task.

[0023] For example, in a practical implementation, the system continuously captures the user's voice and image streams through sensors such as microphones and cameras. This raw data is then sent to a preprocessing module, for example, converting the audio waveform into a Mel spectrogram, or detecting faces from the video and extracting preliminary visual features, decomposing them into features such as facial expressions, head or upper body posture, and lip movements, in preparation for subsequent deep feature encoding.

[0024] For example, a user says, "How's the weather today?" (audio data). The microphone array in the picture frame continuously records this speech, obtaining the raw audio waveform. Simultaneously, the picture frame's camera captures video of the user's face (video data), recording the user's natural facial expressions and lip movements. The system performs preliminary preprocessing on the raw data: converting the audio waveform into a more easily processed Mel spectrogram sequence; detecting faces frame by frame from the video stream, and extracting visual features such as facial motion units, head pose Euler angles, and lip keypoint coordinates.

[0025] S2: Convert user multimodal data into a multimodal joint state vector.

[0026] Specifically, the multimodal joint state vector is used to represent the digital human in a unified feature space. k The complete state at time step. This vector is formed by concatenating or fusing the feature vectors of each modality, i.e. ,in This is the audio modal feature space, such as speech Mel spectrum, phoneme embeddings, or semantic vectors; For facial expression feature space, such as expression parameters or latent vectors; For head or upper body posture parameter space; For lip shape related feature space, such as lip keypoints or lip shape embedding, k represents the discrete frame time index in the digital human generation process.

[0027] For example, the system needs to encode the multimodal information corresponding to the sentence "What's the weather like today?" into a joint state. The pre-processed features are fed into their respective encoding networks. The audio spectrogram is extracted as a 128-dimensional audio feature vector through a one-dimensional convolutional neural network ; the visual features are encoded as 256-dimensional visual feature vectors through a small Transformer and can be further decoupled into expression features , pose features , and lip movement features . The aligned modal feature vectors are concatenated or fused to form the multimodal joint state vector at the current moment k, which comprehensively represents the "auditory state" and "visual state" of the digital human at that moment

[0028] S3: Input the multimodal joint state vector into the pre-trained cross-modal spatio-temporal average flow field network, and predict the cross-modal spatio-temporal average flow field through single-step inference; where the cross-modal spatio-temporal average flow field is used to describe the joint evolution of the multimodal states of the digital human from the current moment to the target moment

[0029] Specifically, the cross-modal spatio-temporal average flow field network is a deep learning network based on a multi-branch Transformer architecture, and its core function is to perform single-step inference. Single-step inference means that the cross-modal spatio-temporal average flow field network can directly predict the state evolution results within the entire time interval through only one forward propagation, thus bypassing multi-step iterative methods such as traditional diffusion models, which is the key to achieving real-time performance. Cross-modal spatio-temporal average flow field: denoted as the output of the network. It is not the final image, but a "displacement instruction" or "change route" that describes how the multimodal states (lip movement, expression, pose, etc.) of the digital human should cooperate and evolve on average from the current moment to the target moment

[0030] For example, the system knows the state of the current frame (neutral expression, closed mouth), and receives the speech features of the next moment (strong audio signal containing "sky"), and needs to generate the next frame. Input the current joint state and the target information into the cross-modal spatio-temporal average flow field network. The cross-modal attention mechanism inside the network works dynamically. For example, it recognizes the phoneme "sky", so it increases the coupling weight of the audio modality to the lip movement modality to ensure accurate lip movement; at the same time, to cooperate with the interrogative tone, it also increases the weight of the audio to the expression modality to generate a slightly puzzled expression. Finally, the cross-modal spatio-temporal average flow field U is output UIt's not an image, but a command vector whose meaning includes: a lip shape component (lips need to open to a specific shape to pronounce the sound "tian"); an expression component (eyebrows slightly raised, corners of the mouth slightly lifted); and a posture component (head slightly nods). All these changes need to be coordinated within 33 milliseconds. This is achieved through a single-step inference formula. = + U， The target state at the next moment can then be obtained. . Specifically, within the normalized time interval [ , ] Within [0,1], the cross-modal spatiotemporal averaged flow field is defined as: .

[0031] in, This represents the multimodal joint state at discrete-time index k; , For the normalized continuous time start and end points; For normalized continuous-time integral variables; This is the instantaneous multimodal velocity field, used to describe the direction and magnitude of the change of multimodal states over continuous time; For the normalized time interval [ , The cross-modal spatiotemporal averaged flow field within [ ] is used to predict the overall evolution of multimodal states in a single step; This is a cross-modal coupling matrix used to model the information transmission relationships between different modes.

[0032] The cross-modal coupling matrix is represented as follows: .

[0033] in, This represents the influence weight from mode j to mode i, which changes over time. The row subscript i indicates the target mode that is "influenced"; the column subscript j indicates the source mode that "exerts influence".

[0034] Subscript a,e,p,l These represent audio, facial expression, posture, and lip-sync modalities, respectively.

[0035] This represents the influence weight of mode j on mode i at time τ. The weight values are adaptively learned by the cross-modal attention network and change dynamically over time, typically ranging from 0 to 1. (Diagonal terms) w iiThis represents the temporal consistency constraint of the modality itself.

[0036] These weights It is not preset, but rather learned and computed in real time and adaptively by the cross-modal attention network during the generation process. That is, the system can intelligently determine which modality should dominate at a given moment and how the modalities should cooperate, based on the current specific context (such as the content and emotion of the speech).

[0037] For example, in a scenario where a digital human says "Wow!" (expressing surprise), suppose at the initial moment of uttering the "W" sound... Audio-driven lip-syncing (The weighting is very high). The system detects that the consonant "W" needs to be pronounced, requiring a "pursed lips" mouth shape. At this point, the cross-modal attention network will be highly activated. The weight of (audio-driven lip-sync). This means that the audio modality has a strong driving instruction on the lip-sync modality: prepare to pucker your lips immediately. Audio-driven facial expressions. As the weights increase, "Wow!" is often accompanied by a surprised expression. Therefore, cross-modal attention networks are also tuned. The weighting of (audio-driven facial expressions) drives the shift in facial expression modality towards "surprise" (raised eyebrows, wide eyes). This involves the coordination between facial expressions and lip movements. and (Possibly with moderate weighting), when making a "surprised" expression, the mouth usually opens slightly involuntarily, coordinating with the "W" shaped pout. Therefore, there is a moderate mutual influence weight between facial expressions and mouth shapes. Posture following. or With lower weights, the head may tilt slightly backward or forward when expressing strong surprise. This movement may be directly driven by the rhythm of the audio or indirectly by strong facial expressions; therefore, the posture modality receives less weight from either the audio or facial expression. The diagonal weights of all modalities ensure that their respective states do not abruptly change but evolve smoothly based on the previous frame. This is especially true at the core moment of uttering the "ow" vowel. hour, When the weight reaches its peak, the mouth shape quickly transitions from "W" to a wide, round "ow", at which point the audio's control over the mouth shape has the highest weight. The weight remains high, the surprised expression persists, and is amplified. The weight may increase, and the head may make a noticeable nodding motion accompanied by accentuation.

[0038] Cross-modal coupling matrix As a real-time influence adjustment center, it increases the weights between corresponding modalities when precise alignment is needed (such as phoneme-to-lip movement). When emotional expression is required (such as tone of voice to facial expression), it establishes and strengthens the corresponding connections. It ensures that all changes occur collaboratively, rather than independently, thereby generating natural and realistic sequences of digital human movements.

[0039] The network architecture of the cross-modal spatiotemporal averaged flow field network is as follows: Input Layer: Modal Feature Concatenation: Receives feature vectors from the multimodal encoder, including audio features (such as 80-dimensional Mel-ray spectrum) and facial expression features. (such as 64-dimensional facial expression parameters), pose features (such as 6-dimensional head posture parameters), lip shape features (e.g., 128-dimensional lip-sync embedding). These features are concatenated into a joint input vector. Temporal embedding: A normalized temporal embedding vector (s0, s1) (e.g., sinusoidal position coding) is introduced to describe continuous time intervals.

[0040] Multi-branch Transformer Encoder: The branch structure configures an independent Transformer encoder branch for each modality. For example, the audio branch has 4 Transformer layers, 8 attention heads, and a hidden layer dimension of 256. The visual branches (expression / lip-reading / gesture) each have 4 Transformer layers, with a hidden layer dimension of 128. A cross-modal attention layer is added after the output of each branch, dynamically calculating the weights of the coupling matrix C(τ). (τ). For example, the attention weights of the audio modality to the lip-sync modality can be calculated through a query, key, and value mechanism.

[0041] Flow Prediction Head: Receives fused multimodal features and outputs cross-modal spatiotemporal averaged flow field through a two-layer fully connected network (MLP). Output dimension and multimodal joint state The dimensions are consistent.

[0042] The total number of parameters in the cross-modal spatiotemporal averaged flow field network is approximately 50M-100M, ensuring lightweight design to meet real-time requirements (single-step inference <100ms).

[0043] When training a cross-modal spatiotemporal average flow field network, high-precision facial expression and pose sequences can be recorded using motion capture devices (such as Vicon), with audio being acquired simultaneously. Alternatively, large-scale publicly available audio-video data can be directly obtained. Dataset construction: VoxCeleb2 (containing video and audio) and LRW (lip-reading dataset) are used, providing face videos and synchronized audio. During data preprocessing, audio features include: extracting 80-dimensional Mel spectrum, frame shift of 10ms, and window length of 25ms. Visual features include: facial keypoint detection (e.g., 68 points), and extraction of motion parameters for regions such as lip shape and eyebrows. Facial expression parameterization: fitting facial coefficients (e.g., 50-dimensional) using 3DMM (3D deformation model). Pose parameters include: Euler angles for head rotation (pitch, yaw, roll). Audio and video timestamps are rigorously calibrated for temporal alignment, with errors controlled within ±5ms. Each data sample contains 2 seconds of time-series data (corresponding to 60 frames, 30fps). By combining these mainstream public datasets, approximately 3,000-5,000 hours of high-quality audio-video training data can be obtained. The flow field matching loss function directly guides the network in predicting accurate flow field U. A large flow field matching value indicates that the network's predicted motion trend is incorrect. Through backpropagation, the network adjusts its internal parameters (especially the generation weights of the cross-modal coupling matrix C(τ)) to make the next predicted U closer to the actual change, directly training the network's core prediction capabilities until the intended effect is achieved.

[0044] S4: Render the cross-modal spatiotemporal average flow field to generate a digital human image frame at the target time.

[0045] Specifically, rendering refers to the process of converting an abstract, cross-modal, spatiotemporal average flow field into a concrete pixel image. An image frame is a single static image that constitutes a digital human video. It is the basic unit of a video stream; multiple consecutive image frames played rapidly within a very short time, due to the persistence of vision in the human eye, create the smooth, dynamic video we see. A target-moment digital human image frame refers to an image that can be output and displayed at a future point in time; multiple image frames form a continuous video stream.

[0046] For example, a cross-modal spatiotemporal averaged flow field network outputs a cross-modal spatiotemporal averaged flow field U. This vector encodes cooperative variation instructions: The amount of facial expression drives the corners of the mouth to turn up and the corners of the eyes to squint (smiling expression). Mouth shape components control lip movement, transitioning from a closed state to the consonant mouth shape preparation for "you"; Posture components control a slight head nod.

[0047] Render generator calculation Obtain the target state This includes encoded features such as smiling expressions, specific lip movements, and nodding gestures. This is the initial state. The target state is... The image is rendered as a single frame. Typically, multiple target states are generated in one inference step, and each target state is rendered into an image frame.

[0048] S5: Transmits the digital human image frame to the display interface for display.

[0049] Specifically, this step involves transmitting each generated image frame to the user terminal in real time for display. Multiple frames form a video stream, which is then integrated with audio features to create a unified video stream of both visuals and sound. For example, after acquiring multimodal data from a user, the computer device processes the data using the above steps and displays the corresponding digital human image frames in real time on the display module. Simultaneously, the data can be transmitted to other terminals for storage or display. The display interface can be a smartphone, an AI photo frame, a computer, or other intelligent display terminals.

[0050] By implementing single-step deterministic inference, the generation path that relies on multi-step iterative denoising is abandoned. Instead, a physical quantity called the "cross-modal spatiotemporal average flow field" is constructed and predicted. This flow field defines the overall evolution trend from the current state to the target state. Therefore, the model can directly output the final generation result in a single forward propagation, eliminating the inherent delay caused by multiple iterations.

[0051] In some implementations, the cross-modal spatiotemporal average flow field network employs a multi-branch Transformer encoding structure; the multi-branch Transformer encoder encodes the multimodal joint state vector and models the dynamic coupling relationship between different modal features through a cross-modal attention mechanism.

[0052] Specifically, a multi-branch Transformer encoding structure refers to a network architecture containing multiple independent, parallel Transformer encoder branches, each specializing in processing feature sequences of a specific modality (such as audio, facial expressions, and lip movements). Cross-modal attention mechanisms are a variant of attention mechanisms used to calculate the correlation weights between features from different modalities, thereby enabling information interaction and fusion between modalities. Each modal feature sequence (such as audio Mel-spectrum and facial expression parameters) is first input into its dedicated Transformer branch for deep encoding, extracting high-order temporal features. These features are then fed into a cross-modal attention layer. In this layer, audio features are used as queries, and lip movement features are used as keys and values. By calculating attention weights, the network can dynamically determine how much influence should be applied to lip movement features when pronouncing a specific phoneme. For example, when pronouncing the plosive "P," the audio→lip movement attention weight is significantly increased to ensure a closed lip shape is generated. By using a multi-branch Transformer coding structure, we can achieve a deep understanding and fusion of multimodal information, dynamically establish semantic relationships between modalities, and ensure that the generated digital human actions (lip movements, facial expressions) are highly coordinated with the driving signals (speech).

[0053] In some implementations, the following steps are also included: (1) Obtain real-time user feedback information and encode the feedback information into a correction vector; (2) The flow field residual network corrects the cross-modal spatiotemporal average flow field based on the correction vector; (3) Generate corrected digital human image frames based on the corrected cross-modal spatiotemporal average flow field; (4) Transmit the corrected digital human image frame to the display interface for display.

[0054] Specifically, user feedback refers to information input by the user through natural interaction, used to adjust the digital human generation process. The flow field residual network is a lightweight neural network used to calculate corrections to the original predicted flow field based on user feedback.

[0055] In some implementations, user feedback information includes at least one of voice, gesture, and text.

[0056] Specifically, the system listens for user feedback in real time and converts it into feedback vectors using an encoder (such as speech recognition, gesture recognition, or text recognition). f Subsequently, f is compared with the original flow field U and the current state. Input the flow field residual network together The network outputs a flow field correction ΔU.

[0057] The expression for the flow field residual network is as follows: .

[0058] Where U represents the original predicted flow field; f Encode user feedback vectors (voice, text, or gesture). This represents the current multimodal state; This is the flow field correction amount.

[0059] .

[0060] in, α The learning rate, with a value between [0,1], can be adjusted online and determines the intensity of the impact of user feedback and the speed at which it takes effect. α =0: This means that user feedback is completely ignored and the digital human will move strictly according to the original predicted U. α =1: This means that the correction ΔU calculated by the flow field residual network is fully adopted. 0<α<1: In most cases, α Values within this range allow for smooth, gradual adjustments, avoiding abrupt changes. For example: α = 0.3: indicates "slight adoption." When the user says "smile," the digital human might only slightly raise the corners of its mouth, a gentle and natural change. α = 0.8: indicates "strong adoption." Under the same command, the digital human will display a more pronounced and larger smile.

[0061] For example, when the digital human is introducing a product with a serious expression, the user might say, "Make your expression more natural." The system encodes this voice command as... f , Based on this, the network calculates a correction factor that primarily affects facial expression modalities. Ultimately, the new flow field... This allows the digital human to smoothly transition to a smiling expression in subsequent frames. By adjusting the output of digital human image frames based on real-time user feedback, the digital human generation process is transformed from a one-way playback into an interactive and guided process, greatly enhancing the application's flexibility and user engagement.

[0062] In some implementations, the flow field residual network It is a trainable, lightweight neural network (such as a small MLP). An MLP (Multi-Layer Perceptron) is a classic feedforward neural network composed of fully connected layers and stacked activation functions. Its simple structure and high computational efficiency make it ideal as a small, dedicated module in a large system. Its relatively few layers and neurons per layer ensure extremely fast forward propagation, meeting the real-time requirement of feedback correction <30ms.

[0063] In this application, the flow field residual network The structure is as follows: The input layer takes three input vectors U, f , The features are concatenated into a longer composite feature vector, which serves as the network input. It contains 1 to 3 hidden layers, each with hundreds of neurons. Non-linearity is introduced using activation functions such as ReLU, enabling the network to learn complex correction strategies.

[0064] The output layer must have the exact same dimension as the original flow field U, directly outputting the correction amount ΔU. A linear activation function is typically used.

[0065] For example, a digital human is introducing a product with a neutral expression, and the user gives the voice command "Please smile".

[0066] Voice commands are encoded as feedback vectors f The current state corresponding to the neutral expression. The original flow field U, after being concatenated, is input into the flow field residual network. After internal calculations, a correction vector ΔU is output. The ΔU vector shows a significant positive change in the "expression"-related dimensions (driving the corners of the mouth to turn upwards), while the change is small or zero in other dimensions such as "lip shape" and "posture." Update formula. The new cross-modal spatiotemporal average flow field is obtained, which guides the generator to make the digital human naturally display a smiling expression in subsequent frames.

[0067] In some implementations, the method further includes: calculating the time offset between two modes in the multimodal joint state vector using a time offset estimation network; and performing time compensation on the modes with time delays based on the time offset.

[0068] Specifically, a time-shift estimation network is a neural network used to compute the inherent delay between feature sequences of two modes. Time compensation refers to time-shifting and interpolating the feature sequences to eliminate the time delay between modes. Time-shift estimation network Given feature sequences from two modalities (such as audio and lip movements), predict the temporal offset between them by calculating their cross-correlation or using an attention mechanism. Δt .

[0069] The specific expression for the time-shift estimation network is as follows: .

[0070] in, They represent the first i , j Feature sequences of each modality; For time offset estimation network; For modality i Relative to mode j The time offset, in milliseconds.

[0071] For example, Analysis revealed that the audio feature sequence was 50ms ahead of the lip-sync feature sequence. The system then interpolated the visual feature sequence and shifted it forward by 50ms, ensuring that at the precise moment of pronouncing the 'A' sound, the corresponding visual feature was an open mouth, thus achieving audio-visual synchronization.

[0072] The time delay is calculated by a time offset estimation network, and compensation is made for modes with time delays, fundamentally solving the problem of audio-visual asynchrony caused by different signal acquisition and processing speeds. Providing high-quality, time-aligned input for cross-modal fusion is the foundation for ensuring the naturalness of the generated content.

[0073] In some implementations, the time offset estimation network is a regression model whose core objective is to accurately calculate the time offset Δt (typically in milliseconds) between two feature sequences of different modalities. The time offset estimation network is based on a one-dimensional convolutional neural network or a recurrent neural network to calculate the time offset. Taking a one-dimensional convolutional neural network as an example, the input layer receives two feature sequences. F i and F j Typically, these features are concatenated along the feature dimension to form a wider temporal feature tensor. Hidden layers consist of multiple one-dimensional convolutional layers, pooling layers, and activation functions (such as linear rectified functions) stacked together. These layers are responsible for extracting discriminative local temporal patterns from the sequence. Fully connected layers flatten the features extracted by the convolutional layers and then perform non-linear transformations through several fully connected layers. The output layer is a linear fully connected layer that outputs a scalar value. Taking a recurrent neural network as an example, two feature sequences... F i and F j Input is processed step-by-step. The hidden layer uses LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) units to process the sequence, and its hidden states can capture long-range temporal dependencies. The output layer takes the hidden state of the last time step and regresses Δt through a fully connected layer.

[0074] For example, the system extracts temporal features of the same duration (e.g., 2 seconds) from raw audio and video streams. The audio is then converted into a Mel spectrogram sequence. Faces in the video stream are detected and encoded as lip-sync sequences. Will and Input to Network. Inside the network (taking a one-dimensional convolutional neural network as an example), the convolutional kernels slide temporally to calculate the cross-correlation function or similar features of two sequences and find the optimal alignment point. The network discovered through calculation that when... Shift backward by 50ms (or equivalently shift back) When shifted forward by 50 ms, the two sequences score highest on a certain consistency metric. Therefore, it outputs Δt = -50 ms. Based on the predicted Δt, the system performs interpolation and time-shifting operations on the feature sequences. Since Δt = -50 ms, the system... The sequence is interpolated (to maintain smoothness) and then shifted forward by 50ms. This adjusts visual features (such as mouth opening) originally at t=100ms to the position at t=50ms, thus precisely aligning them with audio features at t=100ms on the time axis. The output, time-compensated and synchronized feature sequence is then provided to the cross-modal spatiotemporal averaging flow field network.

[0075] In some implementations, the method also includes acquiring a training dataset containing multiple sets of temporally consecutive digital human multimodal state sequences. The temporally continuous multimodal state pairs in the training dataset are input into the cross-modal spatiotemporal mean flow field network to obtain the cross-modal spatiotemporal mean flow field predicted by the cross-modal spatiotemporal mean flow field network. Based on the difference between the predicted cross-modal spatiotemporal average flow field and the actual state changes in the multimodal state pairs, a flow field matching loss is constructed. Among them, the flow field matching loss is the degree of multimodal synchronization; Backpropagation is performed based on the flow field matching loss to update the parameters of the cross-modal spatiotemporal average flow field network and the parameters of the time-shift estimation network.

[0076] Specifically, flow field matching loss is a loss function used to measure the difference between the predicted cross-modal spatiotemporal average flow field and the actual state changes. Its goal is to enable the predicted cross-modal spatiotemporal average flow field to accurately push the current state toward the target state.

[0077] The preparation of the cross-modal spatiotemporal average flow field is divided into a pre-training stage and a fine-tuning and interactive adaptation stage. In the pre-training stage, a large-scale digital human multimodal dataset is input into the cross-modal spatiotemporal average flow field network for learning. At the same time, a loss function is constructed to continuously correct the cross-modal spatiotemporal average flow field network, resulting in the pre-trained cross-modal spatiotemporal average flow field network.

[0078] The expression for the loss function of flow field matching loss is: .

[0079] in For flow field matching loss, To rebuild the losses, and The weighted hyperparameters control the contributions of flow field matching loss and reconstruction loss to the final loss, respectively.

[0080] Alignment loss Alignment loss is used to measure the temporal consistency of different modalities (such as audio, facial expressions, gestures, lip movements, etc.). This loss ensures synchronization between multimodal data and avoids inconsistencies such as lip movements and speech, or facial expressions and speech emotion. Common alignment losses include time window-based alignment and cross-modal synchronization error.

[0081] Control flow field matching loss The higher the weight in the total loss, the better. This will make the model pay more attention to the coordination and consistency between multiple modalities, avoiding obvious jitter and desynchronization. To control reconstruction losses Weight in the total loss. Higher weight. This will make the model pay more attention to the quality of the generated content, ensuring a high degree of similarity to the input data. and The value is usually adjusted in real time. Generally speaking, When the value is large, the system will pay more attention to the consistency of the flow field; while When the value is large, the system will pay more attention to the similarity between the generated image and the real data.

[0082] The pre-trained cross-modal spatiotemporal average flow field network can be directly applied to real-world scenarios, and a feedback loss function can be built for fine-tuning and interactive adaptation during use.

[0083] The expression for the feedback loss function is as follows: in, In order to provide feedback on the losses to users, To delay punishment.

[0084] and These are weight hyperparameters, which control the impact of user feedback and latency penalty on training, respectively.

[0085] To control user feedback loss Weight in the total loss. Higher weight. This will make the model more focused on adjusting the generated results based on user feedback, thereby improving user satisfaction. To control delay penalty losses Weight in the total loss. Higher weight. This will increase the penalty for latency, ensuring that the digital humans generated by the system have low latency (typically required to be less than 30ms). and It needs to be adjusted according to the requirements of real-time interaction. Generally speaking, Setting a larger value will make the system more sensitive to user feedback; This is used to ensure that the model's latency during the generation process is less than 30ms.

[0086] Generally speaking, and The value is usually set to 10 5 Up to 10 3 Between. The optimal value is selected through grid search or random search. and The value is usually adjusted in the range of 1 to 100, depending on the requirements of real-time interaction and the tolerance for system latency.

[0087] For example, =100, =1, which makes the model pay more attention to the consistency of the flow field and avoid inconsistencies between different modes. =10, which increases the importance attached to user feedback, making the generated content more in line with the user's real-time instructions. =1, ensuring low latency and smooth real-time interaction.

[0088] Through the above adjustments, the system can provide a better user experience while ensuring the quality of digital human generation.

[0089] In some implementations, the method for generating digital human image frames by rendering cross-modal spatiotemporal averaged flow fields is as follows: (1) Obtain the target multimodal joint state vector based on the cross-modal spatiotemporal average flow field; (2) The target multimodal joint state vector corresponding image is rendered to generate a digital human image frame by using a dynamic resolution flow field. Different regions of the digital human image frame have different resolutions.

[0090] Specifically, dynamic resolution flow field generation can differentiate computational resources based on the visual importance of different regions of an image, thereby achieving a balance between high quality and high efficiency. Dynamic resolution flow field generation does not refer to generating a new object named "dynamic resolution flow field," but rather to a rendering strategy. This strategy employs a dynamic resolution mechanism when generating the flow field using a cross-modal spatiotemporal average flow field U. The direct object of its rendering is the target multimodal joint state calculated from the cross-modal spatiotemporal average flow field U. , = + U, that is, from the previous multimodal joint space state Change to the joint space state of the target .

[0091] The target multimodal joint state vector encodes the various states of the digital human at time t, including the head, upper body, mouth shape and other parts. After fusing the various parts, the corresponding initial image is formed. Then, the initial image is rendered through a dynamic resolution mechanism to obtain the digital human image frame.

[0092] In some implementations, a method for generating digital human image frames by rendering the image corresponding to the target multimodal joint state vector through a dynamic resolution flow field includes: (1) Based on the visual attention mechanism, the visual importance of each position in the image corresponding to the multimodal joint state vector of the target is scored; (2) Assign resolution at each location based on visual importance score; (3) The different resolution positions are fused by Gaussian weighting to generate digital human image frames.

[0093] Specifically, the visual attention mechanism is a subnetwork used to predict the visual importance of different regions in an image. Gaussian weighted fusion is an image fusion technique used to smoothly stitch together generated regions of different resolutions, eliminating seams. The visual attention network analyzes the target state. Output a heatmap of the same size as the image. A(x,y) The value is between 0 and 1, and the larger the value, the more important the position (such as lips, eyes).

[0094] The expression for the visual attention mechanism is as follows: .

[0095] in, These are pixel or region location coordinates.

[0096] k A vector representing time information. k .

[0097] Position encoding is a vector used to identify the absolute or relative position of a pixel or region (x, y) in an image. It typically uses sine / cosine encoding or learnable embedding vectors.

[0098] For the first k The multimodal joint state of a frame, which is determined by the formula... = + U( The complete internal state description of the digital human in the next frame (the k-th frame) is calculated. It encodes the "semantics" that the digital human should have at that moment, such as: "smiling, opening its mouth to make an 'ah' sound, and nodding slightly."

[0099] The visual importance score for this location ranges from [0,1].

[0100] MLP() stands for Multilayer Perceptron, a small, fully connected neural network responsible for learning how to synthesize information from multiple layers. An initial importance score is calculated. Through training, the system learns to determine the importance of each location in the image for representing the current target state.

[0101] σ() is the activation function, specifically the Sigmoid activation function in this application, which maps any real number to the interval (0, 1). Normalizing the raw scores output by the MLP to between 0 and 1 creates a standard, interpretable probability or weight value, facilitating subsequent resource allocation decisions.

[0102] For example, the system detects the mouth area. A(x,y,k) >0.9, automatically generated using high resolution; background area A (x,y,k) If the resolution is less than 0.3, a lower resolution will be used for generation; Gaussian blending will be used to eliminate resolution boundaries and ensure image continuity. The specific resolution allocation strategy is as follows: Resolution (x,y,k)= in, Typical values range from 0.7 to 0.9. Areas with visual importance scores above this threshold (such as lips and eyes) are considered critical areas and must be rendered with the highest computing resources to ensure detail. Typical values range from 0.3 to 0.5, with scores between [missing value]. and Areas between these areas (such as the cheeks and forehead) are considered secondary areas, and moderate resources are used to balance quality and efficiency.

[0103] like Setting it too high (e.g., 0.95) may result in only a very small number of pixels being allocated high resolution, leading to insufficient detail in important areas. If... Setting it too low (e.g., 0.1) may cause too many areas to be allocated medium resolution, reducing the optimization effect.

[0104] By allocating resolutions and then applying Gaussian weights to different resolution regions, seam artifacts can be eliminated, reducing computation by 30–50% and peak memory usage by 40%.

[0105] For example, when When the message "saying 'wow'" is included, the attention map value A(x,y) in the mouth region approaches 1.0. This region is assigned to the high-resolution branch for generation to render clear teeth and tongue textures; while the hair region, with an A(x,y) value below 0.3, is assigned to the low-resolution branch for faster generation. Finally, Gaussian weighted fusion is used to create a natural transition between the high-resolution lips and the low-resolution cheeks, eliminating any visible seams. This significantly reduces computational and memory consumption, improving generation efficiency while maintaining high definition in the visual focus area.

[0106] Gaussian weighted fusion is a fusion technique based on multi-scale image representation. Its core idea is to construct multi-resolution versions of an image using a Gaussian pyramid and apply a Gaussian weighting strategy at different scales to combine information from multiple input images, thereby generating a detailed and naturally transitioning fusion result. The Gaussian weighted fusion method works as follows: for a pixel on a boundary, its final color value is a weighted average of the results generated by the high-resolution branch and the low-resolution branch. The weights are determined by a Gaussian function distribution; the closer to the center of the high-resolution region, the higher the weight of the high-resolution result (close to 1); the closer to the center of the low-resolution region, the higher the weight of the low-resolution result. At the boundary line, the weights are approximately 0.5 for each. This distance-based Gaussian weight variation ensures smooth and natural transitions in color and detail, thus eliminating harsh seams to the human eye. The weights of the high / low resolution results can be based on the boundary line, each at 0.5. The weight of the high-resolution result increases systematically closer to the high-resolution region, until the high-resolution weight reaches 1. The closer to the low-resolution region, the higher the weight of the low-resolution region becomes, until the weight of the low-resolution region is 1.

[0107] This application fundamentally avoids the computational overhead of iteration by implementing single-step deterministic reasoning. This technique abandons the generation path that relies on multi-step iterative denoising, constructs and predicts the cross-modal spatiotemporal average flow field, and realizes the overall evolution trend from the current state to the target state. Therefore, the model can directly output the final generation result in a single forward propagation, eliminating the inherent delay caused by multiple iterations.

[0108] A dynamic resolution mechanism enables precise on-demand allocation of computing resources. This mechanism incorporates a visual attention prediction network that evaluates the importance of different regions of the image in real time based on the currently generated content. High-computational-complexity branches are used for critical regions such as the mouth and eyes to preserve details, while lightweight branches are used for non-critical regions. Finally, Gaussian-weighted fusion technology ensures visual continuity between regions of different resolutions. This strategy effectively reduces the overall computational cost required for a single forward propagation.

[0109] In terms of interactive experience, a real-time feedback mechanism is introduced. Traditional digital human generation processes are unidirectional and static. This invention integrates an interactive flow field correction mechanism, enabling the system to respond in real-time to natural feedback from users, such as voice, text, or gestures. By dynamically fine-tuning the predetermined generation trajectory through a flow field residual network, the system can dynamically adjust the generated content within milliseconds (e.g., "make the digital human smile"), transforming the digital human from a pre-set player into a real-time guided intelligent agent, greatly enhancing the flexibility of interaction and user engagement.

[0110] Intelligent on-demand allocation of computing resources is achieved. Addressing the varying visual importance of different regions in a digital human, this invention employs a dynamic resolution generation strategy. Based on a visual attention map, this strategy allocates high computing resources to critical areas such as the mouth and eyes to generate rich details, while using lightweight generation for non-focus areas such as hair and background. Furthermore, Gaussian weighted fusion technology seamlessly eliminates resolution boundaries, saving computational overhead and peak memory without affecting visual perception, thus enabling high-quality digital human deployment on consumer-grade hardware.

[0111] In terms of system architecture, technologies such as time alignment, multimodal fusion, and high-quality rendering are integrated into a unified, trainable framework. This end-to-end design significantly reduces technical complexity and integration barriers, freeing developers from the need to worry about tedious intermediate module optimization and allowing them to focus on upper-layer application innovation, thus powerfully promoting the popularization and application of digital human technology.

[0112] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 3As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile and / or volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The network interface is used to communicate with external clients via a network connection. When the computer program is executed by the processor, it implements the functions or steps of a real-time digital human generation method on the server side.

[0113] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the following steps: Obtain the user's multimodal data at the current moment; Convert user multimodal data into a multimodal joint state vector; The multimodal joint state vector is input into a pre-trained cross-modal spatiotemporal average flow field network, and the cross-modal spatiotemporal average flow field is obtained through single-step inference prediction; wherein, the cross-modal spatiotemporal average flow field is used to describe the joint evolution of the digital human's multimodal state from the current time to the target time. The cross-modal spatiotemporal averaged flow field is rendered to generate digital human image frames at the target time. Transmit the digital human image frame to the display interface for display.

[0114] This application achieves single-step deterministic reasoning, abandoning the generation path that relies on multi-step iterative denoising, and instead constructs and predicts a physical quantity called the "cross-modal spatiotemporal average flow field." This flow field defines the overall evolution trend from the current state to the target state. Therefore, the model can directly output the final generation result in a single forward propagation, eliminating the inherent delay caused by multiple iterations. In one embodiment, a computer-readable storage medium is proposed, which stores a computer program that, when executed by a processor, performs the following steps: Obtain the user's multimodal data at the current moment; Convert user multimodal data into a multimodal joint state vector; The multimodal joint state vector is input into a pre-trained cross-modal spatiotemporal average flow field network, and the cross-modal spatiotemporal average flow field is obtained through single-step inference prediction; wherein, the cross-modal spatiotemporal average flow field is used to describe the joint evolution of the digital human's multimodal state from the current time to the target time. The cross-modal spatiotemporal averaged flow field is rendered to generate digital human image frames at the target time. Transmit the digital human image frame to the display interface for display.

[0115] This application achieves single-step deterministic inference, abandoning the generation path that relies on multi-step iterative denoising, and instead constructs and predicts a physical quantity called the "cross-modal spatiotemporal average flow field." This flow field defines the overall evolution trend from the current state to the target state. Therefore, the model can directly output the final generation result in a single forward propagation, eliminating the inherent delay caused by multiple iterations.

[0116] It should be noted that the functions or steps that can be implemented by the computer-readable storage medium or computer device described above can be referred to the relevant descriptions on the server side and client side in the foregoing method embodiments. To avoid repetition, they will not be described one by one here.

[0117] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0118] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.

[0119] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. A method for real-time generation of digital humans, characterized in that, The method includes: Obtain the user's multimodal data at the current moment; The user's multimodal data is converted into a multimodal joint state vector; The multimodal joint state vector is input into a pre-trained cross-modal spatiotemporal average flow field network, and the cross-modal spatiotemporal average flow field is obtained through single-step inference prediction; wherein, the cross-modal spatiotemporal average flow field is used to describe the joint evolution of the digital human's multimodal state from the current time to the target time; The cross-modal spatiotemporal average flow field is rendered to generate a digital human image frame at the target time. The digital human image frame is transmitted to the display interface for display.

2. The method for real-time generation of digital humans according to claim 1, characterized in that, The cross-modal spatiotemporal average flow field network adopts a multi-branch Transformer coding structure; The multi-branch Transformer encoder encodes the multimodal joint state vector and models the dynamic coupling relationship between different modal features through a cross-modal attention mechanism.

3. The method for real-time generation of digital humans according to claim 1, characterized in that, The method further includes: Obtain real-time user feedback information and encode the feedback information into a correction vector; The flow field residual network corrects the cross-modal spatiotemporal average flow field based on the correction vector; Based on the corrected cross-modal spatiotemporal averaged flow field, a corrected digital human image frame is generated; The corrected digital human image frame is transmitted to the display interface for display.

4. The method for real-time generation of digital humans according to claim 3, characterized in that, The user feedback information includes at least one of voice, gesture, and text.

5. The method for real-time generation of digital humans according to claim 1, characterized in that, The method further includes: The time offset between two modes in the multimodal joint state vector is calculated using a time offset estimation network. Based on the time offset, time compensation is performed on modes with time delays.

6. The method for real-time generation of digital humans according to claim 5, characterized in that, The method further includes: Obtain a training dataset, which contains multiple sets of temporally continuous digital human multimodal state sequences; The temporally continuous multimodal state pairs in the training dataset are input into the cross-modal spatiotemporal average flow field network to obtain the cross-modal spatiotemporal average flow field predicted by the cross-modal spatiotemporal average flow field network. Based on the difference between the predicted cross-modal spatiotemporal average flow field and the actual state changes in the multimodal state pair, a flow field matching loss is constructed; wherein, the flow field matching loss is the degree of multimodal synchronization. Backpropagation is performed based on the flow field matching loss to update the parameters of the cross-modal spatiotemporal average flow field network and the parameters of the time offset estimation network.

7. The method for real-time generation of digital humans according to claim 1, characterized in that, The step of rendering the cross-modal spatiotemporal averaged flow field to generate a digital human image frame at the target time includes: The target multimodal joint state vector is obtained based on the cross-modal spatiotemporal average flow field; The target multimodal joint state vector is rendered into a digital human image frame by using a dynamic resolution flow field. The digital human image frame has different resolutions in different regions.

8. The method for real-time generation of digital humans according to claim 7, characterized in that, The method for generating digital human image frames by rendering the image corresponding to the target multimodal joint state vector through a dynamic resolution flow field includes: Visual importance scores are assigned to each position in the image corresponding to the multimodal joint state vector of the target based on the visual attention mechanism. The resolution at each location is assigned based on the visual importance score; By using Gaussian weighting, positions at different resolutions are fused to generate digital human image frames.

9. A computer device, characterized in that, The computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the real-time digital human generation method as described in any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the real-time digital human generation method as described in any one of claims 1 to 8.